Getting your results¶
Extraction is asynchronous: POST /v1/extract returns immediately with one or
more job_ids, and the actual text shows up later. There are two ways to
collect it.
| Pattern | Best for | What you do |
|---|---|---|
| Polling | Quick scripts, sandbox experiments, batches of dozens | Call GET /v1/jobs/{id} until status is terminal. |
| Webhooks | Production, batches of hundreds+, always-on integrations | Register a URL once; we POST signed payloads when jobs finish. |
If in doubt, start with polling — every SDK ships an extract_and_wait
helper that does it for you in one line. Switch to webhooks when you stop
wanting to keep a poller process alive.
Polling¶
POST /v1/extract → { "job_ids": ["..."], "status": "pending" }
GET /v1/jobs/{id} → { "id", "filename", "status", "result"? }
Statuses¶
status |
Meaning | result.content present |
|---|---|---|
pending |
Queued, not yet picked up | no |
processing |
Worker is extracting | no |
completed |
Done, full text available | yes |
partial_success |
Some pages/files failed, partial text available | yes |
failed |
Unrecoverable error | no (error in error_message) |
Treat completed, partial_success, and failed as terminal. Anything else
means "poll again."
Cadence¶
The SDKs default to 1-second poll with exponential backoff (×2, capped at
30 s) and a 5-minute total timeout. Tune via poll_interval,
timeout, and backoff options on wait_for_job / waitForJob. Don't poll
faster than 1 s — rate limits apply.
Code¶
from kreuzberg_cloud import KreuzbergCloud
with KreuzbergCloud(api_key="...") as client:
# One call: submit, poll, return when terminal.
job = client.extract_and_wait(file="invoice.pdf")
print(job.result.content)
# Or split it if you need to do other work in between:
accepted = client.extract(file="invoice.pdf")
job = client.wait_for_job(accepted.job_ids[0], timeout=60)
import { KreuzbergCloud } from "@kreuzberg/cloud";
import { readFile } from "node:fs/promises";
const client = new KreuzbergCloud({ apiKey: process.env.KREUZBERG_API_KEY! });
const data = await readFile("invoice.pdf");
const job = await client.extractAndWait({
file: { name: "invoice.pdf", data, mimeType: "application/pdf" },
});
console.log(job.result?.content);
JOB=$(curl -sX POST https://api.kreuzberg.cloud/v1/extract \
-H "Authorization: Bearer $KREUZBERG_API_KEY" \
-F "file=@invoice.pdf" | jq -r '.job_ids[0]')
while [ "$(curl -s "https://api.kreuzberg.cloud/v1/jobs/$JOB" \
-H "Authorization: Bearer $KREUZBERG_API_KEY" | jq -r .status)" \
!= "completed" ]; do sleep 1; done
curl -s "https://api.kreuzberg.cloud/v1/jobs/$JOB" \
-H "Authorization: Bearer $KREUZBERG_API_KEY" | jq -r .result.content
Webhooks¶
Register a URL on your project, and Kreuzberg Cloud POSTs the result to it as soon as the job reaches a terminal status — no poller, no keep-alive process.
Set up a webhook¶
The fastest path: open the dashboard and add a webhook to your project. You'll need:
- URL — must be
https://. Should respond with2xxquickly (we time out at 30 s). - Events — pick from
job.completed,job.failed,job.cancelled. - Secret — leave blank to have one generated, or paste a 32+ byte random string. Save it once; we hash-store it.
Payload¶
We POST JSON like this:
{
"event_id": "01HZQ...",
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"project_id": "1cbb9d72-660a-4df2-ba3d-66d83b6afaff",
"status": "completed",
"error_message": null,
"timestamp": 1747038551,
"attempt_count": 1
}
Headers:
| Header | Value |
|---|---|
Content-Type |
application/json |
User-Agent |
kreuzberg-webhook/<version> |
X-Webhook-Signature |
sha256=<hex> (only if you set a secret) |
X-Idempotency-Key |
the event_id — use it to deduplicate retries |
Then call GET /v1/jobs/{job_id} once with your API key to fetch the actual
extracted text. Webhook payloads are intentionally small.
Verify the signature¶
X-Webhook-Signature is HMAC-SHA256 of the raw request body with your
webhook secret, hex-encoded, prefixed with sha256=. Verify it before
trusting the payload.
import hmac
import hashlib
from fastapi import FastAPI, Header, HTTPException, Request
SECRET = b"..." # your webhook secret
app = FastAPI()
@app.post("/webhooks/kreuzberg")
async def receive(request: Request,
x_webhook_signature: str = Header(...)):
body = await request.body()
expected = "sha256=" + hmac.new(SECRET, body, hashlib.sha256).hexdigest()
if not hmac.compare_digest(x_webhook_signature, expected):
raise HTTPException(401, "bad signature")
# ... fetch GET /v1/jobs/{job_id} and process
return {"ok": True}
import crypto from "node:crypto";
import express from "express";
const SECRET = process.env.KREUZBERG_WEBHOOK_SECRET!;
const app = express();
app.post("/webhooks/kreuzberg", express.raw({ type: "application/json" }),
(req, res) => {
const sig = req.header("x-webhook-signature") ?? "";
const expected = "sha256=" + crypto
.createHmac("sha256", SECRET)
.update(req.body)
.digest("hex");
if (!crypto.timingSafeEqual(Buffer.from(sig), Buffer.from(expected))) {
return res.status(401).end();
}
// ... fetch GET /v1/jobs/{job_id} and process
res.json({ ok: true });
});
import (
"crypto/hmac"
"crypto/sha256"
"encoding/hex"
"io"
"net/http"
)
var secret = []byte("...")
func receive(w http.ResponseWriter, r *http.Request) {
body, _ := io.ReadAll(r.Body)
mac := hmac.New(sha256.New, secret)
mac.Write(body)
expected := "sha256=" + hex.EncodeToString(mac.Sum(nil))
sig := r.Header.Get("X-Webhook-Signature")
if !hmac.Equal([]byte(sig), []byte(expected)) {
http.Error(w, "bad signature", http.StatusUnauthorized)
return
}
// ... fetch GET /v1/jobs/{job_id} and process
w.WriteHeader(http.StatusOK)
}
Retries¶
We deliver each event at least once. If your endpoint returns non-2xx or times out:
- Up to 5 attempts total.
- Backoff: 5 s → 30 s → 5 min (then dead-letter).
4xxother than429is treated as permanent — we stop retrying.2xx,429,5xx, and connection errors are retried until the cap.
Use the event_id (also in X-Idempotency-Key) to deduplicate: the same
event may arrive more than once if your endpoint responds slowly.
Testing¶
The dashboard has a Send test button that fires a synthetic payload at your URL with a real signature — use it to verify your handler before going live.