Agent Playbook, build tutorial
A walkthrough of how to build a document-processing agent that turns the messy PDFs, scans, and photos your business receives (invoices, purchase orders, delivery challans, KYC forms) into clean, validated, structured data. It works out what each document is, extracts the fields a typed schema asks for straight from the image, checks them against real rules, and sends only the uncertain fields to a person. Every code step is shown in three frameworks, Google ADK, the Claude SDK, and LangGraph, so you can build it on the stack your team already knows. The code here is the agent's core logic; standing it up in production also means adding the setup, integrations, and hardening around it.
Every business runs on documents it did not create: a supplier's invoice, a customer's purchase order, a signed delivery challan, a KYC form. Somebody keys each one into your system by hand, and at any real volume that is slow, expensive, and quietly error-prone, a transposed digit here, a skipped line there, a GSTIN typed wrong that breaks a tax claim later. Templates and old-style OCR help until the next vendor uses a different layout, and then they break.
A document agent reads the image directly. It works out what each document is, extracts the fields into a known shape, validates them against the same rules a careful clerk would apply, and only asks a person about the fields it is unsure of. The layout can change every time; the schema does not. If you want to see the idea on a single file first, the free invoice reader turns one photo into a table in your browser.
You do not build the agent loop from scratch. A framework gives you the moving parts, and you supply the logic: a step is a model plus an instruction plus a typed schema or a few tools. The code tabs on the build steps show the same classify, extract, and validate flow three ways, so you can read it in the stack you already use.
Google ADK composes the steps with a SequentialAgent and a Runner, the Claude SDK drives a tool-use loop, and LangGraph builds an explicit graph. The intake, the validation rules, and the routing are deterministic and identical in all three; only the orchestration and the model calls differ.
Every document flows through one loop: pulled in and classified, extracted to a typed schema, validated, then either exported when confident or sent to a person when not. Corrections feed back as examples, and guardrails for idempotency, confidence, and audit wrap the whole thing.
The document sources (a mailbox, a Drive or S3 folder, a scanner), a few sample documents per type, and the target system you post to.
A model API key (Gemini, Claude, or both), your ERP, DMS, or accounting API for export, and storage for the files and the audit log.
A schema and validation rules per document type, the confidence threshold, DPDP Act 2023 handling for KYC, and who reviews and approves.
Install the packages and set your Gemini key. That is enough to run the runnable file below, which extracts and validates a sample document locally. For production you also connect, with provider-specific credentials, your document sources, your ERP or DMS for export, and storage for the files and the audit trail.
# Python 3.10 or newer
python -m venv venv && source venv/bin/activate
# The runnable path below uses Google ADK + Gemini:
pip install google-adk google-genai pydantic
# Your Gemini key from aistudio.google.com/apikey:
export GEMINI_API_KEY="your-key-here" | Layer | Pick | Why |
|---|---|---|
| Intake | A watched mailbox, a cloud folder (Drive, S3), a scanner, or an upload form | Documents arrive as PDFs, scans, and phone photos. Pull them from where they land so nothing is keyed by hand, and hash each file so a re-run never processes it twice. |
| Classification | A vision model (Gemini 2.5 Flash, Claude) over a page image | Decide what each document is, an invoice, a purchase order, a delivery challan, a KYC form, so the right extraction schema runs. Mixed batches sort themselves. |
| Extraction | A vision model with a typed schema (structured output) | Pull exactly the fields the schema asks for, straight from the image, with no template per vendor. The code tabs show the same call three ways. |
| Validation | Deterministic Python rules and cross-checks | Check formats (GSTIN, PAN, dates), the arithmetic (line totals add up), and required fields. This is testable code, not a model guess. |
| Agent framework | Google ADK, the Claude SDK, or LangGraph | Drives classify, extract, validate, route, and export, with a review branch. Pick the one your team runs; the logic is identical. |
| Output and review | Your ERP or DMS API, plus a human review queue | Confident documents flow straight to your system; anything below threshold goes to a person, field by field, with the image beside it, and every correction is logged. |
The classify, extract, and orchestration steps show the same logic in Google ADK, the Claude SDK, and LangGraph. Use the tabs to switch. The intake, validation, and routing steps are deterministic Python and identical everywhere. These are the building blocks, and they call your own systems (the folder, the ERP) by name; the complete file you can run today is in "Put it together" below.
The work starts with plumbing, not intelligence. Documents arrive in a mailbox, a shared Drive or S3 folder, from a scanner, or through an upload form, so the agent watches those sources and pulls each new file in. It hashes every file as it arrives, which makes the whole pipeline idempotent: the same PDF dropped twice, or a job re-run after a crash, is recognized and skipped rather than processed again. Deterministic, and the same in every framework.
import hashlib, os
def pull_new(folder: str, seen: set) -> list[dict]:
docs = []
for name in os.listdir(folder):
path = os.path.join(folder, name)
digest = hashlib.sha256(open(path, "rb").read()).hexdigest()
if digest in seen: # same file, already processed
continue
seen.add(digest)
docs.append({"path": path, "hash": digest, "name": name})
return docs
# Hashing each file makes re-runs idempotent: a document is never processed twice. A real intake is mixed: invoices, purchase orders, delivery challans, and KYC forms arrive in the same pile, and each needs a different extraction schema. So before extracting, the agent shows the page image to a vision model and asks it to pick one label from a fixed list. That label routes the document to the right schema in the next step. PDFs are rasterized to page images first so every model sees the same input. The tabs show the same classification three ways.
from google import genai
from google.genai import types
client = genai.Client()
TYPES = ["invoice", "purchase_order", "delivery_challan", "kyc_form", "other"]
def classify(image: bytes) -> str:
part = types.Part.from_bytes(data=image, mime_type="image/jpeg")
return client.models.generate_content(
model="gemini-2.5-flash",
contents=[f"What kind of document is this? Reply with one of {TYPES}.", part]
).text.strip() from anthropic import Anthropic
import base64
client = Anthropic()
def classify(image: bytes) -> str:
b64 = base64.standard_b64encode(image).decode()
res = client.messages.create(model="claude-opus-4-8", max_tokens=64,
tools=[{"name": "set_type", "input_schema": {"type": "object",
"properties": {"doc_type": {"type": "string", "enum":
["invoice","purchase_order","delivery_challan","kyc_form","other"]}},
"required": ["doc_type"]}}],
tool_choice={"type": "tool", "name": "set_type"},
messages=[{"role": "user", "content": [
{"type": "image", "source": {"type": "base64",
"media_type": "image/jpeg", "data": b64}},
{"type": "text", "text": "Classify this document."}]}])
return res.content[0].input["doc_type"] from langchain_google_genai import ChatGoogleGenerativeAI
from pydantic import BaseModel
from typing import Literal
import base64
class DocType(BaseModel):
doc_type: Literal["invoice","purchase_order","delivery_challan","kyc_form","other"]
vision = ChatGoogleGenerativeAI(model="gemini-2.5-flash").with_structured_output(DocType)
def classify(image: bytes) -> str:
b64 = base64.b64encode(image).decode()
msg = {"role": "user", "content": [
{"type": "text", "text": "Classify this document."},
{"type": "image_url", "image_url": f"data:image/jpeg;base64,{b64}"}]}
return vision.invoke([msg]).doc_type This is the heart of it. Each document type has a schema, the exact fields you want and their types, and the agent asks the vision model to fill that schema from the image. Because the schema is typed, the model returns structured data, not prose: a vendor string, a total as a number, a list of line items. There is no per-vendor template to maintain; when a supplier changes their layout, the same schema still applies. The tabs show structured extraction in all three frameworks.
from pydantic import BaseModel
class LineItem(BaseModel):
desc: str; qty: float; rate: float; amount: float
class Invoice(BaseModel):
vendor: str; invoice_no: str; gstin: str; total: float; items: list[LineItem]
def extract(image: bytes) -> Invoice:
part = types.Part.from_bytes(data=image, mime_type="image/jpeg")
res = client.models.generate_content(
model="gemini-2.5-flash",
contents=["Extract the invoice fields from this document.", part],
config=types.GenerateContentConfig(
response_mime_type="application/json", response_schema=Invoice))
return res.parsed # a typed Invoice object, not a string SCHEMA = {"type": "object", "properties": {
"vendor": {"type": "string"}, "invoice_no": {"type": "string"},
"gstin": {"type": "string"}, "total": {"type": "number"},
"items": {"type": "array", "items": {"type": "object", "properties": {
"desc": {"type": "string"}, "qty": {"type": "number"},
"rate": {"type": "number"}, "amount": {"type": "number"}}}}},
"required": ["vendor", "invoice_no", "total"]}
def extract(image: bytes) -> dict:
b64 = base64.standard_b64encode(image).decode()
res = client.messages.create(model="claude-opus-4-8", max_tokens=1024,
tools=[{"name": "save_invoice", "input_schema": SCHEMA}],
tool_choice={"type": "tool", "name": "save_invoice"},
messages=[{"role": "user", "content": [
{"type": "image", "source": {"type": "base64",
"media_type": "image/jpeg", "data": b64}},
{"type": "text", "text": "Extract the invoice fields."}]}])
return res.content[0].input # matches SCHEMA from pydantic import BaseModel
class LineItem(BaseModel):
desc: str; qty: float; rate: float; amount: float
class Invoice(BaseModel):
vendor: str; invoice_no: str; gstin: str; total: float; items: list[LineItem]
extractor = ChatGoogleGenerativeAI(
model="gemini-2.5-flash").with_structured_output(Invoice)
def extract(image: bytes) -> Invoice:
b64 = base64.b64encode(image).decode()
return extractor.invoke([{"role": "user", "content": [
{"type": "text", "text": "Extract the invoice fields."},
{"type": "image_url", "image_url": f"data:image/jpeg;base64,{b64}"}]}]) A model can misread a digit, so nothing is trusted until it is checked. The agent runs deterministic rules over the extracted fields: identifiers like the GSTIN match their format, dates are real dates, and the line items actually add up to the stated total. Each rule that fails becomes a flag on that document. This is plain Python, testable and auditable, and it is what turns a confident-looking extraction into one you can actually post.
import re
GSTIN = re.compile(r"[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][1-9A-Z]Z[0-9A-Z]")
def validate(inv) -> list[str]:
flags = []
calc = round(sum(li.qty * li.rate for li in inv.items), 2)
if abs(calc - inv.total) > 1: flags.append("total_mismatch")
if inv.gstin and not GSTIN.fullmatch(inv.gstin): flags.append("bad_gstin")
if not inv.invoice_no: flags.append("missing_invoice_no")
return flags # an empty list means every rule passed Now decide what is safe to pass through. The agent combines the validation flags with an extraction confidence, and routes the document: a clean, high-confidence document goes straight through, while anything with a failed rule or a low score goes to review. The gate is deliberately conservative, because a wrong value posted automatically is far more expensive than a human glance. Tune the threshold to your own risk appetite. Deterministic, and identical everywhere.
def route(flags: list[str], confidence: float, min_conf: float = 0.85) -> str:
if flags or confidence < min_conf:
return "review" # a person checks the flagged or low-confidence doc
return "straight_through" # clean docs flow on with no human touch
# Confidence can come from the model's own signal or from how cleanly it parsed.
# Any failed rule sends the document to a person, no matter how confident the model is. Act on the routing decision. Documents that pass go straight to your system, the ERP, the DMS, or the accounting tool, written through its API with the source file linked for traceability. Everything else lands in a review queue with the extracted fields, the flags, and the original image side by side, so a reviewer corrects a couple of fields instead of typing the whole document. Nothing uncertain is ever posted automatically. Deterministic, and the same in every framework.
def settle(doc: dict, inv, flags: list[str], decision: str) -> dict:
if decision == "straight_through":
erp.create(doc["type"], inv) # posted, source file linked
audit.log("auto_export", doc["hash"], inv)
return {"status": "exported"}
review_queue.add(doc, extracted=inv, flags=flags, image=doc["path"])
return {"status": "in_review", "flags": flags} # a human finishes it
# A clean run posts the obvious and hands a person only the fields that need eyes. Review is where the system gets better, not just corrected. The reviewer sees the document beside the extracted fields, fixes only what was flagged, and approves. The agent logs who changed what, exports the corrected record, and keeps the correction as an example for that document type, so the next batch of the same layout extracts more cleanly. Over time the straight-through rate can climb, as the hard cases become the examples that teach it. Deterministic, and the same in every framework.
def on_review(doc: dict, corrected) -> None:
audit.log("correction", doc["hash"], corrected) # who fixed what, and when
erp.create(doc["type"], corrected) # the fixed record flows on
examples.add(doc["type"], corrected) # a few-shot example for next time
# Every correction makes the next batch of that document type a little more accurate. The pieces run on a schedule over each batch. ADK composes classify, extract, and validate under a SequentialAgent and a Runner, the Claude SDK runs them in a tool-use loop, and LangGraph builds an explicit classify-extract-validate graph. Whichever you pick, the same guardrails apply: the file hash makes every run idempotent, the confidence threshold and the schemas live in config rather than the code, personal data on KYC forms is handled under the DPDP Act 2023, and every extraction, correction, and export is written to an immutable audit log while you track the straight-through rate.
from google.adk.agents import SequentialAgent
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
doc_pipeline = SequentialAgent(name="doc_pipeline",
sub_agents=[classifier, extractor, validator]) # each an LlmAgent with the tools
runner = Runner(agent=doc_pipeline, app_name="docs",
session_service=InMemorySessionService())
def guard(doc: dict, confidence: float, min_conf: float = 0.85) -> str | None:
if store.seen(doc["hash"]): return "duplicate" # idempotent by file hash
if confidence < min_conf: return "needs_human" # below the threshold
return None
# Run on each batch from the watched folder via cron / Cloud Scheduler. from anthropic import Anthropic
client = Anthropic()
TOOLS = [classify_tool, extract_tool, validate_tool, export_tool] # JSON schemas
def run_docs(messages: list):
while True:
resp = client.messages.create(model="claude-opus-4-8", max_tokens=2048,
tools=TOOLS, messages=messages)
if resp.stop_reason != "tool_use":
return resp
messages.append({"role": "assistant", "content": resp.content})
for b in resp.content:
if b.type == "tool_use":
out = dispatch(b.name, b.input)
messages.append({"role": "user", "content": [
{"type": "tool_result", "tool_use_id": b.id, "content": str(out)}]}) from langgraph.graph import StateGraph, START, END
g = StateGraph(dict)
g.add_node("classify", lambda s: {"type": classify(s["image"])})
g.add_node("extract", lambda s: {"doc": extract(s["image"])})
g.add_node("validate", lambda s: {"flags": validate(s["doc"])})
g.add_edge(START, "classify")
g.add_edge("classify", "extract")
g.add_edge("extract", "validate")
g.add_edge("validate", END)
doc_pipeline = g.compile() # invoke per batch via your scheduler
Here is a runnable file. It extracts invoice fields from a sample document,
validates them against the GSTIN format and the line-item total, and decides
straight-through or review. The sample is passed as text so it runs with no
image; production reads the image with the vision model, as in the build
steps. Save it as main.py and run
python main.py.
# main.py -- run: python main.py
# A runnable reference: extracts invoice fields from a sample document (passed as
# text so it runs with no image), validates them, and decides straight-through vs
# review. Production reads the image with the vision model, as in the build steps.
import asyncio, json, re
from google import genai
from google.genai import types
from google.adk.agents import LlmAgent
from google.adk.runners import InMemoryRunner
client = genai.Client() # reads GEMINI_API_KEY
# ---- a sample document (replace with a real image in production) ----
SAMPLE_DOC = """ACME STEEL CO Tax Invoice INV-2026-0441 GSTIN 29ABCDE1234F1Z5
Item Qty Rate Amount
Steel Pipe 2in 40 1200 48000
Brass Valve 15 2000 30000
Total 78000"""
# --------------------------------------------------------------------
def extract(doc_text: str) -> dict:
"""Ask the model to fill a typed JSON schema from the document text."""
out = client.models.generate_content(
model="gemini-2.5-flash",
contents=["Extract invoice fields as JSON with keys vendor, invoice_no, "
"gstin, total (number), and items (each desc, qty, rate, amount). "
"Document:\n" + doc_text],
config=types.GenerateContentConfig(response_mime_type="application/json")).text
return json.loads(out)
def validate(inv: dict) -> list[str]:
"""Deterministic rules: GSTIN format and line items summing to the total."""
flags = []
items = inv.get("items", [])
calc = round(sum(float(i["qty"]) * float(i["rate"]) for i in items), 2)
if abs(calc - float(inv.get("total", 0))) > 1: flags.append("total_mismatch")
if not re.fullmatch(r"[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][1-9A-Z]Z[0-9A-Z]",
inv.get("gstin", "")): flags.append("bad_gstin")
return flags
def route(flags: list[str]) -> str:
return "review" if flags else "straight_through"
agent = LlmAgent(
name="doc_agent", model="gemini-2.5-flash",
instruction=("Call extract on the sample document, then validate the result, "
"then route it. Report the extracted fields, any validation flags, "
"and whether the document goes straight through or to review."),
tools=[extract, validate, route])
async def main():
runner = InMemoryRunner(agent=agent, app_name="docs")
session = await runner.session_service.create_session(app_name="docs", user_id="demo")
msg = types.Content(role="user",
parts=[types.Part(text=f"Process this document:\n{SAMPLE_DOC}")])
async for event in runner.run_async(user_id="demo", session_id=session.id, new_message=msg):
if event.is_final_response():
print(event.content.parts[0].text)
if __name__ == "__main__":
asyncio.run(main()) Because the sample's line items add up and the GSTIN is well-formed, validation returns no flags, so the document is routed straight through (the exact wording is model-generated, so it varies run to run). Break the total or the GSTIN in the sample and validation flags it, so it flips to review, which is the whole point of the gate.
A reviewer sees one screen: the queue of flagged documents, the original image, the extracted fields with the uncertain ones highlighted, and a single action to approve.
| Approach | Best for | Effort | Cost | Control |
|---|---|---|---|---|
| Custom agent (build in-house) | You handle many document types, want your own schemas and validation rules, and need the data in-house | High | Build cost + API and model usage | Full |
| IDP platform (Nanonets, Docsumo, Azure Document Intelligence, AWS Textract) | Common document types, fast start, you accept per-page pricing | Low to medium | Per-page or per-document subscription | Medium |
| No-code (n8n or Make with an OCR node) | Low volume, a single document type, a quick pilot | Low | Cheap, but brittle once layouts and rules grow | Low |
Documents carry sensitive data, KYC forms most of all, so control of the data is part of the design. Personal data is handled in line with the DPDP Act 2023, whose core obligations phase in through 2027, and you can run the whole pipeline in your own cloud or on-premise so identity documents and financial records never leave infrastructure you control. Access to the review queue and the stored files is limited to who needs it.
Every extraction, every correction, and every export is written to an immutable audit log with the file hash, the fields, the confidence, and who approved. That trail is what makes automated document handling defensible, and it is why nothing below your confidence threshold is ever posted without a person.
These are directional, not promises, and depend on your document types and volume. The point is the shape of the change: less time per document, fewer errors, more throughput, and more documents handled with no human touch.
| Metric | Manual | Agent |
|---|---|---|
| Time per document | Minutes of keying per document, more for multi-page or messy scans | Seconds to extract, and a person only checks the fields that were flagged |
| Error rate | Typos and transposed digits creep in at volume, found later or never | Every document validated against format and arithmetic rules, with low-confidence fields flagged |
| Throughput | Capped by how fast people can type | Scales with compute, so a backlog of thousands can clear overnight |
| Straight-through rate | Every document is touched by a human | Clean documents flow through untouched; people handle only the exceptions |
Accuracy and straight-through rates depend on document quality and type. Figures are ranges that vary by business, not guarantees.
Everything you need to know about the service and how it works. Can’t find an answer? Mail us at info@galific.com
Galific designs, builds, and runs document-processing agents like this on ADK, the Claude SDK, or LangGraph, wired into your document sources and your ERP or DMS. Or explore the ready-made versions, from the Invoice Reader to Document Extraction, in our agent suite.