Agent Playbook, build tutorial
A complete, technical build for an accounts payable agent that reads every invoice, matches it to the purchase order and goods receipt, applies your rules, and posts it to your accounting system, escalating to a human only on exceptions. This covers the architecture, real code for each stage, three-way matching, exception handling, security, the numbers, and how to roll it out without it breaking.
Accounts payable looks simple and is not. Every vendor formats invoices differently, totals and taxes need checking, each one must be matched to a purchase order and what actually arrived, approvals get chased over email, and only then is it keyed into the accounting system. The result is slow and error prone. Industry benchmarks put manual processing at roughly $15 to $40 per invoice (Ardent Partners), while top-quartile automated teams run around $10 or below (APQC), and manual AP teams capture only 20 to 30 percent of available early-payment discounts (Ardent Partners).
Template and RPA tools help until a vendor changes their layout, then they break. An agent is different: it reads any layout, checks the numbers against your records, decides under rules you set, and asks a human only when something genuinely does not add up.
An agent follows an observe, decide, act loop: it reads the invoice, checks it against your data, decides what to do under your policy, and takes the action (post, or escalate). A fixed pipeline runs the same steps every time with no branching. Think of it as eyes (OCR), a brain (the model plus your rules), and hands (the ERP and email tools it can call).
Be honest about which you need. If your invoices are uniform and your rules are simple, a straight pipeline is cheaper and easier to trust. The agent earns its extra complexity when invoices vary a lot, exceptions are common, and the right next step genuinely depends on what the document says. Most real AP sits in that second case, which is why this guide builds an agent with a deterministic matching core.
Every invoice flows through one loop driven by a state machine: ingested, extracted, validated, matched, then either auto-approved and posted, or sent to a human as an exception. Corrections feed back into the agent's memory, and guardrails wrap the whole thing.
Digital and scanned PDFs, images, multi-page invoices, and credit and debit notes. Know your messiest formats up front, they decide your OCR choice.
Your accounting or ERP (Tally, Zoho, QuickBooks, NetSuite, SAP), the PO and goods-receipt source, and the vendor master.
Approval rules and limits, an audit log, data retention and PII handling, and separation of duties between who approves and who pays.
| Layer | Pick | Why |
|---|---|---|
| Intake trigger | Gmail / Outlook push API, or a poller | Fire on every new invoice email. Push is real-time; polling every few minutes is simpler to run. |
| OCR / parse | Google Document AI, AWS Textract, or LlamaParse | Turns scans and PDFs into text plus layout. Tesseract is fine for clean digital files only. |
| Extraction | An LLM with structured output (GPT, Claude, Gemini) | Reads any layout and returns typed fields. This is what survives vendor format drift that breaks templates. |
| Orchestration | LangGraph or a small state machine | Drives the invoice through states (ingested, extracted, matched, approved, posted) and the exception branch. |
| Matching + state | Your ERP API (Tally, Zoho, SAP) + Postgres | Pulls POs and receipts to compare against, and stores runs, decisions, and corrections. |
| Review UI | A web app, a sheet, or Slack approval buttons | Where a human clears exceptions. Start with Slack buttons; graduate to a console as volume grows. |
Subscribe to the accounts-payable inbox and fire the pipeline on every new email that has an attachment. The Gmail and Outlook push APIs give you real-time events; a poller that checks every few minutes is the simpler fallback. Normalise everything to a job the moment it arrives so the handler returns fast and the heavy work happens in the queue. Uploads and vendor-portal exports feed the exact same queue, so there is one path to maintain. Capture the raw file and a content hash up front: the hash is what later stops the same invoice being processed twice.
# Watch the AP inbox; enqueue each attachment, return fast
@gmail.on_message(label="ap", has_attachment=True)
def on_invoice(msg):
for att in msg.attachments: # pdf, scan, or image
raw = att.download()
jobs.enqueue("process_invoice", {
"file": raw,
"hash": sha256(raw), # used for dedup later
"source": msg.sender,
}) Not everything in an AP inbox is an invoice. Before extraction, a quick classification step sorts invoices from credit notes, statements, reminders, and spam, and routes non-invoices out. This one cheap step keeps your extraction accurate and your costs down, because you are not running a full extraction on a marketing email. Tag the vendor here too, so later steps can apply vendor-specific rules and memory.
kind = llm.classify(ocr_preview, labels=[
"invoice", "credit_note", "statement", "reminder", "other",
])
if kind != "invoice":
route_elsewhere(kind) # not our job, stop here A PDF or photo is just an image, so OCR first converts it to text and layout (Document AI, Textract, or LlamaParse). Then an LLM maps that text into a typed schema, not free prose. Forcing structured output against a schema is what makes the result usable downstream and is the single biggest quality lever. Every field carries a confidence; anything low-confidence is flagged rather than trusted. Define the schema once and reuse it everywhere.
from pydantic import BaseModel, Field
from datetime import date
class LineItem(BaseModel):
description: str
quantity: float
unit_price: float
amount: float
class Invoice(BaseModel):
vendor: str
invoice_no: str
invoice_date: date
po_number: str | None = Field(None, description="PO number if present")
currency: str = "INR"
tax: float = 0
total: float
line_items: list[LineItem]
# Force the model to fill exactly this shape
def extract(ocr_text: str) -> Invoice:
return llm.parse(ocr_text, response_format=Invoice) Now check what you read against your own records. Re-add the line items and compare to the stated total to catch arithmetic and OCR errors. Look up the PO and the goods receipt by PO number through your ERP API (or a nightly export) and compare totals and quantities within a tolerance you set. Verify the vendor against your master, and confirm this invoice number has not already been paid. When invoice, PO, and receipt agree, that is the three-way match, and it is deterministic comparison code, so it is testable and auditable, unlike asking the model to decide.
def three_way_match(inv: Invoice, po, grn, tolerance=100) -> dict:
return {
"math": abs(sum(li.amount for li in inv.line_items) - inv.total) < 1,
"po_total": abs(inv.total - po.total) <= tolerance,
"receipt": all(grn.qty[li.description] >= li.quantity
for li in inv.line_items),
"vendor": vendor_master.is_approved(inv.vendor),
"not_dup": not ledger.exists(inv.vendor, inv.invoice_no),
} Turn business rules into explicit conditions you own. Clean, in-policy invoices auto-approve; everything else routes to a human with the exact failing check named, so the reviewer knows why in one glance. The model never decides on its own here, it applies your policy, and every decision is logged with the rule that fired. Keep the thresholds in config, not code, so finance can change them without a deploy.
def decide(inv: Invoice, checks: dict) -> tuple[str, list]:
failed = [name for name, ok in checks.items() if not ok]
if not failed and inv.total < policy.auto_approve_limit:
return "auto_approve", []
return "exception", failed # e.g. ["po_total"] -> PO mismatch Exceptions are a first-class feature, not an afterthought, because they are where money is saved or lost. A flagged invoice lands on a review surface (a web console, a sheet, or a Slack message with buttons) showing the invoice, the extracted fields, and the specific failing check. The reviewer approves or corrects in seconds. Two things make this strong: a retry budget that re-extracts with a stricter prompt before bothering a human, and storing every correction keyed to the vendor so the agent stops repeating that mistake. Low-confidence fields and unseen vendors should default to review until they have a track record.
def on_exception(inv, failed):
if inv.retries < 2 and "math" in failed:
return reextract(inv, stricter=True) # try again before a human
review_queue.add(inv, reason=failed,
fields=inv.dict(), confidence=inv.confidence)
# Learn from the fix so it does not recur for this vendor
on_correction(lambda fix: memory.save(inv.vendor, fix)) On approval, write the bill through your accounting tool's API and attach the source PDF and the decision log, so finance can defend every entry. Wrap the agent's actions as typed tools (here, a LangChain StructuredTool) so the orchestrator can call them safely with validated arguments. Where a tool has no API, emit a validated import file instead of re-keying. After posting, reconcile the payment back against the invoice so the loop closes.
from langchain_core.tools import StructuredTool
post_bill = StructuredTool.from_function(
name="post_bill_to_erp",
description="Create a bill in the ERP after approval; returns bill id.",
args_schema=Invoice,
func=lambda inv: erp.create_bill( # Tally / Zoho / QuickBooks
vendor=inv.vendor, total=inv.total,
line_items=inv.line_items, attachment=inv.source_file,
),
) Production is the hard part. Block duplicates on vendor plus invoice number plus the content hash before anything posts. Keep an immutable, append-only audit log of every step, who or what did it, and when, so finance and auditors can trust it. Track extraction accuracy, auto-approval rate, and exception rate on a dashboard, and alert when accuracy slips. Watch for fraud signals such as a vendor's bank details changing between invoices. Feed corrections back so accuracy climbs over time. This is the difference between a weekend demo and something a finance team will sign off on.
if ledger.exists(inv.vendor, inv.invoice_no) or ledger.seen(inv.hash):
return flag("duplicate") # never pay the same bill twice
if vendor_master.bank_changed(inv.vendor, inv.bank_account):
return flag("bank_detail_change") # classic fraud vector -> hold
audit.append(run) # tamper-proof, who/what/when
metrics.track(extraction_accuracy, auto_approve_rate, exception_rate)
alert_if(extraction_accuracy < 0.95) The human only sees exceptions. The console shows the invoice, the extracted fields, the match result, and one decision to make.
| Approach | Best for | Effort | Cost | Control |
|---|---|---|---|---|
| Custom agent (this guide) | You need control, your own rules, and tight ERP fit | High | Build cost + LLM usage | Full |
| IDP platform (Nanonets, Rossum) | Standard AP, fast start, less engineering | Low | Per-page or seat subscription | Medium |
| No-code (n8n, Make) | Low volume, a quick pilot, simple rules | Low | Cheap, but brittle at scale | Low |
AP touches money, so trust is the product. Run the pipeline in your own cloud or on-premise if SOC-2 or GDPR requires it, with the OCR and model self-hosted where needed. Enforce separation of duties so the system that approves is not the one that pays, and keep an immutable audit log of every action for auditors.
Two fraud vectors matter most. Duplicate payments: blocked by the vendor, invoice-number, and content-hash check before posting. Changed bank details: any account that differs from the vendor master is held for human confirmation, since redirecting payments to a new account is the classic invoice fraud.
Benchmarks vary by source and should be treated as ranges, not promises. These are from neutral industry bodies, not vendor marketing.
| Metric | Manual | Automated / agent |
|---|---|---|
| Cost per invoice | ~$15 to $40 (Ardent Partners) | Top-quartile teams run ~$10 or below (APQC) |
| Invoices per person / year | ~4,200 (IOFM average) | ~6,900 at top performers (IOFM) |
| Early-payment discounts captured | 20 to 30% (Ardent Partners) | Higher, because nothing is paid late by accident |
| Touchless rate | Low; most invoices are keyed by hand | Climbs as the auto-approve threshold proves out |
Sources: Ardent Partners AP Metrics That Matter; APQC AP benchmarking; IOFM. Figures are industry ranges.
Everything you need to know about the service and how it works. Canβt find an answer? Mail us at info@galific.com
Galific designs, builds, and runs agents like this, integrated with your ERP and your rules. Or explore the ready-made versions in our agent suite.