Skip to content
Ztract

How to write a good extraction schema in plain language

The schema is just a description of what you want pulled out of a document. You don't need to learn a syntax to write one — but a few habits make the difference between output you trust and output you have to babysit. Here's how to describe what you want so the engine gets it right the first time.

The Ztract team 7 min read
  • tutorial
  • schema design
A hand writing notes in a notebook beside a laptop — the act of describing, in plain words, what you want from a document.

The first thing most people do in Ztract is create a project. The second thing — the one that decides how good your results are — is tell us what to pull out of it. That description is your schema.

Here’s the part that surprises people: you don’t write it in code. There’s no syntax to learn, no field types to declare, no template to draw on the page. You describe what you want the way you’d explain it to a new colleague: “For each invoice, get me the vendor, the total, and every line item.” The engine does the rest.

That freedom is the whole point — but it also means the quality of your output tracks the quality of your description. A vague request gets vague results. A precise one gets clean data you can trust on the first pass. This post is about the handful of habits that make your plain-language schema precise, whatever kind of document you’re working with.

What a good schema looks like

Before the rules, here’s the shape of a description that works well. Say you’re extracting invoices:

“For each invoice, extract the invoice number, the issue date, the vendor name, and the total amount due. Also extract each line item with its description, quantity, unit price, and line total. If a field isn’t present on the document, leave it blank rather than guessing.”

Notice what that does. It names specific fields, not “the important stuff.” It separates the things there’s one of per document (invoice number, total) from the things there are many of (line items). And it says what to do when something is missing. None of that is technical — it’s just precise. The five habits below are how you get there.

Five habits that make a schema precise

1. Be specific about which value you mean

Documents are full of numbers and dates that look similar. “The amount” on an invoice could be the subtotal, the tax, the total before discount, or the final amount due. If you write “extract the amount,” you’re leaving the choice to chance.

Name the exact one:

  • “the date” → ✅ “the invoice issue date (not the due date)”
  • “the amount” → ✅ “the total amount due, after tax and discounts”
  • “the name” → ✅ “the vendor’s company name (not the contact person)”

The more a document has lookalike fields, the more this matters.

2. Say what format you want it in

The engine reads what’s on the page, but you often want it normalized into something consistent — especially for dates, numbers, and currency. If you care about the format, ask for it:

  • “Format all dates as YYYY-MM-DD.”
  • “Express amounts as plain numbers without currency symbols or thousands separators.”
  • “Capture debits as negative numbers.”
  • “Include the currency code (USD, EUR, etc.) as a separate field.”

Without this, you get whatever the document shows — $1,250.00, 1.250,00, (1,250.00) — and you’ll be cleaning it up in the spreadsheet afterwards. One sentence up front saves that work.

3. Separate “one per document” from “one per row”

This is the single thing that trips people up most, and it’s worth slowing down on. Some fields appear once per document — an invoice number, a bank statement period, an account holder. Others repeat — every line item, every transaction, every passenger on a ticket.

If you don’t distinguish them, you can end up with one value where you wanted a list, or a flattened mess where you wanted structure. The fix is to say it out loud:

“Extract the statement-level fields once: account holder, account number, opening balance, closing balance. Then extract each transaction as its own row, with date, description, and amount.”

The words “for each” are your friend. “For each line item…”, “for each transaction…” — they tell the engine to expect a list and give you back clean, repeating rows instead of a jumble.

4. Add a word of disambiguation for confusable fields

Some fields are genuinely ambiguous and no amount of reading the page resolves them — only your intent does. A customs document might carry both an invoice number and a purchase-order number, both a ship-to and a bill-to address, both a gross and a net weight.

When two fields could be confused, add a short clarifier:

  • “the invoice number (the seller’s, labeled ‘INV’ — not the PO number)”
  • “the ship-to address (where goods are delivered, not the billing address)”

You know which one you actually need. Saying so removes the guesswork.

5. Decide what happens when a field is missing

Real documents are inconsistent. One invoice has a PO number, the next doesn’t. If you don’t say what to do, you’re leaving it open — and for extraction, the safe default you almost always want is don’t invent anything:

“If a field isn’t present on the document, leave it blank. Never guess or fill in a placeholder.”

This one line is especially worth it for financial, legal, and medical documents, where a confidently-wrong value is far more dangerous than an empty cell you can see and follow up on.

Three ways to create a schema — and when to use each

Ztract gives you three starting points. The habits above apply to all of them; the question is just where you begin.

  • Start from a ready-made schema. For common documents — invoices, receipts, bank statements, IDs, resumes, contracts, lab reports, customs paperwork — there’s a template that already knows the usual fields. Best when your document is a standard type and you want to start fast, then tweak.
  • Describe the fields yourself. Write the plain-language description from scratch. Best when your document is unusual, or when you want exactly these fields and nothing else. This is where the five habits earn their keep.
  • Infer from a sample. Drop in one representative document and let the engine propose a schema from what it sees. Best when you don’t yet know what fields a document contains until you’ve looked at one — then refine the proposal in plain language.

Most people end up combining them: start from a template or a sample, then sharpen the description by hand using the habits above.

You can checkout our designing your schema documentation page for more infomartion.

A quick troubleshooting table

When the output isn’t what you expected, the cause is usually in the description. The common ones:

What you seeLikely causeThe fix
Wrong number pulled for “amount”Field wasn’t specificName the exact value: “total due after tax”
Dates in mixed formatsNo format requestedAdd “format all dates as YYYY-MM-DD”
One value where you wanted a listRepeating field not markedUse “for each … extract …”
List flattened into one cellSame as aboveSame — name the per-row fields explicitly
A field invented out of nowhereNo missing-field ruleAdd “leave blank if not present, never guess”
Two similar fields swappedNo disambiguationAdd a clarifier: “the PO number, not the invoice number”

The schema is only half the loop

Even a well-written schema benefits from a second look, and Ztract is built around that. Every extracted value is anchored to its position on the source document — click a value and you see exactly where it came from. You scan for the ones that look off, fix them in one click, and you’re done. Corrections don’t cost anything; only extraction counts against your pages, not the editing afterward.

So the goal of a good schema isn’t perfection on the first try — it’s getting close enough that the review step is a quick scan rather than a re-do. The five habits above are what get you there.

Try it on a real document

The fastest way to get a feel for this is to write a description for a document you actually work with and see what comes back. New accounts get 30 free pages, no credit card — plenty to draft a schema, refine it, and watch the output tighten up as you do.

And if a document type or a schema you tried to describe tripped you up — if you couldn’t find the words to get the result you wanted — that’s exactly the feedback we’re after. Tell us; making schema design feel obvious to people who aren’t engineers is the part of the product we care most about getting right.

← Back to all posts