Data Schema
What is a Data Schema?
An Extraction Data Schema is a type of structured data that helps the system identify which data points to extract from specific documents. It guides the system on the technical processes required and ensures the extracted data is formatted for easy access by the user. The Data Schema is composed of a certain number of Fields (or data points), each with a type and an example.
Field Types
There are six different field types:
string
A sequence of characters representing textual data.
classification
A predefined list of categories or types.
date
A specific year, month, and day, formatted as a YYYY-MM-DD
date-time.
number
Numeric data which could be an integer or a floating-point value.
boolean
Represents two possible values: true
or false
nested_object
A complex data type that contains multiple sub-fields or properties allowing one level of nesting.
Example of a Basic Invoice Extraction Data Schema
Supplier Name
string
Acme Supplies Ltd.
Supplier Company Registration
object
(see sub-fields below)
Supplier Company Registration.Number
string
CRN-20250123
Supplier Company Registration.Type
classification
VAT NUMBER
Date
date
2025-06-10
Total Amount
number
1540.75
Taxes
object
(see sub-fields below)
Taxes.Rate
number
0.185
Taxes.Base
number
1300.00
Taxes.Amount
number
240.75
How to Build a Top-Performing Data Schema?
To get the best possible extraction data from a model, you can of course use the Continuous Learning (RAG) feature, but the very first step is to ensure the Data Schema you're using is optimized.
If the foundation is solid, the house is solid
Field Name and Title
The field name is automatically generated from the field title.
While the title is purely for humans, the name has an influence on how the AI system performs the extraction operation.
Try to use simple names that will precisely describe the field you want to extract. The goal is to avoid any possible confusion between data points present in the document.
Consider wanting to extract the name of the company that issued an invoice.
In our model, we've used the field name: supplier_name
It clearly tells the AI to get only the name, and of the invoice supplier.
✅ you could also use vendor_name
, it means the same thing with the same level of precision.
⚠️supplier
might work but is imprecise: which information about the supplier do you need?
⚠️company_name
might work but is imprecise: we know you need the name of the company, but we don't know if company stands for supplier or customer?
❌ company
will likely not work as expected: we do not know neither which information you need nor which company is concerned.
Field Type
Try to use the Field Typesthat will best suits the field you need.
For the due_date
, you could use a string, but a date field is definitely a better solution.
Extraction Guidelines
Sometimes changing the field name and type is not enough to explain what you need for one field. In that case you have the possibility to add Extraction Guidelines to the field. Use natural language to explain how to properly extract the data.
For instance, for supplier_phone_number
, adding the following extraction guidelines could be useful:
If you find several phone numbers in the document, I need the phone number of the supplier headquarters. Also, I want you to reformat the data to match the international phone number format, as follows : +1-212-456-7890
Last updated
Was this helpful?