Data Schema
Overview
An Extraction Data Schema is a type of structured data that helps the system identify which data points to extract from specific documents.
A Data Schema guides the system on the technical processes required and ensures the extracted data is formatted for easy access by the user. The Data Schema is composed of a certain number of Fields (or data points), each with a type and an example.
Field Types
Base Types
Text
A sequence of characters representing textual data.
Number
Numeric data which could be an integer or a floating-point value.
Date
A specific year, month, and day, formatted as a YYYY-MM-DD date or YYYY-MM-DD HH:mm:ss for a date-time.
Classification
A predefined list of categories or types.
Boolean
Represents two possible values: true or false
Nested Object
A complex data type containing multiple subfields or properties, allowing one level of nesting.
Object Detection
Detect the location of a document feature, such as a logo, signature, photo, etc.
Barcode
Detect the location of a 1D barcode (i.e. UPC, EAN) or a 2D barcode (i.e. QR Code, Data Matrix).
Additionally, attempt to decode the contents of the barcode as a string value.
Array Types
Any field can be made into an array, a list of values.
Simply enable "Multiple items can be extracted" when creating or modifying the field.
The return type will be an array of the base type, for example a list of text values.
It is possible to have a list of nested objects, but not a list of lists.
Example
An example of the field types for a basic invoice extraction Data Schema:
Supplier Name
String
Acme Supplies Ltd.
Supplier Logo
Object Detection
Polygon around the logo
Supplier Company Registration
Nested Object
See sub-fields below
Supplier Company Registration.Number
String
CRN-20250123
Supplier Company Registration.Type
Classification
VAT NUMBER
Date
Date
2025-06-10
Total Amount
Number
1540.75
Taxes
Nested Objects Array
See sub-fields below
Taxes[0].Rate
number
0.185
Taxes[0].Base
number
1300.00
Taxes[0].Amount
number
240.75
Performance Optimization
To get the best possible extraction data from a model, you can of course use the Continuous Learning (RAG) feature, but the very first step is to ensure the Data Schema you're using is optimized.
If the foundation is solid, the house is solid.
The various properties of the field all have a role to play in getting the best possible performance.
Field Name and Title
The field Name is automatically generated from the field Title. You can however modify the Title afterwards.
Both the Name and Title are used during processing (inference), but the Name is more important.
Try to use clear, simple names that will precisely describe the field you want to extract. The goal is to avoid any possible confusion between data points present in the document.
Consider wanting to extract the name of the company that issued an invoice.
In our model, we've used the field Name: supplier_name
It clearly tells the AI to get only the name, and of the invoice supplier.
✅ you could also use vendor_name, it means the same thing with the same level of precision.
⚠️supplier might work but is imprecise: which information about the supplier do you need?
⚠️company_name might work but is imprecise: we know you need the name of the company, but we don't know if company stands for supplier or customer?
❌ company will likely not work as expected: we do not know neither which information you need nor which company is concerned.
Field Type
Try to use the Field Types that will best suits the field you need.
For example, while you could use a string for due_date, a date field type is definitely better.
Field Description
The field's Description also has an impact on the model's performance.
Use it to describe what the field represents, and/or of what its use is to you.
For example, the supplier_name field could have:
The name of the supplier.
Used in internal processing to match our supplier ID with the name found on the document.
Field Extraction Guidelines
Sometimes changing the field name and type is not enough to explain what you need for one field. In that case you have the possibility to add extraction Guidelines to the field.
Use natural language to explain how to properly extract the data, and/or any extra steps like formatting.
For instance, with supplier_phone_number , adding the following extraction guidelines could be useful:
If you find several phone numbers in the document, use the phone number of the supplier headquarters.
Always reformat the data to match the international phone number format, as follows: +1-212-456-7890
Relative Importance of Field Properties
Not all field properties have the same importance or weight when it comes to how the models process files.
Additionally, not all types of fields are handled the same way.
In the following table, "Normal Fields" are those that extract textual information from the document (text, dates, numbers, etc), whether they are simple fields, lists, or nested object fields.
"Object Detection" refers to specific processing to extract polygons of various elements on the document, such as signatures, ID photos, etc.
Name
Most important
Not used
Title
Important
Most important
Description
Complementary
Not used
Guidelines
Complementary
Not used
Classification Values
Very important (only for classification fields)
Not used
Language
You can specify a field's Title, Name, Description, and Guidelines in most languages. Note that the Name can only contains ASCII characters.
This includes, but is not limited to:
European languages: English, French, Spanish, German, Italian, Portuguese, Russian, Greek, etc
Asian languages: Hindi, Bengali, Turkish, Urdu, Farsi, Armenian, etc
East Asian languages: Japanese, Mandarin, Korean, Vietnamese, etc
Semitic languages: Arabic, Hebrew, Amharic, etc
African languages: Swahili, Yoruba, Zulu, etc
Note: while the models can understand, we are not able to provide in-depth support for all languages.
Technical Limitations
Number of Fields in the Data Schema
For your data schema, the recommended maximum number of fields is 25.
Beyond this limit, performance will be drastically reduced.
Names of Fields
The field ame must only contain:
lowercase Latin letters without accents (a-z)
numbers (0-9)
underscores (
_)
Last updated
Was this helpful?

