Data Schema Overview
General description of an extraction model's Data Schema.
An Extraction Data Schema defines which data should be extracted, and in what way, from documents sent to the model.
You can think of the Data Schema as a map to the documents you send to the model for extraction.
This map includes which data to extract, how to format the data, pitfalls to avoid, etc. To do this, the Data Schema is composed of Fields (or data points), each field having its own configuration.
In this way the Data Schema is the foundation of your model. All other features will use and depend on the Data Schema.
Use the Live Test when working on your Data Schema to quickly validate changes.
Fields Overview
A Data Schema is primarily composed of fields.
A field describes a single data point to extract in the document.
Each field has the following properties:
Title - human-readable
Name - machine-readable, used as the field's key in the API return
Type - the type of data to extract, more details in Field Types
Description (optional) - provides extra context on how the field is used
Guidelines (optional) - provides instructions to better extract the field
Field Types
A field's type determines how it will be formatted when returned by the API.
The type also gives an indication on what to look for in the input file.
Base Types
Text
A sequence of characters representing textual data.
Number
Numeric data which could be an integer or a floating-point value.
Date
A specific year, month, and day, formatted as a YYYY-MM-DD date or YYYY-MM-DD HH:mm:ss for a date-time.
Classification
A defined list of categories or types to match.
Boolean
Represents two possible values: true or false
Hint: boolean field names should start with "is" or "has".
Nested Object
A complex data type containing multiple subfields or properties, allowing one level of nesting.
Object Detection
Detect the location of a document feature, such as a logo, signature, photo, etc.
Barcode
Detect the location of a 1D barcode (i.e. UPC, EAN) or a 2D barcode (i.e. QR Code, Data Matrix, 2D-DOC, PDF417).
Additionally, attempt to decode the contents of the barcode as a string value.
Array Types
Any field type can be made into an array, a list of values.
Simply enable "Multiple items can be extracted" when creating or modifying the field.
The return type will be an array of the base type, for example a list of text values, or a list of numbers.
It is possible to have a list of nested objects, but not a list of lists.
In some cases, there can be duplicate items, for example when the same value appears on several pages.
Enable "Filter out duplicates from the list of items" to fix this.
Field Examples
Some examples for the best field types to use, given a basic invoice extraction Data Schema.
Supplier Name
String
Acme Supplies Ltd.
Supplier Logo
Object Detection
Polygon around the logo
Supplier Company Registration
Nested Object
See sub-fields below
Supplier Company Registration.Number
String
CRN-20250123
Supplier Company Registration.Type
Classification
VAT NUMBER
Invoice Date
Date
2025-06-10
Is Past Due
Boolean
false
Total Amount
Number
1540.75
Taxes
Nested Objects Array
See sub-fields below
Taxes[0].Rate
number
0.185
Taxes[0].Base
number
1300.00
Taxes[0].Amount
number
240.75
Global Guidelines
In addition to individual field guidelines, a global guideline can be used in your Data Schema.
The global guideline text will apply to all or some fields, depending on your instructions.
Use global guidelines when you want to:
Generalize instructions to several specific fields. For example:
"Number fields related to amounts should always have 3 decimal places."
"Country fields should return the ISO alpha-3 code of the country."
Provide general instructions or context for all fields. For example:
"Ensure ASCII compliance by removing all diacritics from return values."
You may put any number of unrelated guidelines in the text, for example all of the samples above.
For best results, separate each different guideline with a new line.
Language
You can specify a field's Title, Name, Description, and Guidelines in most languages.
This also applies to the Data Schema's Global Guidelines.
This includes, but is not limited to:
European languages: English, French, Spanish, German, Italian, Portuguese, Russian, Greek, etc
Asian languages: Hindi, Bengali, Turkish, Urdu, Farsi, Armenian, etc
East Asian languages: Japanese, Mandarin, Korean, Vietnamese, etc
Semitic languages: Arabic, Hebrew, Amharic, etc
African languages: Swahili, Yoruba, Zulu, etc
Note: while the models can understand these languages, we are not able to provide in-depth support for all languages.
Technical Limitations
Number of Fields in the Data Schema
The recommended maximum number of fields is 25 for a Data Schema.
While there will be no errors, beyond this number response times will increase.
Names of Fields
The field name must only contain:
lowercase Latin letters without accents (a-z)
numbers (0-9)
underscores (
_), but neither first nor last characters can be an underscore.
Next Steps
Now that you're familiar with the different components of the Data Schema, you'll want to take a look at Data Schema Best Practices for tips on building an accurate and well-performing Data Schema.
Last updated
Was this helpful?

