Data Schema Overview

General description of an extraction model's Data Schema.

An Extraction Data Schema defines which data should be extracted, and in what way, from documents sent to the model.

You can think of the Data Schema as a map to the documents you send to the model for extraction.

This map includes which data to extract, how to format the data, pitfalls to avoid, etc. To do this, the Data Schema is composed of Fields (or data points), each field having its own configuration.

In this way the Data Schema is the foundation of your model. All other features will use and depend on the Data Schema.

Use the Live Test when working on your Data Schema to quickly validate changes.

Fields Overview

A Data Schema is primarily composed of fields.

A field describes a single data point to extract in the document.

Each field has the following properties:

  • Title - human-readable

  • Name - machine-readable, used as the field's key in the API return

  • Type - the type of data to extract, more details in Field Types

  • Description (optional) - provides extra context on how the field is used

  • Guidelines (optional) - provides instructions to better extract the field

Field Types

A field's type determines how it will be formatted when returned by the API.

The type also gives an indication on what to look for in the input file.

Base Types

Field Type
Description

Text

A sequence of characters representing textual data.

Number

Numeric data which could be an integer or a floating-point value.

Date

A specific year, month, and day, formatted as a YYYY-MM-DD date or YYYY-MM-DD HH:mm:ss for a date-time.

Classification

A defined list of categories or types to match.

Boolean

Represents two possible values: true or false

Hint: boolean field names should start with "is" or "has".

Nested Object

A complex data type containing multiple subfields or properties, allowing one level of nesting.

Object Detection

Detect the location of a document feature, such as a logo, signature, photo, etc.

Barcode

Detect the location of a 1D barcode (i.e. UPC, EAN) or a 2D barcode (i.e. QR Code, Data Matrix, 2D-DOC, PDF417).

Additionally, attempt to decode the contents of the barcode as a string value.

Array Types

Any field type can be made into an array, a list of values.

Simply enable "Multiple items can be extracted" when creating or modifying the field.

The return type will be an array of the base type, for example a list of text values, or a list of numbers.

It is possible to have a list of nested objects, but not a list of lists.

In some cases, there can be duplicate items, for example when the same value appears on several pages.

Enable "Filter out duplicates from the list of items" to fix this.

Field Examples

Some examples for the best field types to use, given a basic invoice extraction Data Schema.

Field Name
Field Type
Example Return Value

Supplier Name

String

Acme Supplies Ltd.

Supplier Logo

Object Detection

Polygon around the logo

Supplier Company Registration

Nested Object

See sub-fields below

Supplier Company Registration.Number

String

CRN-20250123

Supplier Company Registration.Type

Classification

VAT NUMBER

Invoice Date

Date

2025-06-10

Is Past Due

Boolean

false

Total Amount

Number

1540.75

Taxes

Nested Objects Array

See sub-fields below

Taxes[0].Rate

number

0.185

Taxes[0].Base

number

1300.00

Taxes[0].Amount

number

240.75

Global Guidelines

In addition to individual field guidelines, a global guideline can be used in your Data Schema.

The global guideline text will apply to all or some fields, depending on your instructions.

Use global guidelines when you want to:

  • Generalize instructions to several specific fields. For example:

    • "Number fields related to amounts should always have 3 decimal places."

    • "Country fields should return the ISO alpha-3 code of the country."

  • Provide general instructions or context for all fields. For example:

    • "Ensure ASCII compliance by removing all diacritics from return values."

You may put any number of unrelated guidelines in the text, for example all of the samples above.

For best results, separate each different guideline with a new line.

Language

You can specify a field's Title, Name, Description, and Guidelines in most languages.

This also applies to the Data Schema's Global Guidelines.

This includes, but is not limited to:

  • European languages: English, French, Spanish, German, Italian, Portuguese, Russian, Greek, etc

  • Asian languages: Hindi, Bengali, Turkish, Urdu, Farsi, Armenian, etc

  • East Asian languages: Japanese, Mandarin, Korean, Vietnamese, etc

  • Semitic languages: Arabic, Hebrew, Amharic, etc

  • African languages: Swahili, Yoruba, Zulu, etc

Note: while the models can understand these languages, we are not able to provide in-depth support for all languages.

Technical Limitations

Number of Fields in the Data Schema

The recommended maximum number of fields is 25 for a Data Schema.

While there will be no errors, beyond this number response times will increase.

Names of Fields

The field name must only contain:

  • lowercase Latin letters without accents (a-z)

  • numbers (0-9)

  • underscores (_), but neither first nor last characters can be an underscore.

Next Steps

Now that you're familiar with the different components of the Data Schema, you'll want to take a look at Data Schema Best Practices for tips on building an accurate and well-performing Data Schema.

Last updated

Was this helpful?