Data Schema

Overview

An Extraction Data Schema is a type of structured data that helps the system identify which data points to extract from specific documents.

A Data Schema guides the system on the technical processes required and ensures the extracted data is formatted for easy access by the user. The Data Schema is composed of a certain number of Fields (or data points), each with a type and an example.

Field Types

Base Types

Field Type
Description

Text

A sequence of characters representing textual data.

Number

Numeric data which could be an integer or a floating-point value.

Date

A specific year, month, and day, formatted as a YYYY-MM-DD date or YYYY-MM-DD HH:mm:ss for a date-time.

Classification

A predefined list of categories or types.

Boolean

Represents two possible values: true or false

Nested Object

A complex data type containing multiple subfields or properties, allowing one level of nesting.

Object Detection

Detect the location of a document feature, such as a logo, signature, photo, etc.

Barcode

Detect the location of a 1D barcode (i.e. UPC, EAN) or a 2D barcode (i.e. QR Code, Data Matrix).

Additionally, attempt to decode the contents of the barcode as a string value.

Array Types

Any field can be made into an array, a list of values.

Simply enable "Multiple items can be extracted" when creating or modifying the field.

The return type will be an array of the base type, for example a list of text values.

It is possible to have a list of nested objects, but not a list of lists.

In some cases, there can be duplicate items, for example when the same value appears on several pages.

Enable "Filter out duplicates from the list of items" to fix this.

Example

An example of the field types for a basic invoice extraction Data Schema:

Field
Type
Example

Supplier Name

String

Acme Supplies Ltd.

Supplier Logo

Object Detection

Polygon around the logo

Supplier Company Registration

Nested Object

See sub-fields below

Supplier Company Registration.Number

String

CRN-20250123

Supplier Company Registration.Type

Classification

VAT NUMBER

Date

Date

2025-06-10

Total Amount

Number

1540.75

Taxes

Nested Objects Array

See sub-fields below

Taxes[0].Rate

number

0.185

Taxes[0].Base

number

1300.00

Taxes[0].Amount

number

240.75

Performance Optimization

To get the best possible extraction data from a model, you can of course use the Continuous Learning (RAG) feature, but the very first step is to ensure the Data Schema you're using is optimized.

If the foundation is solid, the house is solid.

The various properties of the field all have a role to play in getting the best possible performance.

Field Name and Title

The field Name is automatically generated from the field Title. You can however modify the Title afterwards.

Both the Name and Title are used during processing (inference), but the Name is more important.

Try to use clear, simple names that will precisely describe the field you want to extract. The goal is to avoid any possible confusion between data points present in the document.

Consider wanting to extract the name of the company that issued an invoice.

In our model, we've used the field Name: supplier_name It clearly tells the AI to get only the name, and of the invoice supplier.

you could also use vendor_name, it means the same thing with the same level of precision.

⚠️supplier might work but is imprecise: which information about the supplier do you need?

⚠️company_name might work but is imprecise: we know you need the name of the company, but we don't know if company stands for supplier or customer?

company will likely not work as expected: we do not know neither which information you need nor which company is concerned.

Field Type

Try to use the Field Types that will best suits the field you need.

For example, while you could use a string for due_date, a date field type is definitely better.

Field Description

The field's Description also has an impact on the model's performance.

Use it to describe what the field represents, and/or of what its use is to you.

For example, the supplier_name field could have:

The name of the supplier.

Used in internal processing to match our supplier ID with the name found on the document.

Field Extraction Guidelines

Sometimes changing the field name and type is not enough to explain what you need for one field. In that case you have the possibility to add extraction Guidelines to the field.

Use natural language to explain how to properly extract the data, and/or any extra steps like formatting.

For instance, with supplier_phone_number , adding the following extraction guidelines could be useful:

If you find several phone numbers in the document, use the phone number of the supplier headquarters.

Always reformat the data to match the international phone number format, as follows: +1-212-456-7890

Relative Importance of Field Properties

Not all field properties have the same importance or weight when it comes to how the models process files.

Additionally, not all types of fields are handled the same way.

In the following table, "Normal Fields" are those that extract textual information from the document (text, dates, numbers, etc), whether they are simple fields, lists, or nested object fields.

"Object Detection" refers to specific processing to extract polygons of various elements on the document, such as signatures, ID photos, etc.

Property
Normal Field Usage
Object Detection Usage

Name

Most important

Not used

Title

Important

Most important

Description

Complementary

Not used

Guidelines

Complementary

Not used

Classification Values

Very important (only for classification fields)

Not used

Language

You can specify a field's Title, Name, Description, and Guidelines in most languages. Note that the Name can only contains ASCII characters.

This includes, but is not limited to:

  • European languages: English, French, Spanish, German, Italian, Portuguese, Russian, Greek, etc

  • Asian languages: Hindi, Bengali, Turkish, Urdu, Farsi, Armenian, etc

  • East Asian languages: Japanese, Mandarin, Korean, Vietnamese, etc

  • Semitic languages: Arabic, Hebrew, Amharic, etc

  • African languages: Swahili, Yoruba, Zulu, etc

Note: while the models can understand, we are not able to provide in-depth support for all languages.

Technical Limitations

Number of Fields in the Data Schema

For your data schema, the recommended maximum number of fields is 25.

Beyond this limit, performance will be drastically reduced.

Names of Fields

The field ame must only contain:

  • lowercase Latin letters without accents (a-z)

  • numbers (0-9)

  • underscores (_)

Last updated

Was this helpful?