Data Schema

Overview

An Extraction Data Schema is a type of structured data that helps the system identify which data points to extract from specific documents.

A Data Schema guides the system on the technical processes required and ensures the extracted data is formatted for easy access by the user. The Data Schema is composed of a certain number of Fields (or data points), each with a type and an example.

Field Types

There are seven different field types:

Field Type
Description

String

A sequence of characters representing textual data.

Classification

A predefined list of categories or types.

Date

A specific year, month, and day, formatted as a YYYY-MM-DD date-time.

Number

Numeric data which could be an integer or a floating-point value.

Boolean

Represents two possible values: true or false

Object Detection

Detect the location of a document feature, such as a logo, signature, photo, etc

Nested Object

A complex data type that contains multiple sub-fields or properties allowing one level of nesting.

Example

An example of the field types for a basic invoice extraction Data Schema:

Field
Type
Example

Supplier Name

String

Acme Supplies Ltd.

Supplier Logo

Object Detection

Polygon (bounding box) around the logo

Supplier Company Registration

Nested Object

(see sub-fields below)

Supplier Company Registration.Number

String

CRN-20250123

Supplier Company Registration.Type

Classification

VAT NUMBER

Date

Date

2025-06-10

Total Amount

Number

1540.75

Taxes

Nested Object

(see sub-fields below)

Taxes.Rate

number

0.185

Taxes.Base

number

1300.00

Taxes.Amount

number

240.75

Building a Top-Performing Data Schema

To get the best possible extraction data from a model, you can of course use the Continuous Learning (RAG) feature, but the very first step is to ensure the Data Schema you're using is optimized.

If the foundation is solid, the house is solid

Field Name and Title

The field name is automatically generated from the field title.

While the title is purely for humans, the name has an influence on how the AI system performs the extraction operation.

Try to use simple names that will precisely describe the field you want to extract. The goal is to avoid any possible confusion between data points present in the document.

Consider wanting to extract the name of the company that issued an invoice.

In our model, we've used the field name: supplier_name It clearly tells the AI to get only the name, and of the invoice supplier.

you could also use vendor_name, it means the same thing with the same level of precision.

⚠️supplier might work but is imprecise: which information about the supplier do you need?

⚠️company_name might work but is imprecise: we know you need the name of the company, but we don't know if company stands for supplier or customer?

company will likely not work as expected: we do not know neither which information you need nor which company is concerned.

Field Type

Try to use the Field Typesthat will best suits the field you need.

For the due_date, you could use a string, but a date field is definitely a better solution.

Extraction Guidelines

Sometimes changing the field name and type is not enough to explain what you need for one field. In that case you have the possibility to add Extraction Guidelines to the field.

Use natural language to explain how to properly extract the data.

For instance, for supplier_phone_number , adding the following extraction guidelines could be useful:

If you find several phone numbers in the document, I need the phone number of the supplier headquarters. Also, I want you to reformat the data to match the international phone number format, as follows : +1-212-456-7890

You can specify guidelines in most languages, including, but not limited to:

  • European languages: English, French, Spanish, German, Italian, Portuguese, Russian, Greek, etc

  • Asian languages: Hindi, Bengali, Turkish, Urdu, Farsi, Armenian, etc

  • East Asian languages: Japanese, Mandarin, Korean, Vietnamese, etc

  • Semitic languages: Arabic, Hebrew, Amharic, etc

  • African languages: Swahili, Yoruba, Zulu, etc

Note: while the models can understand, we are not able to provide in-depth support for all languages.

Technical Limitations

Number of Fields in the Data Schema

For your data schema, the recommended maximum number of properties is 25.

Beyond this limit, performance will be drastically reduced.

Names of Fields

The field name must only contain:

  • lowercase Latin letters without accents (a-z)

  • numbers (0-9)

  • underscores (_)

Last updated

Was this helpful?