Data Schema

What is a Data Schema?

An Extraction Data Schema is a type of structured data that helps the system identify which data points to extract from specific documents. It guides the system on the technical processes required and ensures the extracted data is formatted for easy access by the user. The Data Schema is composed of a certain number of Fields (or data points), each with a type and an example.

Field Types

There are six different field types:

Field Type

Description

string

A sequence of characters representing textual data.

classification

A predefined list of categories or types.

date

A specific year, month, and day, formatted as a YYYY-MM-DD date-time.

number

Numeric data which could be an integer or a floating-point value.

boolean

Represents two possible values: true or false

nested_object

A complex data type that contains multiple sub-fields or properties allowing one level of nesting.

Example of a Basic Invoice Extraction Data Schema

Field

Type

Example

Supplier Name

string

Acme Supplies Ltd.

Supplier Company Registration

object

(see sub-fields below)

Supplier Company Registration.Number

string

CRN-20250123

Supplier Company Registration.Type

classification

VAT NUMBER

Date

date

2025-06-10

Total Amount

number

1540.75

Taxes

object

(see sub-fields below)

Taxes.Rate

number

0.185

Taxes.Base

number

1300.00

Taxes.Amount

number

240.75

How to Build a Top-Performing Data Schema?

To get the best possible extraction data from a model, you can of course use the Continuous Learning (RAG) feature, but the very first step is to ensure the Data Schema you're using is optimized.

If the foundation is solid, the house is solid

Field Name and Title

The field name is automatically generated from the field title.

While the title is purely for humans, the name has an influence on how the AI system performs the extraction operation.

Try to use simple names that will precisely describe the field you want to extract. The goal is to avoid any possible confusion between data points present in the document.

Consider wanting to extract the name of the company that issued an invoice.

In our model, we've used the field name: supplier_name It clearly tells the AI to get only the name, and of the invoice supplier.

✅ you could also use vendor_name, it means the same thing with the same level of precision.

⚠️supplier might work but is imprecise: which information about the supplier do you need?

⚠️company_name might work but is imprecise: we know you need the name of the company, but we don't know if company stands for supplier or customer?

❌ company will likely not work as expected: we do not know neither which information you need nor which company is concerned.

Field Type

Try to use the Field Typesthat will best suits the field you need.

For the due_date, you could use a string, but a date field is definitely a better solution.

Extraction Guidelines

Sometimes changing the field name and type is not enough to explain what you need for one field. In that case you have the possibility to add Extraction Guidelines to the field. Use natural language to explain how to properly extract the data.

For instance, for supplier_phone_number , adding the following extraction guidelines could be useful:

If you find several phone numbers in the document, I need the phone number of the supplier headquarters. Also, I want you to reformat the data to match the international phone number format, as follows : +1-212-456-7890

PreviousModels NextContinuous Learning (RAG)

Last updated 5 days ago

Was this helpful?