Load and Adjust a File

Reference documentation on loading and manipulating files using Mindee client libraries.

This is reference documentation. Looking for a quick TL;DR?

  • Take a look at the Integrating Mindee page.

  • Use the search bar above to ask our documentation AI to write code samples for you.

Requirements

You'll nee to have your Mindee client configured correctly as described in the Configure the Client page.

Overview

Overall, the steps to sending a file are:

  1. Load a source file.

  2. Optional: adjust the source file before sending.

  3. Use the Mindee client instance to send the file.

Load a Source File

You can load a source file from a path, from raw bytes, from a bytes stream, or from a language-specific object. Choose the appropriate type based on your application requirements.

If you're unsure of which to use, we recommend loading from a path.

You'll need to use the mindee_client instance created in Configure the Client.

To load a path string, use source_from_path .

input_path = "/path/to/the/file.ext"
input_source = mindee_client.source_from_path(input_path)

To load a Path instance, use source_from_path.

from pathlib import Path

input_path = Path("/path/to/the/file.ext")
input_source = mindee_client.source_from_path(input_path)

To load raw bytes, use source_from_bytes .

from pathlib import Path

input_path = Path("/path/to/the/file.ext")
with input_path.open("rb") as fh:
    input_bytes = fh.read()

input_source = mindee_client.source_from_bytes(
    input_bytes,
    filename="file.ext",
)

To load a base-64 string, use source_from_b64string . The string will be decoded into bytes internally.

from pathlib import Path

input_base64 = "iVBORw0KGgoAAAANSUhEUgAAABgAAA ..."

input_source = mindee_client.source_from_b64string(
    input_base64,
    filename="base64_file.txt",
)

To load a file handle, use source_from_file. It must be opened in binary mode, as a BinaryIO .

from pathlib import Path

input_path = Path("/path/to/the/file.ext")
with input_path.open("rb") as fh:
    input_source = mindee_client.source_from_file(fh)
    # IMPORTANT:
    # Continue all operations inside the 'with' statement.
    mindee_client.enqueue_and_get_inference(
        input_source, params
    )

Adjust the Source File

Optionally make changes and adjustments to the source file before sending.

All file adjustments are applied in-memory to the source file instance.

If loaded from disk, the original file is not modified.

Fixing PDF Headers

In some cases, PDFs will have corrupt or invalid headers. These files will return a 4xx HTTP error as the server will be unable to process them.

You can try to fix the headers using the provided functions.

Note: this feature is not yet available for all languages.

Using the input_source instance created above.

input_source.fix_pdf()

File Compression

There is no need to send excessively large files to the Mindee API.

Unfortunately, many modern smartphones can take very high resolution images.

We provide a way to compress images before sending to the API.

Using the input_source instance created above.

Basic usage is very simple, and can be applied to both images and PDFs:

input_source.compress(quality=85)

For images, you can also set a maximum height and/or width. The aspect ratio will always be preserved.

For example to compress and resize to no greater than 1920x1920 pixels:

input_source.compress(
    quality=85, max_width=1920, max_height=1920
)

PDF Page Manipulations

In some cases, PDFs will have some superfluous pages present.

For example a cover page or terms and conditions which are not useful to the desired data extraction.

These extra pages count towards your billing and slow down processing.

It is therefore in your best interest to remove them before sending.

Parameters:

  • "Page Indexes" is required and is a list of 0-based page indexes. Use negative values to specify indexes starting from the end, i.e. -1 for the last page.

  • "Operation" specifies whether to keep only specified pages or remove specified pages. One of "Keep Only" or "Remove".

  • "On Min Pages" is optional and specifies the minimum number of pages a document must have for the operation to take place. The value of 0 means any number of pages.

Exact naming of parameters will depend on the language, see below.

Using the input_source instance created above.

from mindee import PageOptions

# Set the options as follows:
# For all documents, keep only the first page
page_options = PageOptions(
    operation="KEEP_ONLY",
    page_indexes=[0],
)

# Apply in-memory
input_source.apply_page_options(page_options)

Some other examples:

# Only for documents having 3 or more pages:
# Keep only these pages: first, penultimate, last
PageOptions(
    operation="KEEP_ONLY",
    on_min_pages=3,
    page_indexes=[0, -2, -1],
)

# For all documents:
# Remove the first page
PageOptions(
    operation="REMOVE",
    page_indexes=[0],
)

# Only for documents having 10 or more pages:
# Remove the first 5 pages
PageOptions(
    operation="REMOVE",
    on_min_pages=10,
    page_indexes=list(range(5)),
)

Send the File

Now that your file is ready, you'll want to send it to the Mindee servers for processing.

Head on over to the Send for Processing section for details on the next step.

Last updated

Was this helpful?