Skip to content

How to set up a document ingestion flow - PDFs

This mini-guide provides a quick overview of how to upload, process, and use PDF files of varying length and content in mAIstro.

PDF with text

Text-based PDFs are the most straightforward to work with in mAIstro. These files contain embedded digital text (not scanned images), which allows the AI to directly parse, analyze, and reuse the content.

To process a PDF containing solely text, the user should follow these steps:

  1. Upload the PDF as a local document using the Upload Document node under Upload Data.
  2. Use a context loop to split the document into smaller, readable sections.
  3. Save the content into a variable to be used by the AI agent.
  4. End the loop.

EXAMPLE NTL:

{{ doc  | name: "example.pdf" }}
{{ contextLoop  | tokens: "3000" | overlap: "10" }}
{{ variable  | name: "exampleContent" | mode: "append" }}
{{ endLoop  }}
<< name: exampleContent, prompt: false >>

imagepdf

PDF with images

When a PDF consists of scanned pages or embedded images (ex. from a physical book or printed document), it must be converted into readable text using Optical Character Recognition (OCR).

To process an image-based PDF for use in mAIstro:

  1. Upload document using the Upload & OCR node under Upload Data.
  2. Once OCR is applied, you can proceed with looping and variable setup just like with a text-based PDF.

For more information on OCR'ing an image, see Upload & OCR.

Long PDF with images and text

Complex PDFs such as technical manuals, equipment documentation, or compliance guides often contain both unstructured and structured content (e.g., narrative text, tables, flowcharts).

Processing a long PDF file with images and text in mAIstro can be efficiently managed by following these steps:

  1. Set up a PDF loop, which splits a PDF by page and loops through the pages as both text and images.
  2. Send contents to the LLM and save as a variable.
  3. End the PDF Loop
  4. Run context loops on the variable, then end the loop.

EXAMPLE NTL:

{{ pdfLoop  | file: "longexample.pdf" | pagesPerLoop: "1" }}
{{ LLM  | prompt: "Your job is to turn everything in this business document into text. \n\nIf there are any images, describe them in detail.\n \nIf there are charts or diagrams, explain all of the data points.\nHere is the text:\n<< name: pdfLoopText0, prompt: false >>\n" | cache: "true" | images: "<< name: pdfLoopImage0, prompt: false >>" }}=>{{ variable  | name: "accumulator" | mode: "append" }}
{{ endLoop  }}
<< name: accumulator, prompt: false >>
{{ contextLoop  | tokens: "3000" | overlap: "10" }}
{{ endLoop  }}

imagelong

Final Tips

  • You can check if your PDF is text-based by trying to highlight/select text. If you can’t, it likely needs OCR.
  • Token Overlap is important for preserving sentence continuity across segments.
  • Test small samples first when working with large PDFs to fine-tune your prompts or loop settings.