Document Ingestion Control

Before AI Can Reason Over Documents, the Documents Have to Survive Extraction

AI systems do not consume contracts, statements, policies, correspondence, and case files directly.

They consume whatever the ingestion pipeline produces.

If that pipeline flattens structure, loses tables, detaches annotations, removes page evidence, or breaks source traceability, the model is reasoning over damaged material.

AUD 9,500 to 12,500 · 1 week

Who It Is For

Document Workflows Where Structure Matters

This review is for organisations using documents in AI or retrieval workflows where structure matters.

01

Legal correspondence and discovery bundles

02

Bank statements and financial tracing

03

Contracts and clauses

04

Board papers and policy libraries

05

Internal knowledge bases

06

Regulatory or compliance document sets

07

Legacy RTF or PDF archives

08

Private RAG pipelines

09

AI search over sensitive business material

The Problem

Flattening Is Not Neutral

A document can look orderly to a human and become disorderly the moment it enters a pipeline.

PDFs are often positioned drawing instructions, not semantic documents. RTF files can contain structure that disappears when treated as plain text.

Once structure is lost, later AI stages try to infer what the ingestion layer already destroyed.

  • Tables become unreliable text blocks
  • Amounts separate from their row labels
  • Dates lose sequence and context
  • Clauses detach from headings
  • Page numbers and source positions disappear
  • Annotations are ignored or misattached
  • Reading order is guessed without evidence
  • Metadata is discarded
  • Retrieval returns plausible fragments without auditability
What We Assess

Extraction Reliability Before AI Scale

01

Whether documents are being flattened too early

02

Whether tables, rows, columns, and merged cells are preserved or lost

03

Whether page, position, and source evidence survive extraction

04

Whether annotations, links, metadata, and form fields are captured

05

Whether retrieval chunks reflect document structure

06

Whether sensitive documents require local-first processing

07

Whether current tooling is suitable for the document class

08

Whether failure modes are visible through diagnostics

What You Receive

One Week, Focused Findings, Clear Next Steps

01

Document-ingestion risk report

02

Extraction failure map

03

Representative examples of structure loss

04

Table and layout reliability assessment

05

Source traceability assessment

06

Privacy and local-processing recommendations

07

Tooling recommendation

08

Remediation priorities

09

Next-step control recommendation

Sourcetrace Behind the Review

Infrastructure Proof, Not a Preset Conclusion

Lumen & Lever maintains Sourcetrace, a local-first document-structure layer used to inspect how document meaning survives extraction.

Sourcetrace RTF is powered by rtfstruct, a free open-source parser for reading RTF as structure, not just text.

Sourcetrace PDF is powered by pdfstruct, a source-available commercial parser for traceable PDF extraction.

The review remains tool-agnostic in conclusion. If a better existing tool fits the document set, the recommendation will say so.

When Not To Choose This

Choose the Narrowest Useful Entry Point

  • If you only need a broad register of AI use across the business, start with the AI Usage Control Baseline.
  • If you need board-grade readiness across governance, lifecycle control, cost behaviour, override capability, and capital gates, start with the Structural AI Architecture Sprint.
  • If your documents are mostly scanned images and OCR quality is the central issue, this review may identify the problem but is not an OCR implementation project.
Next Step

Assess the Documents Before Scaling the AI System

If document structure is unreliable at ingestion, retrieval and model reasoning inherit the damage.

Start the intake