Document Ingestion Control

Before AI Can Reason Over Documents, the Documents Have to Survive Extraction

AI systems do not consume contracts, statements, policies, correspondence, and case files directly.

They consume whatever the ingestion pipeline produces.

If that pipeline flattens structure, loses tables, detaches annotations, removes page evidence, or breaks source traceability, the model is reasoning over damaged material.

AUD 9,500 to 12,500 · 1 week

Start the intake View Sourcetrace

Who It Is For

Document Workflows Where Structure Matters

This review is for organisations using documents in AI or retrieval workflows where structure matters.

Legal correspondence and discovery bundles

Bank statements and financial tracing

Contracts and clauses

Board papers and policy libraries

Internal knowledge bases

Regulatory or compliance document sets

Legacy RTF or PDF archives

Private RAG pipelines

AI search over sensitive business material

The Problem

Flattening Is Not Neutral

A document can look orderly to a human and become disorderly the moment it enters a pipeline.

PDFs are often positioned drawing instructions, not semantic documents. RTF files can contain structure that disappears when treated as plain text.

Once structure is lost, later AI stages try to infer what the ingestion layer already destroyed.

• Tables become unreliable text blocks
• Amounts separate from their row labels
• Dates lose sequence and context
• Clauses detach from headings
• Page numbers and source positions disappear
• Annotations are ignored or misattached
• Reading order is guessed without evidence
• Metadata is discarded
• Retrieval returns plausible fragments without auditability

What We Assess

Extraction Reliability Before AI Scale

Whether documents are being flattened too early

Whether tables, rows, columns, and merged cells are preserved or lost

Whether page, position, and source evidence survive extraction

Whether annotations, links, metadata, and form fields are captured

Whether retrieval chunks reflect document structure

Whether sensitive documents require local-first processing

Whether current tooling is suitable for the document class

Whether failure modes are visible through diagnostics

What You Receive

One Week, Focused Findings, Clear Next Steps

Document-ingestion risk report

Extraction failure map

Representative examples of structure loss

Table and layout reliability assessment

Source traceability assessment

Privacy and local-processing recommendations

Tooling recommendation

Remediation priorities

Next-step control recommendation

Sourcetrace Behind the Review

Infrastructure Proof, Not a Preset Conclusion

Lumen & Lever maintains Sourcetrace, a local-first document-structure layer used to inspect how document meaning survives extraction.

Sourcetrace RTF is powered by rtfstruct, a free open-source parser for reading RTF as structure, not just text.

Sourcetrace PDF is powered by pdfstruct, a source-available commercial parser for traceable PDF extraction.

The review remains tool-agnostic in conclusion. If a better existing tool fits the document set, the recommendation will say so.

When Not To Choose This

Choose the Narrowest Useful Entry Point

• If you only need a broad register of AI use across the business, start with the AI Usage Control Baseline.
• If you need board-grade readiness across governance, lifecycle control, cost behaviour, override capability, and capital gates, start with the Structural AI Architecture Sprint.
• If your documents are mostly scanned images and OCR quality is the central issue, this review may identify the problem but is not an OCR implementation project.

Next Step

Assess the Documents Before Scaling the AI System

If document structure is unreliable at ingestion, retrieval and model reasoning inherit the damage.

Start the intake