Source-Available Tool

Local-First PDF Structure Extraction for AI Pipelines

Sourcetrace PDF is powered by pdfstruct, a source-available PDF extraction layer for converting born-digital PDFs into traceable layout-aware JSON and Markdown.

It is designed for private AI pipelines where page evidence, text position, reading order, tables, annotations, metadata, and diagnostics matter before retrieval or model reasoning.

Powered by pdfstruct · Source-available · Free local evaluation · Commercial licence required for production use

What It Preserves

Traceable Layout-Aware JSON and Markdown

01

Page boundaries

02

Text runs

03

Glyph positions where enabled

04

Lines and paragraph candidates

05

Reading-order candidates

06

Table candidates with confidence and evidence

07

Links

08

Annotations

09

Images and image metadata

10

Form fields where supported

11

Metadata

12

Diagnostics

13

JSON and Markdown export

What It Does Not Pretend

PDF Structure Is Inferred

PDF is not a semantic document format.

A PDF page is often a set of positioned drawing operations. Tables, headings, columns, and reading order may not exist explicitly inside the file.

Sourcetrace PDF exposes confidence, evidence, and diagnostics rather than pretending layout recovery is certain.

Fit

Best Fit

  • Legal document bundles
  • Financial statements
  • Contracts
  • Policy documents
  • Board packs
  • Born-digital PDFs
  • Private RAG pipelines
  • Local-first AI ingestion
  • Document auditability workflows

Not Best Fit

  • Scanned documents where OCR is the primary problem
  • Image-only PDFs
  • Generic cheap document conversion at cents-per-page pricing
  • Use cases where cloud extraction is already acceptable and sufficient
  • Workflows that do not need traceability or layout evidence
Licence

Commercial Source-Available Licence

Sourcetrace PDF is intended to be source-available.

Free permitted use

  • Local evaluation
  • Personal use
  • Academic research
  • Non-production internal testing
  • Proof-of-concept assessment

Commercial licence required

  • Production use
  • Client work
  • Regulated workflows
  • SaaS or hosted API use
  • Redistribution
  • Embedding in another commercial product
  • Processing documents for third parties
  • Usage above the free evaluation threshold

Recommended licence model: Business Source License 1.1 with later conversion to Apache-2.0 after the defined change date.

Deployment Model

Local First, Not Hosted First

Sourcetrace PDF is intended to run where the documents already live.

  • Python library
  • Command-line tool
  • Docker container
  • Client-hosted API wrapper
  • Integration inside private AI ingestion pipelines

A hosted API is not the default model because sensitive document workflows often require local control, private processing, and clear custody of source material.

Commercial Use Through Lumen & Lever

Commercial Use Is Handled Through Lumen & Lever

Available commercial paths include professional licence, team licence, embedded/OEM licence, Document Structure Review, custom extractor pack, and Structural AI Architecture Sprint where the issue extends beyond extraction.

Discuss Commercial Use