PDF → JSON Data Extraction System

A Python-based system for extracting structured data from PDF documents into a strict JSON schema using layout-aware parsing.

Key Features

Layout-based extraction using PDF text coordinates
No regex, no heuristic guessing
Supports minor layout variations via configurable tolerance
Strict JSON schema validation
Built-in logging for traceability
Modular and extensible design

Approach

The system relies on PDF text positions (x/y coordinates) to accurately map labels and values. This ensures reliable extraction even for complex or semi-structured PDFs such as invoices, reports, or statements.

Typical Use Cases

Invoice processing
Financial documents
Business reports
Enterprise or government PDFs

Tech Stack

Python
pdfplumber / pdfminer
JSON Schema validation
Logging

Notes

Sample PDFs are not included for privacy reasons. The repository demonstrates the extraction architecture and workflow.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
extractor		extractor
output		output
schemas		schemas
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF → JSON Data Extraction System

Key Features

Approach

Typical Use Cases

Tech Stack

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF → JSON Data Extraction System

Key Features

Approach

Typical Use Cases

Tech Stack

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages