HTML Prototype
Interactive UI shell with component architecture
Document Extraction Engine
Docling + Tika parsers with failover & metadata
Table Extraction Pipeline
Camelot lattice/stream + accuracy filtering + Excel output
OCR & Scanned Documents
PaddleOCR PP-OCRv5 + DocTR ensemble + auto-detection
AI Cross-Reference
LlamaIndex semantic indexing + verification + audit trails
Excel Integration
xlwings live connector + workbook gen + 4 templates
Full Pipeline & Dashboard
Orchestrator + production dashboard + API gateway
| Job ID | Source | Status | Stages | Duration | Output |
|---|---|---|---|---|---|
| No jobs yet — use Pipeline to process documents | |||||
Run Pipeline
Drop source document here or click to upload
Pipeline Profile
⚡ Quick
Parse + table extraction only. Fastest processing for simple document reads.
🔍 Standard
Full pipeline: parse, tables, OCR if scanned, cross-reference, workpaper generation.
🧠 Thorough
All stages + LLM-powered claim verification. Maximum accuracy for critical audits.
Pipeline Stages
Job History
| Job ID | Source | Profile | Status | Evidence | Findings | Duration | Created |
|---|---|---|---|---|---|---|---|
| No jobs recorded | |||||||
Document Extraction
| Parser | Formats | Features | Status |
|---|---|---|---|
| Docling | PDF, DOCX, PPTX, HTML, Markdown | Layout analysis, table detection, metadata | Primary |
| Apache Tika | 1000+ formats | Metadata extraction, content detection | Fallback |
| Camelot | PDF tables | Lattice/Stream/Hybrid, accuracy scoring | Active |
| PaddleOCR | Images, scanned PDFs | PP-OCRv5, 100+ languages, angle classification | Active |
| DocTR | Images, PDFs | Transformer detection, ensemble partner | Ensemble |
AI Cross-Reference Engine
| Check | Type | Description |
|---|---|---|
| Numeric Comparison | Rule-based | Extracts and compares numeric values with 0.01% tolerance |
| Date Matching | Rule-based | Normalizes and compares date formats across documents |
| Reference Matching | Rule-based | PO/INV/REF number pattern detection and matching |
| Text Overlap | Rule-based | Word-level Jaccard similarity with stop-word filtering |
| Semantic Search | AI (embeddings) | LlamaIndex vector search with BGE-small-en-v1.5 |
| LLM Verification | AI (optional) | Contextual claim verification via Ollama/llama3.2 |
Audit Workpaper Templates
🧾 Invoice Verification
3-way match: Invoice ↔ PO ↔ GRN. Pre-built variance formulas and exception tracking.
📊 Financial Statement Review
Trial balance, GL detail, cross-references, and variance analysis sheets.
📜 Contract Review
Terms verification, compliance tracking, and findings documentation.
📋 General Audit
Generic evidence workpaper with findings, data, and change log sheets.
| Feature | Description |
|---|---|
| 7 Standard Sheets | Cover, Evidence, Findings, Extracted Data, Source Index, Change Log, Charts |
| Conditional Formatting | Green/amber/red confidence cells, opinion highlighting |
| Quality Charts | Match quality distribution bar chart with exact/high/partial/weak breakdown |
| Change Tracking | Full audit trail of every cell modification with timestamp and provenance |
| Auto-Filters | Frozen headers, column filters, alternating row colours |
| Evidence IDs | Unique traceable IDs (E-ENGAGEMENT-NNNN) for every evidence item |
System Configuration
| Method | Endpoint | Phase | Description |
|---|---|---|---|
| GET | /api/health | Core | Health check & version |
| POST | /api/extract | 2 | Parse document (auto-detect parser) |
| POST | /api/tables/extract | 3 | Extract tables (JSON with quality metrics) |
| POST | /api/tables/extract/excel | 3 | Extract tables → formatted .xlsx |
| POST | /api/tables/extract/csv | 3 | Extract tables → zipped CSVs |
| POST | /api/tables/report | 3 | Parsing debug report |
| POST | /api/ocr/recognize | 4 | OCR single image |
| POST | /api/ocr/recognize/pdf | 4 | OCR scanned PDF |
| POST | /api/ocr/detect-scan | 4 | Detect scanned vs text PDF |
| POST | /api/ocr/extract-fields | 4 | OCR + invoice field extraction |
| POST | /api/ai/index | 5 | Index target documents |
| POST | /api/ai/cross-reference | 5 | Cross-reference source vs targets |
| POST | /api/ai/cross-reference/audit | 5 | Full audit trail (JSON/MD/XLSX) |
| POST | /api/ai/verify | 5 | Verify single claim |
| POST | /api/ai/search | 5 | Semantic search indexed docs |
| GET | /api/excel/templates | 6 | List workpaper templates |
| POST | /api/excel/templates/create | 6 | Generate template .xlsx |
| POST | /api/excel/read | 6 | Read workbook data |
| POST | /api/excel/write | 6 | Write with change tracking |
| POST | /api/pipeline/index-targets | 7 | Index reference documents |
| POST | /api/pipeline/run | 7 | Execute full pipeline |
| GET | /api/pipeline/jobs | 7 | List all pipeline jobs |
| Package | Version | Phase | Purpose |
|---|---|---|---|
| fastapi | ≥0.111 | Core | API framework |
| docling | ≥2.15 | 2 | Primary document parser |
| tika-client | ≥0.5 | 2 | Fallback parser (1000+ formats) |
| camelot-py[cv] | ≥1.0.9 | 3 | PDF table extraction |
| paddleocr | ≥3.0 | 4 | Primary OCR (PP-OCRv5) |
| python-doctr | ≥0.8 | 4 | Alternative/ensemble OCR |
| pymupdf | ≥1.24 | 4 | PDF rendering + scan detection |
| llama-index-core | ≥0.12 | 5 | Semantic indexing & retrieval |
| sentence-transformers | ≥3.0 | 5 | BGE-small-en-v1.5 embeddings |
| xlwings | ≥0.33 | 6 | Live Excel automation |
| openpyxl | ≥3.1 | 3,6 | Excel workbook generation |
| pandas | ≥2.0 | All | Data manipulation |