Enter your 6-digit authentication code

Pipeline Ready — Phase 7 Production
Documents Processed
0
Across all jobs
Cross-References
0
Evidence items found
Discrepancies
0
Requiring review
Workpapers
0
Generated
Architecture — All 7 Phases Complete
1

HTML Prototype

Interactive UI shell with component architecture

Complete
2

Document Extraction Engine

Docling + Tika parsers with failover & metadata

Complete
3

Table Extraction Pipeline

Camelot lattice/stream + accuracy filtering + Excel output

Complete
4

OCR & Scanned Documents

PaddleOCR PP-OCRv5 + DocTR ensemble + auto-detection

Complete
5

AI Cross-Reference

LlamaIndex semantic indexing + verification + audit trails

Complete
6

Excel Integration

xlwings live connector + workbook gen + 4 templates

Complete
7

Full Pipeline & Dashboard

Orchestrator + production dashboard + API gateway

Complete
Recent Jobs 0 jobs
Job IDSourceStatus StagesDurationOutput
No jobs yet — use Pipeline to process documents

Run Pipeline

📄

Drop source document here or click to upload

Supports PDF, DOCX, XLSX, images (PNG, JPG, TIFF)

Pipeline Profile

⚡ Quick

Parse + table extraction only. Fastest processing for simple document reads.

🔍 Standard

Full pipeline: parse, tables, OCR if scanned, cross-reference, workpaper generation.

🧠 Thorough

All stages + LLM-powered claim verification. Maximum accuracy for critical audits.

Pipeline Stages

1
Parse
2
Tables
3
OCR
4
Cross-Ref
5
Verify
6
Workpaper

Job History

Job IDSourceProfile StatusEvidenceFindings DurationCreated
No jobs recorded

Document Extraction

Supported Parsers
ParserFormatsFeaturesStatus
DoclingPDF, DOCX, PPTX, HTML, MarkdownLayout analysis, table detection, metadataPrimary
Apache Tika1000+ formatsMetadata extraction, content detectionFallback
CamelotPDF tablesLattice/Stream/Hybrid, accuracy scoringActive
PaddleOCRImages, scanned PDFsPP-OCRv5, 100+ languages, angle classificationActive
DocTRImages, PDFsTransformer detection, ensemble partnerEnsemble

AI Cross-Reference Engine

Match Quality
Average confidence score
Indexed Chunks
Across target documents
Verification
Claims verified
Verification Methods
CheckTypeDescription
Numeric ComparisonRule-basedExtracts and compares numeric values with 0.01% tolerance
Date MatchingRule-basedNormalizes and compares date formats across documents
Reference MatchingRule-basedPO/INV/REF number pattern detection and matching
Text OverlapRule-basedWord-level Jaccard similarity with stop-word filtering
Semantic SearchAI (embeddings)LlamaIndex vector search with BGE-small-en-v1.5
LLM VerificationAI (optional)Contextual claim verification via Ollama/llama3.2

Audit Workpaper Templates

🧾 Invoice Verification

3-way match: Invoice ↔ PO ↔ GRN. Pre-built variance formulas and exception tracking.

📊 Financial Statement Review

Trial balance, GL detail, cross-references, and variance analysis sheets.

📜 Contract Review

Terms verification, compliance tracking, and findings documentation.

📋 General Audit

Generic evidence workpaper with findings, data, and change log sheets.

Workpaper Features
FeatureDescription
7 Standard SheetsCover, Evidence, Findings, Extracted Data, Source Index, Change Log, Charts
Conditional FormattingGreen/amber/red confidence cells, opinion highlighting
Quality ChartsMatch quality distribution bar chart with exact/high/partial/weak breakdown
Change TrackingFull audit trail of every cell modification with timestamp and provenance
Auto-FiltersFrozen headers, column filters, alternating row colours
Evidence IDsUnique traceable IDs (E-ENGAGEMENT-NNNN) for every evidence item

System Configuration

API Endpoints (22 total)
MethodEndpointPhaseDescription
GET/api/healthCoreHealth check & version
POST/api/extract2Parse document (auto-detect parser)
POST/api/tables/extract3Extract tables (JSON with quality metrics)
POST/api/tables/extract/excel3Extract tables → formatted .xlsx
POST/api/tables/extract/csv3Extract tables → zipped CSVs
POST/api/tables/report3Parsing debug report
POST/api/ocr/recognize4OCR single image
POST/api/ocr/recognize/pdf4OCR scanned PDF
POST/api/ocr/detect-scan4Detect scanned vs text PDF
POST/api/ocr/extract-fields4OCR + invoice field extraction
POST/api/ai/index5Index target documents
POST/api/ai/cross-reference5Cross-reference source vs targets
POST/api/ai/cross-reference/audit5Full audit trail (JSON/MD/XLSX)
POST/api/ai/verify5Verify single claim
POST/api/ai/search5Semantic search indexed docs
GET/api/excel/templates6List workpaper templates
POST/api/excel/templates/create6Generate template .xlsx
POST/api/excel/read6Read workbook data
POST/api/excel/write6Write with change tracking
POST/api/pipeline/index-targets7Index reference documents
POST/api/pipeline/run7Execute full pipeline
GET/api/pipeline/jobs7List all pipeline jobs
Dependencies
PackageVersionPhasePurpose
fastapi≥0.111CoreAPI framework
docling≥2.152Primary document parser
tika-client≥0.52Fallback parser (1000+ formats)
camelot-py[cv]≥1.0.93PDF table extraction
paddleocr≥3.04Primary OCR (PP-OCRv5)
python-doctr≥0.84Alternative/ensemble OCR
pymupdf≥1.244PDF rendering + scan detection
llama-index-core≥0.125Semantic indexing & retrieval
sentence-transformers≥3.05BGE-small-en-v1.5 embeddings
xlwings≥0.336Live Excel automation
openpyxl≥3.13,6Excel workbook generation
pandas≥2.0AllData manipulation