DocSnip OSS — Command Centre

Documents Processed

0

Across all jobs

Cross-References

0

Evidence items found

Discrepancies

0

Requiring review

Workpapers

0

Generated

Architecture — All 7 Phases Complete

1

HTML Prototype

Interactive UI shell with component architecture

Complete

2

Document Extraction Engine

Docling + Tika parsers with failover & metadata

Complete

3

Table Extraction Pipeline

Camelot lattice/stream + accuracy filtering + Excel output

Complete

4

OCR & Scanned Documents

PaddleOCR PP-OCRv5 + DocTR ensemble + auto-detection

Complete

5

AI Cross-Reference

LlamaIndex semantic indexing + verification + audit trails

Complete

6

Excel Integration

xlwings live connector + workbook gen + 4 templates

Complete

7

Full Pipeline & Dashboard

Orchestrator + production dashboard + API gateway

Complete

Recent Jobs 0 jobs

Job ID	Source	Status	Stages	Duration	Output
No jobs yet — use Pipeline to process documents

Run Pipeline

📄

Drop source document here or click to upload

Supports PDF, DOCX, XLSX, images (PNG, JPG, TIFF)

Pipeline Profile

⚡ Quick

Parse + table extraction only. Fastest processing for simple document reads.

🔍 Standard

Full pipeline: parse, tables, OCR if scanned, cross-reference, workpaper generation.

🧠 Thorough

All stages + LLM-powered claim verification. Maximum accuracy for critical audits.

Pipeline Stages

1

Parse

2

Tables

3

OCR

4

Cross-Ref

5

Verify

6

Workpaper

Job History

Job ID	Source	Profile	Status	Evidence	Findings	Duration	Created
No jobs recorded

Document Extraction

Supported Parsers

Parser	Formats	Features	Status
Docling	PDF, DOCX, PPTX, HTML, Markdown	Layout analysis, table detection, metadata	Primary
Apache Tika	1000+ formats	Metadata extraction, content detection	Fallback
Camelot	PDF tables	Lattice/Stream/Hybrid, accuracy scoring	Active
PaddleOCR	Images, scanned PDFs	PP-OCRv5, 100+ languages, angle classification	Active
DocTR	Images, PDFs	Transformer detection, ensemble partner	Ensemble

AI Cross-Reference Engine

Match Quality

—

Average confidence score

Indexed Chunks

—

Across target documents

Verification

—

Claims verified

Verification Methods

Check	Type	Description
Numeric Comparison	Rule-based	Extracts and compares numeric values with 0.01% tolerance
Date Matching	Rule-based	Normalizes and compares date formats across documents
Reference Matching	Rule-based	PO/INV/REF number pattern detection and matching
Text Overlap	Rule-based	Word-level Jaccard similarity with stop-word filtering
Semantic Search	AI (embeddings)	LlamaIndex vector search with BGE-small-en-v1.5
LLM Verification	AI (optional)	Contextual claim verification via Ollama/llama3.2

Audit Workpaper Templates

🧾 Invoice Verification

3-way match: Invoice ↔ PO ↔ GRN. Pre-built variance formulas and exception tracking.

📊 Financial Statement Review

Trial balance, GL detail, cross-references, and variance analysis sheets.

📜 Contract Review

Terms verification, compliance tracking, and findings documentation.

📋 General Audit

Generic evidence workpaper with findings, data, and change log sheets.

Workpaper Features

Feature	Description
7 Standard Sheets	Cover, Evidence, Findings, Extracted Data, Source Index, Change Log, Charts
Conditional Formatting	Green/amber/red confidence cells, opinion highlighting
Quality Charts	Match quality distribution bar chart with exact/high/partial/weak breakdown
Change Tracking	Full audit trail of every cell modification with timestamp and provenance
Auto-Filters	Frozen headers, column filters, alternating row colours
Evidence IDs	Unique traceable IDs (E-ENGAGEMENT-NNNN) for every evidence item

System Configuration

API Endpoints (22 total)

Method	Endpoint	Phase	Description
GET	/api/health	Core	Health check & version
POST	/api/extract	2	Parse document (auto-detect parser)
POST	/api/tables/extract	3	Extract tables (JSON with quality metrics)
POST	/api/tables/extract/excel	3	Extract tables → formatted .xlsx
POST	/api/tables/extract/csv	3	Extract tables → zipped CSVs
POST	/api/tables/report	3	Parsing debug report
POST	/api/ocr/recognize	4	OCR single image
POST	/api/ocr/recognize/pdf	4	OCR scanned PDF
POST	/api/ocr/detect-scan	4	Detect scanned vs text PDF
POST	/api/ocr/extract-fields	4	OCR + invoice field extraction
POST	/api/ai/index	5	Index target documents
POST	/api/ai/cross-reference	5	Cross-reference source vs targets
POST	/api/ai/cross-reference/audit	5	Full audit trail (JSON/MD/XLSX)
POST	/api/ai/verify	5	Verify single claim
POST	/api/ai/search	5	Semantic search indexed docs
GET	/api/excel/templates	6	List workpaper templates
POST	/api/excel/templates/create	6	Generate template .xlsx
POST	/api/excel/read	6	Read workbook data
POST	/api/excel/write	6	Write with change tracking
POST	/api/pipeline/index-targets	7	Index reference documents
POST	/api/pipeline/run	7	Execute full pipeline
GET	/api/pipeline/jobs	7	List all pipeline jobs

Dependencies

Package	Version	Phase	Purpose
fastapi	≥0.111	Core	API framework
docling	≥2.15	2	Primary document parser
tika-client	≥0.5	2	Fallback parser (1000+ formats)
camelot-py[cv]	≥1.0.9	3	PDF table extraction
paddleocr	≥3.0	4	Primary OCR (PP-OCRv5)
python-doctr	≥0.8	4	Alternative/ensemble OCR
pymupdf	≥1.24	4	PDF rendering + scan detection
llama-index-core	≥0.12	5	Semantic indexing & retrieval
sentence-transformers	≥3.0	5	BGE-small-en-v1.5 embeddings
xlwings	≥0.33	6	Live Excel automation
openpyxl	≥3.1	3,6	Excel workbook generation
pandas	≥2.0	All	Data manipulation