Optical Character Recognition (OCR) Systems process scanned text into text usable by computers. We observe that different OCRs make independent mistakes. This example uses a simple Logistic Regression encoded in our system, to select between OCR outputs when they differ.
This example uses outputs from two open-source OCRs for a dataset of 620 words, whose features are already extracted. The dataset is hand- labeled.
- PostreSQL
- Python
- Matplotlib (
pip install matplotlib)
- If necessary, modify
db.urlto fill in your database connection details. - Execute
deepdive initdbto initialize database. - Execute
deepdive run.
- Execute
python feature-analysis.py. - Feature analysis and system calibration result are in
output/andrun/LATEST/calibration/, respectively. - For details, run
deepdive sqlto examine the result relations.