# Wishlist This document lists in no particular order of priority, possible improvements to Dawsonia. ## Table detection and OCR for printed tables without lines The current approach is specialized in tables with lines and HTR (handwriting text recognition). For printed tables the columns and rows have to be distinguished using its structure. For this application Microsoft's Table Transformer (both -Detection and -Structure-Recognition models) and TrOCR / [EasyOCR](https://github.com/JaidedAI/EasyOCR) would be handy. See this working demo for inspiration ## Extend training dataset To get something similar handwriting samples, some training data should be generated from the same set of journals we intend to digitize. Right now we have ~10000 samples of handwritten text. To get more, either - manually use the `dawsonia label` command, or - make use of the the internal ObsMORA dataset can be used to generate more training dataset for the journals SMHI is digitizing. ### Public handwriting training datasets - CVL Digit Strings dataset - [ORAND-CAR | ICFHR 2014 Competition on Handwritten Digit String Recognition in Challenging Datasets](https://www.orand.cl/icfhr2014-hdsr/) - [IAM handwriting database](https://fki.tic.heia-fr.ch/databases/iam-handwriting-database) both historical and handwriting-db-v3 - [DIDA dataset](https://didadataset.github.io/): see also ```{note} For internal use these have been downloaded to SMHI's long-term storage project directory. ``` ## Alternative models for OCR and HTR ### OCR - https://huggingface.co/microsoft/trocr-base-printed - https://huggingface.co/microsoft/trocr-small-printed - https://huggingface.co/microsoft/trocr-large-printed ### HTR - https://huggingface.co/microsoft/trocr-base-handwritten - https://huggingface.co/microsoft/trocr-small-handwritten - https://huggingface.co/microsoft/trocr-large-handwritten - https://huggingface.co/Riksarkivet/satrn_htr ## Dashboard for ML training: MLops Most dashboards are proprietary but some free ones like ML-Flow (flexible but limited customization?) tensorboard (easy but no customization?) exists. Here is a (perhaps biased) write-up on the options available . ## Dashboard for digitizing, verification and correction For simple digitization and verification there is a [Gradio](https://gradio.app) dashboard. See `scripts/data_verify.py` ```{literalinclude} ../../scripts/data_verify.py :language: py ``` Something more advanced can be built using Gradio. See for inspiration. Other Streamlit or Plotly Dash are viable dashboarding options. Another possiblity is to build a Jupyter Notebook with widgets for interaction. ## Hyperparameter tuning To some extent it can be manually done. A more structured approach would be to use a framework such as [Optuna](https://optuna.org/). ## Pipeline for flagging failed digitizations While running in production, only the most certain digitization should pass through. The rest should be flagged and the images should be presevered for debugging. ## Deskewing images Right now simple rotations and filtering are done to pre-process the images. This requires a high-quality scan with pages that are not skewed / warped. This may not always be the case. To correct this deskewing can be performed. An attempt was made [in this notebook](./notebooks/2023-11-03-deskew.ipynb) where opencv and scikit-image based on [Hough transforms](https://muthu.co/skew-detection-and-correction-of-document-images-using-hough-transform/) and contour detection was used. It did not give promising results. Some other techniques are mentioned here: - https://pyimagesearch.com/2017/02/20/text-skew-correction-opencv-python/ - https://safjan.com/tools-for-doc-deskewing-and-dewarping/