# Challenges for Hackathon 2024 ```{dropdown} Some potential "hacks" which can improve Dawsonia These are merely suggestions. Choose one challenge which suits your interest. If you don't find anything suitable, [chat with us](https://matrix.to/#/#aiforobs:matrix.org) and perhaps we can discover something else. ``` ```{note} A **starting point** is mentioned for every challenge. ``` ## 1. Improve table detection method `OPENCV_CONTOURS` {bdg-primary}`image-processing` Dawsonia has a new method for table detection. This relies on OpenCV for detecting table cell contours and scikit-learn's algorithm for clustering table cells. An advantage is that scanned pages can be slightly unstructured / warped and the method should work. The method works for certain cases but needs improvements so that it can work robustly. **Starting point**: See the demo of the [](./getting_started/table_detect_opencv_contours.myst.ipynb) method. :::{aside} An Optional Title This is an aside. It is not entirely relevant to the main article. ::: ## 2. Integrate AI model to handle printed text {bdg-info}`AI / ML` [EasyOCR](https://github.com/JaidedAI/EasyOCR), or [MMOCR](https://github.com/open-mmlab/mmocr) or one of the following [TrOCR](https://huggingface.co/docs/transformers/v4.41.3/en/model_doc/trocr) models ([base](https://huggingface.co/microsoft/trocr-base-printed), [small](https://huggingface.co/microsoft/trocr-small-printed) or [large](https://huggingface.co/microsoft/trocr-large-printed) variants) could be utilized to infer strings from images of printed text. **Starting point**: One will have to refactor and/or create an analogue of the functions {func}`dawsonia.digitize.digitize_page_and_write_output` and {func}`dawsonia.digitize.digitize_table_with_model` to allow for other OCR models. ## 3. Active learning through semi-supervised training of the HTR model {bdg-info}`AI / ML` By saving the detected table cell image samples and the inferred digitized text we can expand the training data. This can be used to retrain the HTR neural network. Considerations: - Does this method improve the performance of the model? - Can we have a workflow, where we maintain an isolated validation and test dataset? **Starting point**: Within the function {func}`dawsonia.digitize.digitize_table_with_model` after infering the text in a table cell, we can save a labelled image into the filesystem. Reuse it to expand the current training data stored in the `data/interim/label_old` directory. ## 4. Use image-processing based table detection to train a AI-based table detection {bdg-info}`AI / ML` Table transformer and table structure recognition models exist pretrained on digital documents. To adapt to older documents, some transfer learning may be required. In order to do so the first step is to create a workflow to generate training dataset for a table detection. The format of the dataset should match that of [pubtables-1m] dataset in order to train [table-transformer](https://github.com/microsoft/table-transformer). **Starting point**: See and the [source code](https://huggingface.co/spaces/nielsr/tatr-demo) to see in practice how such a method may work. Take a look at structure of [pubtables-1m] dataset also. Download pretrained table-transformer and table-structure recognition models. See {func}`dawsonia.label.label_page` to obtain coordinates of the table cells and save it to comply with [pubtables-1m] format. [pubtables-1m]: https://huggingface.co/datasets/bsmock/pubtables-1m ## 5. Extend HTR training dataset using public datasets and make it compatible with Dawsonia {bdg-info}`AI / ML` The homebrewed training dataset can be easily extended with CVL strings and DIDA dataset. Considerations: - The resulting dataset will be imbalanced. Does it affect the performance of the model? - If so, can we balance the dataset between training, test and validation splits? **Starting point**: Public handwriting training datasets and reorganize them to match the training data stored in the `data/interim/label_old` directory. - [CVL Digit Strings dataset](https://cvl.tuwien.ac.at/research/cvl-databases/icdar2013-handwritten-digit-and-digit-string-recognition-competition/) | [Zenodo archive](https://doi.org/10.5281/zenodo.1492173) - [ORAND-CAR](https://www.orand.cl/icfhr2014-hdsr/) - [DIDA dataset](https://didadataset.github.io/) | [Kaggle](https://www.kaggle.com/datasets/ayavariabdi/didadataset) ## 6. Extend training dataset using different image augmentation methods {bdg-primary}`image-processing` {bdg-info}`AI / ML` Choose an image augmentation library to and write a script to generate augment HTR image samples. Augmentation should preferably be applied after splitting into training, test and validation datasets. **Starting point**: One of the following implementations could be used: - https://git.smhi.se/ai-for-obs/dawsonia/-/blob/main/src/dawsonia/ml/data/preproc.py?ref_type=heads#L45 - https://github.com/microsoft/unilm/tree/master/trocr/augmentation - https://github.com/albumentations-team/albumentations/ ## 7. Create a dashboard for digitization and deploy it to Hugging Face {bdg-success}`visualization` The dashboard can be implemented using one of the following libraries: - Gradio: https://www.gradio.app/demos/ - Streamlit: https://streamlit.io/gallery **Starting point**: A basic Gradio dashboard is implemented [scripts/data_verify.py](../../scripts/data_verify.py). Once implemented the dashboard can be uploaded in [Hugging Face space](https://huggingface.co/docs/hub/spaces-overview). An advanced example of a similar HTR application can also be looked at to see what is possible: [Riksarkivet/htr_demo](https://huggingface.co/spaces/Riksarkivet/htr_demo).