Challenges for Hackathon 2024#

Note

A starting point is mentioned for every challenge.

1. Improve table detection method `OPENCV_CONTOURS`#

image-processing

Dawsonia has a new method for table detection. This relies on OpenCV for detecting table cell contours and scikit-learn’s algorithm for clustering table cells. An advantage is that scanned pages can be slightly unstructured / warped and the method should work. The method works for certain cases but needs improvements so that it can work robustly.

Starting point: See the demo of the Table detection with OPENCV_CONTOURS method.

2. Integrate AI model to handle printed text#

AI / ML

EasyOCR, or MMOCR or one of the following TrOCR models (base, small or large variants) could be utilized to infer strings from images of printed text.

Starting point: One will have to refactor and/or create an analogue of the functions dawsonia.digitize.digitize_page_and_write_output() and dawsonia.digitize.digitize_table_with_model() to allow for other OCR models.

3. Active learning through semi-supervised training of the HTR model#

AI / ML

By saving the detected table cell image samples and the inferred digitized text we can expand the training data. This can be used to retrain the HTR neural network.

Considerations:

Does this method improve the performance of the model?
Can we have a workflow, where we maintain an isolated validation and test dataset?

Starting point: Within the function dawsonia.digitize.digitize_table_with_model() after infering the text in a table cell, we can save a labelled image into the filesystem. Reuse it to expand the current training data stored in the data/interim/label_old directory.

4. Use image-processing based table detection to train a AI-based table detection#

AI / ML

Table transformer and table structure recognition models exist pretrained on digital documents. To adapt to older documents, some transfer learning may be required.

In order to do so the first step is to create a workflow to generate training dataset for a table detection. The format of the dataset should match that of pubtables-1m dataset in order to train table-transformer.

Starting point: See https://huggingface.co/spaces/nielsr/tatr-demo and the source code to see in practice how such a method may work. Take a look at structure of pubtables-1m dataset also. Download pretrained table-transformer and table-structure recognition models. See dawsonia.label.label_page() to obtain coordinates of the table cells and save it to comply with pubtables-1m format.

5. Extend HTR training dataset using public datasets and make it compatible with Dawsonia#

AI / ML

The homebrewed training dataset can be easily extended with CVL strings and DIDA dataset.

Considerations:

The resulting dataset will be imbalanced. Does it affect the performance of the model?
If so, can we balance the dataset between training, test and validation splits?

Starting point: Public handwriting training datasets and reorganize them to match the training data stored in the data/interim/label_old directory.

6. Extend training dataset using different image augmentation methods#

image-processing AI / ML

Choose an image augmentation library to and write a script to generate augment HTR image samples. Augmentation should preferably be applied after splitting into training, test and validation datasets.

Starting point: One of the following implementations could be used:

7. Create a dashboard for digitization and deploy it to Hugging Face#

visualization

The dashboard can be implemented using one of the following libraries:

Gradio: https://www.gradio.app/demos/
Streamlit: https://streamlit.io/gallery

Starting point: A basic Gradio dashboard is implemented scripts/data_verify.py. Once implemented the dashboard can be uploaded in Hugging Face space. An advanced example of a similar HTR application can also be looked at to see what is possible: Riksarkivet/htr_demo.

Challenges for Hackathon 2024#

1. Improve table detection method OPENCV_CONTOURS#

2. Integrate AI model to handle printed text#

3. Active learning through semi-supervised training of the HTR model#

4. Use image-processing based table detection to train a AI-based table detection#

5. Extend HTR training dataset using public datasets and make it compatible with Dawsonia#

6. Extend training dataset using different image augmentation methods#

7. Create a dashboard for digitization and deploy it to Hugging Face#

1. Improve table detection method `OPENCV_CONTOURS`#