Wishlist#

This document lists in no particular order of priority, possible improvements to Dawsonia.

Table detection and OCR for printed tables without lines#

The current approach is specialized in tables with lines and HTR (handwriting text recognition). For printed tables the columns and rows have to be distinguished using its structure. For this application Microsoft’s Table Transformer (both -Detection and -Structure-Recognition models) and TrOCR / EasyOCR would be handy. See this working demo for inspiration

https://huggingface.co/spaces/nielsr/tatr-demo

Extend training dataset#

To get something similar handwriting samples, some training data should be generated from the same set of journals we intend to digitize. Right now we have ~10000 samples of handwritten text. To get more, either

manually use the dawsonia label command, or
make use of the the internal ObsMORA dataset can be used to generate more training dataset for the journals SMHI is digitizing.

Public handwriting training datasets#

Note

For internal use these have been downloaded to SMHI’s long-term storage project directory.

Alternative models for OCR and HTR#

OCR#

HTR#

Dashboard for ML training: MLops#

Most dashboards are proprietary but some free ones like ML-Flow (flexible but limited customization?) tensorboard (easy but no customization?) exists. Here is a (perhaps biased) write-up on the options available https://neptune.ai/blog/best-mlops-tools.

Dashboard for digitizing, verification and correction#

For simple digitization and verification there is a Gradio dashboard. See scripts/data_verify.py

#!/usr/bin/env python3
import json
from datetime import datetime
from pathlib import Path

import gradio as gr
import pandas as pd
from dawsonia.config import read_configfile
from dawsonia.io import read_book

# config = read_configfile(Path("cfg/c21921.toml"))
config = read_configfile(Path("cfg/dawsonia.toml"))
cfg_digitize = config["dawsonia"]["digitize"]

model_path = Path(cfg_digitize["model_path"]).expanduser()
output_path = Path(cfg_digitize["output_path"]).expanduser()
raw_path = model_path.parent.parent / "raw_zarr"


def show_digitized(station_name: str, year: int, first_page: int, date_page: str):
    page_number = (
        int(first_page) + (pd.to_datetime(date_page) - datetime(int(year), 1, 1)).days
    )
    pdf = next((raw_path / station_name).glob(f"*{year}.zarr.zip"))
    first, last, book = read_book(pdf)
    image = book.read_page(page_number)
    # image = next(get_pages(pdf, page_number, page_number + 1))

    stem = pdf.name.split(".")[
        0
    ]  # to get the stem when 2 extensions are used like .zarr.zip
    output_path_page = output_path / station_name.lower() / stem / str(date_page)

    df = pd.read_parquet(output_path_page.with_suffix(".parquet"))
    df_prob = pd.read_parquet(
        (output_path_page.parent / "probablities" / output_path_page.name).with_suffix(
            ".parquet"
        )
    )
    stats = json.loads(
        (output_path_page.parent / "statistics" / output_path_page.name)
        .with_suffix(".json")
        .read_text()
    )
    meta = json.loads(
        (output_path_page.parent / "table_meta" / output_path_page.name)
        .with_suffix(".json")
        .read_text()
    )

    return image, df, df_prob, stats, meta


all_stations = [path_dir.name for path_dir in raw_path.iterdir()]


app = gr.Interface(
    show_digitized,
    inputs=[
        gr.Textbox(label="station name"),
        gr.Textbox(label="year"),
        gr.Textbox(label="page corresponding to 1st January"),
        gr.Textbox(label="date"),
    ],
    outputs=[
        gr.Image(label="raw"),
        gr.Dataframe(label="digitized"),
        gr.Dataframe(label="probablities"),
        gr.JSON(label="statistics"),
        gr.JSON(label="metadata"),
    ],
    examples=[
        [all_stations[0], 1927, 3, "1927-01-02"],
        ["KALMAR", 1932, 3, "1932-01-11"],
    ],
)
app.launch(server_name="0.0.0.0", share=True)

Something more advanced can be built using Gradio. See https://huggingface.co/spaces/Riksarkivet/htr_demo for inspiration. Other Streamlit or Plotly Dash are viable dashboarding options. Another possiblity is to build a Jupyter Notebook with widgets for interaction.

Hyperparameter tuning#

To some extent it can be manually done. A more structured approach would be to use a framework such as Optuna.

Pipeline for flagging failed digitizations#

While running in production, only the most certain digitization should pass through. The rest should be flagged and the images should be presevered for debugging.

Deskewing images#

Right now simple rotations and filtering are done to pre-process the images. This requires a high-quality scan with pages that are not skewed / warped. This may not always be the case. To correct this deskewing can be performed. An attempt was made in this notebook where opencv and scikit-image based on Hough transforms and contour detection was used. It did not give promising results. Some other techniques are mentioned here: