Miscellaneous implementation details#

Input data and neural network models#

All large data files are organized in another git repository: https://git.smhi.se/ai-for-obs/data This also includes trained model files organized under data/processed/dawsonia_model*. This path is provided as argument in dawsonia digitize --model-path, see below.

Configuration#

For ease of use with command line operations and to encode document’s table formats we use TOML configuration files.

Command line parameters#

Some parameters specific to a machine / setup are often repeated. These can be saved in configuration file with sections named [dawsonia.<command>]. The files are typically saved in the directory cfg, but can be placed anywhere and passed on to the commands as

dawsonia <command> -c cfg/dawsonia.toml ...

This will save some keystrokes and repeated entering of --model-path, --output-path etc.

Note

Hyphens (-) in the command line becomes underscore (_) in the TOML file

Table formats#

Typically saved in the directory table_formats, but can be specified by the --table-fmt-dir command-line argument. The name of the file should correspond to what is returned by dawsonia.io.get_station_name(). There can be:

  • many [versions.<version_name>] sections to encode aliases for different versions of table formats,

  • a [default] section to encode the most common table format version, and

  • many [YYYY] sections to specify particular version of the table format for the particular year.

See dawsonia.label.read_specific_table_format() which would return the format specific to a given PDF file.

Preprocessing#

The configuration file may include a section [default.preproc] for defining the preprocessing operations. [YYYY.preproc] sections are also permissible. These are sections are parsed and converted into dawsonia.typing.PreprocConfig and gets used in dawsonia.image_preproc.

Transforms#

Image transformations can be mentioned in a section [default.transforms]. See for example table_formats/öland.toml. The section should follow the specification in dawsonia.typing.Transforms.

Versions: Columns, rows and tables#

Each [versions.<version_name>] section should contain three keys:

  • columns: headings of the columns

  • rows: indices corresponding to the rows, often in the first column of the table

  • tables: imply the shapes of the tables, i.e. how many rows and columns do we expect to contain handwritten text. This should be listed left-to-right and top-to-down order as it appears in the page.

Command-line interface#

The primary mode of using dawsonia is from the command line. Typical pipeline follows execution of the following commands in order

  1. dawsonia label

  2. dawsonia prepare

  3. dawsonia ml --train

  4. dawsonia ml --test

  5. dawsonia digitize

The commands are also sensitive to:

  • DAWSONIA_DEBUG_TABLE_DETECT

  • DAWSONIA_DEBUG_DIGITIZE

  • DAWSONIA_DEBUG (which activates all the above debug)

environment variables.

dawsonia#

DAWSONIA: Digitize hAndWritten obServatiONs In weather journAls (version 0.1.0b0)This is the main command which provides subcommands for different stages of the pipeline

dawsonia [OPTIONS] COMMAND [ARGS]...

Options

--install-completion#

Install completion for the current shell.

--show-completion#

Show completion for the current shell, to copy it or customize the installation.

digitize#

Digitize PDF / ZARR into a dataframe using a trained ML model and write it out.

dawsonia digitize [OPTIONS] COMMAND [ARGS]...

label#

Label PDF images to create raw training data.

dawsonia label [OPTIONS] COMMAND [ARGS]...

ml#

MLops: transform data, train, test

dawsonia ml [OPTIONS] COMMAND [ARGS]...

prepare#

Creates new train, validation, test and ground truth text files in “washington” format for the HTR network and copies in image as input data for the model.

Parameters


n_train: int

Number of images in training set

n_val: int

Number of images in validation set

n_test: int

Number if images in test set. NOTE: If n_test == -1, all the files from label_path would be used for testing

label_path: Path

Path to label directory (where the pictures are located).

dawsonia prepare [OPTIONS] COMMAND [ARGS]...

Logging#

See dawsonia.log.init_logger() for the logging configuration. It includes handlers for logging into the console and filesystem.