User guide#

Input data and neural network models#

All large data files are organized in another git repository: https://git.smhi.se/ai-for-obs/data This also includes trained model files organized under data/processed/dawsonia_model*. This path is provided as argument in dawsonia digitize --model-path, see below.

Configuration#

For ease of use with command line operations and to encode document’s table formats we use TOML configuration files.

Command line parameters#

Some parameters specific to a machine / setup are often repeated. These can be saved in configuration file with sections named [dawsonia.<command>]. The files are typically saved in the directory cfg, but can be placed anywhere and passed on to the commands as

dawsonia <command> -c cfg/dawsonia.toml ...

This will save some keystrokes and repeated entering of --model-path, --output-path etc.

Note

Hyphens (-) in the command line becomes underscore (_) in the TOML file

Table formats#

Typically saved in the directory table_formats, but can be specified by the --table-fmt-dir command-line argument. The name of the file should correspond to what is returned by dawsonia.label.station_name(). There can be:

  • many [versions.<version_name>] sections to encode aliases for different versions of table formats,

  • a [default] section to encode the most common table format version, and

  • many [YYYY] sections to specify particular version of the table format for the particular year.

See dawsonia.label.read_specific_table_format() which would return the format specific to a given PDF file.

Preprocessing#

The configuration file may include a section [default.preproc] for defining the preprocessing operations. [YYYY.preproc] sections are also permissible. These are sections are parsed and converted into dawsonia.typing.PreprocConfig and gets used in dawsonia.image_preproc.

Transforms#

Image transformations can be mentioned in a section [default.transforms]. See for example table_formats/öland.toml. The section should follow the specification in dawsonia.typing.Transforms.

Versions: Columns, rows and tables#

Each [versions.<version_name>] section should contain three keys:

  • columns: headings of the columns

  • rows: indices corresponding to the rows, often in the first column of the table

  • tables: imply the shapes of the tables, i.e. how many rows and columns do we expect to contain handwritten text. This should be listed left-to-right and top-to-down order as it appears in the page.

Command-line interface#

The primary mode of using dawsonia is from the command line. Typical pipeline follows execution of the following commands in order

  1. dawsonia label

  2. dawsonia prepare

  3. dawsonia ml --train

  4. dawsonia ml --test

  5. dawsonia digitize

The commands are also sensitive to:

  • DAWSONIA_DEBUG_TABLE_DETECT

  • DAWSONIA_DEBUG_DIGITIZE

  • DAWSONIA_DEBUG (which activates all the above debug)

environment variables.