User guide#
Input data and neural network models#
All large data files are organized in another git repository:
https://git.smhi.se/ai-for-obs/data This also includes trained model files
organized under data/processed/dawsonia_model*
. This path is provided as
argument in dawsonia digitize --model-path
, see below.
Configuration#
For ease of use with command line operations and to encode document’s table formats we use TOML configuration files.
Command line parameters#
Some parameters specific to a machine / setup are often repeated. These can be
saved in configuration file with sections named [dawsonia.<command>]
. The
files are typically saved in the directory cfg
, but can be placed anywhere
and passed on to the commands as
dawsonia <command> -c cfg/dawsonia.toml ...
This will save some keystrokes and repeated entering of --model-path
,
--output-path
etc.
Note
Hyphens (-) in the command line becomes underscore (_) in the TOML file
Table formats#
Typically saved in the directory table_formats
, but can be specified by the
--table-fmt-dir
command-line argument. The name of the file should correspond
to what is returned by dawsonia.label.station_name()
. There can be:
many
[versions.<version_name>]
sections to encode aliases for different versions of table formats,a
[default]
section to encode the most common table format version, andmany
[YYYY]
sections to specify particular version of the table format for the particular year.
See dawsonia.label.read_specific_table_format()
which would return
the format specific to a given PDF file.
Preprocessing#
The configuration file may include a section [default.preproc]
for defining
the preprocessing operations. [YYYY.preproc]
sections are also permissible.
These are sections are parsed and converted into
dawsonia.typing.PreprocConfig
and gets used in
dawsonia.image_preproc
.
Transforms#
Image transformations can be mentioned in a section [default.transforms]
.
See for example table_formats/öland.toml
. The section should follow the
specification in dawsonia.typing.Transforms
.
Versions: Columns, rows and tables#
Each [versions.<version_name>]
section should contain three keys:
columns
: headings of the columnsrows
: indices corresponding to the rows, often in the first column of the tabletables
: imply the shapes of the tables, i.e. how many rows and columns do we expect to contain handwritten text. This should be listed left-to-right and top-to-down order as it appears in the page.
Command-line interface#
The primary mode of using dawsonia is from the command line. Typical pipeline follows execution of the following commands in order
dawsonia label
dawsonia prepare
dawsonia ml --train
dawsonia ml --test
dawsonia digitize
The commands are also sensitive to:
DAWSONIA_DEBUG_TABLE_DETECT
DAWSONIA_DEBUG_DIGITIZE
DAWSONIA_DEBUG
(which activates all the above debug)
environment variables.