Miscellaneous implementation details#
Input data and neural network models#
All large data files are organized in another git repository:
https://git.smhi.se/ai-for-obs/data This also includes trained model files
organized under data/processed/dawsonia_model*. This path is provided as
argument in dawsonia digitize --model-path, see below.
Configuration#
For ease of use with command line operations and to encode document’s table formats we use TOML configuration files.
Command line parameters#
Some parameters specific to a machine / setup are often repeated. These can be
saved in configuration file with sections named [dawsonia.<command>]. The
files are typically saved in the directory cfg, but can be placed anywhere
and passed on to the commands as
dawsonia <command> -c cfg/dawsonia.toml ...
This will save some keystrokes and repeated entering of --model-path,
--output-path etc.
Note
Hyphens (-) in the command line becomes underscore (_) in the TOML file
Table formats#
Typically saved in the directory table_formats, but can be specified by the
--table-fmt-dir command-line argument. The name of the file should correspond
to what is returned by dawsonia.io.get_station_name(). There can be:
many
[versions.<version_name>]sections to encode aliases for different versions of table formats,a
[default]section to encode the most common table format version, andmany
[YYYY]sections to specify particular version of the table format for the particular year.
See dawsonia.label.read_specific_table_format() which would return
the format specific to a given PDF file.
Preprocessing#
The configuration file may include a section [default.preproc] for defining
the preprocessing operations. [YYYY.preproc] sections are also permissible.
These are sections are parsed and converted into
dawsonia.typing.PreprocConfig and gets used in
dawsonia.image_preproc.
Transforms#
Image transformations can be mentioned in a section [default.transforms].
See for example table_formats/öland.toml. The section should follow the
specification in dawsonia.typing.Transforms.
Versions: Columns, rows and tables#
Each [versions.<version_name>] section should contain three keys:
columns: headings of the columnsrows: indices corresponding to the rows, often in the first column of the tabletables: imply the shapes of the tables, i.e. how many rows and columns do we expect to contain handwritten text. This should be listed left-to-right and top-to-down order as it appears in the page.
Command-line interface#
The primary mode of using dawsonia is from the command line. Typical pipeline follows execution of the following commands in order
dawsonia labeldawsonia preparedawsonia ml --traindawsonia ml --testdawsonia digitize
The commands are also sensitive to:
DAWSONIA_DEBUG_TABLE_DETECTDAWSONIA_DEBUG_DIGITIZEDAWSONIA_DEBUG(which activates all the above debug)
environment variables.
dawsonia#
DAWSONIA: Digitize hAndWritten obServatiONs In weather journAls (version 0.1.0b0)This is the main command which provides subcommands for different stages of the pipeline
dawsonia [OPTIONS] COMMAND [ARGS]...
Options
- --install-completion#
Install completion for the current shell.
- --show-completion#
Show completion for the current shell, to copy it or customize the installation.
digitize#
Digitize PDF / ZARR into a dataframe using a trained ML model and write it out.
dawsonia digitize [OPTIONS] COMMAND [ARGS]...
label#
Label PDF images to create raw training data.
dawsonia label [OPTIONS] COMMAND [ARGS]...
ml#
MLops: transform data, train, test
dawsonia ml [OPTIONS] COMMAND [ARGS]...
prepare#
Creates new train, validation, test and ground truth text files in “washington” format for the HTR network and copies in image as input data for the model.
Parameters
- n_train: int
Number of images in training set
- n_val: int
Number of images in validation set
- n_test: int
Number if images in test set. NOTE: If n_test == -1, all the files from label_path would be used for testing
- label_path: Path
Path to label directory (where the pictures are located).
dawsonia prepare [OPTIONS] COMMAND [ARGS]...
Logging#
See dawsonia.log.init_logger() for the logging configuration. It includes handlers for logging into the console and filesystem.