`dawsonia.digitize`#

Pipeline to digitize PDFs

Setting environment variable DEBUG=1 shows values which get discarded.

Module Contents#

Classes#

`_Metadata`
`Statistics`	Stores information about the tables detected and aggregated statistics of predictions as a JSON file
`Page`	Stores a copy of the preprocessed image as a `*.webp` file.
`TableMetadata`	Stores sizes (number of rows and columns) and positions (coordinates) of the tables which were detected.
`SequentialPool`

Functions#

`make_executor`
`all_file_paths`
`digitize_book`	Digitize PDF / ZARR into a dataframe using a trained ML model and write it out.
`digitize_page_and_write_output`	Read page from book, load model, run model prediction and write result to filesystem.
`init_tokenizer`	Tokenizer for the ML model
`load_model`	Load checkpoint and initialize ML Model
`digitize_table_with_model`	Digitize table cell-by-cell by using the callback function `predictor`.
`check_probability_thresh`	Check if the first prediction is within warning threshold

Data#

`logger`
`DAWSONIA_DEBUG_DIGITIZE`
`all_statistics_paths`
`all_probablities_paths`
`all_table_meta_paths`
`app`

API#

dawsonia.digitize.logger#: ‘getLogger(…)’

dawsonia.digitize.DAWSONIA_DEBUG_DIGITIZE#: None

class dawsonia.digitize._Metadata#

_subdir: ClassVar[str] = <Multiline-String>#

_ext: ClassVar[str]#: ‘.json’

classmethod _metadata_relative_to_output_path(output_path: pathlib.Path) → pathlib.Path#

classmethod ensure_output_path(path) → pathlib.Path#

to_json(path: pathlib.Path)#: Writes JSON file to the path pointing to the file or relative to the output path of the main result.

classmethod from_json(path: Path | str)#

class dawsonia.digitize.Statistics#

Bases: dawsonia.digitize._Metadata

Stores information about the tables detected and aggregated statistics of predictions as a JSON file

tables_detected: int#: 0

predictions_total: int#: 0

predictions_above_thresh: int#: 0

predictions_empty_value: int#: 0

unset_values: int#: 0

_subdir: ClassVar[str]#: ‘statistics’

compute(result: pandas.DataFrame, probablities: pandas.DataFrame, prob_thresh: float)#

class dawsonia.digitize.Page#

Bases: dawsonia.digitize._Metadata

Stores a copy of the preprocessed image as a *.webp file.

image: numpy.typing.NDArray#: ‘field(…)’

_subdir: ClassVar[str]#: ‘pages’

_ext: ClassVar[str]#: ‘.webp’

to_image(path)#

class dawsonia.digitize.TableMetadata#

Bases: dawsonia.digitize._Metadata

Stores sizes (number of rows and columns) and positions (coordinates) of the tables which were detected.

table_sizes: list[list[int]]#: ‘field(…)’

table_positions: list[list[float]]#: ‘field(…)’

_subdir: ClassVar[str]#: ‘table_meta’

set(table_sizes: dawsonia.typing.TableSizes, table_pos_arrays: dawsonia.typing.TablePosArrays)#

class dawsonia.digitize.SequentialPool(max_workers=None)#

Bases: concurrent.futures.Executor

submit(func, *args, **kwargs)#

dawsonia.digitize.make_executor(jobs: int) → concurrent.futures.Executor#

dawsonia.digitize.all_file_paths(output_path: pathlib.Path, subdir, suffix)#

dawsonia.digitize.all_statistics_paths#: ‘partial(…)’

dawsonia.digitize.all_probablities_paths#: ‘partial(…)’

dawsonia.digitize.all_table_meta_paths#: ‘partial(…)’

dawsonia.digitize.digitize_book(path_file: pathlib.Path, first_date: str, last_date: str, size_cell: tuple[float, float, float, float] = (1.0, 1.0, 1.0, 1.0), first_page: typing.Annotated[int, typer.Option('-f', '--first-page', help='the page number corresponding to first_date')] = 0, last_page: typing.Annotated[int, typer.Option('-l', '--last-page', help='the page number corresponding to last_date')] = 0, page_middle: typing.Annotated[int, typer.Option('-m', '--page-middle', help='X coordinate of middle of page to help the rotation correction')] = -1, table_fmt_dir: pathlib.Path = Path('table_formats'), model_path: pathlib.Path = Path('/local_disk', 'data', 'ai-for-obs', 'processed', 'dawsonia_model_2022-12-19'), prob_thresh: float = 0.8, output_path: pathlib.Path = Path('output', 'digitized'), output_text_fmt: bool = False, jobs: typing.Annotated[int, typer.Option(help='parallel jobs over pages in the book (default: max workers in the system)')] = -1, verbose: bool = False, config: typing.Annotated[pathlib.Path, typer.Option(*config_cli_names, **config_kwargs)] = Path('dawsonia.toml'))#: Digitize PDF / ZARR into a dataframe using a trained ML model and write it out.

dawsonia.digitize.digitize_page_and_write_output(book: dawsonia.io.Book, init_data: list[dict[str, numpy.typing.NDArray]], page_number: int, date_str: str, model_path: pathlib.Path, model_predict: collections.abc.Callable, prob_thresh: float, output_path_page: pathlib.Path, output_text_fmt: bool, debug: bool) → tuple[int, str, dawsonia.digitize.Statistics]#: Read page from book, load model, run model prediction and write result to filesystem.

dawsonia.digitize.init_tokenizer()#: Tokenizer for the ML model

dawsonia.digitize.load_model(model_path: pathlib.Path, vocab_size: int, source: str = 'washington', arch: str = 'flor') → dawsonia.ml.ml.HTRModel#: Load checkpoint and initialize ML Model

dawsonia.digitize.digitize_table_with_model(book: dawsonia.io.Book, predictor: collections.abc.Callable[[numpy.typing.NDArray], tuple[dawsonia.typing.Prediction, dawsonia.typing.Probability]], image_page: numpy.typing.NDArray, table_pos_array: numpy.typing.NDArray, table_size: collections.abc.Iterable[int], init_data: dict[str, numpy.typing.NDArray], row_start: int = 0, col_start: int = 0, debug: bool = False) → tuple[pandas.DataFrame, pandas.DataFrame]#: Digitize table cell-by-cell by using the callback function predictor.

dawsonia.digitize.check_probability_thresh(predict, probablities, debug, err_threshold=80, prob_diff_warn_threshold=50)#: Check if the first prediction is within warning threshold

dawsonia.digitize.app#: ‘Typer(…)’

dawsonia.digitize#

Module Contents#

Classes#

Functions#

Data#

API#

`dawsonia.digitize`#