dawsonia.digitize#

Pipeline to digitize PDFs

Setting environment variable DEBUG=1 shows values which get discarded.

Module Contents#

Classes#

_Metadata

Statistics

Stores information about the tables detected and aggregated statistics of predictions as a JSON file

Page

Stores a copy of the preprocessed image as a *.webp file.

TableMetadata

Stores sizes (number of rows and columns) and positions (coordinates) of the tables which were detected.

SequentialPool

Functions#

make_executor

all_file_paths

digitize_book

Digitize PDF / ZARR into a dataframe using a trained ML model and write it out.

digitize_page_and_write_output

Read page from book, load model, run model prediction and write result to filesystem.

init_tokenizer

Tokenizer for the ML model

load_model

Load checkpoint and initialize ML Model

digitize_table_with_model

Digitize table cell-by-cell by using the callback function predictor.

check_probability_thresh

Check if the first prediction is within warning threshold

Data#

logger

DAWSONIA_DEBUG_DIGITIZE

all_statistics_paths

all_probabilities_paths

all_table_meta_paths

app

API#

dawsonia.digitize.logger#

‘getLogger(…)’

dawsonia.digitize.DAWSONIA_DEBUG_DIGITIZE#

None

class dawsonia.digitize._Metadata#
_subdir: ClassVar[str] = <Multiline-String>#
_ext: ClassVar[str]#

‘.json’

classmethod _metadata_relative_to_output_path(output_path: pathlib.Path) pathlib.Path#
classmethod ensure_output_path(path) pathlib.Path#
to_json(path: pathlib.Path)#

Writes JSON file to the path pointing to the file or relative to the output path of the main result.

classmethod from_json(path: Path | str)#
class dawsonia.digitize.Statistics#

Bases: dawsonia.digitize._Metadata

Stores information about the tables detected and aggregated statistics of predictions as a JSON file

predictions_total: int#

0

predictions_above_thresh: int#

0

predictions_empty_value: int#

0

unset_values: int#

0

_subdir: ClassVar[str]#

‘statistics’

compute(result: pandas.DataFrame, probabilities: pandas.DataFrame, prob_thresh: float)#
class dawsonia.digitize.Page#

Bases: dawsonia.digitize._Metadata

Stores a copy of the preprocessed image as a *.webp file.

image: numpy.typing.NDArray#

‘field(…)’

_subdir: ClassVar[str]#

‘pages’

_ext: ClassVar[str]#

‘.webp’

to_image(path)#
class dawsonia.digitize.TableMetadata#

Bases: dawsonia.digitize._Metadata

Stores sizes (number of rows and columns) and positions (coordinates) of the tables which were detected.

table_sizes: list[list[int]]#

‘field(…)’

table_positions: list[list[float]]#

‘field(…)’

_subdir: ClassVar[str]#

‘table_meta’

set(table_sizes: dawsonia.typing.TableSizes, table_pos_arrays: dawsonia.typing.TablePosArrays)#
class dawsonia.digitize.SequentialPool(max_workers=None)#

Bases: concurrent.futures.Executor

submit(func, *args, **kwargs)#
dawsonia.digitize.make_executor(jobs: int) concurrent.futures.Executor#
dawsonia.digitize.all_file_paths(output_path: pathlib.Path, subdir, suffix)#
dawsonia.digitize.all_statistics_paths#

‘partial(…)’

dawsonia.digitize.all_probabilities_paths#

‘partial(…)’

dawsonia.digitize.all_table_meta_paths#

‘partial(…)’

dawsonia.digitize.digitize_book(path_file: pathlib.Path, size_cell: tuple[float, float, float, float] = (1.0, 1.0, 1.0, 1.0), size_cell_alt: typing.Optional[tuple[float, float, float, float]] = None, first_page: typing.Annotated[int, typer.Option('-f', '--first-page', help='the page number corresponding to the first page you want digitized')] = 0, last_page: typing.Annotated[int, typer.Option('-l', '--last-page', help='the page number corresponding to the last page you want digitized')] = 0, page_middle: typing.Annotated[int, typer.Option('-m', '--page-middle', help='X coordinate of middle of page to help the rotation correction')] = -1, table_fmt_dir: pathlib.Path = Path('table_formats'), model_path: pathlib.Path = Path('/local_disk', 'data', 'ai-for-obs', 'processed', 'dawsonia_model_2022-12-19'), prob_thresh: float = 0.8, output_path: pathlib.Path = Path('output', 'digitized'), output_text_fmt: bool = False, jobs: typing.Annotated[int, typer.Option(help='parallel jobs over pages in the book (default: max workers in the system)')] = -1, verbose: bool = False, config: typing.Annotated[pathlib.Path, typer.Option(*config_cli_names, **config_kwargs)] = Path('dawsonia.toml'))#

Digitize PDF / ZARR into a dataframe using a trained ML model and write it out.

dawsonia.digitize.digitize_page_and_write_output(book: dawsonia.io.Book, init_data: list[dict[str, numpy.typing.NDArray]], page_number: int, page_str: str, model_path: pathlib.Path, model_predict: collections.abc.Callable, prob_thresh: float, output_path_page: pathlib.Path, output_text_fmt: bool, debug: bool, size_cell_alt: Optional[tuple[float, float, float, float]] = None) tuple[int, str, dawsonia.digitize.Statistics]#

Read page from book, load model, run model prediction and write result to filesystem.

dawsonia.digitize.init_tokenizer()#

Tokenizer for the ML model

dawsonia.digitize.load_model(model_path: pathlib.Path, vocab_size: int, source: str = 'washington', arch: str = 'flor') dawsonia.ml.ml.HTRModel#

Load checkpoint and initialize ML Model

dawsonia.digitize.digitize_table_with_model(book: dawsonia.io.Book, predictor: collections.abc.Callable[[numpy.typing.NDArray], tuple[dawsonia.typing.Prediction, dawsonia.typing.Probability]], image_page: numpy.typing.NDArray, table_pos_array: numpy.typing.NDArray, table_size: collections.abc.Iterable[int], init_data: dict[str, numpy.typing.NDArray], row_start: int = 0, col_start: int = 0, debug: bool = False, save_cells_dir: Optional[pathlib.Path] = None, cell_prefix: Optional[str] = None, size_cell_alt: Optional[tuple[float, float, float, float]] = None) tuple[pandas.DataFrame, pandas.DataFrame]#

Digitize table cell-by-cell by using the callback function predictor.

dawsonia.digitize.check_probability_thresh(predict, probabilities, debug, err_threshold=80, prob_diff_warn_threshold=50)#

Check if the first prediction is within warning threshold

dawsonia.digitize.app#

‘Typer(…)’