dawsonia.digitize#
Pipeline to digitize PDFs
Setting environment variable DEBUG=1 shows values which get discarded.
Module Contents#
Classes#
Stores information about the tables detected and aggregated statistics of predictions as a JSON file |
|
Stores a copy of the preprocessed image as a |
|
Stores sizes (number of rows and columns) and positions (coordinates) of the tables which were detected. |
|
Functions#
Digitize PDF / ZARR into a dataframe using a trained ML model and write it out. |
|
Read page from book, load model, run model prediction and write result to filesystem. |
|
Tokenizer for the ML model |
|
Load checkpoint and initialize ML Model |
|
Digitize table cell-by-cell by using the callback function |
|
Check if the first prediction is within warning threshold |
Data#
API#
- dawsonia.digitize.logger#
‘getLogger(…)’
- dawsonia.digitize.DAWSONIA_DEBUG_DIGITIZE#
None
- class dawsonia.digitize._Metadata#
-
- classmethod _metadata_relative_to_output_path(output_path: pathlib.Path) pathlib.Path#
- classmethod ensure_output_path(path) pathlib.Path#
- to_json(path: pathlib.Path)#
Writes JSON file to the path pointing to the file or relative to the output path of the main result.
- class dawsonia.digitize.Statistics#
Bases:
dawsonia.digitize._MetadataStores information about the tables detected and aggregated statistics of predictions as a JSON file
- compute(result: pandas.DataFrame, probablities: pandas.DataFrame, prob_thresh: float)#
- class dawsonia.digitize.Page#
Bases:
dawsonia.digitize._MetadataStores a copy of the preprocessed image as a
*.webpfile.- image: numpy.typing.NDArray#
‘field(…)’
- to_image(path)#
- class dawsonia.digitize.TableMetadata#
Bases:
dawsonia.digitize._MetadataStores sizes (number of rows and columns) and positions (coordinates) of the tables which were detected.
- set(table_sizes: dawsonia.typing.TableSizes, table_pos_arrays: dawsonia.typing.TablePosArrays)#
- class dawsonia.digitize.SequentialPool(max_workers=None)#
Bases:
concurrent.futures.Executor- submit(func, *args, **kwargs)#
- dawsonia.digitize.make_executor(jobs: int) concurrent.futures.Executor#
- dawsonia.digitize.all_file_paths(output_path: pathlib.Path, subdir, suffix)#
- dawsonia.digitize.all_statistics_paths#
‘partial(…)’
- dawsonia.digitize.all_probablities_paths#
‘partial(…)’
- dawsonia.digitize.all_table_meta_paths#
‘partial(…)’
- dawsonia.digitize.digitize_book(path_file: pathlib.Path, first_date: str, last_date: str, size_cell: tuple[float, float, float, float] = (1.0, 1.0, 1.0, 1.0), first_page: typing.Annotated[int, typer.Option('-f', '--first-page', help='the page number corresponding to first_date')] = 0, last_page: typing.Annotated[int, typer.Option('-l', '--last-page', help='the page number corresponding to last_date')] = 0, page_middle: typing.Annotated[int, typer.Option('-m', '--page-middle', help='X coordinate of middle of page to help the rotation correction')] = -1, table_fmt_dir: pathlib.Path = Path('table_formats'), model_path: pathlib.Path = Path('/local_disk', 'data', 'ai-for-obs', 'processed', 'dawsonia_model_2022-12-19'), prob_thresh: float = 0.8, output_path: pathlib.Path = Path('output', 'digitized'), output_text_fmt: bool = False, jobs: typing.Annotated[int, typer.Option(help='parallel jobs over pages in the book (default: max workers in the system)')] = -1, verbose: bool = False, config: typing.Annotated[pathlib.Path, typer.Option(*config_cli_names, **config_kwargs)] = Path('dawsonia.toml'))#
Digitize PDF / ZARR into a dataframe using a trained ML model and write it out.
- dawsonia.digitize.digitize_page_and_write_output(book: dawsonia.io.Book, init_data: list[dict[str, numpy.typing.NDArray]], page_number: int, date_str: str, model_path: pathlib.Path, model_predict: collections.abc.Callable, prob_thresh: float, output_path_page: pathlib.Path, output_text_fmt: bool, debug: bool) tuple[int, str, dawsonia.digitize.Statistics]#
Read page from book, load model, run model prediction and write result to filesystem.
- dawsonia.digitize.init_tokenizer()#
Tokenizer for the ML model
- dawsonia.digitize.load_model(model_path: pathlib.Path, vocab_size: int, source: str = 'washington', arch: str = 'flor') dawsonia.ml.ml.HTRModel#
Load checkpoint and initialize ML Model
- dawsonia.digitize.digitize_table_with_model(book: dawsonia.io.Book, predictor: collections.abc.Callable[[numpy.typing.NDArray], tuple[dawsonia.typing.Prediction, dawsonia.typing.Probability]], image_page: numpy.typing.NDArray, table_pos_array: numpy.typing.NDArray, table_size: collections.abc.Iterable[int], init_data: dict[str, numpy.typing.NDArray], row_start: int = 0, col_start: int = 0, debug: bool = False) tuple[pandas.DataFrame, pandas.DataFrame]#
Digitize table cell-by-cell by using the callback function
predictor.
- dawsonia.digitize.check_probability_thresh(predict, probablities, debug, err_threshold=80, prob_diff_warn_threshold=50)#
Check if the first prediction is within warning threshold
- dawsonia.digitize.app#
‘Typer(…)’