Digitization#

Once we are sure that all or some of the tables can be detected, we are ready to execute the whole digitization pipeline. The command dawsonia digitize processes a PDF / Zarr file in dawsonia.digitize.digitize_book() as follows:

Note

Pre-trained models for HTR are available in https://git.smhi.se/ai-for-obs/data. For example data/models/dawsonia/2024-03-29 is one such model

Demo: digitization#

Pandas 3.0 deprecation warning

Due to a change in API in upcoming Pandas 3.0, some deprecation warnings are emitted. These will have to be fixed in the future, but as long we use Pandas 2.x it should be fine. So we suppress it below.

import warnings
warnings.simplefilter("ignore", category=FutureWarning, lineno=0, append=False)

Now we can digitize using the data repository and table_formats directory that we have prepared in the previous notebooks. We could either run the command:

dawsonia digitize \
    -f 3 -l 7 \
    --model-path data/models/dawsonia/2024-03-29 \
    --output-path digitized \
    --table-fmt-dir table_formats \
    --verbose \
    data/raw_zarr/bjuröklubb_example/bjuröklubb_1927.zarr 1927-01-01 1927-12-31
from dawsonia.digitize import digitize_book
from pathlib import Path

cwd = Path.cwd()

digitize_book(
    cwd / "data/raw_zarr/bjuröklubb_example/bjuröklubb_1927.zarr",
    "1927-01-01", "1927-12-31",
    first_page=3, last_page=7,
    model_path=cwd / "data/models/dawsonia/2024-03-29",
    output_path=cwd / "digitized", 
    table_fmt_dir=cwd / "table_formats",
    verbose=True,
)
INFO     2024-06-05 12:56:23,082 - dawsonia.io._zarr - INFO - table_format = TableFormat(name_idx='tid',           
         columns=[['term_på_baro', 'barom', 'torra_term', 'våta_term', 'moln_slag_lägre', 'moln_mängd_lägre',      
         'moln_slag_högre', 'moln_mängd_total'], ['vind_riktning', 'vind_beaufort', 'vind_m_sek', 'sikt',          
         'sjögang', 'maximi_term', 'minimi_term', 'nederbörd_mängd', 'nederbörd_slag']], rows=(datetime.time(2, 0),
         datetime.time(8, 0), datetime.time(14, 0), datetime.time(19, 0), datetime.time(21, 0)), tables=[[5, 8],   
         [5, 9], [3, 1], [4, 2], [4, 5]], preproc=PreprocConfig(table_modif=True, corr_rotate=True,                
         row_idx_unit=<TimeUnits.HOURS: 1>, method=<PreprocMethods.SCIPY_PROJ: 1>, idx_tables_size_verify=[0, 1]), 
         transforms=None, version='1', station='bjuröklubb_example')                                               
INFO     2024-06-05 12:56:23,104 - dawsonia.io._zarr - INFO - Setting first_page = 3                               
INFO     2024-06-05 12:56:23,109 - dawsonia.digitize - INFO - Digitizing Book(file=<zarr.hierarchy.Group '/'       
         read-only>, page_middle=None, table_format=TableFormat(name_idx='tid', columns=[['term_på_baro', 'barom', 
         'torra_term', 'våta_term', 'moln_slag_lägre', 'moln_mängd_lägre', 'moln_slag_högre', 'moln_mängd_total'], 
         ['vind_riktning', 'vind_beaufort', 'vind_m_sek', 'sikt', 'sjögang', 'maximi_term', 'minimi_term',         
         'nederbörd_mängd', 'nederbörd_slag']], rows=(datetime.time(2, 0), datetime.time(8, 0), datetime.time(14,  
         0), datetime.time(19, 0), datetime.time(21, 0)), tables=[[5, 8], [5, 9], [3, 1], [4, 2], [4, 5]],         
         preproc=PreprocConfig(table_modif=True, corr_rotate=True, row_idx_unit=<TimeUnits.HOURS: 1>,              
         method=<PreprocMethods.SCIPY_PROJ: 1>, idx_tables_size_verify=[0, 1]), transforms=None, version='1',      
         station='bjuröklubb_example'), size_cell=[1.0, 1.0, 1.0, 1.0], preprocessor=Preprocessor())               
INFO     2024-06-05 12:56:26,589 - dawsonia.digitize - INFO - Writing                                              
         /home/a002487/Sources/ai-for-obs/dawsonia/docs/source/getting_started/digitized/bjuröklubb_example/bjurökl
         ubb_1927/1927-01-01                                                                                       
INFO     2024-06-05 12:56:26,702 - dawsonia.digitize - INFO - Writing                                              
         /home/a002487/Sources/ai-for-obs/dawsonia/docs/source/getting_started/digitized/bjuröklubb_example/bjurökl
         ubb_1927/1927-01-02                                                                                       
INFO     2024-06-05 12:56:26,713 - dawsonia.digitize - INFO - Writing                                              
         /home/a002487/Sources/ai-for-obs/dawsonia/docs/source/getting_started/digitized/bjuröklubb_example/bjurökl
         ubb_1927/1927-01-03                                                                                       
INFO     2024-06-05 12:56:26,720 - dawsonia.digitize - INFO - Writing                                              
         /home/a002487/Sources/ai-for-obs/dawsonia/docs/source/getting_started/digitized/bjuröklubb_example/bjurökl
         ubb_1927/1927-01-04                                                                                       
INFO     2024-06-05 12:57:12,298 - dawsonia.digitize - INFO - page_number = 6 date_str = '1927-01-04'              
INFO     2024-06-05 12:57:12,303 - dawsonia.digitize - INFO - output_stats = Statistics(tables_detected=2,         
         predictions_total=85, predictions_above_thresh=80, predictions_empty_value=50, unset_values=0)            
INFO     2024-06-05 12:57:12,308 - dawsonia.digitize - INFO - page_number = 5 date_str = '1927-01-03'              
INFO     2024-06-05 12:57:12,313 - dawsonia.digitize - INFO - output_stats = Statistics(tables_detected=2,         
         predictions_total=85, predictions_above_thresh=85, predictions_empty_value=48, unset_values=0)            
INFO     2024-06-05 12:57:12,318 - dawsonia.digitize - INFO - page_number = 3 date_str = '1927-01-01'              
INFO     2024-06-05 12:57:12,323 - dawsonia.digitize - INFO - output_stats = Statistics(tables_detected=2,         
         predictions_total=85, predictions_above_thresh=80, predictions_empty_value=50, unset_values=0)            
INFO     2024-06-05 12:57:12,329 - dawsonia.digitize - INFO - page_number = 4 date_str = '1927-01-02'              
INFO     2024-06-05 12:57:12,334 - dawsonia.digitize - INFO - output_stats = Statistics(tables_detected=2,         
         predictions_total=85, predictions_above_thresh=78, predictions_empty_value=55, unset_values=0)            

Tip

Python: A script can such as scripts/digitize.py or scripts/digitize_api.py may be used.

CLI: Some of these options are repeatable and are best stored in a cfg file such as cfg/lumi.toml which is present in the dawsonia repository. Once created this file can be supplied to the command line as dawsonia digitize -c cfg/myconfig.toml.

!tail -n 6 ../../../cfg/lumi.toml
[dawsonia.digitize]
first_page = 3
last_page = 100000
verbose = true
model_path = "~/scratch/data/processed/dawsonia_model_2022-12-19"
output_path = "~/scratch/data/processed/digitized_2022-12-19"

Results#

As mentioned earlier, for every page a couple of files are generated. Let’s take a look at these.

ls digitized/bjuröklubb_example/bjuröklubb_1927/*
digitized/bjuröklubb_example/bjuröklubb_1927/1927-01-01.parquet
digitized/bjuröklubb_example/bjuröklubb_1927/1927-01-02.parquet
digitized/bjuröklubb_example/bjuröklubb_1927/1927-01-03.parquet
digitized/bjuröklubb_example/bjuröklubb_1927/1927-01-04.parquet

digitized/bjuröklubb_example/bjuröklubb_1927/pages:
1927-01-01.webp  1927-01-02.webp  1927-01-03.webp  1927-01-04.webp

digitized/bjuröklubb_example/bjuröklubb_1927/probablities:
1927-01-01.parquet  1927-01-02.parquet  1927-01-03.parquet  1927-01-04.parquet

digitized/bjuröklubb_example/bjuröklubb_1927/statistics:
1927-01-01.json  1927-01-02.json  1927-01-03.json  1927-01-04.json

digitized/bjuröklubb_example/bjuröklubb_1927/table_meta:
1927-01-01.json  1927-01-02.json  1927-01-03.json  1927-01-04.json
import json
from pandas import read_parquet
from PIL import Image

def read_json(path):
    return json.loads(Path(path).read_text())

Preprocessed Image#

im = Image.open("digitized/bjuröklubb_example/bjuröklubb_1927/pages/1927-01-01.webp")
im
../_images/c427a1c32791cd757f21c4ab2722728a72bdab0d83bdaec475a87d6215a42cd0.png

Digitized Text#

read_parquet("digitized/bjuröklubb_example/bjuröklubb_1927/1927-01-01.parquet")
term_på_baro barom torra_term våta_term moln_slag_lägre moln_mängd_lägre moln_slag_högre moln_mängd_total vind_riktning vind_beaufort vind_m_sek sikt sjögang maximi_term minimi_term nederbörd_mängd nederbörd_slag
tid
02:00:00
08:00:00 +11.0 747.0 -15.6 7 10 10 5 9 9 x 0.6 5.
14:00:00 +10.0 50 -16.8 7 10 10 19. 5 9 9
19:00:00 +9.4 753. -15.8 6 3 3 15. 3 5 9 x 0.9 5.
21:00:00

Probablities#

read_parquet("digitized/bjuröklubb_example/bjuröklubb_1927/probablities/1927-01-01.parquet")
term_på_baro barom torra_term våta_term moln_slag_lägre moln_mängd_lägre moln_slag_högre moln_mängd_total vind_riktning vind_beaufort vind_m_sek sikt sjögang maximi_term minimi_term nederbörd_mängd nederbörd_slag
tid
02:00:00 0.999486 0.999484 0.999494 0.999471 0.999463 0.999468 0.999445 0.999480 0.999513 0.999450 0.999487 0.999486 0.999495 0.922424 0.938132 0.999491 0.999493
08:00:00 0.986166 0.873290 0.787986 0.999477 0.998900 0.998792 0.999478 0.999507 0.997447 0.999035 0.999489 0.999445 0.999150 0.999483 0.999475 0.641884 0.994220
14:00:00 0.994033 0.989950 0.695166 0.999494 0.999613 0.999524 0.999486 0.998935 0.128251 0.999284 0.999475 0.975948 0.999498 0.999462 0.999462 0.999461 0.999475
19:00:00 0.990853 0.807118 0.944315 0.999499 0.903842 0.999241 0.999482 0.988703 0.539789 0.999513 0.999260 0.999186 0.999362 0.999443 0.999478 0.998484 0.991648
21:00:00 0.999528 0.999487 0.999502 0.999514 0.999522 0.999486 0.999497 0.999517 0.999523 0.999502 0.999559 0.999493 0.999530 0.999529 0.999447 0.999522 0.999465

Statistics#

read_json("digitized/bjuröklubb_example/bjuröklubb_1927/statistics/1927-01-01.json")
{'tables_detected': 2,
 'predictions_total': 85,
 'predictions_above_thresh': 80,
 'predictions_empty_value': 50,
 'unset_values': 0}

Table Meta#

table_meta = read_json("digitized/bjuröklubb_example/bjuröklubb_1927/table_meta/1927-01-01.json")
table_meta
{'table_sizes': [[5, 8], [5, 9], [], [], []],
 'table_positions': [[[[146, 168, 50, 73],
    [146, 245, 50, 81],
    [146, 324, 50, 77],
    [146, 400, 50, 76],
    [146, 472, 50, 68],
    [146, 533, 50, 53],
    [146, 591, 50, 63],
    [146, 646, 50, 47]],
   [[196, 168, 49, 73],
    [196, 245, 49, 81],
    [196, 324, 49, 77],
    [196, 400, 49, 76],
    [196, 472, 49, 68],
    [196, 533, 49, 53],
    [196, 591, 49, 63],
    [196, 646, 49, 47]],
   [[245, 168, 50, 73],
    [245, 245, 50, 81],
    [245, 324, 50, 77],
    [245, 400, 50, 76],
    [245, 472, 50, 68],
    [245, 533, 50, 53],
    [245, 591, 50, 63],
    [245, 646, 50, 47]],
   [[295, 168, 49, 73],
    [295, 245, 49, 81],
    [295, 324, 49, 77],
    [295, 400, 49, 76],
    [295, 472, 49, 68],
    [295, 533, 49, 53],
    [295, 591, 49, 63],
    [295, 646, 49, 47]],
   [[345, 168, 52, 73],
    [345, 245, 52, 81],
    [345, 324, 52, 77],
    [345, 400, 52, 76],
    [345, 472, 52, 68],
    [345, 533, 52, 53],
    [345, 591, 52, 63],
    [345, 646, 52, 47]]],
  [[[156, 745, 50, 85],
    [156, 816, 50, 57],
    [156, 872, 50, 55],
    [156, 934, 50, 70],
    [156, 1002, 50, 65],
    [156, 1067, 50, 66],
    [156, 1132, 50, 64],
    [156, 1194, 50, 60],
    [156, 1258, 50, 67]],
   [[206, 745, 49, 85],
    [206, 816, 49, 57],
    [206, 872, 49, 55],
    [206, 934, 49, 70],
    [206, 1002, 49, 65],
    [206, 1067, 49, 66],
    [206, 1132, 49, 64],
    [206, 1194, 49, 60],
    [206, 1258, 49, 67]],
   [[254, 745, 48, 85],
    [254, 816, 48, 57],
    [254, 872, 48, 55],
    [254, 934, 48, 70],
    [254, 1002, 48, 65],
    [254, 1067, 48, 66],
    [254, 1132, 48, 64],
    [254, 1194, 48, 60],
    [254, 1258, 48, 67]],
   [[303, 745, 50, 85],
    [303, 816, 50, 57],
    [303, 872, 50, 55],
    [303, 934, 50, 70],
    [303, 1002, 50, 65],
    [303, 1067, 50, 66],
    [303, 1132, 50, 64],
    [303, 1194, 50, 60],
    [303, 1258, 50, 67]],
   [[354, 745, 51, 85],
    [354, 816, 51, 57],
    [354, 872, 51, 55],
    [354, 934, 51, 70],
    [354, 1002, 51, 65],
    [354, 1067, 51, 66],
    [354, 1132, 51, 64],
    [354, 1194, 51, 60],
    [354, 1258, 51, 67]]]]}