Digitization#
Once we are sure that all or some of the tables can be detected, we are ready to execute the whole digitization pipeline. The command dawsonia digitize processes a PDF / Zarr file in dawsonia.digitize.digitize_book() as follows:
Read table formats file
Create a
dawsonia.io.Bookinstance usingdawsonia.io.read_book()For each page, in
dawsonia.digitize.digitize_page_and_write_output()using parallel processes,load the image
detect tables
for each table, iterate over table cells within
digitize_table_with_model()infer text from table cell and save it to a intermediate
pandas.Dataframe
save the Dataframe to the filesystem as a parquet / markdown file along with other extra metadata such as:
Note
Pre-trained models for HTR are available in https://git.smhi.se/ai-for-obs/data. For example data/models/dawsonia/2024-03-29 is one such model
Demo: digitization#
Pandas 3.0 deprecation warning
Due to a change in API in upcoming Pandas 3.0, some deprecation warnings are emitted. These will have to be fixed in the future, but as long we use Pandas 2.x it should be fine. So we suppress it below.
import warnings
warnings.simplefilter("ignore", category=FutureWarning, lineno=0, append=False)
Now we can digitize using the data repository and table_formats directory that we have prepared in the previous notebooks. We could either run the command:
dawsonia digitize \
-f 3 -l 7 \
--model-path data/models/dawsonia/2024-03-29 \
--output-path digitized \
--table-fmt-dir table_formats \
--verbose \
data/raw_zarr/bjuröklubb_example/bjuröklubb_1927.zarr 1927-01-01 1927-12-31
from dawsonia.digitize import digitize_book
from pathlib import Path
cwd = Path.cwd()
digitize_book(
cwd / "data/raw_zarr/bjuröklubb_example/bjuröklubb_1927.zarr",
"1927-01-01", "1927-12-31",
first_page=3, last_page=7,
model_path=cwd / "data/models/dawsonia/2024-03-29",
output_path=cwd / "digitized",
table_fmt_dir=cwd / "table_formats",
verbose=True,
)
INFO 2024-06-05 12:56:23,082 - dawsonia.io._zarr - INFO - table_format = TableFormat(name_idx='tid', columns=[['term_på_baro', 'barom', 'torra_term', 'våta_term', 'moln_slag_lägre', 'moln_mängd_lägre', 'moln_slag_högre', 'moln_mängd_total'], ['vind_riktning', 'vind_beaufort', 'vind_m_sek', 'sikt', 'sjögang', 'maximi_term', 'minimi_term', 'nederbörd_mängd', 'nederbörd_slag']], rows=(datetime.time(2, 0), datetime.time(8, 0), datetime.time(14, 0), datetime.time(19, 0), datetime.time(21, 0)), tables=[[5, 8], [5, 9], [3, 1], [4, 2], [4, 5]], preproc=PreprocConfig(table_modif=True, corr_rotate=True, row_idx_unit=<TimeUnits.HOURS: 1>, method=<PreprocMethods.SCIPY_PROJ: 1>, idx_tables_size_verify=[0, 1]), transforms=None, version='1', station='bjuröklubb_example')
INFO 2024-06-05 12:56:23,104 - dawsonia.io._zarr - INFO - Setting first_page = 3
INFO 2024-06-05 12:56:23,109 - dawsonia.digitize - INFO - Digitizing Book(file=<zarr.hierarchy.Group '/' read-only>, page_middle=None, table_format=TableFormat(name_idx='tid', columns=[['term_på_baro', 'barom', 'torra_term', 'våta_term', 'moln_slag_lägre', 'moln_mängd_lägre', 'moln_slag_högre', 'moln_mängd_total'], ['vind_riktning', 'vind_beaufort', 'vind_m_sek', 'sikt', 'sjögang', 'maximi_term', 'minimi_term', 'nederbörd_mängd', 'nederbörd_slag']], rows=(datetime.time(2, 0), datetime.time(8, 0), datetime.time(14, 0), datetime.time(19, 0), datetime.time(21, 0)), tables=[[5, 8], [5, 9], [3, 1], [4, 2], [4, 5]], preproc=PreprocConfig(table_modif=True, corr_rotate=True, row_idx_unit=<TimeUnits.HOURS: 1>, method=<PreprocMethods.SCIPY_PROJ: 1>, idx_tables_size_verify=[0, 1]), transforms=None, version='1', station='bjuröklubb_example'), size_cell=[1.0, 1.0, 1.0, 1.0], preprocessor=Preprocessor())
INFO 2024-06-05 12:56:26,589 - dawsonia.digitize - INFO - Writing /home/a002487/Sources/ai-for-obs/dawsonia/docs/source/getting_started/digitized/bjuröklubb_example/bjurökl ubb_1927/1927-01-01
INFO 2024-06-05 12:56:26,702 - dawsonia.digitize - INFO - Writing /home/a002487/Sources/ai-for-obs/dawsonia/docs/source/getting_started/digitized/bjuröklubb_example/bjurökl ubb_1927/1927-01-02
INFO 2024-06-05 12:56:26,713 - dawsonia.digitize - INFO - Writing /home/a002487/Sources/ai-for-obs/dawsonia/docs/source/getting_started/digitized/bjuröklubb_example/bjurökl ubb_1927/1927-01-03
INFO 2024-06-05 12:56:26,720 - dawsonia.digitize - INFO - Writing /home/a002487/Sources/ai-for-obs/dawsonia/docs/source/getting_started/digitized/bjuröklubb_example/bjurökl ubb_1927/1927-01-04
INFO 2024-06-05 12:57:12,298 - dawsonia.digitize - INFO - page_number = 6 date_str = '1927-01-04'
INFO 2024-06-05 12:57:12,303 - dawsonia.digitize - INFO - output_stats = Statistics(tables_detected=2, predictions_total=85, predictions_above_thresh=80, predictions_empty_value=50, unset_values=0)
INFO 2024-06-05 12:57:12,308 - dawsonia.digitize - INFO - page_number = 5 date_str = '1927-01-03'
INFO 2024-06-05 12:57:12,313 - dawsonia.digitize - INFO - output_stats = Statistics(tables_detected=2, predictions_total=85, predictions_above_thresh=85, predictions_empty_value=48, unset_values=0)
INFO 2024-06-05 12:57:12,318 - dawsonia.digitize - INFO - page_number = 3 date_str = '1927-01-01'
INFO 2024-06-05 12:57:12,323 - dawsonia.digitize - INFO - output_stats = Statistics(tables_detected=2, predictions_total=85, predictions_above_thresh=80, predictions_empty_value=50, unset_values=0)
INFO 2024-06-05 12:57:12,329 - dawsonia.digitize - INFO - page_number = 4 date_str = '1927-01-02'
INFO 2024-06-05 12:57:12,334 - dawsonia.digitize - INFO - output_stats = Statistics(tables_detected=2, predictions_total=85, predictions_above_thresh=78, predictions_empty_value=55, unset_values=0)
Tip
Python: A script can such as scripts/digitize.py or scripts/digitize_api.py may be used.
CLI:
Some of these options are repeatable and are best stored in a cfg file such as cfg/lumi.toml which is present in the dawsonia repository. Once created this file can be supplied to the command line as dawsonia digitize -c cfg/myconfig.toml.
!tail -n 6 ../../../cfg/lumi.toml
[dawsonia.digitize]
first_page = 3
last_page = 100000
verbose = true
model_path = "~/scratch/data/processed/dawsonia_model_2022-12-19"
output_path = "~/scratch/data/processed/digitized_2022-12-19"
Results#
As mentioned earlier, for every page a couple of files are generated. Let’s take a look at these.
ls digitized/bjuröklubb_example/bjuröklubb_1927/*
digitized/bjuröklubb_example/bjuröklubb_1927/1927-01-01.parquet
digitized/bjuröklubb_example/bjuröklubb_1927/1927-01-02.parquet
digitized/bjuröklubb_example/bjuröklubb_1927/1927-01-03.parquet
digitized/bjuröklubb_example/bjuröklubb_1927/1927-01-04.parquet
digitized/bjuröklubb_example/bjuröklubb_1927/pages:
1927-01-01.webp 1927-01-02.webp 1927-01-03.webp 1927-01-04.webp
digitized/bjuröklubb_example/bjuröklubb_1927/probablities:
1927-01-01.parquet 1927-01-02.parquet 1927-01-03.parquet 1927-01-04.parquet
digitized/bjuröklubb_example/bjuröklubb_1927/statistics:
1927-01-01.json 1927-01-02.json 1927-01-03.json 1927-01-04.json
digitized/bjuröklubb_example/bjuröklubb_1927/table_meta:
1927-01-01.json 1927-01-02.json 1927-01-03.json 1927-01-04.json
import json
from pandas import read_parquet
from PIL import Image
def read_json(path):
return json.loads(Path(path).read_text())
Preprocessed Image#
im = Image.open("digitized/bjuröklubb_example/bjuröklubb_1927/pages/1927-01-01.webp")
im
Digitized Text#
read_parquet("digitized/bjuröklubb_example/bjuröklubb_1927/1927-01-01.parquet")
| term_på_baro | barom | torra_term | våta_term | moln_slag_lägre | moln_mängd_lägre | moln_slag_högre | moln_mängd_total | vind_riktning | vind_beaufort | vind_m_sek | sikt | sjögang | maximi_term | minimi_term | nederbörd_mängd | nederbörd_slag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| tid | |||||||||||||||||
| 02:00:00 | |||||||||||||||||
| 08:00:00 | +11.0 | 747.0 | -15.6 | 7 | 10 | 10 | 5 | 9 | 9 | x | 0.6 | 5. | |||||
| 14:00:00 | +10.0 | 50 | -16.8 | 7 | 10 | 10 | 19. | 5 | 9 | 9 | |||||||
| 19:00:00 | +9.4 | 753. | -15.8 | 6 | 3 | 3 | 15. | 3 | 5 | 9 | x | 0.9 | 5. | ||||
| 21:00:00 |
Probablities#
read_parquet("digitized/bjuröklubb_example/bjuröklubb_1927/probablities/1927-01-01.parquet")
| term_på_baro | barom | torra_term | våta_term | moln_slag_lägre | moln_mängd_lägre | moln_slag_högre | moln_mängd_total | vind_riktning | vind_beaufort | vind_m_sek | sikt | sjögang | maximi_term | minimi_term | nederbörd_mängd | nederbörd_slag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| tid | |||||||||||||||||
| 02:00:00 | 0.999486 | 0.999484 | 0.999494 | 0.999471 | 0.999463 | 0.999468 | 0.999445 | 0.999480 | 0.999513 | 0.999450 | 0.999487 | 0.999486 | 0.999495 | 0.922424 | 0.938132 | 0.999491 | 0.999493 |
| 08:00:00 | 0.986166 | 0.873290 | 0.787986 | 0.999477 | 0.998900 | 0.998792 | 0.999478 | 0.999507 | 0.997447 | 0.999035 | 0.999489 | 0.999445 | 0.999150 | 0.999483 | 0.999475 | 0.641884 | 0.994220 |
| 14:00:00 | 0.994033 | 0.989950 | 0.695166 | 0.999494 | 0.999613 | 0.999524 | 0.999486 | 0.998935 | 0.128251 | 0.999284 | 0.999475 | 0.975948 | 0.999498 | 0.999462 | 0.999462 | 0.999461 | 0.999475 |
| 19:00:00 | 0.990853 | 0.807118 | 0.944315 | 0.999499 | 0.903842 | 0.999241 | 0.999482 | 0.988703 | 0.539789 | 0.999513 | 0.999260 | 0.999186 | 0.999362 | 0.999443 | 0.999478 | 0.998484 | 0.991648 |
| 21:00:00 | 0.999528 | 0.999487 | 0.999502 | 0.999514 | 0.999522 | 0.999486 | 0.999497 | 0.999517 | 0.999523 | 0.999502 | 0.999559 | 0.999493 | 0.999530 | 0.999529 | 0.999447 | 0.999522 | 0.999465 |
Statistics#
read_json("digitized/bjuröklubb_example/bjuröklubb_1927/statistics/1927-01-01.json")
{'tables_detected': 2,
'predictions_total': 85,
'predictions_above_thresh': 80,
'predictions_empty_value': 50,
'unset_values': 0}
Table Meta#
table_meta = read_json("digitized/bjuröklubb_example/bjuröklubb_1927/table_meta/1927-01-01.json")
table_meta
{'table_sizes': [[5, 8], [5, 9], [], [], []],
'table_positions': [[[[146, 168, 50, 73],
[146, 245, 50, 81],
[146, 324, 50, 77],
[146, 400, 50, 76],
[146, 472, 50, 68],
[146, 533, 50, 53],
[146, 591, 50, 63],
[146, 646, 50, 47]],
[[196, 168, 49, 73],
[196, 245, 49, 81],
[196, 324, 49, 77],
[196, 400, 49, 76],
[196, 472, 49, 68],
[196, 533, 49, 53],
[196, 591, 49, 63],
[196, 646, 49, 47]],
[[245, 168, 50, 73],
[245, 245, 50, 81],
[245, 324, 50, 77],
[245, 400, 50, 76],
[245, 472, 50, 68],
[245, 533, 50, 53],
[245, 591, 50, 63],
[245, 646, 50, 47]],
[[295, 168, 49, 73],
[295, 245, 49, 81],
[295, 324, 49, 77],
[295, 400, 49, 76],
[295, 472, 49, 68],
[295, 533, 49, 53],
[295, 591, 49, 63],
[295, 646, 49, 47]],
[[345, 168, 52, 73],
[345, 245, 52, 81],
[345, 324, 52, 77],
[345, 400, 52, 76],
[345, 472, 52, 68],
[345, 533, 52, 53],
[345, 591, 52, 63],
[345, 646, 52, 47]]],
[[[156, 745, 50, 85],
[156, 816, 50, 57],
[156, 872, 50, 55],
[156, 934, 50, 70],
[156, 1002, 50, 65],
[156, 1067, 50, 66],
[156, 1132, 50, 64],
[156, 1194, 50, 60],
[156, 1258, 50, 67]],
[[206, 745, 49, 85],
[206, 816, 49, 57],
[206, 872, 49, 55],
[206, 934, 49, 70],
[206, 1002, 49, 65],
[206, 1067, 49, 66],
[206, 1132, 49, 64],
[206, 1194, 49, 60],
[206, 1258, 49, 67]],
[[254, 745, 48, 85],
[254, 816, 48, 57],
[254, 872, 48, 55],
[254, 934, 48, 70],
[254, 1002, 48, 65],
[254, 1067, 48, 66],
[254, 1132, 48, 64],
[254, 1194, 48, 60],
[254, 1258, 48, 67]],
[[303, 745, 50, 85],
[303, 816, 50, 57],
[303, 872, 50, 55],
[303, 934, 50, 70],
[303, 1002, 50, 65],
[303, 1067, 50, 66],
[303, 1132, 50, 64],
[303, 1194, 50, 60],
[303, 1258, 50, 67]],
[[354, 745, 51, 85],
[354, 816, 51, 57],
[354, 872, 51, 55],
[354, 934, 51, 70],
[354, 1002, 51, 65],
[354, 1067, 51, 66],
[354, 1132, 51, 64],
[354, 1194, 51, 60],
[354, 1258, 51, 67]]]]}