Table detection with `SCIPY_PROJ` method#

This is the default method for table detection.

The algorithm behind this method can be approximately described as follows:

We start from a grayscale pre-processed image from dawsonia.image_preproc.Preprocessor.preprocess() which undergoes the following:
- colors transformed to grayscale
- background noise removed
- corrective rotation applied (optional)
- text removed, grid lines filtered out and thickened
Distinct tables are identified using scipy.ndimage.label() which assigns a label to a group of connected lines. See dawsonia.table_detect.scipy_proj.table_detect_scipy_proj() for the implementation.
The function dawsonia.table_detect.scipy_proj.projections_of_label_tables() applies projection along horizontal and vertical directions.
For different values of a factor called sensibility, the function dawsonia.table_detect.scipy_proj.get_table_structure() identifies peaks to get x and y coordinates for every row and column.
Row heights and column widths are calculated using dawsonia.table_detect.scipy_proj.get_position().
Table cells are sorted in left-to-right, top-to-bottom order using dawsonia.table_detect.scipy_proj.sort_insertion().
Leading row and columns are removed (optional)
If the detected table structure matches the configuration specified in the divided_tables and section definitions of the .toml file, the result is saved to the output lists.
The functions dawsonia.table_detect.utils.apply_row_manipulation() and dawsonia.table_detect.utils.apply_col_manipulation() are called to handle tables that are divided into multiple row or column sections.

%matplotlib inline
import matplotlib.pyplot as plt

We start by setting the environment variable in Bash,

export DAWSONIA_DEBUG_TABLE_DETECT=1

or in the Python console:

import os
os.environ["DAWSONIA_DEBUG_TABLE_DETECT"] = "1"

Now, you need some data to get started. The HTR model weights and data file strucuture are stored in another git repository.

Note

You may notice that that the example journal files used in this page are missing from the git repository. These are not released openly as of now, but it could be available in the future. Please contact us if you are interested.

!git clone https://git.smhi.se/ai-for-obs/data.git

Input data and table formats configuration#

The data repository contains a PDF (we support both PDFs and Zarr files of all formats) with scans of 2 pages each from the weather station Bjuröklubb.

%ls -1 data/raw/bjuroklubb_example/

bjuroklubb_1955-01-01_1955-02-31.pdf

Before we start we need to at least create a table formats configuration file. We do that by manually viewing the files. Visually we know that there are:

3 tables in first page and 2 tables in the second page
each have a distinct number of rows and columns
3 tables that contains numeric values that are of interest to us (first page)
the column and row names which are static values
Some columns and rows only being separated by dots and not continuous lines. This requires some configuration for column sections and row sections.
row sections in all tables
column sections in the third table

The name of the file is important. Notice that the stem of the configuration file is the same as the parent directory which holds the books, which is bjuroklubb_example.

%mkdir -p table_formats

%%file table_formats/bjuroklubb_example.toml
# NOTE: that the file name matches the directory containing the files.
[default]
version = 0

[default.preproc]
corr_rotate = false
idx_tables_size_verify = [0, 1, 2]     # table_modif below needs to describe 
                                       # whether leading rows and columns should be removed from these tables
method = "SCIPY_PROJ"
row_idx_unit = "NONE"
table_modif = [[1, 1], [1, 1], [1, 1]] # This denotes how many 
                                       # leading rows and columns to remove from table 0,1 and 2.
                                       # Each element is constructed [nrows to remove, ncols to remove]

[version.0]
# column sections describe whether any columns (including leading columns) are separated by non-continuous lines 
# [cols,nsections]
col_sections = [
  [[1,30]],         # first table consists of 30 singular columns
  [[1,30]],         # second table consists of 30 singular columns
  [[1,11], [2,8]]   # third table consists of 11 singular columns 
                    # and then 8 sections that each need to be divided into two columns each
]
#describing column names left after contingent removal of leading columns and division of column sections
columns = [
  [
    "h_1",
    "h_2",
    "h_3",
    "h_4",
    "h_5",
    "h_6",
    "h_7",
    "h_8",
    "h_9",
    "h_10",
    "h_11",
    "h_12",
    "h_13",
    "h_14",
    "h_15",
    "h_16",
    "h_17",
    "h_18",
    "h_19",
    "h_20",
    "h_21",
    "h_22",
    "h_23",
    "h_24",
    "Dagssum",
    "Dagsmed",
    "Dagsmax",
    "Dagsmin",
    "Diff"
  ],
  [
    "h_1",
    "h_2",
    "h_3",
    "h_4",
    "h_5",
    "h_6",
    "h_7",
    "h_8",
    "h_9",
    "h_10",
    "h_11",
    "h_12",
    "h_13",
    "h_14",
    "h_15",
    "h_16",
    "h_17",
    "h_18",
    "h_19",
    "h_20",
    "h_21",
    "h_22",
    "h_23",
    "h_24",
    "Dagssum",
    "Dagsmed",
    "Dagsmax",
    "Dagsmin",
    "Diff"
  ],
  [
    "h_1",
    "h_4",
    "h_7",
    "h_10",
    "h_13",
    "h_16",
    "h_19",
    "h_22",
    "Dagssumma_rel_fukt",
    "Dagsmedel_rel_fukt",
    "h_1_rikt",
    "h_1_hast",
    "h_4_rikt",
    "h_4_hast",
    "h_7_rikt",
    "h_7_hast",
    "h_10_rikt",
    "h_10_hast",
    "h_13_rikt",
    "h_13_hast",
    "h_16_rikt",
    "h_16_hast",
    "h_19_rikt",
    "h_19_hast",
    "h_22_rikt",
    "h_22_hast"
  ]
]
# Table sizes after division according to col_sections and row_sections. Including leading columns and rows.
divided_tables = [
  [34, 30],
  [34, 30],
  [34, 27]
]
name_idx = ["x_idx_0", "x_idx_1", "x_idx_2"]
# row sections describe whether any rows (including leading rows) are separated by non-continuous lines 
# [rows,nsections]
row_sections = [
  [
    [1, 1],             # The first table has one leading row, then
    [5, 5],             # five sections that should be divided into 5 rows each, then
    [6, 1],             # one section that should be divided into 5 rows, then
    [1, 2],             # two regular rows
  ],
  [
    [1, 1],               # The second table has one leading row, then
    [5, 5],               # five sections that should be divided into 5 rows each
    [6, 1],               # one section that should be divided into 5 rows, then
    [1, 2],               # two regular rows
  ],
  [
    [1, 1],               # The second table has one leading row, then
    [5, 5],               # five sections that should be divided into 5 rows each
    [6, 1],               # one section that should be divided into 5 rows, then
    [1, 2],               # two regular rows
  ]
]
#row names after contingent removal of leading rows and division of row sections
rows = [["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "Timsumma", "Timmedel"], ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "Timsumma", "Timmedel"], ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "Timsumma", "Timmedel"]]
# Table sizes before division according to row_sections and col_sections 
# [nrows, ncols]
tables = [
  [9, 30],
  [9, 30],
  [9, 19]
]

Overwriting table_formats/bjuröklubb_example.toml

Demo: Table detection#

Now that we have all the prerequisites read, let’s execute the command:

DAWSONIA_DEBUG_TABLE_DETECT=1 dawsonia label --first-page 1 --last-page 1 --no-interactive data/raw/bjuroklubb_example/bjuroklubb_1955-01-01_1955-02-31.pdf

or its Python API equivalent:

from dawsonia import label

label.command("data/raw/bjuroklubb_example/bjuroklubb_1955-01-01_1955-02-31.pdf", 1, 1, interactive=False)

INFO     2025-10-07 13:50:57,720 - dawsonia.io._pdf - INFO - table_format = TableFormat(name_idx=['x_idx_0',       
         'x_idx_1', 'x_idx_2'], columns=[['h_1', 'h_2', 'h_3', 'h_4', 'h_5', 'h_6', 'h_7', 'h_8', 'h_9', 'h_10',   
         'h_11', 'h_12', 'h_13', 'h_14', 'h_15', 'h_16', 'h_17', 'h_18', 'h_19', 'h_20', 'h_21', 'h_22', 'h_23',   
         'h_24', 'Dagssum', 'Dagsmed', 'Dagsmax', 'Dagsmin', 'Diff'], ['h_1', 'h_2', 'h_3', 'h_4', 'h_5', 'h_6',   
         'h_7', 'h_8', 'h_9', 'h_10', 'h_11', 'h_12', 'h_13', 'h_14', 'h_15', 'h_16', 'h_17', 'h_18', 'h_19',      
         'h_20', 'h_21', 'h_22', 'h_23', 'h_24', 'Dagssum', 'Dagsmed', 'Dagsmax', 'Dagsmin', 'Diff'], ['h_1',      
         'h_4', 'h_7', 'h_10', 'h_13', 'h_16', 'h_19', 'h_22', 'Dagssumma_rel_fukt', 'Dagsmedel_rel_fukt',         
         'h_1_rikt', 'h_1_hast', 'h_4_rikt', 'h_4_hast', 'h_7_rikt', 'h_7_hast', 'h_10_rikt', 'h_10_hast',         
         'h_13_rikt', 'h_13_hast', 'h_16_rikt', 'h_16_hast', 'h_19_rikt', 'h_19_hast', 'h_22_rikt', 'h_22_hast']], 
         rows=[('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', 
         '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', 'Timsumma', 'Timmedel'),    
         ('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', 
         '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', 'Timsumma', 'Timmedel'), ('1',    
         '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', 
         '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', 'Timsumma', 'Timmedel')], tables=[[9,   
         30], [9, 30], [9, 19]], row_sections=[[[1, 1], [5, 5], [6, 1], [1, 1], [1, 1]], [[1, 1], [5, 5], [6, 1],  
         [1, 1], [1, 1]], [[1, 1], [5, 5], [6, 1], [1, 1], [1, 1]]], col_sections=[[[1, 30]], [[1, 30]], [[1, 11], 
         [2, 8]]], divided_tables=[[34, 30], [34, 30], [34, 27]], preproc=PreprocConfig(table_modif=[[1, 1], [1,   
         1], [1, 1]], corr_rotate=False, row_idx_unit=<TimeUnits.NONE: 3>, method=<PreprocMethods.SCIPY_PROJ: 1>,  
         idx_tables_size_verify=[0, 1, 2]), transforms=None, version='0', station='bjuroklubb_example')

INFO     2025-10-07 13:50:57,778 - dawsonia.io._pdf - INFO - pdf_info = {'Creator': 'pdftk-java 3.3.3', 'Producer':
         'itext-paulo-155 (itextpdf.sf.net - lowagie.com)', 'CreationDate': 'Mon Oct  6 10:46:04 2025 CEST',       
         'ModDate': 'Mon Oct  6 10:46:04 2025 CEST', 'Tagged': 'no', 'UserProperties': 'no', 'Suspects': 'no',     
         'Form': 'none', 'JavaScript': 'no', 'Pages': 4, 'Encrypted': 'no', 'Page size': '848.16 x 1805.64 pts',   
         'Page rot': '0', 'File size': '55857749 bytes', 'Optimized': 'no', 'PDF version': '1.6'}

INFO     2025-10-07 13:50:57,788 - dawsonia.io._pdf - INFO - Setting first_page = 1

../_images/732914f958a4043187035c40ce3f2f51d0b43f33d053a7e5852dcb9692a48c93.png

INFO     2025-10-07 13:51:06,821 - dawsonia.table_detect.scipy_proj - INFO - Detected nb_labels_pass = 3

INFO     2025-10-07 13:51:09,279 - dawsonia.table_detect.scipy_proj - INFO - with sensibility = 0.9: l_size = [[9, 
         29], [9, 30], [9, 19]]

../_images/25038a33d5eb8edc89af7fcb7a04593577c83d1b7b82bc166c98ed4b97734d60.png

../_images/f540e647d09b732dc63fe2c01fe0200bb8b3a978b8bdbc9fdb38a1326cf55b32.png

../_images/d64e979274ca429f08a524166ffb3a01697ff088c97effc78f3e53d89754c76d.png

INFO     2025-10-07 13:51:31,452 - dawsonia.table_detect.scipy_proj - INFO - saving table 1

INFO     2025-10-07 13:51:31,454 - dawsonia.table_detect.scipy_proj - INFO - saving table 2

INFO     2025-10-07 13:51:31,460 - dawsonia.table_detect.scipy_proj - INFO - with sensibility = 0.7: l_size = [[9, 
         30], [9, 30], [9, 19]]

../_images/530dee37c8aa9004cd5b3f4e864a591347954298e04633820a00958122d68815.png

../_images/23b0e8e5cfc7348122ad1534dc1d2f1873fa6628b8016494b0acd4403d81a94d.png

../_images/0b0f2fbd866d1ba8882b96465ac647b25ffa01a7b97a4c30a203b008d2e91b9e.png

INFO     2025-10-07 13:51:38,210 - dawsonia.table_detect.scipy_proj - INFO - saving table 0

INFO     2025-10-07 13:51:38,213 - dawsonia.table_detect.utils - INFO - 🌞 final size of tables: [[33, 29], [33,   
         29], [33, 26]]

../_images/7d18609a4511a97d79211a429c23ce90dcb2f3988e86f64fb860bf33d05da029.png

Table detection with SCIPY_PROJ method#

Input data and table formats configuration#

Demo: Table detection#

Table detection with `SCIPY_PROJ` method#