Table detection with SCIPY_PROJ method#

This is the default method for table detection.

The algorithm behind this method can be approximately described as follows:

%matplotlib inline
import matplotlib.pyplot as plt

We start by setting the environment variable in Bash,

export DAWSONIA_DEBUG_TABLE_DETECT=1

or in the Python console:

import os
os.environ["DAWSONIA_DEBUG_TABLE_DETECT"] = "1"

Now, you need some data to get started. The HTR model weights and data file strucuture are stored in another git repository.

!git clone https://git.smhi.se/ai-for-obs/data.git
fatal: destination path 'data' already exists and is not an empty directory.

Input data and table formats configuration#

The data repository contains a PDF (we support both PDFs and Zarr files of all formats) with scans from the weather station 144300.

%ls -1 data/raw/144300/
144300_1957-01-01_1957-12-31.pdf*

Before we start we need to at least create a table formats configuration file. We do that by manually viewing the files. Visually we know that there are:

  • 3 tables in odd numbered pages and 2 tables in even numbered pages (only the first page is visible below). The fact that the two tables present on even numbered pages don’t have the same configuration in terms of row sections as the first two tables on the odd numbered pages means that there is no hope in digitizing the even numbered pages whilst correctly digitizing the odd numbered pages. Dawsonia has currently no way of handling varying configuration of tables from page to page.

  • Some columns and rows only being separated by dots and not continuous lines. This requires some configuration for column sections and row sections.

  • there are row sections in all tables

  • there are column sections in the third table

So in conclusion, having inspected the 144300 station document we can see that the document is special in two ways:

  • It has varying configuration of tables from page to page. This means that we have to decide what type of page is more important to digitize if we only want to create one table_formats file and not run several pipeines to digitize all pages. In this example we will focus on the odd numbered pages.

  • It has something called “row sections” and “column sections”. The presence of “row sections” means the document consistently has some sections of rows which dawsonia will have difficulty to interpret as separate rows, since it relies on continuous lines separating rows. So what the human eye recognizes as for example 5 rows being separated with some dotted lines, dawsonia is going to recognize as only one row. To separate what dawsonia recognizes as a row and what actually is a true row we use the term “row section”. “Row section” is what we call a section that dawsonia recognizes as a single row even though the human eye might see that it actually contains several rows. The row sections and column sections always have to be described in the table_formats file so that dawsonia “knows” when to divide any detected rows (actually row sections) into several rows. Dawsonia only knows to divide any row or column sections into equal parts.

To break down the concept of row- and column sections even further, let’s look at the precence of row-sections in the first table below. The table contains hourly measurements for january 1957 in station 144300. columns represent hours and rows represent days. There are some columns and rows for statistics.

The first row is the header row containing the time stamps of measurements. It is completely defined with continuous lines. The following five rows (for dates 1-5) are separated with dotted lines until the 5th where there is a new continuous line. This means dawsonia will detect the five rows as one single row. We have found a row section that needs to be divided into several rows. The following five lines (for dates 6-10) are also separated with dotted lines until the 10th. It goes on like this until there is a row section with 6 rows for dates 26-31. After that there are two statistics rows separated with continuous lines. These properties have to be described in the table formats file.

from IPython.display import Image
Image(filename='images/144300_1957-01-01_1957-12-31.png') 
../_images/cf29781c52982c8665652a7abf6ae7502feed4052100a3e85323e7e5cd7229ab.png

The name of the file is important. Notice that the stem of the configuration file is the same as the parent directory which holds the books, which is 144300.

%mkdir -p table_formats
%%file table_formats/144300.toml
# NOTE: that the file name matches the directory containing the files.
[default]
version = 0

[default.preproc]
corr_rotate = false
idx_tables_size_verify = [0, 1, 2]     # we want to digitize all three tables. table_modif below needs to describe 
                                       # whether leading rows and columns should be removed from these tables
method = "SCIPY_PROJ"
row_idx_unit = "NONE"                  # we dont want dawosnia to try to create timestamps for the rows. 
table_modif = [[1, 1], [1, 1], [1, 1]] # This denotes how many 
                                       # leading rows and columns to remove from table 0,1 and 2.
                                       # Each element is constructed [nrows to remove, ncols to remove]

[version.0]
# column sections describe whether any columns (including leading columns) are separated by non-continuous lines 
# [cols,nsections]
col_sections = [
  [[1,30]],         # first table consists of 30 singular columns
  [[1,30]],         # second table consists of 30 singular columns
  [[1,11], [2,8]]   # third table consists of 11 singular columns 
                    # and then 8 column sections that each need to be divided into two columns each
]
#describing column names left after contingent removal of leading columns and division of column sections
columns = [
  [
    "h_1",
    "h_2",
    "h_3",
    "h_4",
    "h_5",
    "h_6",
    "h_7",
    "h_8",
    "h_9",
    "h_10",
    "h_11",
    "h_12",
    "h_13",
    "h_14",
    "h_15",
    "h_16",
    "h_17",
    "h_18",
    "h_19",
    "h_20",
    "h_21",
    "h_22",
    "h_23",
    "h_24",
    "Dagssum",
    "Dagsmed",
    "Dagsmax",
    "Dagsmin",
    "Diff"
  ],
  [
    "h_1",
    "h_2",
    "h_3",
    "h_4",
    "h_5",
    "h_6",
    "h_7",
    "h_8",
    "h_9",
    "h_10",
    "h_11",
    "h_12",
    "h_13",
    "h_14",
    "h_15",
    "h_16",
    "h_17",
    "h_18",
    "h_19",
    "h_20",
    "h_21",
    "h_22",
    "h_23",
    "h_24",
    "Dagssum",
    "Dagsmed",
    "Dagsmax",
    "Dagsmin",
    "Diff"
  ],
  [
    "h_1",
    "h_4",
    "h_7",
    "h_10",
    "h_13",
    "h_16",
    "h_19",
    "h_22",
    "Dagssumma_rel_fukt",
    "Dagsmedel_rel_fukt",
    "h_1_rikt",
    "h_1_hast",
    "h_4_rikt",
    "h_4_hast",
    "h_7_rikt",
    "h_7_hast",
    "h_10_rikt",
    "h_10_hast",
    "h_13_rikt",
    "h_13_hast",
    "h_16_rikt",
    "h_16_hast",
    "h_19_rikt",
    "h_19_hast",
    "h_22_rikt",
    "h_22_hast"
  ]
]
# Table sizes after division according to col_sections and row_sections. Including leading columns and rows.
divided_tables = [
  [34, 30],
  [34, 30],
  [34, 27]
]
name_idx = ["x_idx_0", "x_idx_1", "x_idx_2"]
# row sections describe whether any rows (including leading rows) are separated by non-continuous lines 
# [rows,nsections]
row_sections = [
  [
    [1, 1],             # The first table has one leading row, then
    [5, 5],             # five row sections that should be divided into 5 rows each, then
    [6, 1],             # one section that should be divided into 5 rows, then
    [1, 2],             # two regular rows
  ],
  [
    [1, 1],               # The second table has one leading row, then
    [5, 5],               # five sections that should be divided into 5 rows each then
    [6, 1],               # one section that should be divided into 6 rows, then
    [1, 2],               # two regular rows
  ],
  [
    [1, 1],               # The second table has one leading row, then
    [5, 5],               # five sections that should be divided into 5 rows each, then
    [6, 1],               # one section that should be divided into 6 rows, then
    [1, 2],               # two regular rows
  ]
]
#row names after contingent removal of leading rows and division of row sections
rows = [["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "Timsumma", "Timmedel"], ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "Timsumma", "Timmedel"], ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "Timsumma", "Timmedel"]]
# Table sizes before division according to row_sections and col_sections 
# [nrows, ncols]
tables = [
  [9, 30],
  [9, 30],
  [9, 19]
]
Overwriting table_formats/144300.toml

Demo: Table detection#

Now that we have all the prerequisites read, let’s execute the command:

DAWSONIA_DEBUG_TABLE_DETECT=1 dawsonia label --first-page 1 --last-page 1 --no-interactive data/raw/144300/144300_1957-01-01_1957-12-31.pdf

or its Python API equivalent:

from dawsonia import label

label.command("data/raw/144300/144300_1957-01-01_1957-12-31.pdf", 1, 1, interactive=False)
INFO     2026-03-10 10:43:38,821 - dawsonia.io._pdf - INFO - table_format = TableFormat(name_idx=['x_idx_0',       
         'x_idx_1', 'x_idx_2'], columns=[['h_1', 'h_2', 'h_3', 'h_4', 'h_5', 'h_6', 'h_7', 'h_8', 'h_9', 'h_10',   
         'h_11', 'h_12', 'h_13', 'h_14', 'h_15', 'h_16', 'h_17', 'h_18', 'h_19', 'h_20', 'h_21', 'h_22', 'h_23',   
         'h_24', 'Dagssum', 'Dagsmed', 'Dagsmax', 'Dagsmin', 'Diff'], ['h_1', 'h_2', 'h_3', 'h_4', 'h_5', 'h_6',   
         'h_7', 'h_8', 'h_9', 'h_10', 'h_11', 'h_12', 'h_13', 'h_14', 'h_15', 'h_16', 'h_17', 'h_18', 'h_19',      
         'h_20', 'h_21', 'h_22', 'h_23', 'h_24', 'Dagssum', 'Dagsmed', 'Dagsmax', 'Dagsmin', 'Diff'], ['h_1',      
         'h_4', 'h_7', 'h_10', 'h_13', 'h_16', 'h_19', 'h_22', 'Dagssumma_rel_fukt', 'Dagsmedel_rel_fukt',         
         'h_1_rikt', 'h_1_hast', 'h_4_rikt', 'h_4_hast', 'h_7_rikt', 'h_7_hast', 'h_10_rikt', 'h_10_hast',         
         'h_13_rikt', 'h_13_hast', 'h_16_rikt', 'h_16_hast', 'h_19_rikt', 'h_19_hast', 'h_22_rikt', 'h_22_hast']], 
         rows=[('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', 
         '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', 'Timsumma', 'Timmedel'),    
         ('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', 
         '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', 'Timsumma', 'Timmedel'), ('1',    
         '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', 
         '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', 'Timsumma', 'Timmedel')], tables=[[9,   
         30], [9, 30], [9, 19]], row_sections=[[[1, 1], [5, 5], [6, 1], [1, 2]], [[1, 1], [5, 5], [6, 1], [1, 2]], 
         [[1, 1], [5, 5], [6, 1], [1, 2]]], col_sections=[[[1, 30]], [[1, 30]], [[1, 11], [2, 8]]],                
         divided_tables=[[34, 30], [34, 30], [34, 27]], preproc=PreprocConfig(table_modif=[[1, 1], [1, 1], [1, 1]],
         corr_rotate=False, row_idx_unit=<TimeUnits.NONE: 3>, method=<PreprocMethods.SCIPY_PROJ: 1>,               
         idx_tables_size_verify=[0, 1, 2]), transforms=None, version='0', station='144300')                        
INFO     2026-03-10 10:43:38,839 - dawsonia.io._pdf - INFO - PDF metadata: {'pages': 24, 'metadata':               
         {'CreationDate': "D:20250609083704+01'00'", 'ModDate': "D:20250613080108+02'00'", 'Producer': 'GS PDF LIB 
         v0.002', 'Title': ''}}                                                                                    
INFO     2026-03-10 10:43:38,841 - dawsonia.io._pdf - INFO - Setting first_page = 1                                
INFO     2026-03-10 10:43:38,842 - dawsonia.io._pdf - INFO - Opening PDF                                           
         data/raw/144300/144300_1957-01-01_1957-12-31.pdf to extract pages 12                                     
INFO     2026-03-10 10:43:38,849 - dawsonia.io._pdf - INFO - Total 24 pages detected; reading up to 2              
../_images/085f1a2e75cf6ac815fc52bfe24423ff65bd862bc19be56afaf70e6e6f3bccec.png
INFO     2026-03-10 10:43:46,842 - dawsonia.table_detect.scipy_proj - INFO - Detected nb_labels_pass = 3           
INFO     2026-03-10 10:43:46,845 - dawsonia.table_detect.scipy_proj - INFO - with sensibility = 0.9: l_size = [[9, 
         28], [9, 30], [9, 19]]                                                                                    
../_images/959ce6d1aa18186c243f8fe3e5f304290755b4b3fe2f1fc6a7265d63eda111f2.png ../_images/b07546ecb04753eb401893670dd733670b9fecaf505dee4af96f47de1dc101b5.png ../_images/33a58ea878c137b42b215c027ac5b2e4ea2b4e488a94dc0cf0b8a501d0063b29.png
INFO     2026-03-10 10:43:51,885 - dawsonia.table_detect.scipy_proj - INFO - saving table 1                        
INFO     2026-03-10 10:43:51,886 - dawsonia.table_detect.scipy_proj - INFO - saving table 2                        
INFO     2026-03-10 10:43:51,888 - dawsonia.table_detect.scipy_proj - INFO - with sensibility = 0.7: l_size = [[9, 
         30], [9, 30], [9, 19]]                                                                                    
../_images/4abdd99de90bfa7ee02efe6b9c6d84ff3e103d14285d0f1ba69b4b13a24522b0.png ../_images/2ee1756d1798e38d62fa32cd7eb55e6847ccdf2eab0851a55ff103f2339245e5.png ../_images/527a4e1f49429fddf2b0adfe3b3818c2607d952353307f2576ef55b8d2571a28.png
INFO     2026-03-10 10:43:57,894 - dawsonia.table_detect.scipy_proj - INFO - saving table 0                        
INFO     2026-03-10 10:43:57,896 - dawsonia.table_detect.utils - INFO - 🌞 final size of tables: [[33, 29], [33,   
         29], [33, 26]]                                                                                            
../_images/09df47910d3bbdcdd11cbc3eec027e14ca0a6b42cbf8a9c84a54cf917fbce162.png

We can see that dawsonia was successful in finding all tables on the first page since it says: Final size of tables [[33, 29], [33, 29], [33, 26]]. That is the full table sizes without the leading rows and columns.