Loops, Logic & Syntax

Introduction

Recently, I came across the revised edition of the technical information and official documents from World Athletics (formerly known as the IAAF) regarding competition organisation. As I returned to athletics this winter, I thought it would be enjoyable to pursue a project on parsing this data, store them in a SQL database and finally developing an API.

Creating a database that can be access through an API will be useful for statistical data analysis or other interesting software projects.

The PDF parser and API were developed in Golang, Python, and Common Lisp. There was no particular reason for developing it in each of these languages; I did so purely as an exercise and as a valuable learning experience. Today I will focus on the PDF parser.

World Athletics Scoring Table

In January 2025, World Athletics updated its athletics scoring table. The tables, previously known IAAF Scoring Tables of Athletics, were created by Bojidar Spririev in 1982 and now maintained by his son Attila [1].

Extracting Data from PDF files

Extracting data from PDFs can be complex. The methodology utilised in this project is only applicable to the specific source material. It would certainly be beneficial to have a more universal approach for extracting tabular data from PDFs and converting it into the preferred format. At present, the code written is applicable only to this specific problem.

Consideration

Understanding how the PDF is formatted was the first step I took in order to identify how to parse it correctly. The initial implementation was done in Python using the pdfplumber library.

The package pdfparser.py will require further iterations, but it is currently running and successfully parsing the technical documentation correctly into both .csv and .json formats.

Difficulties

The first hurdle encountered was extracting tables from pages that matched a particular pattern. As the tables could not be easily extracted using the extract_tables() method, I resorted to the extract_text() method to obtain the first few lines from the page and check for a match against a specific pattern.

If the pattern exists, we can infer that a table is present on the page. An additional function is then run to extract a table-like structure from the page using fuzzy grouping.

Implementation (Python)

First, a virtual environment was created and activated using:

python3 -m venv venv; source venv/bin/activate

The folder structure looks like,

.
├── LICENSE
├── README.md
├── data/
├── main.py
├── project.toml
├── requirements.txt
├── src
│   ├── __init__.py
│   └──  pdfparser.py
├── tests
├── tmp/
├── .gitignore
└── venv/

The pdfpulmber library was installed via pip install together with,

dotenv
numpy
pandas
wget
orjson

An .env file was created to store important local environment variables, which are loaded by the load_dotenv() function when main.py is run.
As of now, the main environment variables required are the URLs pointing to the World Athletics Scoring Tables of Athletics and the IAAF Scoring Tables for Combined Events.

I am also importing two functions from a local package, download_pdf and fetch_table_data_from_PDF.

import os
from dotenv import load_dotenv
import shutil

load_dotenv() #load env variables

from src.pdfparser import download_pdf, fetch_table_data_from_PDF

combined_events_scores_tables =  os.getenv("IAAF_COMBINED_EVENTS_SCORING_PDF_URL")
individual_events_score_tables = os.getenv("IAAF_SCORING_PDF_URL")

pdfs=[
    {   
        "name": "individual_events_score_tables.pdf",
        "url": individual_events_score_tables
        },
     {
         "name":"combined_events_score_tables.pdf",
         "url": combined_events_scores_tables
    }
]

In the first section of the main program two variables pointing to the PDF urls are called using os.getenv("<ENV VAR>")

The program loops through the list of PDFs, each corresponding to a dictionary containing a name and a urlkey value pair.
The download_pdf function is called to download the source PDF via wget into a temporary tmp folder which is later cleared when the program succesfully completes its task.

Finally, fetch_table_data_from_PDF() is called.

for pdf in pdfs:
    pdf_path = download_pdf(pdf["url"], tmp_dir="tmp",filename= pdf["name"])
    print(pdf_path)
    fetch_table_data_from_PDF(pdf_path, True)
    # remove content of ./tmp folder
    for filename in os.listdir("tmp"):
        file_path = os.path.join("tmp", filename)
        try:
            if os.path.isfile(file_path) or os.path.islink(file_path):
                os.unlink(file_path)
            elif os.path.isdir(file_path):
             # Remove subfolder(s) and contents
                shutil.rmtree(file_path) 
        except Exception as e:
            print(f"Failed to erase {file_path}, or {file_path} does not exist: {e}")

Note that os.unlink() is used to remove files; this method achieves the same result as os.remove() [2].
It is also important to check that the filename is not part of a subfolder — in which case, shutil.rmtree() is used instead.

`fetch_tabular_data_from_PDF()`

The function fetch_tabular_data_from_PDF() is the most important part of the program. It loops through the PDF and extracts text from the first few lines of each page, searching for a match with the sections specified in the SECTIONS array.


# IDs associated with men disciplines
MEN_SECTION_IDS = {
    "MEN's Sprints - Part I":
    ["Points", "50m", "55m", "60m", "100m", "200m", "200m sh"],
    "MEN's Sprints - Part II":
    ["Points", "300m", "300m sh", "400m", "400m sh", "500m", "500m sh"],
    "MEN's Hurdles": ["Points", "50mH", "55mH", "60mH", "100mH", "400mH"],
    "MEN'S JUMPS, THROWS AND COMBINED EVENTS": [
        "Points", "HJ", "PV", "LJ", "TJ", "SP", "DT", "HT", "JT", "Hept. sh",
        "Dec"
    ]
}

# IDs associated with women disciplines
WOMEN_SECTION_IDS = {
    "WOMEN's Sprints - Part I":
    ["Points", "50m", "55m", "60m", "100m", "200m", "200m sh"],
    "WOMEN's Sprints - Part II":
    ["Points", "300m", "300m sh", "400m", "400m sh", "500m", "500m sh"],
    "WOMEN's Hurdles": ["Points", "50mH", "55mH", "60mH", "110mH", "400mH"],
    "WOMEN's JUMPS, THROWS AND COMBINED EVENTS": [
        "Points", "HJ", "PV", "LJ", "TJ", "SP", "DT", "HT", "JT", "Pent.sh",
        "Hept."
    ]
}


SECTIONS = [MEN_SECTION_IDS, WOMEN_SECTION_IDS]

The keys in the dictionary are declared such that each key corresponds to the table title being searched for, while the value represents the expected headers or column names of the table. If there is a match for the section title, and a partial match (i.e. >= 50%) for the headers, the page is assumed to contain a table which is then extracted by arranging the words into a table using fuzzy y-grouping. A few checks are performed to ensure that the table satisfies specific requirements — i.e., if the table is None or its length is less than 2, the current iteration is skipped.

if not table or len(table) < 2:
    print(f"⚠️ Page {i}: No structured table found.")
    continue

There were some issues I noticed while debugging the code. Sometimes, the key-value pairs were misaligned due to poor column formatting on my part. Additionally, the page number was often identified as a value, leading to an extra entry of the type:

{
    "Points": "36",
    "50m": null,
    "55m": null,
    "60m": null,
    "100m": null,
    "200m": null,
    "200m sh": null
}

Therefore, prior to creating a DataFrame, I clean up the columns and ensure the size of the data is correct by selecting the appropriate indices.

columns = [col.strip() for col in table[1]] 
data = [row[:len(columns)] for row in table[2:]]  # match column count
df = pd.DataFrame(data, columns=columns)

Finally, the table is saved either as a .csv or .json file, depending on the values assigned to the boolean variables.
The default values are save_json=True and save_csv=False.

After running the main program, python main.py the tabular data is saved as either a .csv or .json file, depending on the setting used for thesave_csvand save_json in main.py.
Note that the filenames are formatted using the information provided in the SECTIONS array, along with the page number where the table is identified.

Below is an example of the final folder structure when save_json=True and save_csv=False.

    .
    ├── LICENSE
    ├── README.md
    ├── data    
    │   ├── MENS_JUMPS_THROWS_AND_COMBINED_EVENTS
    │   │   ├── mens_jumps_throws_and_combined_events_page_368_table_0.json
    │   │   ├── mens_jumps_throws_and_combined_events_page_371_table_0.json
    │   │   ├── ....
    │   │   └──
    │   ├── MENs_Hurdles
    │   │   ├── mens_hurdles_page_68_table_0.json
    │   │   ├── mens_hurdles_page_69_table_0.json
    │   │   ├── ...
    │   │   └──
    │   ├── MENs_Sprints_Part_I
    │   │   ├── mens_sprints_part_i_page_10_table_0.json
    │   │   ├── mens_sprints_part_i_page_11_table_0.json
    │   │   ├── ...
    │   │   └──
    │   ├── MENs_Sprints_Part_II
    │   │   ├── mens_sprints_part_ii_page_38_table_0.json
    │   │   ├── mens_sprints_part_ii_page_39_table_0.json
    │   │   ├── ...
    │   ├── WOMENs_Hurdles
    │   │   ├── womens_hurdles_page_458_table_0.json
    │   │   ├── womens_hurdles_page_459_table_0.json
    │   │   ├── ...
    │   │   └──
    │   ├── WOMENs_JUMPS_THROWS_AND_COMBINED_EVENTS
    │   │   ├── womens_jumps_throws_and_combined_events_page_758_table_0.json
    │   │   ├── womens_jumps_throws_and_combined_events_page_759_table_0.json
    │   │   ├── womens_jumps_throws_and_combined_events_page_760_table_0.json
    │   │   └── ...
    │   ├── WOMENs_Sprints_Part_I
    │   │   ├── WOMENs_Sprints_Part_I_page_398_table_0.json
    │   │   ├── WOMENs_Sprints_Part_I_page_399_table_0.json
    │   │   ├── WOMENs_Sprints_Part_I_page_400_table_0.json
    │   │   └──
    │   └── WOMENs_Sprints_Part_II
    │       ├── womens_sprints_part_ii_page_428_table_0.json
    │       ├── womens_sprints_part_ii_page_429_table_0.json
    │       ├── womens_sprints_part_ii_page_430_table_0.json
    │       ├── womens_sprints_part_ii_page_431_table_0.json
    │       ├── womens_sprints_part_ii_page_432_table_0.json
    │       └──
    ├── main.py
    ├── project.toml
    ├── requirements.txt
    ├── src
    │   ├── __init__.py
    │   └── pdfparser.py
    ├── tests/
    └── venv

References

[1] Spiriev, B. and Spiriev, A. (2025) Scoring Tables of Athletics / Tables de Cotation d’Athlétisme: 2025 Revised Edition. Monaco: World Athletics. Available at: https://worldathletics.org/about-iaaf/documents/technical-information (Accessed: 17 April 2025).

[2] Python Software Foundation (2023). os — Miscellaneous operating system interfaces. Python 3.9 documentation. Available at: https://docs.python.org/3/library/os.html (Accessed: 19 April 2025).