Discovering Docling

A couple of years back, I started using a RSS reader again. I find it really useful for compiling notes for the CAT newsletter I’ve been editing and publishing most weeks, but I also end up with loads of links that are interesting, but do not fit.

The first of hopefully a series of small “just a link” posts about stuff I find, as I clear out my newsreader backlog. Starting with an interesting-looking OSS project, Docling.

What is Docling?

Found via Simon Willison’s blog, Docling is a python project from IBM, that appears to use a series of small ML models working together to more effectively parse PDF documents, to make it possible to pull out meaningful information from them. There’s a technical report explaining it in detail on Arxiv, and it’s on github too.

I’d most likely find this useful at work, where I maintain a platform to aggregate sustainability data from providers of managed and hosting digital services, like WordPress hosting, virtual machines, and storage and so on.

Heres the blurb:

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

In addition to being able to ‘read’ PDFs, it’s also able to output the content in helpful chunks that would make it suitable for carrying out RAG-style analysis for ‘asking questions of a document’, that we see.

Why I’m interested in it

I’m particularly interested in how well smarter PDF processing software like this handled working with published ‘CSRD’ reports by companies, that for the first time ever, should be publishing information in standardised, comparable formats, making it possible to make meaningful comparisons between companies.

I’ve written a bit already about how the CSRD, and more specifically, the ESRS make it possible to fetch specific datapoints out of compliant reports, but it looks like not every report in 2025 will be digitally tagged, or follow the standardised file format (the ESEF, a flavour of iXBRL, which itself is a flavour of xhtml) required by the law, as it appears to be being phased in.

Until these reports are published in that ESEF format, I’m curious about whether it’s possible to parse PDF reports to make somewhat comparable queries with docling, to the ones I might make against an iXBRL file using a dedicated tool like Arelle.

Trying it out. I just tried running it from the command line in my terminal on my 2020 Macbook M1, with 16gb of RAM:

uvx docling Netcompany_Annual-Report_2024.pdf --to json --to md

The Netcompany Annual Report is one of the first reports to be published in 2025 that follows the CSRD apparently, and it seemed a good target report to try out.

This command took about 14 minutes on my machine, to:

run uv to download dependencies
download the various ML models used to parse the PDF
parse the 50mb PDF
generate a 65mb markdown file, with embedded data-encoded images, along with a 325mb json file, also containing the same data-encoded images.

The second run took about 12 minutes to carry out steps 3 and 4. So there’s data there. Can it be easily queried to look for ESRS style datapoints?

That’s the next challenge.

Posted

February 10, 2025

asides, snippets

chris

Tags:

AI, docling, pdfs