Data Science Collective

Advice, insights, and ideas from the Medium data science community

Follow publication

SmolDocling: A New Era in Document Processing — OCR

A model that outperforms its competitors 27 times its size with the DocTags format

Buse Şenol
Data Science Collective
9 min readMar 24, 2025

Document understanding and conversion technologies have become one of the most critical components of digitalization processes today. SmolDocling, a new development in this field, stands out as an ultra-compact vision model designed for end-to-end document conversion.

The paper of this model, prepared jointly by HuggingFace and IBM, was published on March 14. If you are ready now, we will examine what is written in this paper and how it is implemented.

If you like this article and want to show some love:

  • Clap 50 times — each one helps more than you think! 👏
  • Follow me here on Medium and subscribe for free to catch my latest posts. 🫶
  • Let’s connect on LinkedIn.

What is SmolDocling?

SmolDocling is an ultra-compact model derived from Hugging Face’s SmolVLM-256M model, 5–10 times smaller than other vision models. Containing only 256 million parameters, this model performs at a level that can compete successfully with vision models 27 times larger.

One of the most important features of SmolDocling is its ability to fully represent the content and structure of document pages. The model can capture not only the content, but also the document structure and the positioning of elements within the page.

DocTags Format

SmolDocling uses a format for document transformation, “DocTags”. DocTags is an XML-like markup language that defines key attributes of document elements. This format includes the following key features:

Basic Structure of DocTags

DocTags define three basic properties of document elements. Element type refers to text, image, table, code, title, footnote, and other types of content components. Position on page indicates the exact placement of the element on the page and shows where it is located. Content represents the textual or structural content of the element and encompasses the actual information contained within the…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Data Science Collective
Data Science Collective

Published in Data Science Collective

Advice, insights, and ideas from the Medium data science community

Buse Şenol
Buse Şenol

Written by Buse Şenol

BAU Software Engineering | Data Scientist | The AI Lens Editor | https://www.linkedin.com/in/busekoseoglu/

Responses (2)

Write a response