The fastest non-VLM parser that preserves document structure: tables, headings, lists is OpenDataLoader PDF.

🚀 The developers found room to improve on latency, so we profiled. We initially expected the sorting algorithm (XY-Cut++) to be the bottleneck, but it turned out to be less than **1% **of the total time. The real cost was hiding in content filtering (5…


This content originally appeared on DEV Community and was authored by Julia 👩🏻‍💻

🚀 The developers found room to improve on latency, so we profiled. We initially expected the sorting algorithm (XY-Cut++) to be the bottleneck, but it turned out to be less than **1% **of the total time. The real cost was hiding in content filtering (55%) and preprocessing (25%).

Benchmarks
🖇️3 fixes applied
💥Page-level parallel processing
💥Hidden text detection → opt-in
💥Text-only fast path
💢Output is byte-for-byte identical before and after optimization. Only the speed changed results stay the same.

🖇️OpenDataLoader PDF highlights
🚀#1 in latency 🥇(585 pages in 1.10s)
🗃️#1 in memory efficiency 🥇(7.4MB)
💢Java · Python · Node.js SDK
💢Multiple output formats (text, markdown, HTML, JSON, PDF)

Check out the benchmark below for latency and memory usage results. See the PR for full details on what changed and how we got here. We'd love your feedback if you try it out!

GitHub: http://github.com/opendataloader-project/opendataloader-pdf?utm_source=x&utm_medium=social&utm_campaign=perf_update
Benchmark: http://github.com/opendataloader-project/opendataloader-bench?utm_source=x&utm_medium=social&utm_campaign=perf_update
PR: https://github.com/opendataloader-project/opendataloader-pdf/pull/362?utm_source=x&utm_medium=social&utm_campaign=perf_update


This content originally appeared on DEV Community and was authored by Julia 👩🏻‍💻


Print Share Comment Cite Upload Translate Updates
APA

Julia 👩🏻‍💻 | Sciencx (2026-04-01T11:30:45+00:00) The fastest non-VLM parser that preserves document structure: tables, headings, lists is OpenDataLoader PDF.. Retrieved from https://www.scien.cx/2026/04/01/the-fastest-non-vlm-parser-that-preserves-document-structure-tables-headings-lists-is-opendataloader-pdf/

MLA
" » The fastest non-VLM parser that preserves document structure: tables, headings, lists is OpenDataLoader PDF.." Julia 👩🏻‍💻 | Sciencx - Wednesday April 1, 2026, https://www.scien.cx/2026/04/01/the-fastest-non-vlm-parser-that-preserves-document-structure-tables-headings-lists-is-opendataloader-pdf/
HARVARD
Julia 👩🏻‍💻 | Sciencx Wednesday April 1, 2026 » The fastest non-VLM parser that preserves document structure: tables, headings, lists is OpenDataLoader PDF.., viewed ,<https://www.scien.cx/2026/04/01/the-fastest-non-vlm-parser-that-preserves-document-structure-tables-headings-lists-is-opendataloader-pdf/>
VANCOUVER
Julia 👩🏻‍💻 | Sciencx - » The fastest non-VLM parser that preserves document structure: tables, headings, lists is OpenDataLoader PDF.. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2026/04/01/the-fastest-non-vlm-parser-that-preserves-document-structure-tables-headings-lists-is-opendataloader-pdf/
CHICAGO
" » The fastest non-VLM parser that preserves document structure: tables, headings, lists is OpenDataLoader PDF.." Julia 👩🏻‍💻 | Sciencx - Accessed . https://www.scien.cx/2026/04/01/the-fastest-non-vlm-parser-that-preserves-document-structure-tables-headings-lists-is-opendataloader-pdf/
IEEE
" » The fastest non-VLM parser that preserves document structure: tables, headings, lists is OpenDataLoader PDF.." Julia 👩🏻‍💻 | Sciencx [Online]. Available: https://www.scien.cx/2026/04/01/the-fastest-non-vlm-parser-that-preserves-document-structure-tables-headings-lists-is-opendataloader-pdf/. [Accessed: ]
rf:citation
» The fastest non-VLM parser that preserves document structure: tables, headings, lists is OpenDataLoader PDF. | Julia 👩🏻‍💻 | Sciencx | https://www.scien.cx/2026/04/01/the-fastest-non-vlm-parser-that-preserves-document-structure-tables-headings-lists-is-opendataloader-pdf/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.