Join us

FinePDFs: Liberating 3T of the finest tokens from PDFs - a Hugging Face Space by HuggingFaceFW

@kala ・ Jan 19,2026

Hugging Face introduces FinePDFs, a large open dataset built by extracting and cleaning text from millions of PDF documents, reaching trillions of tokens across many languages. The post explains how the pipeline handles messy PDF structure, layout noise, duplication, and low-quality content to produce training-ready text. The goal is to unlock high-value data from PDFs, such as scientific and technical documents, and make it openly available for model training and research.

Give a Pawfive to this post!

Only registered users can post comments. Please, login or signup.

Share with your friends and followers

Start writing about what excites you in tech — connect with developers, grow your voice, and get rewarded.

Join other developers and claim your FAUN.dev() account now!

Publish your first story!

Kala #GenAI

FAUN.dev()

@kala

Generative AI Weekly Newsletter, Kala. Curated GenAI news, tutorials, tools and more!

Developer Influence

31

Influence

1

Total Hits

149

Posts

Join and showcase your work and skills

FAUN.dev() is where engineers from GitHub, Netflix, and Shopify go to stay ahead — fast.