Read and extract text and other content from PDFs in C# (port of PDFBox)
-
Updated
Jun 9, 2026 - C#
Read and extract text and other content from PDFs in C# (port of PDFBox)
OCR engine for all the languages
Document Layout Analysis resources repos for development with PdfPig.
Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
Page to PAGE Layout Analysis Tool
A powerful CLI tool for visualization and encoding of PAGE-XML files
OCR-D guidelines for Ground Truth production
Graph based Layout Analysis for Historical Manuscripts in data scarce settings
About The repo gt_structure_1_4 is part of the OCR-D Ground Truth Structure corpus. Only the structure of the printed page is annotated. The corpus was created as a result of the DFG project OCR-D.
PaddleOCR-based base segmentation pipeline for vertically-written ancient Chinese manuscripts, producing PAGE-XML for eScriptorium.
The repo gt_structure_1_3 is part of the OCR-D Ground Truth Structure corpus. Only the structure of the printed page is annotated. The corpus was created as a result of the DFG project OCR-D.
Add a description, image, and links to the page-xml topic page so that developers can more easily learn about it.
To associate your repository with the page-xml topic, visit your repo's landing page and select "manage topics."