Back to List
Notice:This resource is provided by a third-party author. Please review the code with AI tools or manually before use to ensure security and compatibility.
PythonPaddlePaddle/PaddleOCR

PaddleOCR

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

96.7/100
72.4KForks: 10.0K
View on GitHubHomepage →
Loading report...

Similar Projects

docstrange

52

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

Python1.4K

MinerU

88

Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.

Python56.3K

text-extract-api

64

Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown

Python3.0K

llm_aided_ocr

68

Enhances Tesseract OCR output using LLMs (local or API) for error correction, smart chunking, and markdown formatting of scanned PDFs

Python2.9K
Back to List