Turn PDFs into Markdown

Private, local, and accurate OCR using Ollama and DeepSeek-OCR.

Get Started

What is OCR?

OCR is a command-line tool that converts PDF documents into formatted Markdown text. It works by rendering PDF pages as images and feeding them into the deepseek-ocr:latest model via Ollama.

Unlike cloud-based solutions, this runs entirely on your machine—keeping your documents private while leveraging state-of-the-art vision-language models.

Features

🔒 Privacy First

Everything runs locally through Ollama. No data is ever sent to the cloud.

📄 PDF to Markdown

Converts scanned documents or slides directly into clean, editable Markdown format.

🎯 Page Selection

Process specific pages, ranges, or exclude parts of the document easily with CLI flags.

🤖 AI-Powered

Uses deepseek-ocr, a specialized model for understanding layout and text in images.

Requirements

Ollama running locally with the model pulled:
```
ollama pull deepseek-ocr:latest
```
Poppler (required for pdf2image):
Debian/Ubuntu: sudo apt-get install poppler-utils
macOS: brew install poppler

Installation

The recommended way to install is via pipx:

pipx install git+https://github.com/arrase/OCR.git

Or with pip:

pip install git+https://github.com/arrase/OCR.git

Usage

Run the tool on any PDF file:

ocr document.pdf

This will create document.md in the same directory.

Page Selection

You can selectively process pages using --include and --exclude (1-based page numbers).

Process only the first page:

ocr --include 1 document.pdf

Process pages 1 through 5, skipping page 3:

ocr --include 1-5 --exclude 3 document.pdf

Complex combinations:

ocr --include 1,3,5-8 --exclude 6-7 document.pdf

Configuration

The tool supports a YAML configuration file. You can specify a custom path with the --config flag.

Configuration File

By default, it looks for ~/ocr_config.yaml. Example structure:

model: deepseek-ocr:latest
base_url: http://localhost:11434/v1
prompt: |
  Convert the document to markdown.

Environment Variables

Environment variables take precedence over the configuration file:

OLLAMA_BASE_URL
OLLAMA_MODEL