OCR for Images and Video

Features

Users & Accounts

The application lets users register and log in, manage their profile, and work within a credit-based and storage-quota model. Each processing operation consumes a configurable number of credits, and the system checks availability before starting work. Storage usage is tracked and limited per user across all resource types.

Image Processing

Image processing supports uploads in common formats (JPG, PNG, WEBP), either single or in batches. Users can run OCR in three quality modes: Low (fast, PaddleOCR), Medium (balanced, with document structure via PaddleOCRVL), and High (highest quality, DeepSeek-OCR). In medium and high modes the system can return structured output (tables, headings, paragraphs). Extraction can be limited to a user-selected region of the image.

Video Processing

Video handling includes upload, metadata extraction (duration, resolution, FPS), and automatic generation of a preview thumbnail. Built-in subtitles are detected and extracted by analyzing only frames where the subtitle content changes; the system produces SRT files and supports re-processing. Subtitle files can also be uploaded and managed independently, edited with versioning, and downloaded in original or translated versions.

AI & Translation

AI features include translation of extracted text and subtitles to a chosen language, with optional context to improve quality, and integration with multiple providers (e.g. Gemini). The system can correct typical OCR errors while preserving structure, and can generate explanations of recognized text. Translation jobs support full lifecycle control (Stop, Resume, Retry), preserve already translated segments, and retry only failed parts.

Outputs & History

Users can download processed outputs (text and SRT), and the system applies retention policies to remove files after a configured time. Activity history is recorded and shown for videos, images, and subtitles, with filtering and sorting by resource type and date. An administrative panel gives privileged users access to user management and system statistics.

Technical Stack

Orchestration Layer

The orchestration layer is implemented in Laravel (PHP). Laravel handles authentication, authorization, validation, routing, and storage abstraction, and cleanly separates request handling, business logic in services, and background job dispatching. MySQL is the primary durable store for users, file metadata, task results, and credit accounting; the relational model fits user–file relationships and the need for transactional integrity on credits and quotas.

Message Broker & Coordination

Redis is used as a multi-purpose in-memory component: as a message broker for job queues, as a fast store for temporary status keys, for real-time progress via publish/subscribe, and for distributed coordination of shared GPU resources. Workers consume tasks from Redis queues and publish status updates back through Redis and via HTTP callbacks to the backend.

Processing Layer

The processing layer is implemented in Python, where most OCR, computer vision, and model inference tooling lives. Python workers run as stateless processes that pull jobs from Redis, run OCR and video pipelines, and report results through the backend API. File exchange is done via controlled download and upload endpoints rather than shared disk, so workers can run on separate machines.

Frontend

The frontend is built with React, Inertia.js, and Tailwind CSS. Inertia keeps routing and controllers on the server in Laravel while rendering React components, giving a responsive UI without a fully separate SPA deployment and avoiding duplication of routing and authorization logic.

Technical Decisions

Asynchronous Architecture

The system uses a two-stage asynchronous workflow: the web backend accepts requests, persists metadata, and enqueues processing tasks; dedicated worker processes execute those tasks and report back. This microservice-inspired split keeps long-running OCR and video work out of the HTTP path and allows horizontal scaling of workers without changing core business logic.

Security

Security is layered: session-based authentication for the web UI, fine-grained permissions enforced in policies so each operation checks resource ownership and required rights, and a shared secret for backend-to-backend calls so workers authenticate to the Laravel API without exposing user credentials. File access is restricted by disk and path; filenames are represented by UUIDs to avoid unsafe user-controlled names.

File Handling

File handling is treated as a first-class concern. Storage is namespaced per user for quota and cleanup. Files are written to a temporary location and then moved atomically to the final path. A shared file cache with LRU-style eviction and reference counting avoids duplicate downloads and prevents removal of files still in use. Redis is used to coordinate download locks so the same file is not fetched multiple times by different workers.

GPU Resource Management

GPU capacity is managed by a distributed ResourceManager backed by Redis. A single integer counter tracks reserved GPU memory in MB; before starting GPU-heavy work a worker atomically increments this counter by the cost of its mode (low, medium, high). If the result would exceed the configured budget, the increment is rolled back and the job is requeued with a short delay. When the job finishes or fails, the worker decrements the counter and removes its allocation record. Workers can run on separate hosts; stale allocations from crashed workers can be cleared. Workers may also avoid registering medium or high OCR services at all when the budget is too small.

Worker Scheduling

Workers use a chameleon-style scheduler to reduce model warm-up overhead. Each worker has a state (e.g. cold, OCR-ready, heavy) and an inertia counter; when OCR work is ongoing and the image queue has pending jobs, the worker prefers to continue with OCR to reuse loaded models instead of switching to other task types. Task types are grouped so workers minimize context switching.

OCR & Video Pipelines

OCR and video processing are split by responsibility. Image jobs are dispatched with a processing mode and an optional extraction region; the worker resolves the input file (via URL or local path), selects the appropriate OCR service, and returns flat text plus confidence or structured blocks. Video metadata (including thumbnail) is handled by a separate queue. Subtitle extraction uses an external tool to detect candidate frames, runs OCR in parallel on a reduced set of images and builds SRT with timestamps. The pipeline can overlap frame discovery with streaming OCR and uses simple heuristics (e.g. file size stability) to avoid reading partially written frames. A tail-drain pass after the external process exits catches late frames.

Credits & Quotas

Credits are deducted only when a resource first transitions to completed status, avoiding charges for failed jobs. Storage quota is enforced by a dedicated service that aggregates usage across images, videos, and subtitles and compares it to the user’s limit before accepting new uploads. All critical state changes and billing-related updates are wrapped in database transactions to preserve consistency under asynchronous execution.

Configuration & Observability

Configuration is driven by environment variables so the same codebase can be tuned per deployment without code changes. Logging is structured to support multiple processes, and job status is published in real time so the UI can show progress and completion without relying solely on polling. The design prioritizes responsiveness under load, horizontal scalability of workers, isolation of GPU-heavy work, and clear accountability of user data and credits, at the cost of higher operational complexity and eventual consistency that is addressed through explicit status fields and careful state modeling in the UI.