Changes

User Agent

Added a User-Agent header.
Moved the version literal __version__ to a separate file to prevent circular referencing.

Rename "OCRPreset" to "OcrPreset"

This pull request refactors the OCRPreset class to OcrPreset across the codebase for consistency in naming conventions.

Renamed OCRPreset to OcrPreset in files like README.md, anyparser_core/__init__.py, and examples.
Updated variable names and documentation to reflect the new class name.
Modified test files to use the updated class.

This change is purely a refactor with no functional impact, aiming for consistency and improved readability.

Breaking Changes

The class OCRPreset has been renamed to OcrPreset to maintain consistency in naming conventions.

Migration Guide

Search and replace all instances of OCRPreset with OcrPreset in your codebase.

Anyparser Core: Your Foundation for AI Data Preparation

https://anyparser.com

Unlock the potential of your AI models with Anyparser Core, the Python SDK designed for high-performance content extraction and format conversion. Built for developers, this SDK streamlines the process of acquiring clean, structured data from diverse sources, making it an indispensable tool for building cutting-edge applications in Retrieval Augmented Generation (RAG), Agentic AI, Generative AI, and robust ETL Pipelines.

Key Benefits for AI Developers:

Rapid Data Acquisition for RAG: Extract information up to 10x faster than traditional methods, accelerating the creation of your knowledge bases for efficient RAG implementations.
High-Accuracy Data for Generative AI: Achieve up to 10x improvement in extraction accuracy, ensuring your Generative AI models are trained and operate on reliable, high-quality data. Output in JSON or Markdown is directly consumable by AI processes.
Cost-Effective Knowledge Base Construction: Efficiently build and maintain knowledge bases from unstructured data, significantly reducing the overhead for RAG, Agentic AI, and other AI applications.
Developer-First Design: Unlimited local usage (fair use policies apply) allows for rapid experimentation and seamless integration into your existing AI workflows.
Optimized for ETL Pipelines: Provides a robust extraction layer for your ETL processes, handling a wide variety of file types and URLs to feed your data lakes and AI systems.

Get Started Quickly:

Free Access: Obtain your API credentials and start building your AI data pipelines today at Anyparser Studio.
Installation: Install the SDK with a simple pip command.
Run Examples: Copy and paste the provided examples to see how easy it is to extract data for your AI projects.

Before starting, add a new API key on the Anyparser Studio.

export ANYPARSER_API_URL=https://anyparserapi.com
export ANYPARSER_API_KEY=<your-api-key>

export ANYPARSER_API_URL=https://eu.anyparserapi.com
export ANYPARSER_API_KEY=<your-api-key>

Installation

pip install anyparser-core

Core Usage Examples for AI Applications

These examples demonstrate how to use Anyparser Core for common AI tasks, arranged from basic to advanced usage.

Example 1: Quick Start with Single Document

When you're just getting started or prototyping, you can use this simplified approach with minimal configuration:

import os
import asyncio

from anyparser_core import Anyparser

single_file = "docs/sample.docx"

# Instantiate with default settings, assuming API credentials are
# set as environment variables.
parser = Anyparser()

result = asyncio.run(parser.parse(single_file))
print(result)

Example 2: Building a RAG Knowledge Base from Local Documents

This example showcases how to extract structured data from local files with full configuration, preparing them for indexing in a RAG system. The JSON output is ideal for vector databases and downstream AI processing. Perfect for building your initial knowledge base with high-quality, structured data.

import os
import asyncio
import sys

from anyparser_core import Anyparser, AnyparserOption

single_file = "docs/sample.docx"

options = AnyparserOption(
    api_url=os.getenv("ANYPARSER_API_URL"),
    api_key=os.getenv("ANYPARSER_API_KEY"),
    format="json",
    image=True,
    table=True,
)

parser = Anyparser(options)

result = asyncio.run(parser.parse(single_file))

for item in result:
    print("-" * 100)
    print("File:", item.original_filename)
    print("Checksum:", item.checksum)
    print("Total characters:", item.total_characters)
    print("Markdown:", item.markdown)

Example 3: OCR Processing for Image-Based Documents

Extract text from images and scanned documents using our advanced OCR capabilities. This example shows how to configure language and preset options for optimal results, particularly useful for processing historical documents, receipts, or any image-based content:

import os
import asyncio
import sys

from anyparser_core import Anyparser, AnyparserOption, OcrLanguage, OCRPreset

single_file = "docs/document.png"

options = AnyparserOption(
    api_url=os.getenv("ANYPARSER_API_URL"),
    api_key=os.getenv("ANYPARSER_API_KEY"),
    model="ocr",
    format="markdown",
    ocr_language=[OcrLanguage.JAPANESE],
    ocr_preset=OCRPreset.SCAN,
)

parser = Anyparser(options)

result = asyncio.run(parser.parse(single_file))
print(result)

Example 4: Processing Multiple Documents for Batch RAG Updates

This example demonstrates how to process multiple documents in a single batch, ideal for updating your RAG knowledge base or processing document collections efficiently:

import os
import asyncio
import sys

from anyparser_core import Anyparser, AnyparserOption

multiple_files = ["docs/sample.docx", "docs/sample.pdf"]

options = AnyparserOption(
    api_url=os.getenv("ANYPARSER_API_URL"),
    api_key=os.getenv("ANYPARSER_API_KEY"),
    format="json",
    image=True,
    table=True,
)

parser = Anyparser(options)

result = asyncio.run(parser.parse(multiple_files))

for item in result:
    print("-" * 100)
    print("File:", item.original_filename)
    print("Checksum:", item.checksum)
    print("Total characters:", item.total_characters)
    print("Markdown:", item.markdown)

print("-" * 100)

Example 5: Web Crawling for Dynamic Knowledge Base Updates

Keep your knowledge base fresh with our powerful web crawling capabilities. This example shows how to crawl websites while respecting robots.txt directives and maintaining politeness delays:

import os
import asyncio
import sys

from anyparser_core import Anyparser, AnyparserOption

item = "https://anyparser.com"

options = AnyparserOption(
    api_url=os.getenv("ANYPARSER_API_URL"),
    api_key=os.getenv("ANYPARSER_API_KEY"),
    model="crawler",
    format="json",
    max_depth=50,
    max_executions=2,
    strategy="LIFO",
    traversal_scope="subtree",
)

parser = Anyparser(options)

result = asyncio.run(parser.parse(item))

for candidate in result:
    print("Start URL            :", candidate.start_url)
    print("Total characters     :", candidate.total_characters)
    print("Total items          :", candidate.total_items)
    print("Robots directive     :", candidate.robots_directive)
    print("\n")
    print("*" * 100)
    print("Begin Crawl")
    print("*" * 100)
    print("\n")

    for item in candidate.items:
        if candidate.items.index(item) > 0:
            print("-" * 100)
            print("\n")

        print("URL                  :", item.url)
        print("Title                :", item.title)
        print("Status message       :", item.status_message)
        print("Total characters     :", item.total_characters)
        print("Politeness delay     :", item.politeness_delay)
        print("Content:\n")
        print(item.markdown)

    print("*" * 100)
    print("End Crawl")
    print("*" * 100)
    print("\n")

Configuration for Optimized AI Workloads

The Anyparser class utilizes the AnyparserOption dataclass for flexible configuration, allowing you to fine-tune the extraction process for different AI use cases.

from dataclasses import dataclass
from typing import List, Literal, Optional, Union

from anyparser_core import OcrLanguage, OCRPreset

@dataclass
class AnyparserOption:
    """Configuration options for the Anyparser API."""
    
    # API Configuration
    api_url: Optional[str] = None  # API endpoint URL, defaults to environment variable ANYPARSER_API_URL
    api_key: Optional[str] = None  # API key, defaults to environment variable ANYPARSER_API_KEY
    
    # Output Format
    format: Literal["json", "markdown", "html"] = "json"  # Output format
    
    # Processing Model
    model: Literal["text", "ocr", "vlm", "lam", "crawler"] = "text"  # Processing model to use
    
    # Text Processing
    encoding: Literal["utf-8", "latin1"] = "utf-8"  # Text encoding
    
    # Content Extraction
    image: Optional[bool] = None  # Enable/disable image extraction
    table: Optional[bool] = None  # Enable/disable table extraction
    
    # Input Sources
    files: Optional[Union[str, List[str]]] = None  # Input files to process
    url: Optional[str] = None  # URL for crawler model
    
    # OCR Configuration
    ocr_language: Optional[List[OcrLanguage]] = None  # Languages for OCR processing
    ocr_preset: Optional[OCRPreset] = None  # Preset configuration for OCR
    
    # Crawler Configuration
    max_depth: Optional[int] = None  # Maximum crawl depth
    max_executions: Optional[int] = None  # Maximum number of pages to crawl
    strategy: Optional[Literal["LIFO", "FIFO"]] = None  # Crawling strategy
    traversal_scope: Optional[Literal["subtree", "domain"]] = None  # Crawling scope

Key Configuration Parameters:

Parameter	Type	Default	Description
`api_url`	`Optional[str]`	`None`	API endpoint URL. Defaults to `ANYPARSER_API_URL` environment variable
`api_key`	`Optional[str]`	`None`	API key for authentication. Defaults to `ANYPARSER_API_KEY` environment variable
`format`	`str`	`"json"`	Output format: `"json"`, `"markdown"`, or `"html"`
`model`	`str`	`"text"`	Processing model: `"text"`, `"ocr"`, `"vlm"`, `"lam"`, or `"crawler"`
`encoding`	`str`	`"utf-8"`	Text encoding: `"utf-8"` or `"latin1"`
`image`	`Optional[bool]` ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Changes

Breaking Changes

Migration Guide

Uh oh!

Changes

Breaking Changes

Migration Guide

Uh oh!

Anyparser Core: Your Foundation for AI Data Preparation

Installation

Core Usage Examples for AI Applications

Example 1: Quick Start with Single Document

Example 2: Building a RAG Knowledge Base from Local Documents

Example 3: OCR Processing for Image-Based Documents

Example 4: Processing Multiple Documents for Batch RAG Updates

Example 5: Web Crawling for Dynamic Knowledge Base Updates

Configuration for Optimized AI Workloads

Uh oh!

Releases: anyparser/anyparser_core

anyparser-core@1.0.2

Changes

Breaking Changes

Migration Guide

Uh oh!

anyparser-core@1.0.1

Changes

Breaking Changes

Migration Guide

Uh oh!

anyparser-core@1.0.0

Anyparser Core: Your Foundation for AI Data Preparation

Installation

Core Usage Examples for AI Applications

Example 1: Quick Start with Single Document

Example 2: Building a RAG Knowledge Base from Local Documents

Example 3: OCR Processing for Image-Based Documents

Example 4: Processing Multiple Documents for Batch RAG Updates

Example 5: Web Crawling for Dynamic Knowledge Base Updates

Configuration for Optimized AI Workloads

Uh oh!