The Aurelio Platform SDK. API references
To install the Aurelio SDK, use pip or poetry:
pip install aurelio-sdk
The SDK requires an API key for authentication. Get key from Aurelio Platform. Set your API key as an environment variable:
export AURELIO_API_KEY=your_api_key_here
See examples for more details.
from aurelio_sdk import AurelioClient
import os
client = AurelioClient(api_key=os.environ["AURELIO_API_KEY"])
or use asynchronous client:
from aurelio_sdk import AsyncAurelioClient
client = AsyncAurelioClient(api_key="your_api_key_here")
from aurelio_sdk import ChunkingOptions, ChunkResponse
# All options are optional with default values
chunking_options = ChunkingOptions(
chunker_type="semantic", max_chunk_length=400, window_size=5
)
response: ChunkResponse = client.chunk(
content="Your text here to be chunked", processing_options=chunking_options
)
from aurelio_sdk import ExtractResponse
# From a local file
file_path = "path/to/your/file.pdf"
response_pdf_file: ExtractResponse = client.extract_file(
file_path=file_path, model="aurelio-base", chunk=True, wait=-1
)
# For higher accuracy on complex documents
response_pdf_file_high: ExtractResponse = client.extract_file(
file_path=file_path, model="docling-base", chunk=True, wait=-1
)
# For state-of-the-art text extraction using a VLM (may be more expensive)
response_pdf_file_vlm: ExtractResponse = client.extract_file(
file_path=file_path, model="gemini-2-flash-lite", chunk=True, wait=-1
)
from aurelio_sdk import ExtractResponse
# From a local file
file_path = "path/to/your/file.mp4"
# Video files only support aurelio-base model
response_video_file: ExtractResponse = client.extract_file(
file_path=file_path,
model="aurelio-base",
chunk=True,
wait=-1,
processing_options={
"chunking": {
"chunker_type": "semantic" # For better semantic chunking
}
}
)
from aurelio_sdk import ExtractResponse
# From URL
url = "https://arxiv.org/pdf/2408.15291"
response_pdf_url: ExtractResponse = client.extract_url(
url=url, model="aurelio-base", chunk=True, wait=-1
)
# For more complex PDFs requiring higher accuracy
response_pdf_url_high: ExtractResponse = client.extract_url(
url=url, model="docling-base", chunk=True, wait=-1
)
from aurelio_sdk import ExtractResponse
# From URL
url = "https://storage.googleapis.com/gtv-videos-bucket/sample/ForBiggerMeltdowns.mp4"
response_video_url: ExtractResponse = client.extract_url(
url=url,
model="aurelio-base", # Only model supported for video
chunk=True,
wait=-1,
processing_options={
"chunking": {
"chunker_type": "semantic" # For better semantic chunking
}
}
)
# Set wait time for large files with more accurate models
# Wait time is set to 10 seconds
response_pdf_url: ExtractResponse = client.extract_url(
url="https://arxiv.org/pdf/2408.15291", model="docling-base", chunk=True, wait=10
)
# Get document status and response
document_response: ExtractResponse = client.get_document(
document_id=response_pdf_file.document.id
)
print("Status:", document_response.status)
# Use a pre-built function, which helps to avoid long hanging requests (Recommended)
document_response = client.wait_for(
document_id=response_pdf_file.document.id, wait=300
)
from aurelio_sdk import EmbeddingResponse
response: EmbeddingResponse = client.embedding(
input="Your text here to be embedded",
model="bm25")
# Or with a list of texts
response: EmbeddingResponse = client.embedding(
input=["Your text here to be embedded", "Your text here to be embedded"]
)
The ExtractResponse
object contains the following key information:
status
: The current status of the extraction taskusage
: Information about token usage, pages processed, and processing timemessage
: Any relevant messages about the extraction processdocument
: The extracted document information, including its IDchunks
: The extracted text, divided into chunks if chunking was enabled
The EmbeddingResponse
object contains the following key information:
message
: Any relevant messages about the embedding processmodel
: The model name used for embeddingusage
: Information about token usage, pages processed, and processing timedata
: The embedded documents
- Use appropriate wait times based on your use case and file sizes.
- Use async client for better performance.
- For large files or when processing might take longer, enable polling for long-hanging requests.
- Always handle potential exceptions and check the status of the response.
- Choose the appropriate model for your needs:
aurelio-base
: Fastest and cheapest, good for clean PDFs (equivalent to old "low" quality)docling-base
: More accurate for complex documents (equivalent to old "high" quality)gemini-2-flash-lite
: State-of-the-art VLM-based extraction, highest accuracy but potential for rare hallucinations
- For videos, only
aurelio-base
is supported, but you can customize chunking via processing_options.