Skip to content

Commit

Permalink
feat(docling): add docling
Browse files Browse the repository at this point in the history
  • Loading branch information
hongbo-miao committed Dec 24, 2024
1 parent a8803e2 commit 8067f19
Show file tree
Hide file tree
Showing 11 changed files with 1,663 additions and 12 deletions.
37 changes: 37 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ jobs:
national-instruments-hm-veristand: ${{ steps.filter.outputs.national-instruments-hm-veristand }}
hm-autogluon: ${{ steps.filter.outputs.hm-autogluon }}
hm-aws-parallelcluster: ${{ steps.filter.outputs.hm-aws-parallelcluster }}
hm-docling: ${{ steps.filter.outputs.hm-docling }}
hm-duckdb-query-duckdb: ${{ steps.filter.outputs.hm-duckdb-query-duckdb }}
hm-duckdb-query-protobuf: ${{ steps.filter.outputs.hm-duckdb-query-protobuf }}
hm-flax: ${{ steps.filter.outputs.hm-flax }}
Expand Down Expand Up @@ -240,6 +241,9 @@ jobs:
hm-aws-parallelcluster:
- '.github/workflows/test.yml'
- 'cloud-platform/aws/aws-parallelcluster/pcluster/**'
hm-docling:
- '.github/workflows/test.yml'
- 'machine-learning/hm-docling/**'
hm-duckdb-query-duckdb:
- '.github/workflows/test.yml'
- 'data-storage/hm-duckdb/query-duckdb/**'
Expand Down Expand Up @@ -1915,6 +1919,39 @@ jobs:
with:
directory: machine-learning/hm-kubeflow/pipelines/classify-mnist

docling-test:
name: Docling | Test
needs: detect-changes
if: ${{ needs.detect-changes.outputs.hm-docling == 'true' }}
runs-on: ubuntu-24.04
environment: test
timeout-minutes: 10
steps:
- name: Checkout
uses: actions/checkout@v4.2.2
- name: Install uv
uses: astral-sh/setup-uv@v5.0.1
with:
version: 0.5.11
enable-cache: true
cache-dependency-glob: machine-learning/hm-docling/uv.lock
- name: Set up Python
uses: actions/setup-python@v5.3.0
with:
python-version-file: machine-learning/hm-docling/pyproject.toml
- name: Install dependencies
working-directory: machine-learning/hm-docling
run: |
uv sync --dev
- name: Test
working-directory: machine-learning/hm-docling
run: |
uv run poe test-coverage
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v5.1.2
with:
directory: machine-learning/hm-docling

langchain-chat-pdf-test:
name: LangChain (chat-pdf) | Test
needs: detect-changes
Expand Down
3 changes: 3 additions & 0 deletions .mergify.yml
Original file line number Diff line number Diff line change
Expand Up @@ -311,6 +311,9 @@ pull_request_rules:
- or:
- check-success=Kubeflow (classify-mnist) | Test
- check-skipped=Kubeflow (classify-mnist) | Test
- or:
- check-success=Docling | Test
- check-skipped=Docling | Test
- or:
- check-success=LangChain (chat-pdf) | Test
- check-skipped=LangChain (chat-pdf) | Test
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -424,6 +424,7 @@ The diagram illustrates the repository's architecture, which is considered overl

- **LlamaIndex** - LLM application framework
- **LangChain** - LLM application framework
- **Docling** - LLM application framework
- **GPT4All** - Local LLM models
- **LiteLLM** - LLM gateway
- **Open WebUI** - AI chat interface
Expand Down
13 changes: 13 additions & 0 deletions machine-learning/hm-docling/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
uv-install-python::
uv python install
uv-update-lock-file:
uv lock
uv-install-dependencies:
uv sync --dev

uv-run-dev:
uv run poe dev
uv-run-test:
uv run poe test
uv-run-test-coverage:
uv run poe test-coverage
27 changes: 27 additions & 0 deletions machine-learning/hm-docling/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
[project]
name = "hm-docling"
version = "1.0.0"
requires-python = "~=3.12.0"
dependencies = [
"docling==2.14.0",
]

[dependency-groups]
dev = [
"poethepoet==0.31.1",
"pytest==8.3.4",
"pytest-cov==6.0.0",
]

[tool.uv]
package = false

[[tool.uv.index]]
name = "pytorch-cu124"
url = "https://download.pytorch.org/whl/cu124"
explicit = true

[tool.poe.tasks]
dev = "python src/main.py"
test = "pytest --verbose --verbose"
test-coverage = "pytest --cov=. --cov-report=xml"
3 changes: 3 additions & 0 deletions machine-learning/hm-docling/src/dummy_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
class TestDummy:
def test_dummy(self):
assert 1 + 1 == 2
36 changes: 36 additions & 0 deletions machine-learning/hm-docling/src/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import logging
from pathlib import Path

from docling.datamodel.pipeline_options import EasyOcrOptions, PdfPipelineOptions
from docling.document_converter import DocumentConverter


def main() -> None:
data_dir = Path("data")
pdf_paths = data_dir.glob("**/*.pdf")

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options = EasyOcrOptions(force_full_page_ocr=True)

converter = DocumentConverter()

for pdf_path in pdf_paths:
try:
# Convert PDF to markdown
res = converter.convert(pdf_path)
markdown_content = res.document.export_to_markdown()

# Write markdown to file
markdown_path = pdf_path.with_suffix(".md")
markdown_path.write_text(markdown_content, encoding="utf-8")
logging.info(f"Converted {pdf_path.name}")
except Exception as e:
logging.info(f"Error processing {pdf_path.name}: {e}")


if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
main()
Loading

0 comments on commit 8067f19

Please sign in to comment.