Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for HSBC non-OCR statements #164

Merged
merged 27 commits into from
Sep 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
f1752b1
chore(pdf): remove old get_byte_stream function
benjamin-awd Sep 1, 2024
c85b614
refactor(pdf): make PdfDocument a child class of fitz.Document
benjamin-awd Sep 1, 2024
c6c5146
chore: remove old mock_document fixture
benjamin-awd Sep 1, 2024
4c2cb88
feat(banks/hsbc): add support for non-OCR credit statements
benjamin-awd Sep 2, 2024
dceffdd
chore(constants): remove case insensitive modifier from formats with …
benjamin-awd Sep 2, 2024
46a8020
refactor(pdf): use file_path as first arg to PdfDocument
benjamin-awd Sep 2, 2024
18b0b08
build(deps): add ocrmypdf as a system dependency
benjamin-awd Sep 2, 2024
b7274a3
chore(pdf): improve ocrmypdf performance
benjamin-awd Sep 3, 2024
472d823
chore(pipeline): shorten create_handler function signature
benjamin-awd Sep 3, 2024
c58d108
refactor(pipeline): move parser & handler creation logic to extract
benjamin-awd Sep 3, 2024
6960cf9
refactor: pass PdfPages instead of parser
benjamin-awd Sep 3, 2024
db743c1
chore(generic): move GenericBank to generic __init__
benjamin-awd Sep 3, 2024
fcae170
chore(pipeline): import Transaction from statements namespace
benjamin-awd Sep 3, 2024
7bc4ed1
chore: rename generic/generic_handler to generic/handler
benjamin-awd Sep 3, 2024
1fedd41
refactor(pipeline): move bank detection logic to CLI
benjamin-awd Sep 3, 2024
25a68bf
refactor(detector): move detector to banks namespace
benjamin-awd Sep 3, 2024
22fe99e
chore: import from pymupdf instead of fitz
benjamin-awd Sep 3, 2024
62ba9f2
refactor: remove unnecessary usage of pydantic dataclasses
benjamin-awd Sep 3, 2024
b7af37b
refactor(pdf): add metadata identifier attr to PdfDocument
benjamin-awd Sep 4, 2024
190510d
refactor(banks/base): fix type hint for identifiers
benjamin-awd Sep 4, 2024
4c67671
build(deps): move ocrmypdf to extras
benjamin-awd Sep 4, 2024
67943c3
refactor(pdf): lazily import ocrmypdf
benjamin-awd Sep 4, 2024
ec356ea
refactor(pdf): perform ocr based on metadata identifiers
benjamin-awd Sep 4, 2024
b7e4638
chore: linting for ocr changes
benjamin-awd Sep 4, 2024
88daff3
refactor(pipeline): move parser instantiation logic to CLI
benjamin-awd Sep 5, 2024
502b0aa
refactor(pipeline): allow custom document to be passed
benjamin-awd Sep 5, 2024
dd6ee26
docs(README): add note about OCR feature
benjamin-awd Sep 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
- name: Install pdftotext
uses: daaku/gh-action-apt-install@v4
with:
packages: build-essential libpoppler-cpp-dev pkg-config
packages: build-essential libpoppler-cpp-dev pkg-config ocrmypdf

- name: Setup Python & Poetry
uses: ./.github/actions/setup-python-poetry
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/publish.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ jobs:
- name: Install pdftotext
uses: daaku/gh-action-apt-install@v4
with:
packages: build-essential libpoppler-cpp-dev pkg-config
packages: build-essential libpoppler-cpp-dev pkg-config ocrmypdf

- name: Setup Python & Poetry
uses: ./.github/actions/setup-python-poetry
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
- name: Install pdftotext
uses: daaku/gh-action-apt-install@v4
with:
packages: build-essential libpoppler-cpp-dev pkg-config
packages: build-essential libpoppler-cpp-dev pkg-config ocrmypdf

- name: Setup Python & Poetry
uses: ./.github/actions/setup-python-poetry
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,13 @@ Monopoly is a pip-installable Python package on [PyPI](https://pypi.org/project/
Since Monopoly uses `pdftotext`, you'll need to install additional dependencies:

```sh
apt-get install build-essential libpoppler-cpp-dev pkg-config
apt-get install build-essential libpoppler-cpp-dev pkg-config ocrmypdf
```

or

```sh
brew install gcc@11 pkg-config poppler
brew install gcc@11 pkg-config poppler ocrmypdf
```

Then install with pipx:
Expand Down Expand Up @@ -72,7 +72,7 @@ python3 src/monopoly/examples/single_statement.py
## Features
- Parses PDFs using predefined configuration classes per bank.
- Handles locked PDFs with credentials passed via environment variables.
- Supports a variety of date/number formats and determines if a transaction is debit or credit.
- Supports adding OCR for image-based bank statements.
- Provides a generic parser that can be used without any predefined configuration (caveat emptor).
- Includes a safety check (enabled by default) that validates totals for debit or credit statements.

Expand Down
1,147 changes: 1,005 additions & 142 deletions poetry.lock

Large diffs are not rendered by default.

10 changes: 7 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ tabulate = "^0.9.0"
pydantic = "^2.5.2"
dateparser = "^1.2.0"
strenum = "^0.4.15"

ocrmypdf = { version = "^16.5.0", optional = true }

[tool.poetry.group.dev.dependencies]
black = ">=23.7,<25.0"
Expand All @@ -43,9 +43,11 @@ types-tabulate = "^0.9.0.20240106"
pytest-xdist = "^3.6.1"
flake8 = "^7.0.0"
ruff = ">=0.4.7,<0.7.0"
git-cliff = "^2.3.0"

[tool.poetry.extras]
ocr = ["ocrmypdf"]

git-cliff = "^2.3.0"
[tool.taskipy.tasks]
format = "isort . && black ."
lint = "flake8 src && pylint src && ruff check src"
Expand Down Expand Up @@ -85,7 +87,9 @@ disable_error_code = [

[[tool.mypy.overrides]]
module = [
"fitz",
"pymupdf",
"ocrmypdf",
"ocrmypdf.exceptions",
"pdftotext",
"pdf2john",
]
Expand Down
7 changes: 5 additions & 2 deletions src/monopoly/banks/__init__.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
import logging
from typing import Type

from ..examples.example_bank import ExampleBank
from .base import BankBase
from .citibank import Citibank
from .dbs import Dbs
from .detector import BankDetector
from .example_bank import ExampleBank
from .hsbc import Hsbc
from .maybank import Maybank
from .ocbc import Ocbc
from .standard_chartered import StandardChartered

banks: list[Type[BankBase]] = [
banks: list[Type["BankBase"]] = [
Citibank,
Dbs,
ExampleBank,
Expand All @@ -21,3 +22,5 @@
]

logger = logging.getLogger(__name__)

__all__ = ["BankDetector", "BankBase", *[bank.__name__ for bank in banks]]
2 changes: 2 additions & 0 deletions src/monopoly/banks/base.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import logging
from typing import Any

from monopoly.config import PdfConfig, StatementConfig

Expand All @@ -15,6 +16,7 @@ class BankBase:

statement_configs: list[StatementConfig]
pdf_config: PdfConfig = PdfConfig()
identifiers: list[list[Any]]

def __init_subclass__(cls, **kwargs) -> None:
if not hasattr(cls, "statement_configs"):
Expand Down
18 changes: 11 additions & 7 deletions src/monopoly/bank_detector.py → src/monopoly/banks/detector.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
import logging
from dataclasses import Field, fields
from functools import cached_property
from typing import Any, Type
from typing import TYPE_CHECKING, Any, Type

from monopoly.banks import BankBase, banks
from monopoly.identifiers import Identifier, MetadataIdentifier, TextIdentifier
from monopoly.identifiers import Identifier, TextIdentifier
from monopoly.pdf import PdfDocument

if TYPE_CHECKING:
from .base import BankBase

logger = logging.getLogger(__name__)


Expand All @@ -20,20 +22,22 @@ def metadata_items(self) -> list[Any]:
Retrieves encryption and metadata identifiers from a bank statement PDF
"""
identifiers: list[Identifier] = []
if metadata := self.document.open().metadata:
metadata_identifier = MetadataIdentifier(**metadata)
if metadata_identifier := self.document.metadata_identifier:
identifiers.append(metadata_identifier)

if not identifiers:
raise ValueError("Could not get identifier")

return identifiers

def detect_bank(self) -> Type[BankBase] | None:
def detect_bank(self, banks: list[Type["BankBase"]]) -> Type["BankBase"] | None:
"""
Reads the encryption metadata or actual metadata (if the PDF is not encrypted),
and checks for a bank based on unique identifiers.
"""
if not banks:
banks = []

logger.debug("Found PDF properties: %s", self.metadata_items)

for bank in banks:
Expand All @@ -43,7 +47,7 @@ def detect_bank(self) -> Type[BankBase] | None:

def is_bank_identified(
self,
bank: Type[BankBase],
bank: Type["BankBase"],
) -> bool:
"""
Checks if a bank is identified based on a list of metadata items.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
from re import compile as regex

from monopoly.banks.base import BankBase
from monopoly.config import StatementConfig
from monopoly.constants import EntryType, InternalBankNames, SharedPatterns
from monopoly.identifiers import TextIdentifier

from .base import BankBase


class ExampleBank(BankBase):
"""Dummy class to help with reading the example PDF statement"""
Expand Down
29 changes: 18 additions & 11 deletions src/monopoly/banks/hsbc/hsbc.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,19 +26,26 @@ class Hsbc(BankBase):
multiline_transactions=True,
)

email_statement_identifier = [
MetadataIdentifier(
title="PRJ_BEAGLE_ST_CNS_SGH_APP_Orchid",
author="Registered to: HSBCGLOB",
creator="OpenText Exstream",
),
TextIdentifier("HSBC"),
]

web_and_mobile_statement_identifier = [
MetadataIdentifier(
format="PDF 1.7", producer="OpenText Output Transformation Engine"
)
]

pdf_config = PdfConfig(
page_bbox=(0, 0, 379, 842),
page_bbox=(0, 0, 379, 840),
ocr_identifiers=web_and_mobile_statement_identifier,
)

identifiers = [
[
MetadataIdentifier(
title="PRJ_BEAGLE_ST_CNS_SGH_APP_Orchid",
author="Registered to: HSBCGLOB",
creator="OpenText Exstream",
),
TextIdentifier("HSBC"),
],
]
identifiers = [email_statement_identifier, web_and_mobile_statement_identifier]

statement_configs = [credit_config]
17 changes: 13 additions & 4 deletions src/monopoly/cli.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
import traceback
from concurrent.futures import ProcessPoolExecutor
from dataclasses import dataclass, field
from pathlib import Path
from typing import Collection, Iterable, Optional, TypedDict

import click
from pydantic.dataclasses import Field, dataclass
from tabulate import tabulate
from tqdm import tqdm

Expand Down Expand Up @@ -41,7 +41,7 @@ class Result:

source_file_name: str
target_file_name: Optional[str] = None
error_info: dict[str, str] = Field(default_factory=dict)
error_info: dict[str, str] = field(default_factory=dict)


@dataclass
Expand Down Expand Up @@ -123,10 +123,19 @@ def process_statement(
information about the processed statement. If an error occurs during processing,
returns a Result object with error information.
"""
from monopoly.pipeline import Pipeline # pylint: disable=import-outside-toplevel
# pylint: disable=import-outside-toplevel, too-many-locals
from monopoly.banks import BankDetector, banks
from monopoly.generic import GenericBank
from monopoly.pdf import PdfDocument, PdfParser
from monopoly.pipeline import Pipeline

try:
pipeline = Pipeline(file)
document = PdfDocument(file)
analyzer = BankDetector(document)
bank = analyzer.detect_bank(banks) or GenericBank
parser = PdfParser(bank, document)
pipeline = Pipeline(parser)

statement = pipeline.extract(safety_check=safety_check)
transactions = pipeline.transform(statement)

Expand Down
7 changes: 4 additions & 3 deletions src/monopoly/config.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
from dataclasses import field
from dataclasses import dataclass, field
from typing import Optional, Pattern

from pydantic.dataclasses import dataclass

from monopoly.constants import BankNames, EntryType, InternalBankNames
from monopoly.enums import RegexEnum
from monopoly.identifiers import MetadataIdentifier


@dataclass
Expand Down Expand Up @@ -67,7 +66,9 @@ class PdfConfig:
- `page_bbox`: A tuple representing the bounding box range for every
page. This is used to avoid weirdness like vertical text, and other
PDF artifacts that may affect parsing.
- `ocr_identifiers`: Applies OCR on PDFs with a specific metadata identifier.
"""

page_range: tuple[Optional[int], Optional[int]] = (None, None)
page_bbox: Optional[tuple[float, float, float, float]] = None
ocr_identifiers: Optional[list[MetadataIdentifier]] = None
13 changes: 7 additions & 6 deletions src/monopoly/constants/date.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,20 +14,21 @@
class DateFormats(StrEnum):
"""Holds a case-insensitive list of common ISO 8601 date formats"""

D = r"(?i:1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)"
DD = r"(?i:01|02|03|04|05|06|07|08|09|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)"
M = r"(?i:1|2|3|4|5|6|7|8|9|10|11|12)"
MM = r"(?i:01|02|03|04|05|06|07|08|09|10|11|12)"
D = r"(1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)"
DD = r"(01|02|03|04|05|06|07|08|09|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)"
M = r"(1|2|3|4|5|6|7|8|9|10|11|12)"
MM = r"(01|02|03|04|05|06|07|08|09|10|11|12)"
MMM = r"(?i:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)"
MMMM = r"(?i:January|February|March|April|May|June|July|August|September|October|November|December)"
YY = r"(?i:[2-5][0-9]\b)"
YYYY = r"(?i:20\d{2}\b)"
YY = r"([2-5][0-9]\b)"
YYYY = r"(20\d{2}\b)"


class ISO8601(RegexEnum):
DD_MM = rf"\b({DateFormats.DD}[\/\-\s]{DateFormats.MM})"
DD_MM_YY = rf"\b({DateFormats.DD}[\/\-\s]{DateFormats.MM}[\/\-\s]{DateFormats.YY})"
DD_MMM = rf"\b({DateFormats.DD}[-\s]{DateFormats.MMM})"
DD_MMM_RELAXED = DD_MMM.replace(r"[-\s]", r"(?:[-\s]|)")
DD_MMM_YY = rf"\b({DateFormats.DD}[-\s]{DateFormats.MMM}[-\s]{DateFormats.YY})"
DD_MMM_YYYY = (
rf"\b({DateFormats.DD}[-\s]{DateFormats.MMM}[,\s]{{1,2}}{DateFormats.YYYY})"
Expand Down
4 changes: 2 additions & 2 deletions src/monopoly/constants/statement.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,8 +103,8 @@ class CreditTransactionPatterns(RegexEnum):
+ SharedPatterns.AMOUNT_EXTENDED
)
HSBC = (
rf"(?P<posting_date>{ISO8601.DD_MMM})\s+"
rf"(?P<transaction_date>{ISO8601.DD_MMM})\s+"
rf"(?P<posting_date>{ISO8601.DD_MMM_RELAXED})\s+"
rf"(?P<transaction_date>{ISO8601.DD_MMM_RELAXED})\s+"
+ SharedPatterns.DESCRIPTION
+ SharedPatterns.AMOUNT_EXTENDED
)
Expand Down
3 changes: 0 additions & 3 deletions src/monopoly/examples/__init__.py

This file was deleted.

17 changes: 10 additions & 7 deletions src/monopoly/examples/single_statement.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from monopoly.banks import ExampleBank
from monopoly.pdf import PdfDocument, PdfParser
from monopoly.pipeline import Pipeline


Expand All @@ -6,13 +8,11 @@ def example():
a single bank statement

You can pass in the bank class if you want to specify a specific bank,
or ignore the bank argument and let the Pipeline try to automatically
detect the bank.
or use the BankDetector class to try to detect the bank automatically.
"""
pipeline = Pipeline(
file_path="src/monopoly/examples/example_statement.pdf",
# bank=ExampleBank
)
document = PdfDocument(file_path="src/monopoly/examples/example_statement.pdf")
parser = PdfParser(ExampleBank, document)
pipeline = Pipeline(parser)

# This runs pdftotext on the PDF and
# extracts transactions as raw text
Expand All @@ -22,12 +22,15 @@ def example():
transactions = pipeline.transform(statement)

# Parsed transactions writen to a CSV file in the "example" directory
pipeline.load(
file_path = pipeline.load(
transactions=transactions,
statement=statement,
output_directory="src/monopoly/examples",
)

with open(file_path, encoding="utf8") as file:
print(file.read()[0:248])


if __name__ == "__main__":
example()
8 changes: 2 additions & 6 deletions src/monopoly/generic/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,4 @@
from .generic import DateMatch, DatePatternAnalyzer
from .generic_handler import GenericStatementHandler
from .handler import GenericBank, GenericStatementHandler

__all__ = [
"DatePatternAnalyzer",
"DateMatch",
"GenericStatementHandler",
]
__all__ = ["DatePatternAnalyzer", "DateMatch", "GenericStatementHandler", "GenericBank"]
Loading
Loading