Skip to content

Commit

Permalink
Add table graph tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
Romuald Rousseau committed Apr 30, 2024
1 parent a6ccd05 commit 6b3ba74
Show file tree
Hide file tree
Showing 14 changed files with 117 additions and 282 deletions.
Binary file added data/AG120-N-074.pdf
Binary file not shown.
12 changes: 6 additions & 6 deletions dependencies
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,12 @@ com.github.romualdrousseau:shuju-jackson:jar:1.23
com.fasterxml.jackson.core:jackson-databind:jar:2.15.2
com.fasterxml.jackson.core:jackson-annotations:jar:2.15.2
com.fasterxml.jackson.core:jackson-core:jar:2.15.2
com.github.romualdrousseau:any2json:jar:20240422.102355-9:2.38-SNAPSHOT
com.github.romualdrousseau:any2json:jar:2.38
org.python:jython-standalone:jar:2.7.3
com.github.romualdrousseau:any2json-layex-parser:jar:20240314.021346-4:2.38-SNAPSHOT
com.github.romualdrousseau:any2json-net-classifier:jar:20240314.021330-4:2.38-SNAPSHOT
com.github.romualdrousseau:any2json-csv:jar:20240314.021455-2:2.38-SNAPSHOT
com.github.romualdrousseau:any2json-excel:jar:20240314.021620-3:2.38-SNAPSHOT
com.github.romualdrousseau:any2json-layex-parser:jar:2.38
com.github.romualdrousseau:any2json-net-classifier:jar:2.38
com.github.romualdrousseau:any2json-csv:jar:2.38
com.github.romualdrousseau:any2json-excel:jar:2.38
org.apache.poi:poi:jar:5.2.3
commons-codec:commons-codec:jar:1.15
org.apache.commons:commons-collections4:jar:4.4
Expand All @@ -51,7 +51,7 @@ org.apache.xmlbeans:xmlbeans:jar:5.1.1
org.apache.commons:commons-compress:jar:1.21
com.github.virtuald:curvesapi:jar:1.07
org.apache.poi:poi-scratchpad:jar:5.2.3
com.github.romualdrousseau:any2json-pdf:jar:20240419.045004-6:2.38-SNAPSHOT
com.github.romualdrousseau:any2json-pdf:jar:2.38
technology.tabula:tabula:jar:1.0.5
org.locationtech.jts:jts-core:jar:1.18.1
org.slf4j:slf4j-simple:jar:1.7.32
Expand Down
146 changes: 0 additions & 146 deletions docs/how_it_works.md

This file was deleted.

26 changes: 5 additions & 21 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Welcome to Any2Json Documents
# Welcome to PyAny2Json Documents

***Revolutionizing Data Management: The Transformative Potential of a Novel Framework for Semi-Structured Documents***

Expand All @@ -11,26 +11,10 @@
* [Tutorial 5 - Data extraction with pivot](tutorial_5.md)
* [Tutorial 6 - More complex noise reduction](tutorial_6.md)
* [Tutorial 7 - Data extraction from PDF](tutorial_7.md)
* [Tutorial 8 - Make a classifier from scratch](tutorial_8.md)
* [Tutorial 8 - Data extraction from paginated PDF](tutorial_8.md)
* [Tutorial 9 - Browse the table grah](tutorial_9.md)
* [Tutorial 10 - Make a classifier from scratch](tutorial_10.md)

## How it works

* Please find detailed explanations on how Any2json works and its unique features [here](how_it_works.md)

## Plugins

* [Any2Json Layex Parser](https://github.com/RomualdRousseau/Any2Json-Layex-Parser/)
* [Any2Json Net Classifier](https://github.com/RomualdRousseau/Any2Json-Net-Classifier/)
* [Any2Json Csv](https://github.com/RomualdRousseau/Any2Json-Csv/)
* [Any2Json Excel](https://github.com/RomualdRousseau/Any2Json-Excel/)
* [Any2Json Dbf](https://github.com/RomualdRousseau/Any2Json-Dbf/)
* [Any2Json Parquet](https://github.com/RomualdRousseau/Any2Json-Parquet/)
* [Any2Json Pdf](https://github.com/RomualdRousseau/Any2Json-Pdf/)

## Models

* [Models](https://github.com/RomualdRousseau/Any2Json-Models/)

## Resources

* [White Papers](white_papers.md)
* Please find detailed explanations on how Any2json works and its unique features [here](https://romualdrousseau.github.io/Any2Json-Documents/)
58 changes: 0 additions & 58 deletions docs/objects.txt

This file was deleted.

12 changes: 0 additions & 12 deletions docs/patents.md

This file was deleted.

13 changes: 13 additions & 0 deletions docs/tutorial_10.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Tutorial 10 - Make a classifier from scratch

[View source on GitHub](https://github.com/RomualdRousseau/Any2Json-Examples).

This tutoral is a continuation of the [Tutorial 9](tutorial_9.md).

***Coming soon***

## Conclusion

Congratulations! You have loaded documents using Any2Json.

For more examples of using Any2Json, check out the [tutorials](index.md).
2 changes: 1 addition & 1 deletion docs/tutorial_8.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Tutorial 8 - Make a classifier from scratch
# Tutorial 8 - Data extraction from paginated PDF

[View source on GitHub](https://github.com/RomualdRousseau/Any2Json-Examples).

Expand Down
13 changes: 13 additions & 0 deletions docs/tutorial_9.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Tutorial 9 - Browse the table grah

[View source on GitHub](https://github.com/RomualdRousseau/Any2Json-Examples).

This tutoral is a continuation of the [Tutorial 8](tutorial_8.md).

***Coming soon***

## Conclusion

Congratulations! You have loaded documents using Any2Json.

For more examples of using Any2Json, check out the [tutorials](index.md).
24 changes: 0 additions & 24 deletions docs/white_papers.md

This file was deleted.

40 changes: 40 additions & 0 deletions examples/tutorial9.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
from pyany2json import ModelBuilder, LayexTableParser, DocumentFactory, INTELLI_LAYOUT
from pyany2json.document_factory import DataTable, TableGraph


REPO_BASE_URL = "https://raw.githubusercontent.com/RomualdRousseau/Any2Json-Models/main"
MODEL_NAME = "sales-english"
FILE_PATH = "data/AG120-N-074.pdf"
FILE_ENCODING = "UTF-8"


builder = ModelBuilder().fromURI("{0}/{1}/{1}.json".format(REPO_BASE_URL, MODEL_NAME))
parser = LayexTableParser(
[""], ["((vv$)(v+$v+$))(()(.+$)())+()", "(()(.+$))(()(.+$)())+()"]
)
model = (
builder.setTableParser(parser)
.build()
)

def visitTable(parent: TableGraph):
for c in parent.children():
table = c.getTable()
if isinstance(table, DataTable):
for header in table.headers():
print(header.getName(), end=" ")
print()
for row in table.rows():
for cell in row.cells():
print(cell.getValue(), end=" ")
print()
if len(c.children()) > 0:
visitTable(c)

with DocumentFactory.createInstance(FILE_PATH, FILE_ENCODING) as doc:
doc.setModel(model)
doc.setHints([INTELLI_LAYOUT])
for sheet in doc.sheets():
root = sheet.getTableGraph()
if root.isPresent():
visitTable(root.get())
5 changes: 2 additions & 3 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
site_name: Any2Json Documents
site_name: PyAny2Json Documents
nav:
- Home: index.md
- How it works: how_it_works.md
- White Papers: white_papers.md

Loading

0 comments on commit 6b3ba74

Please sign in to comment.