PDF Field Extraction with Transformers

Matthieu Hanania

This project leverages Transformer models to automatically identify fields in PDF files, such as titles, names, or dates. It is designed to extract valuable information from complex documents by combining the power of NLP models and PDF preprocessing.

Features

Text extraction from PDF files.
Generation of sample sentences
Configurable model to adapt to various document types.

Requirements and Installation

This project mainly needs theses requirements

Python 3.8 or higher
The following libraries (installable via pip):
- transformers
- torch
- PyPDF2
- pandas
- numpy

Installation with

pip install -r requirements.txt

Project Structure

main.ipynb: The main notebook that handles field identification and classification using Transformers.
jsonGenerer.ipynb: Generates JSON outputs for the extracted information. Used to train the model
Pdf2Text.py: A module to convert PDF files into plain text.

Explanation of Code Files

`Pdf2Text.py`

This script handles the conversion of PDF files into plain text using the PyPDF2 library. It reads the PDF, extracts text from each page, and combines the content into a single string. Example usage:

`main.ipynb`

This notebook is the core of the project. It:

Loads and preprocesses text
Uses the Transformers model Camenbert pre-trained on several French sentences, from the Hugging Face transformers library to analyze and classify key fields within the text.

About Transformers:

Transformers are state-of-the-art models for natural language processing (NLP). They use attention mechanisms to understand the context and relationships between words in a sentence. In this project, the pre-trained model is fine-tuned or adapted to identify specific fields in the text, such as titles, dates, and names.

How Transformers Work in the Code:

Input: The text is tokenized into smaller units (tokens) using a tokenizer provided by the Transformer model.
Model Application: The tokenized input is passed through the Transformer model to extract features and context-aware representations of the text.
Classification: A classification head (e.g., a feedforward neural network) is used on top of the Transformer to categorize text segments into predefined field types.

`jsonGenerer.ipynb`

This notebook takes the extracted and classified information and structures it into a JSON format for easy integration with other applications. It ensures that the data is organized and accessible.

Created with ❤️ by Matthieu Hanania.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
__pycache__		__pycache__
.gitignore		.gitignore
Pdf2Text.py		Pdf2Text.py
Readme.md		Readme.md
csv_dataset.csv		csv_dataset.csv
generated_data.json		generated_data.json
jsonGenerer.ipynb		jsonGenerer.ipynb
main.ipynb		main.ipynb
modeles_de_contrats-convention_d_accord-cap.pdf		modeles_de_contrats-convention_d_accord-cap.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Field Extraction with Transformers

Features

Requirements and Installation

Project Structure

Explanation of Code Files

`Pdf2Text.py`

`main.ipynb`

About Transformers:

How Transformers Work in the Code:

`jsonGenerer.ipynb`

About

Uh oh!

Releases

Packages

Uh oh!

Languages

MatthieuHanania/AI-transformers-for-contract-reading

Folders and files

Latest commit

History

Repository files navigation

PDF Field Extraction with Transformers

Features

Requirements and Installation

Project Structure

Explanation of Code Files

Pdf2Text.py

main.ipynb

About Transformers:

How Transformers Work in the Code:

jsonGenerer.ipynb

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`Pdf2Text.py`

`main.ipynb`

`jsonGenerer.ipynb`

Packages