forge

Email dataset for email signature parsing

Rationale

When working on a machine learning algorithm to solve some problem there are two most important things: the algorithm itself and the dataset to train / test the algorithm on.

Ever since we open-sourced our signature parsing library talon we were looking for a way to make the dataset publicly available. So that it would be easier for developers to contribute to the core functionality.

Another objective is to spark collaboration on the dataset itself to make it more representational and up to date.

Emails Source

The core of the dataset is a subset of Enron data, cleansed of private, health and financial information by EDRM.

Feel free to add your own emails but be careful to not include any sensetive information. Extend / use the dataset at your own risk.

Format

Emails with signatures are in dataset/P folder (P stands for positive). Emails without, are in dataset/N folder (N stands for negative).

Each email is represented by two files - xyz_body has email text, xyz_sender has sender name / address.

Signature lines are annotated with #sig#. For example:

Hi John,

Can we have a meeting tomorrow at 11 AM CST?

#sig#--
#sig#Mike Smith
#sig#555-243-0623

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataset		dataset
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

forge

Rationale

Emails Source

Format

About

Releases

Packages

License

mailgun/forge

Folders and files

Latest commit

History

Repository files navigation

forge

Rationale

Emails Source

Format

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages