Skip to content

Commit

Permalink
First public version
Browse files Browse the repository at this point in the history
  • Loading branch information
ivbeg committed Feb 7, 2022
1 parent 6062cfa commit 49cfd5d
Show file tree
Hide file tree
Showing 30 changed files with 2,896 additions and 2 deletions.
8 changes: 8 additions & 0 deletions .idea/.gitignore

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

23 changes: 23 additions & 0 deletions .idea/inspectionProfiles/Project_Default.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/inspectionProfiles/profiles_settings.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 16 additions & 0 deletions .idea/metacrafter.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 10 additions & 0 deletions .idea/modules.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 7 additions & 0 deletions .idea/vcs.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions .metacrafter
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
rulepath:
- rules
# - ../metacrafter-rules/rules

153 changes: 151 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,151 @@
# metacrafter
Metadata and data identification tool and Python library. Identifies PII, common identifiers, language specific identifiers. Full customizable and flexible rules
# Metacrafter

Python command line tool and python engine to label table fields and fields in data files.
It could help to find meaningful data in your tables and data files or to find Personal identifable information (PII).

## Installation

To install Python library use `pip install metacrafter` via pip or `python setup.py install`

## Features

Metacrafter is a rule based tool that helps to label fields of the tables in databases. It scans table and finds person names, surnames, midnames, PII data, basic identifiers like UUID/GUID.
These rules written as .yaml files and could be easily extended.

File formats supported:
* CSV
* JSON lines
* JSON (array of dicts)
* BSON
* Parquet

Databases support:
* Any database supported by [SQLAlchemy](https://www.sqlalchemy.org/)
* MongoDB

Metacrafter key features:
* 25 basic and PII rules.
* all labels metadata collected into [Metacrafter registry](https://github.com/apicrafter/metacrafter-registry ) public repository -
* 312 date detection rules/patterns, date detection using [qddate](https://github.com/ivbeg/qddate), "quick and dirty" date detection library
* extendable set of rules using PyParsing, exact text match and validation functions
* support any database supported by SQLAlchemy
* advanced context and language management. You could apply only rules relevant to certain data of choosen language
* built-in API server
* commercial support and additional rules available


## Command line examples

### File analysis examples

# Scan CSV file
$ metacrafter scan-file --format short somefile.csv

# Scan CSV file with delimiter ';' and windows-1251 encoding
$ metacrafter scan-file --format short --encoding windows-1251 --delimiter ';' somefile.csv

# Scan JSON lines file, output results as stats table to file file
$ metacrafter scan-file --format stats -o somefile_result.json somefile.jsonl


Result example of 'full' type of formatting
```
key ftype tags matches
-------------------- ------- ------ -------------------------------
name str uniq
addressresidence str uniq address 59.80
addressactivities str address 50.98
addressobjects str address 28.00
bin int ogrn 99.02
inn str inn 100.00,inn 99.02
purposeaudit str runpa 8.82
dateregistration str datetime 94.12 (dt:date:date_2)
expirydate str datetime 18.63 (dt:date:date_2)
startdateactivity str datetime 28.43 (dt:date:date_2)
othergrounds str dict
startdateaudit str datetime 65.69 (dt:date:date_2)
workdays int dict
workhours str dict
formaudit str dict
namestatecontrol str
assignment decree str dict
effectivedate str dict
Inspectionenddate str empty
riskcategory str dict
expirationdate str empty
startupnotifications str empty
daylastcheck str empty
otherreasonsrefusal str empty
numbersystem str empty
```


### Database analysis examples

# Scan MongoDB database 'fns', save results as result.json and format output as 'stats'
$ metacrafter scan-mongodb --dbname fns -o result.json -f full

# Scan Postgres database 'dbname', with schema 'public'.
$ metacrafter scan-db --schema public --connstr postgresql+psycopg2://username:password@127.0.0.1:15432/dbname



# Rules

All rules described as YAML files and by default rules loaded from directory 'rules' or from list of directories provided in .metacrafter file with YAML format

All rules could be applied to **fields** or **data** .

Compare engines defined in **match** parameter in rule description:
* text - scan text for exact match to one of text values. Text values delimited by comma (',')
* ppr - scan text for PyParsing. PyParsing rule defined as Python code with PyParsing objects like Word(nums, exact=4)
* func - scan text using Python function provided. Function shoud accept one string parameter and shoud return True or False

## How to write rules

### Function (func)

Example Russian administrative legal act/law matched by custom function
```
runpabyfunc:
key: runpa
name: Russian legal act / law
maxlen: 500
minlen: 3
priority: 1
match: func
type: data
rule: metacrafter.rules.ru.gov.is_ru_law
```

### Exact text match (text)

Example midname matching by exact field name
```
midname:
key: person_midname
name: Person midname by known
rule: midname,secondname,middlename,mid_name,middle_name
type: field
match: text
```
### PyParsing rule (ppr)

Example Russian cadastral number
```
rukadastr:
key: rukadastr
name: Russian land territory cadastral identifier
rule: Word(nums, min=1, max=2) + Literal(':').suppress() + Word(nums, min=1, max=2) + Literal(':').suppress() + Word(nums, min=6, max=7) + Literal(':').suppress() + Word(nums, min=1, max=6)
maxlen: 20
minlen: 12
priority: 1
match: ppr
type: data
```

## Commercial support

Please write apicrafter@apicrafter.ru or ivan@begtin.tech to request beta access to commercial API.
Commercial API support 195 fields and data rules and provided with dedicated support.
Empty file added metacrafter/__init__.py
Empty file.
18 changes: 18 additions & 0 deletions metacrafter/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/usr/bin/env python
"""The main entry point. Invoke as `metacrafter' or `python -m metacrafter`.
"""
import sys


def main():
try:
from .core import cli
exit_status = cli()
except KeyboardInterrupt:
print("Ctrl-C pressed. Aborting")
sys.exit(0)


if __name__ == '__main__':
main()
4 changes: 4 additions & 0 deletions metacrafter/classify/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
__author__ = "Ivan Begtin"
__version__ = '0.0.1'
__license__ = "Apache"
__doc__ = "Metacrafter metadata classification tool"
Loading

0 comments on commit 49cfd5d

Please sign in to comment.