Skip to content

nagyantal9312/python_anonymizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python anonymizer

Table of Contents

The aim of this project

The aim of this project is to develop an application which is able to detect personal data, and after detection anonymises and pseudonymises them.

Technologies

The project is written in Python 3.8. The dependencies can be found in requirements.txt.

The data used

How this program works

The personal data detection functions can be found in detection.py. If the program detects personal data in a column, it saves the column name to labels.csv, and labels it based on the type of the data. Writing labels to a file is used in order to store the results for later use. The pseudonymisation.py module does the pseudonymisation and the anonymisation.py module does the anonymisation. The functions in datamanager.py make working with datasets easier. The analysis.py module can provide information to the user about the uniqueness of data in each column, and in the combination of columns. The auto_anon_and_pseud function in main.py automatizes the forementioned tasks.

The result (anonymised dataset) is written to data/output/outputtest.csv file.

The program includes a datacrawler package, which can be used for crawling data from different websites (currently just koronavirus.gov.hu).

The list of recognised personal data

  • hungarian licence plates
  • english disease names
  • hungarian disease names
  • hungarian tax number
  • hungarian TAJ number
  • hungarian personal number
  • hungarian first names
  • hungarian phone number
  • MAC address
  • IP address
  • email address
  • country codes and country names in english
  • human age

Pseudonymisation techniques

  • number to interval
  • country name to region
  • separating email address to three parts and hashing them
  • text to number

Anonymisation techniques

K-anonymity, L-diversity, T-closeness. The implementation is based on https://github.com/Nuclearstar/K-Anonymity.