title | emoji | colorFrom | colorTo | sdk | sdk_version | app_file | pinned |
---|---|---|---|---|---|---|---|
Privacy Policy Analyzer |
👁️🗨️ |
yellow |
gray |
gradio |
4.8.0 |
app.py |
false |
Have you ever tried to read a privacy policy? In a world inundated with digital agreements and privacy statements, understanding the fine print often becomes a daunting task. PPA aims to be your ally in this realm, deciphering the intricacies of these policies and empowering you to make informed decisions.
Imagine a tool that swiftly dissects the language of privacy policies, unraveling their content to categorize them simply as 'good' or 'bad.' PPA does just that by harnessing the capabilities of data extraction, text processing, and a unique Doc2Vec-based classification system.
The project truly commenced upon encountering the Princeton-Leuven Longitudinal Corpus of Privacy Policies, which provides a substantial database of around 100K unique privacy policies. These policies were gathered, processed, and embedded into a vector space using Doc2Vec.
To assign labels to the policies, I sourced data from ToS;DR, which provides a community-driven database of combined ratings for privacy policies and terms of service. Utilizing this, domain names from policy file-paths were gathered, and the ToS;DR search API assigned letter scores ('A' to 'E') to rated policies. Simplifying the ratings, 'A' or 'B' were labeled 'good,' while the rest ('C', 'D', and 'E') were labeled 'bad.'
The D2VClassifier employs a hybrid unsupervised-learning/supervised inference approach, based on Doc2Vec. First, the Doc2Vec model is trained on the entire corpus. For the training set policies which do posses labeles, it computes mean vectors for each class ('good' and 'bad'). When presented with a new document, the Doc2Vec model infers a vector representation, and the classifier calculates its similarity to the mean vectors. The resulting score reflects the closeness of the document to the learned 'good' and 'bad' representations.
The classification decision is determined by thresholding this similarity score. If the similarity score exceeds a predefined threshold, the document is classified as 'good'; otherwise, it's classified as 'bad'. This methodology enables the model to infer the quality of policies based on their similarity to the limited labeled data available, allowing a binary classification output.
In order to differentiate between privacy policies and other types of documents a few heuristics are employed, including distribution of common words, document length, and token filtering ratio against the training corpus dictionary.
- CorpusProcessor: Handles text processing, tokenization, and indexing for the Doc2Vec model.
- SampleGenerator: Assists in handling the training and testing data, ensuring balanced representation for model training while ensuring a one-at-a time presence of documents in RAM.
- D2VClassifier: Integrates with scikit-learn for hyperparameter tuning and pipeline connectivity, utilizing specialized text corpus retrieval for Doc2Vec.
- IndexedFile: Facillitates working with on-disk data, namely enabeling external shuffling via file start position indexing of data samples. Used with the above classes.
- CLI Script: Includes a CLI script named
ppa_cli.py
for running the trained model to classify content fetched from URLs. The script should be placed within the same directory as theppa
package. - Deployed App: The model is deployed using Gradio via the
app.py
file, offering a user-friendly interface for policy analysis.
- See
requirements.txt
for necessary dependencies. - Refer to the Thinking Process.py Jupyter notebook for a detailed walkthrough of the project's development stages.
The trained model can classify the content fetched from a provided URL using the CLI script. To use the script:
-
Place the
ppa_cli.py
script file within the same directory as theppa
package. -
Ensure dependencies from
requirements.txt
are installed. -
Run the CLI script by executing the following command:
python ppa_cli.py <URL>
Replace
<URL>
with the URL you want to classify.
The CLI script fetches the document text from the provided URL using trafilatura, processes the document using a trained CorpusProcessor, and then classifies it using a loaded D2VClassifier model. The classification result (label) and score are printed to the terminal.
The project is continuously evolving, with several improvements and enhancements in progress.
Efforts are underway to transition towards downstream supervised classification, contingent on acquiring a more extensive labeled dataset.
I'm actively acquiring significantly more data by scraping privacy policies using the modern Tranco list, updating a smart aggregation of the top 1M visited domains from four popular lists. Simultaneously, I'm collecting potential labels for additional policies from ToS;DR.
Another avenue being explored involves pseudo-labeling unlabeled training data with high/low thresholds for prediction. This strategy aims to enhance scoring robustness by averaging model summary vectors, especially the 'good' ones, across a broader sample set. This could refine tuning based on scores and lay the groundwork for supervised classification.
These enhancements and strategies are designed to refine the model's accuracy, robustness, and applicability. They particularly aim to address the challenge of representing 'good' and 'bad' policies beyond crude mean single-vector representations offered by Doc2Vec. The goal is to enable a more nuanced understanding of privacy policies by leveraging more complex classification models capable of capturing the intricate features embedded within the data.
While I'm not accepting pull requests at this early stage, any feedback is welcome. You're encouraged to sign up to ToS;DR to contribute to the labeling effort.
This project is licensed under the MIT License. See LICENSE for more details.