In this project, we use deep learning models for plagiarism detection and document synthesis purposes.
- Clone the project
git clone https://github.com/abdoulfataoh/doc-summary-and-plagiarism-detection.git
cd doc-summary-and-plagiarism-detection
- Install poetry for virtual environment management
sudo apt-get update
sudo apt-get install curl
curl -sSL https://install.python-poetry.org | python3 -
- Install dependancies with poetry and use virtual env
poetry install --dev
poetry shell
- Install Spacy and NLTK language models
python -m spacy download en_core_web_sm
python -m spacy download fr_core_news_sm
python -c "import nltk;nltk.download('punkt')"
python -c "import nltk;nltk.download('stopwords')"
- (Optional) Use test configuration and file
echo -n 'TEST=True' > .env
make flake8
make test
- Settings variables
The configuration of the system is done through configuration variables.
export
command can be used to set a variable value.
The complete settings vars cant be found at
app/settings.py
For example:
export OPENAI_API_KEY='API KEY' # Your openai api key to interact with chatgpt model
export TEST=True # To enable test mode
export WORKDIR='path/to/wordir' # default value is 'static'
export PLAGIARISM_TRAIN_DATASET_FOLDER='path/to/dataset' # PDFs dataset folder path
- (if TEST env is True) Set it to False
rm .env
- Train models
Dataset must be pdf files and stored in
assets/dataset/plagiarism/train/
make train
- Create embeddings
make embeddings
- Run the streamlit server to use the app
make streamlit-server