Plagiarism-Checker

A utility to check if a document's contents are plagiarised.

How it works

It searches online using Google Search API's for some queries. Queries are n-grams extracted from the source txt file.
Resulting URL, matched contents are checked for similarity with given text query.
Result of average similarity of all URL's is stored in output text file.

Required Libraries

The project uses python-docx module to decode docx files. The python-docx module has its own set of dependent libraries. The required libraries are:

PIL
lxml
python-dateutil
python-docx

GETTING LIBRARIES ON LINUX

Get easy_install

sudo apt-get install python-setuptools

Install PIP

sudo easy_install pip

Install dependent libraries

sudo pip install PIL

sudo pip install lxml

sudo pip install python-dateutil

Install python-docx

sudo pip install docx

Install pdftotext for pdf support (sketchy at the moment)

sudo apt-get install poppler-utils

Get ppt and doc support

sudo apt-get install catdoc

GETTING LIBRARIES ON WINDOWS

These steps assume you already have python installed and that python is in your windows environment variables.

Download setup-tools according to your python version. (That is python 2.7 in most cases)

Run the .exe file. The installer will automatically find your python installation location from the registry and install easy_install to the Scripts directory where your python installation is located.

Once the installer has run, add easy_install to the windows environment variables path.

Open a command window
Run the following command:

easy_install pip

Then install the required libraries for docx support

pip install PIL

pip install lxml

pip install python-dateutil

pip install docx

EXEs for pdf, ppt and doc support are included in the package. Nothing need be installed.

Folder Structure

assets/

Holds Twitter Bootstrap CSS and Javascript files and images/glyphicons

config/

Stores configuration data (Path to Python on Windows)

scripts/

Contains python scripts to perform plagiarism checks

temp/

Contains uploaded files

Python Scripts

Backend is supported using python. There are 3 scripts in total.

scripts/main.py

Main script which gets the results of plagiarism

scripts/htmlstrip.py

Used to strip text from HTML tags

scripts/cosineSim.py

Helper modules to find cosine similarity between strings

Usage of Python Script (Standalone)

python main.py sampleText.txt sampleOut.txt

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
assets		assets
config		config
scripts		scripts
temp		temp
LICENSE.md		LICENSE.md
README.md		README.md
executeScripts.php		executeScripts.php
index.php		index.php
logs.php		logs.php
processFile.php		processFile.php
readLogs.php		readLogs.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Plagiarism-Checker

How it works

Required Libraries

GETTING LIBRARIES ON LINUX

GETTING LIBRARIES ON WINDOWS

Folder Structure

Python Scripts

Usage of Python Script (Standalone)

About

Releases

Packages

Contributors 3

Languages

License

architshukla/Plagiarism-Checker

Folders and files

Latest commit

History

Repository files navigation

Plagiarism-Checker

How it works

Required Libraries

GETTING LIBRARIES ON LINUX

GETTING LIBRARIES ON WINDOWS

Folder Structure

Python Scripts

Usage of Python Script (Standalone)

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages