Chen Zhang (cz1389) & Guang Yang (gy552)
This is originally response to the term project of Spring 2017 CSCI-GA.2580-001 Web Search Engines at New York University taught by Professor Ernest Davis. If you want to use this software for academic purposes, aka assignments, please refer to the ACADEMIC INTEGRITY.
- python 2.7
- django
- numpy
- sklearn
- bs4
- lxml
We suggest to run it on
ssh linserv1.cims.nyu.edu
You need to make sure that python 2.7 is called. So you will have to excute.
module load python-2.7
Otherwise python 2.6 will be the default version to invoke.
Pick a port number between 10000 and 25000. Then you should cd to the django directory and excute
python manage.py runserver 0.0.0.0: your-port-number/wikiNet/
and the webpage should be available at
linserv1.cims.nyu.edu:your-port-number
We did most of the computation offline. And the website and its components are all stored in-memory. No database system is used. It could take a REALLY LONG TIME to load all data. So we provide a smaller yet fully functional dataset with the submission. The smaller dataset was crawled starting from /wiki/Apple_Inc. and contains 456 documents, the maximum depth is 2.
A project running on a much larger dataset (starting from /wiki/Apple_Inc. but with maximum depth of 4, containing 6050 documents) is running on
linserv1.cims.nyu.edu:13890/wikiNet/
python hcrawler.py < relative address > < maximum depth >
For example, if your seed will be https://en.wikipedia.org/wiki/Apple, then you should put /wiki/Apple as the relative address.
The indexed pages will be stored in a sibling directory named Apple, named after the index, which will be exported as a sibling file named Apple.stats.
Each line of the Apple.stats is organized as
"%d %s %s %d",index, relative url, parent's relative url, depth
- Duplication prevention is not implemented.
- Some seemingly important pages may be missed.
python texify.py < path to the directory where downloaded pages are stored > < name of that diretory >
The first argument should be a path containing both the diretory of the downloaded pages, and the .stats file.
The content of the pages downloaded will be extracted and store in a sibling directory content.
And a python pickle file urlgi.pkl will be generated, storing the graph.
python knowledgeGraph.py < path to the content directory and urlgi.pkl >
A class.pkl file will be generated under the path.
In this class encapsulates all the data and motheds that make up WikiNet.
You should move the class.pkl under the django WikiNet diretory.
There may be some module name inconsistence. In that case you will need to manually load and dump the class.pkl file for django view to load the data.
Just invoke python 2.7 under the WikiNet diretory
module load python-2.7
python
to enter the interactive interface. Then
import cPickle as pkl
from knowledgeGraph import knowledgeGraph
with open('class.pkl') as f:
G = pkl.load(f) # this will take a whilewith open('class.pkl','w') as f:
pkl.dump(G, f) # this will also take a while
Then the problems should be solved.