Wikipedia Search Engine

An efficient and scalable search engine for Wikipedia pages.

Consists of two main stages:

Creation of inverted index
Mechanism of query searches

Software optimized for search time, search relevancy, indexing time and indexing size.

Files structure

indexer.py This file uses the wiki dump and generates the index files. While parsing the dump xml file using saxparser, it processes the content and writes inverted indexes. The files are created in blocks to satisfy memory constraints Run command:

$ python3 indexer.py dump.xml <index_dir> <title_dir>

These index files are combined into 19382 sorted files to get the final index 3. config.py This file contains the configs and tokens for the other files 4. search.py This file takes in queries and runs search for them in the index. It tokenises the queries before performing binary searching on the sorted index to get the result Run command:

$ python3 search.py queries_op.txt

Functionality:

The search engine is built by going through the following stages:

XML Parsing
Tokenization
Case folding
Stop words removal
Stemming
Inverted Index creation
Optimization and query

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
config.py		config.py
indexer.py		indexer.py
search.py		search.py
stats.txt		stats.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia Search Engine

Files structure

Functionality:

About

Releases

Packages

Languages

snehalkumar5/Wikipedia-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Search Engine

Files structure

Functionality:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages