Search Engine

This project is a small search engine.

The data used in this search engine are from Common Crawl.

Design

There are four main components of this search engine:

Posting Generator

This component processes the crawled web data (in WARC format), parses the HTML contents, extracts useful informations, and generates Postings for every word occurrence. Since this part is I/O intensive, it is implemented using multithreading.

Posting Generator is also responsible for building important information such as term table, URL table, etc, which will later be used by the query processor.
Merger

Since Posting Generator is a multithread process, all the intermediate results are stored as (sorted) files on disks. The Merger's job is to efficiently combine all intermediate results and sort them in an I/O efficient way.
Index Builder

The index builder takes the output of the Merger and builds inverted lists for all the word occurrences. Because the size of inverted lists is large, we use various compression techniques and compress the index by about 90%.
Query Preprocessor

The final piece of the search engine is responsible for handling user queries and returning the top 20 results based on the ranking functions. The query processor also generators a snippet for each query result. This component is built with the goal of minimizing latency.

Checkout the README files in each subdirectorie for further details.

There are three main steps:

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
IndexBuilder		IndexBuilder
MergeSort		MergeSort
PostingGenerator		PostingGenerator
QueryProcessor		QueryProcessor
.gitignore		.gitignore
README.md		README.md
download_pages.py		download_pages.py
load.sh		load.sh
run.sh		run.sh
wet.paths		wet.paths