Search Engine

This is the final project for SMU's CS 2341 (Data Structures) class. It was done as a partner programming assignment with another student, Maria Harrison.

It is a rudimentary search engine, designed to comb through a number of COVID-19 research articles, constructing an inverted file index (stored as an AVL tree) and a hash table to store information about an article's author. The user can then input queries using simple boolean operators to find documents that contain specific words. The engine also writes and saves persistence files; upon execution, if persistence files exist within the directory from a previous execution, then the engine can load those for faster construction of the two main data structures. The engine also records several statistics, such as the number of unique articles indexed, the average words indexed per article, etc.

I wrote the following classes:

UserInterface: A basic interface for the user to input commands. Allows the user to:
- Clear the inverted file index and author hash table,
- Populate the index and hash table from scratch, also creating new persistence files,
- Populate the index and hash table with existing persistence files,
- Enter a search query, or
- Output engine statistics.
DocumentProcessor: Parses .json documents and a metadata .csv file.
- Utilizes the std::filesystem library to iterate through all documents in the corpus.
- Utilizes RapidJSON, a C++ .json parsing library, to record words and authors that appear in each document within the corpus
- Utilizes the Oleander Stemming library. This allows us to avoid bloating the inverted file index with grammatical modifications of words. For example, "running" will not be added, only its stem "run" will.
- Communicates with the IndexHandler to add processed document's information to the data structures
IndexHandler: Instantiates a "singleton" instance of the inverted file index and authors hash table.
- Interfaces with the index and hash table to add words or authors.
- Runs a user-inputted query and returns a set of relevant documents.
- Loads saved persistence files.
- Retrieves the engine's statistics.
HashTable: a templated hash table.
- Stores an author's name and the documents that they have written.
- Uses separate chaining.

The C++ std library is extensively used throughout the project. Specifically, std::set, std::filesystem, and std::pair were of great use to me. We used Doxygen to document all of our classes, and created a UML class diagram for our professor's viewing.

Doxygen-generated documentation site is saved under the Documentation folder as index.html
UML Class Diagram is saved in source directory as Search Engine UML Class Diagram.jpg

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
.idea		.idea
Documentation/html		Documentation/html
cmake-build-debug		cmake-build-debug
include		include
AVLTree.h		AVLTree.h
CMakeLists.txt		CMakeLists.txt
DocumentProcessor.cpp		DocumentProcessor.cpp
DocumentProcessor.h		DocumentProcessor.h
HashTable.h		HashTable.h
IndexHandler.cpp		IndexHandler.cpp
IndexHandler.h		IndexHandler.h
InvertedIndex.cpp		InvertedIndex.cpp
InvertedIndex.h		InvertedIndex.h
InvertedIndexEntry.cpp		InvertedIndexEntry.cpp
InvertedIndexEntry.h		InvertedIndexEntry.h
Node.h		Node.h
QueryProcessor.cpp		QueryProcessor.cpp
QueryProcessor.h		QueryProcessor.h
README.md		README.md
Search Engine UML Class Diagram.jpg		Search Engine UML Class Diagram.jpg
UserInterface.cpp		UserInterface.cpp
UserInterface.h		UserInterface.h
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search Engine

About

Releases

Packages

Languages

johnlandonwood/searchengine

Folders and files

Latest commit

History

Repository files navigation

Search Engine

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages