The Search Engine Project is a web crawling application crafted in C++, intended to scan internet websites and archive the collected data into a database (in memory). It empowers users with the ability to input search queries, and the application provides search results consisting of links containing the specified keywords. The links are sorted using the Pagerank algorithm, which ranks internet pages to prioritize the most relevant search results. Additionally, the project supports receiving queries from remote users through the TCP protocol. The project utilizes multithreading to enhance efficiency and enable concurrent searching and crawling, allowing users to search for information even while the crawling process is ongoing. It offers flexibility in its configuration options, allowing users to customize its behavior through a configuration file.
The Crawler
class is responsible for traversing web pages by retrieving URLs and parsing HTML content. It utilizes the Html_parser
object to extract links and text from the retrieved pages. The extracted data is then passed to the Updater
class, which handles the insertion of data into the database. The Updater
class holds instances of classes that implement the Graph
interface, representing the relationship between links, and the WordLinks
interface, representing links that contain words. The Crawler
also interacts with instances of the Scanner
interface, which can be configured for BFS or DFS strategies. Thread_safe_set
are employed to ensure the uniqueness of URLs. Additionally, the crawling process is controlled by the Limit_counter
object, which stops the crawler when a specified limit of links is reached. The Limit_counter
object is thread-safe, allowing for concurrent access and increases.
The Search_engine
class is the application responsible for handling user queries, providing results, and sorting outcomes. The Search_engine
class achieves its tasks by holding three objects:
- Client Interface: This object enables the
Search_engine
to receive queries from users. Whether the queries are entered locally via the console or remotely through sockets, theSearch_engine
efficiently manages user interactions. - Searcher Instance: The
Search_engine
utilizes an instance of theSearcher
interface to retrieve relevant search results from a specific database. - Sorter Instance: Additionally, the
Search_engine
holds an instance of theSorter
interface to organize search results effectively. It can sort results either by word instances, the Pagerank algorithm, or any other custom sorting method specified.
The Singleton pattern is implemented for the Configuration
class because many classes interact with it, and it holds important configuration settings such as the length of result links, number of threads, scanning type (BFS or DFS), sorting method, and whether the application is used locally or remotely. By making Configuration
a Singleton, we ensure that there is only one instance of it throughout the application, and all classes can access its settings consistently.
During the crawling process, whenever the Updater
class finishes inserting a buffer of links into the database, it notifies the Pagerank
class. The Pagerank
class then calculates the Pagerank scores for the newly inserted links. This ensures that Pagerank scores are updated in real-time as new data is added to the database.
The Crawler
and Search_engine
classes interact with various components and rely heavily on different parameters, such as sorting method, scanning type, number of threads, and other important settings provided by the Configuration
class. These dependencies are injected during object creation, allowing for easier testing and modification. This approach enables greater flexibility and maintainability by decoupling the classes from specific implementations and allowing them to be customized based on configuration settings.
- Gumbo Parser : Used for parsing HTML content and extracting links from web pages. Install it using the command
sudo apt install -y libgumbo-dev
- Curlpp: Used for downloading pages from URLs. Install it using the command
sudo apt install -y libcurlpp-dev
- GCC Compiler: Project requires
g++
compiler (version 7.3+) with C++17 support.
- Clone the repository
git clone https://github.com/dudi-w/koogle_search_engine
cd koogle_search_engine
- Compile the files
make -j4
- Run the program
./mainServer
To use the client interface type e
and Enter
or type c
to use the same local terminal
if you decided to use client interface run the following command in secound terminal
./mainClient