- Simple search engine with basic functionality.
- NOTE that this project is done for educational and scientific purposes, thus it may not meet industrial and production standards.
The problem is offered and described by Vahram Martirosyan, Ph.D.
The ecosystem of this seach engine consists of a few components:
- Crawler
- Link Queue
- Pub / Sub System
- Indexing Service
- Ranking Service
- Storage
- Search Programmatic API
- Search API Host
- Search API UI
- User Services
- Crawler surfs all the internet parsing the content of the websites and collecting the hyperlinks in the link queue for further crawling. Crawler starts its work from the initial list of most informative links stored in the link queue.
Below are the steps which crawler should do :
- Pop a link from link queue.
- Crawl the page.
- Parse the HTML content of the page.
- Push the hyperlinks to the link queue.
- Publish the content of parsed pages with the defined topic.
- Link Queue will probably be customly designed multiple push / pop queue framework suitable for this purpose.
- Pub / Sub System is a library providing convenient abstractions over RabbitMQ. This is module is used by the Crawler and Indexing Service. Crawlers publish the parsed content with defined topic, whereas Indexing Service subsribes to this topic to be able to get updates. In the future Ranking Service can also be subscribed.
- The primary goal of Indexing Service is to index the parsed content to improve search speed.
Operations which this service should do:
- Subcribe to the defined topic.
- Add the results to storage.
- Ranking Service ranks the results, domains and links to improve the quality and relevancy of the search results. This module is a relatively active component.
- Storage is a quite passive component. It is a separate process which provides API for our MySQL database.
API that Storage componenet exposes:
- AddCache - add page content
- Search
- This one is a library to execute search queries.
- This is a service which exposes public suitable API for programmatic API.