Skip to content

Project for indexing and searching emails on ZincSearch engine

Notifications You must be signed in to change notification settings

juliandresbv/email-index-n-search

Repository files navigation

Email Index 'n' Search

Table of Contents

  • Description
  • Pre-requisites
  • Quick Start
  • Considerations
  • Documentation

Description

This repository contains the source code for the project "email-index-n-search". A set of apps that allow indexing and searching emails on ZincSearch, "a search engine that does full-text indexing".

Pre-requisites

Before running the application, you need to have the following dependencies installed on your system:

Quick Start

Initial setup

Follow the steps below to set up the apps:

  1. Clone the repository:
git clone https://github.com/juliandresbv/email-index-n-search
cd email-index-n-search/
  1. On the root directory of the repository, run ZincSearch on Docker:
docker-compose up -d

or

docker compose up -d
  1. Install dependencies:

    Note: Choose either option i. or option ii. below:

    1. Make command:

      1. On the root directory of the repository, run the following command:

        make install
    2. Individual tabs/windows:

      Open three (3) terminal windows/tabs at the root of the project in each one:

      1. On tab/window #1, run the following commands to install the dependencies for the indexer app:

        cd indexer/
        go mod download -x
      2. On tab/window #2, run the following commands to install the dependencies for the server app:

        cd server/
        go mod download -x
      3. On tab/window #3, run the following commands to install the dependencies for the emails-search-app app:

        cd emails-search-app/
        npm install
  2. Create and set the environment variables on a .env file for every app following the form of the .env.example file on each app's directory.

Running the apps

Follow the steps below to run the apps:

  1. Indexer:

    Note:

    About the download and decompression of the dataset:

    The indexer app downloads and decompresses the dataset automatically. However, if you want to do it manually, you can do so by downloading the dataset from the following link https://www.cs.cmu.edu/~./enron/enron_mail_20110402.tgz. Then, copy/move it to the indexer/data directory and decompress it.

    1. Open/re-use a terminal window/tab at the indexer/ directory and follow the steps below:

      • Dev mode:

        make run-dev [prof.mode=<mode>]
      • Prod mode:

        make run-prd [prof.mode=<mode>]

      Note: To run the app executing profiling analysis, add the prof.mode=<mode> flag to the command above. The available modes are: cpu, mem, goroutine, thread. The profiling results will be saved in the indexer/profiling-results directory.

  2. Server:

    1. Open/re-use a terminal window/tab at the server/ directory and follow the steps below:

      • Dev mode:

        make run-dev
      • Prod mode:

        make run-prd
  3. Emails Search App:

    1. Open/re-use a terminal window/tab at the emails-search-app/ directory and follow the steps below:

      1. Run app:

        npm run dev [-- --port=<port>]

      Note: To run the app on a specific port, add the -- --port=<port> flag to the command above.

Considerations

Profiling

To interactively see the profiling results, run the following commands on a terminal based at the indexer/ directory:

go tool pprof -http=:<port> ./profiling-results/<pprof-file>

Below, there are the graphics of some results of the memory profiling analysis:

  • Pre-optimization memory profiling:

    • In-use space Callgraph:

      Memory profiling pre-optimizations

    • In-use space Flame graph:

      Memory profiling pre-optimizations

  • Post-optimization memory profiling:

    • In-use space Callgraph:

      Memory profiling post-optimizations

    • In-use space Flame graph:

      Memory profiling post-optimizations

Optimizations

Sequential vs Concurrent

At the initial stage of the development of the indexer app, the indexing process was developed with a sequential programming approach. However, after reviewing and assessing the resources consumed on the execution, it was found that tasks such as reading the dataset and parsing the data were taking a lot of resources and time while running.

Looking at the OS Activity Monitor, the memory consumed by the indexer with the sequential programming approach was around ~2.5 GB and it was increasing as time passed.

Therefore, another kind of approach should be implemented to improve the performance of the app.

After some research, it was found that the concurrent programming approach could be a good option to improve the performance of the app. So, the indexing process was re-implemented using the concurrent programming approach.

The tasks that were re-factored to be executed concurrently were:

  • Downloading the dataset:

    The download is split into concurrent go routines that will download the dataset in chunks of 5 MB.

  • Reading and parsing the dataset, and producing JSON files to index:

    The reading and parsing of the dataset is split into chunks of 1000 files (this will be further explained in the JSON file size section) and uses a proportion of 5% of the amount of the chunks to process.

    For the specific dataset used in this project, the amount of files to read is around 517000 files, resulting in 517 chunks of size 1000, and around 25 (the result of the 5% of 517 chunks) go routines spawned to process each chunk.

    The concurrent programming approach uses a semaphore logic to control the amount of go routines spawned at the same time. This is done to avoid the overuse of resources and to limit the amount of concurrent go routines running.

    The final step of this set of tasks is to produce the JSON files to index. This is done by marshaling the data read and parsed from the dataset into JSON files.

  • Indexing the documents via ZincSearch API using bulk load endpoint:

    Once the JSON files are produced, the indexing process is done by sending an HTTP request to the ZincSearch API using the bulk load endpoint with the data previously produced as JSON files.

    This process was also done using concurrent programming via go routines supported by a semaphore logic.

    The proportion of go routines spawned to bulk load the JSON files concurrently is 5% of the amount of JSON files produced. This percentage was tuned to avoid overloading the ZincSearch API.

    Additionally, a throttling logic was also implemented to send HTTP requests every 750 milliseconds.

Looking at the OS Activity Monitor, the memory consumed by the indexer with the concurrent approach was around ~300 MB and it was stable as time passed.

JSON file size

For indexing the emails to ZincSearch, a bulk load technique was considered the best option to go with, since the ZincSearch API provides three endpoints to bulk load documents (bulk load V1, bulk load V2, multidocuments upload). The chosen endpoint was bulk load V2 API endpoint, since the format of the data to be indexed is a JSON file.

At the first stage of the development, the size of the JSON files was 5000 records per file. However, after some testing, it was found that the size of the JSON files was too big and the indexing process was taking a lot of time to complete and was producing low latencies while consuming the API. The internal response time of the API was around 20000 milliseconds.

Therefore, the size of the JSON files was reduced to 1000 records per file (considering the recommendations provided in the bulk load V1 API endpoint). This change improved the performance of the indexing process and the latencies while consuming the API. After this change, the internal response time of the API was around 200 milliseconds.

As a remark, the trade-off between the size of the JSON files vs. the response time and throughput to the API was a key consideration in deciding the size of the JSON files. A smaller size of records for every JSON file will produce more files at the end, but it will result in better performance while consuming the API. So consuming more storage space is preferable to overloading ZincSearch API.

DTOs (Marshaling and Unmarshaling) vs Bytes

Before the optimizations were applied to the bulk load process. The profiling results showed that the use of Marshal and Unmarshal functions was pretty much recurrent. This specifically happened when reading each JSON file and converting it to a struct to be indexed via ZincSearch API. The Pre-optimizations memory profiling/In-use space Callgraph graph above helps to understand this. The implication of using Marshal and Unmarshal was that on the runtime there were a lot of allocations and deallocations of memory. In the long run, this could produce memory leaks, which could deteriorate the performance of the app.

Therefore, the use of Marshal and Unmarshal functions was replaced by the use of bytes. This change improved the performance of the app and reduced the memory usage. The Post-optimizations memory profiling/In-use space Callgraph graph above helps to understand this.

The trade-off of this change is that the code is less readable, since in a practical sense, it loses context because of replacing DTOs (whose purpose is to provide a clear contract/interface of the shape of the data that is going to be expected on each point of the app) for bytes. However, the performance of the app is more important than the readability and maintainability of the code.

Documentation

API documentation

For a better understanding of the Server API, the OpenAPI documentation via Swagger is available on the following link http://localhost:<port>/docs/index.html while the server is running.

About

Project for indexing and searching emails on ZincSearch engine

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published