Skip to content

A domain-specific Python search engine leveraging Flask, Pinecone, and Sentence Transformers for semantic search. Deployed on Azure with Gunicorn, Nginx, and SSL for secure and scalable performance.

Notifications You must be signed in to change notification settings

mallickboy/Python_Search_Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Search Engine 2.0 Server setup

Pull the code

mkdir pysearch

cd pysearch

git clone https://github.com/mallickboy/Python_Search_Engine.git

cd Python_Search_Engine

git checkout version2.0

Create virtual environment

sudo apt install python3.9 python3.9-venv python3.9-distutils

python3.9 -m venv pysearch

Activate virtual environment (from parent folder)

Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass &

.\search_engine\Scripts\activate or

source pysearch/bin/activate

Install PyTorch ( lightweight CPU version only )

pip install torch --index-url https://download.pytorch.org/whl/cpu

Install required libraries

pip install -r requirements.txt

Open firewall & update inbound port 8000 in Azure

sudo ufw enable

sudo ufw allow 8000

sudo ufw status

Run and Test

python app.py

Visit http://{your_server_ip}:8000

Adding SSL, NGINX and deployment using GUNICORNN

Deploy the site using gunicorn

gunicorn -w 4 -b 0.0.0.0:8000 app:app

sudo nano /etc/systemd/system/pysearch.service paste code of server_setup/pysearch.service

sudo systemctl daemon-reload

sudo systemctl restart pysearch

sudo systemctl enable pysearch

sudo systemctl status pysearch

Setup Domain/ Sub-Domain

Add New record in Advanced DNS in domain provider

Type = "A Record" Host= "pysearch" Value = "server public ip" TTL = "Automatic" ( Sub-Domain )

Type = "A Record" Host= "@" Value = "server public ip" TTL = "Automatic" ( Domain )

Installing SSL certificate

sudo apt install certbot python3-certbot-nginx

sudo certbot --nginx -d pysearch.mallickboy.com

sudo certbot certificates

sudo systemctl status certbot.timer check auto renewal

Set-Up NGINX

sudo nano /etc/nginx/sites-available/pysearch.mallickboy.com (copy contents of server_setup/pysearch.mallickboy.com)

sudo nano /etc/nginx/sites-available/default (copy contents of server_setup/default)

sudo ln -s /etc/nginx/sites-available/pysearch.mallickboy.com /etc/nginx/sites-enabled/

sudo nginx -t

sudo systemctl reload nginx

sudo systemctl restart nginx

Visit

https://pysearch.mallickboy.com

Output View

**Pinecone Side Vectors** ![pinecone](https://github.com/user-attachments/assets/0bd8a37f-510c-471e-9206-76135b905bd1)

Client Side Search Results :

Screenshot 2024-04-02 211102

Screenshot 2024-04-02 210855

Screenshot 2024-04-02 210658

Server Side Messages :

Screenshot 2024-04-02 211726

image

Design and Discussion

Group Members: Tamal Mallick , Sushanta Das , Suvam Manna

Problem Description:

Building a search engine for a specific domain (Python) with the help of web crawling, Socket

programming, sentence embedding and Vector database, to get relevant result for specific domain. User

will get result of a search query based on cosine similarity search in vector database. Also implemented and

integrated Knapsack Cryptosystem for securely transmitting user query and search results over network.

Algorithm and Design:

1) Collecting web pages to make our search engine database

i) Implemented web crawling using multithreading with some starting links containing keyword

“Python”, to collect link of webpage and then metadata (such as title, heading tag, some

paragraph) to gather valuable information about each link. This information will be used to

search for the webpage URL.

ii) We are collecting only the valid link / webpages by ignoring the links with status code in

between (400 ,499)

iii) These data are stored in Vector database (Pinecone) as Word Embeddings (High dimensional

Vectors ,768 dimensions). Now we can search some query, database will return most similar

results (each result having title, link and description).

2) Building and running the webserver

i) Webserver is implemented using socket programming with multithreading which will handle

multiple HTTP request coming from clients.

ii) Once the server is running, if anyone search the server URL using web client (Browser) first

the server is sends Required html, CSS and JavaScript file to run the main client program and

access frontend to see outputs.

iii) Once the dedicated client’s program (client.js) is running, in case of Secure mode public key

and private key is generated using knapsack cryptographic algorithm. First server and client

exchange their public keys. Then it will continue to listen for search requests from client.

iv) Now onwards whenever the server receives a search request from client it performs a cosine

similarity search on the vector database/local file and send the top results to the client. In

secure mode encryption and decryption are performed before each send operation and after

each receive operation through socket.

3) Searching some query

i) User will visit the link at which server is running (like http://192.168.29.37:8080/). A

webpage will open which has input box and search button. In Secure mode client will receive

public key of server.

ii) When user types some search query and hits submit button, client will send the query to the

server. Server has an endpoint for accepting Post request (POST /submit HTTP/1.1). Server

will accept the request.

iii) Now server will call a function for searching on vector database and finally send the relevant

search results to the client.

iv) Results will be displayed in the client’s webpage.

Future Work

Future enhancements and developments for the search engine may include increasing the searching speed,

understand and collect more valuable metadata, providing summery of the top results, refining search

algorithms for enhanced accuracy and relevance, expanding the search database to encompass a broader

range of Python-related content, integrating advanced security features to mitigate emerging threats, and

incorporating user feedback to continuously enhance the user experience and functionality.

Conclusion

The creation of a domain-specific search engine tailored for Python represents a significant leap forward in

providing users with a robust platform for accessing relevant information within the Python programming

ecosystem. By seamlessly integrating advanced technologies such as web crawling, vector databases, socket

programming, and cryptographic protocols, this search engine delivers not only swift and accurate search

results but also ensures the security of user interactions. Also, implementation of multithreading and query-

based client holding allow us to save the resources and serve large number of clients at a time.

About

A domain-specific Python search engine leveraging Flask, Pinecone, and Sentence Transformers for semantic search. Deployed on Azure with Gunicorn, Nginx, and SSL for secure and scalable performance.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published