Search Engine Project

Introduction

This is a search engine implementation with java and tomcat, the following are as project details:

Step one: crawl website with a spider
- a spider with BFS approach keep visit new website until all site is fetched
- the spider obtain its information and content
Step two: storing and indexing
- process content which fetch from website by perform stop word removal and stemming
- store data and keyword in database (we use rocksDb, a key value store db, in our project)
- build inverted index and keep updating its content
Step three: calculate google page rank
- process all fetched page with page rank algorithm
- iterate until the page rank of each page is close to converge
- store the page rank value in database
Step four: ready for process query
- after program receive keyword, do stop word removal and stemming
- calculate cosine similarity by using the inverted index
- get the page rank from database
- calculate score of related page
- return pages with highest score and render html page with jsp

Update log

Update (26/4): term weighted added, code is refactored, phrase search is still not supported.

Update (27/4): posting list of inverted index will record the position of word in the page. (similar to slide 14 in lecture notes: implementation issues). It can be useful for for identify phrase

Update(29/4): retriever has basic function(calculate cosine similarity), still not support phrase search. Fixed some bugs in inverted index. Make code cleaner

Update(7/5): retriever has nearly complete function(calculate cosine similarity), support phrase search. Fixed some bugs. Pre-compute document length, make search faster. Page rank still not supported

Update(10/5): tomcat server was added

Update(11/5): nearly project complete

Update(11/5): added page rank

Update(12/5): clean up code and upgrade CSS
remember to download all the db at the drive

remember to place all db file and stopword.txt at: PATH TO TOMCAT\pache-tomcat-8.5.54-windows-x64\apache-tomcat-8.5.54\bin

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.idea		.idea
database		database
lib		lib
out/production/cseSearchEngine		out/production/cseSearchEngine
src		src
web		web
cseSearchEngine.iml		cseSearchEngine.iml
readme.md		readme.md
stopwords.txt		stopwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search Engine Project

Introduction

Update log

About

Releases

Packages

Languages

jordanleeeee/4321-Search-Engine-Project

Folders and files

Latest commit

History

Repository files navigation

Search Engine Project

Introduction

Update log

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages