Simple search system that includes inverted index builder and boolean query processor for information retrieval.
This program uses Reuters-21578 Dataset. Please place the dataset before build inverted index.
Also do not forget to add stopwords in stopwords.txt
file.
The file tree must be like this:
├── dictionary.pkl (not necessary) ├── main.py ├── README.md ├── reuters21578 │ ├── lewis.dtd │ ├── README.txt │ ├── reut2-000.sgm │ ├── reut2-001.sgm │ ├── ... │ └── reut2-021.sgm ├── src │ ├── base.py │ ├── inverted_intex.py │ ├── query_processor.py │ └── sgm_preprocessor.py └── stopwords.txt
Programs run with
python main.py
command. Program gets input query and print result until q
is given.
4 different query types are implemented:
- Conjunction: w1 AND w2 AND w3...AND wn
example: oil AND agriculture AND vegetable
result: [3950, 5655, 7625, 8003, 9550, 9756, 10720, 14509, 15341, 18403, 20232] - Disjunction: w1 OR w2 OR w3...OR wn
example: hate OR love OR cry
result: 1895, 3148, 6338, 7366, 8827, 10890, 17099, 17903, 19559] - Conjunction and Negation: w1 AND w2...AND wn NOT wn+1 NOT wn+2 ...NOT wn+m
example: oil AND agriculture AND vegetable NOT price
result: [3950, 5655, 7625, 8003, 9550, 9756, 10720, 14509, 15341, 20232] - Disjunction and Negation: w1 OR w2...OR wn NOT wn+1 NOT wn+2 ...NOT wn+m
example: hate OR love OR cry NOT money NOT price
result: [1895, 3148, 6338, 7366, 8827, 10890, 17099]