-
Notifications
You must be signed in to change notification settings - Fork 0
parshva45/Inverted-Indexer
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
- The three tasks of the IR HW3 are performed by: 1) Three python files task1.py, task2.py, task3.py performing each of the tasks 1, 2, 3 respectively 2) A directory named "Raw_HTML_Downloads" which contains 1000 files each containing raw html code of 1000 URLs crawled in IR HW1 Task1 (Input for HW3 Task 1) 3) A directory named "Tokenization_Outputs" which contains 1000 files which are output files of Task 1 4) A directory named "Indexing_Outputs" which contains 3 files: - Unigrams.txt - containing unigrams of all the tokens from 1000 output files in "Tokenization_Outputs" folder along with Document IDs of the documents in which it was found and corresponding term frequency in that document. - Bigrams.txt - containing bigrams of all the tokens from 1000 output files in "Tokenization_Outputs" folder along with Document IDs of the documents in which it was found and corresponding term frequency in that document. - Trigrams.txt - containing trigrams of all the tokens from 1000 output files in "Tokenization_Outputs" folder along with Document IDs of the documents in which it was found and corresponding term frequency in that document. 5) A directory named "Frequency_Tables" which contains 6 files: - Unigrams_Term_Frequency_Table.txt - contains unigrams and corresponding total of term frequencies of each unigram using Task 2 output - Unigrams_Document_Frequency_Table.txt - contains unigrams, list of Document IDs of the documents in which it was found and its corresponding document frequency - Bigrams_Term_Frequency_Table.txt - contains bigrams and corresponding total of term frequencies of each bigram using Task 2 output - Bigrams_Document_Frequency_Table.txt - contains bigrams, list of Document IDs of the documents in which it was found and its corresponding document frequency - Trigrams_Term_Frequency_Table.txt - contains trigrams and corresponding total of term frequencies of each trigram using Task 2 output - Trigrams_Document_Frequency_Table.txt - contains trigrams, list of Document IDs of the documents in which it was found and its corresponding document frequency 6) A file "Global_statistics.txt" containing total number of unigrams, bigrams and trigrams 7) A file "Stop_Words.txt" containing a list of stop words with appropriate explanation Setup : - You should have a Python programming environment set up on your machine. Run the code: - Run from your Terminal or Comand Prompt. - For performing Task 1 Go to to the directory where task1.py resides and use the command > python task1.py to run. Type 1,2,3 or 4 depending on whether you wish to perform depunctuation and/or case-folding and press Enter - For performing Task 2 Go to to the directory where task2.py resides and use the command > python task2.py to run. - For performing Task 3 Go to to the directory where task3.py resides and use the command > python task3.py to run. Results: The results for Task 1 are generated in "Tokenization_Outputs" directory The results for Task 2 are generated in "Indexing_Outputs" directory The results for Task 3 are generated in "Frequency_Tables" directory The file "Global_statistics.txt" is generated in Task 2 DESIGN CHOICES: General: - While naming files according to its URL, some URLs had '/' in it, which couldnt be kept as file name as it is invalid. So, each '/' in the URL has been replaced by '_' Eg. For saving raw html of "https://en.wikipedia.org/wiki/C/2011_W3_(Lovejoy)"" the name given to the file is "C_2011_W3_(Lovejoy)"" Task 1: - For running Task 1, the options for depunctuation and case-folding are provided to the user by asking to choose option by pressing 1,2,3 or 4 where: Enter 1 if you want to perform both case-folding and punctuation handling Enter 2 if you want to perform just case-folding Enter 3 if you want to perform just punctuation handling Enter 4 if you dont want to perform case-folding or punctuation handling Any other input shows message as "Invalid input" - As special symbols are denoted differently in UTF-8 format, for eg like '\xe2', they are chose to be removed - If the token starts with '$' '+' '.' or '-' the respective punctuation is not stripped - If the token ends with '+' or '%' the respective punctuation is not stripped - If a token contains only numbers and punctuations, following things are taken care of even if depunctuation is to be done: 1) '.' and '-' within the numbers are all preserved always Eg. 245.56 27-12-1994 2) If the number starts with '$' '+' '.' or '-' and ends with '+' or '%' both the start and end punctuations are preserved, rest start/end punctuations are replaced by space Eg. +50% $250+ 3) If the number just starts with '$' '+' '.' or '-' and does not end with '+' or '%' only the start punctuation is preserved, rest start/end punctuations are replaced by space Eg. $500 -20.5 4) If the number does not start with '$' '+' '.' or '-' and just ends with '+' or '%' only the end punctuation is preserved, rest start/end punctuations are replaced by space Eg. 25% 100+ Task 2: - Each output file contains each entry as: term -> (docID1,tf1) (docID2,tf2) .... (docIDn,tfn) where tf1 is term frequency of term in docID1 Eg. enumerating -> (Hurricane_Janet,1) (New_York_City,2) - Along with generation of output files "Unigrams.txt", "Bigrams.txt" and "Trigrams.txt", of Task 2, "Trigrams_Document_Frequency_Table.txt" which contains document frequency table of trigrams (Task 3 deliverable) is also generated in Task 2 due to the following reason: - The trigrams document frequency table is not possible to store in a dictionary because of high number of trigrams and corresponding list of document IDs (given MemoryError if tried to). So, this table is generated in runtime by writing into file while traversing the list of trigrams without storing - Format : Term -> List of Document ID(s) -> Document Frequency Eg. officials opened shelters -> ['Hurricane_Ingrid', 'Hurricane_Matthew'] -> 2 - "Global_statistics.txt" mentions total number of unigrams, bigrams, trigrams Task 3: - Format for Term Frequency Table: Term -> Sum of all Term frequencies Eg. august -> 7260 hurricane center -> 2780 of use and -> 1001 - Format for Document Frequency Table: Term -> List of Document ID(s) -> Document Frequency Eg. disarmed -> ['Mexico', 'World_War_II'] -> 2 $10,000 in -> ['Hurricane_Georges', 'Lewes,_Delaware'] -> 2
About
Implementing my own inverted indexer, text processing and generating corpus statistics
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published