GitHub - parshva45/Inverted-Indexer: Implementing my own inverted indexer, text processing and generating corpus statistics

parshva45 / Inverted-Indexer Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Implementing my own inverted indexer, text processing and generating corpus statistics

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Frequency_Tables		Frequency_Tables
Indexing_Outputs		Indexing_Outputs
Raw_HTML_Downloads		Raw_HTML_Downloads
Tokenization_Outputs		Tokenization_Outputs
Document_IDs.txt		Document_IDs.txt
Global_statistics.txt		Global_statistics.txt
README.txt		README.txt
Stop_Words.txt		Stop_Words.txt
hw3.pdf		hw3.pdf
task1.py		task1.py
task2.py		task2.py
task3.py		task3.py

Repository files navigation

- The three tasks of the IR HW3 are performed by:

1) Three python files task1.py, task2.py, task3.py
   performing each of the tasks 1, 2, 3 respectively

2) A directory named "Raw_HTML_Downloads" which contains 1000 files each
   containing raw html code of 1000 URLs crawled in IR HW1 Task1
      (Input for HW3 Task 1)

3) A directory named "Tokenization_Outputs" which contains 1000 files which are
   output files of Task 1

4) A directory named "Indexing_Outputs" which contains 3 files:
   - Unigrams.txt - containing unigrams of all the tokens from 1000 output
                    files in "Tokenization_Outputs" folder along with Document IDs
                    of the documents in which it was found and corresponding
                    term frequency in that document.

   - Bigrams.txt  - containing bigrams of all the tokens from 1000 output
                    files in "Tokenization_Outputs" folder along with Document IDs
                    of the documents in which it was found and corresponding
                    term frequency in that document.

   - Trigrams.txt - containing trigrams of all the tokens from 1000 output
                    files in "Tokenization_Outputs" folder along with Document IDs
                    of the documents in which it was found and corresponding
                    term frequency in that document.

5) A directory named "Frequency_Tables" which contains 6 files:
   - Unigrams_Term_Frequency_Table.txt
     - contains unigrams and corresponding total of term frequencies of each unigram
       using Task 2 output

   - Unigrams_Document_Frequency_Table.txt
     - contains unigrams, list of Document IDs of the documents in which it was found
       and its corresponding document frequency

   - Bigrams_Term_Frequency_Table.txt
     - contains bigrams and corresponding total of term frequencies of each bigram
       using Task 2 output

   - Bigrams_Document_Frequency_Table.txt
     - contains bigrams, list of Document IDs of the documents in which it was found
       and its corresponding document frequency

   - Trigrams_Term_Frequency_Table.txt
     - contains trigrams and corresponding total of term frequencies of each trigram
       using Task 2 output

   - Trigrams_Document_Frequency_Table.txt
     - contains trigrams, list of Document IDs of the documents in which it was found
       and its corresponding document frequency

6) A file "Global_statistics.txt" containing total number of unigrams, bigrams and trigrams

7) A file "Stop_Words.txt" containing a list of stop words with appropriate explanation

Setup :

- You should have a Python programming environment set up on your machine.


Run the code:

- Run from your Terminal or Comand Prompt.

- For performing Task 1
  Go to to the directory where task1.py resides and use the command
  > python task1.py
  to run.

  Type 1,2,3 or 4 depending on whether you wish to perform depunctuation and/or case-folding
  and press Enter

- For performing Task 2
  Go to to the directory where task2.py resides and use the command
  > python task2.py
  to run.

- For performing Task 3
  Go to to the directory where task3.py resides and use the command
  > python task3.py
  to run.

Results:

The results for Task 1 are generated in "Tokenization_Outputs" directory
The results for Task 2 are generated in "Indexing_Outputs" directory
The results for Task 3 are generated in "Frequency_Tables" directory
The file "Global_statistics.txt" is generated in Task 2

DESIGN CHOICES:

General:

- While naming files according to its URL, some URLs had '/' in it, which couldnt
  be kept as file name as it is invalid. So, each '/' in the URL has been replaced by '_'
  Eg. For saving raw html of "https://en.wikipedia.org/wiki/C/2011_W3_(Lovejoy)""
      the name given to the file is "C_2011_W3_(Lovejoy)""

Task 1:

- For running Task 1, the options for depunctuation and case-folding are provided to the
  user by asking to choose option by pressing 1,2,3 or 4 where:

  Enter 1 if you want to perform both case-folding and punctuation handling
  Enter 2 if you want to perform just case-folding
  Enter 3 if you want to perform just punctuation handling
  Enter 4 if you dont want to perform case-folding or punctuation handling

  Any other input shows message as "Invalid input"

- As special symbols are denoted differently in UTF-8 format, for eg like '\xe2',
  they are chose to be removed

- If the token starts with '$' '+' '.' or '-' the respective punctuation is not stripped

- If the token ends with '+' or '%' the respective punctuation is not stripped

- If a token contains only numbers and punctuations, following things are 
  taken care of even if depunctuation is to be done:

  1) '.' and '-' within the numbers are all preserved always
     Eg. 245.56 27-12-1994
  2) If the number starts with '$' '+' '.' or '-' and ends with '+' or '%'
     both the start and end punctuations are preserved, rest start/end punctuations are replaced by space
     Eg. +50% $250+
  3) If the number just starts with '$' '+' '.' or '-' and does not end with '+' or '%'
     only the start punctuation is preserved, rest start/end punctuations are replaced by space
     Eg. $500 -20.5
  4) If the number does not start with '$' '+' '.' or '-' and just ends with '+' or '%'
     only the end punctuation is preserved, rest start/end punctuations are replaced by space
     Eg. 25% 100+

Task 2:

- Each output file contains each entry as:
  term -> (docID1,tf1) (docID2,tf2) .... (docIDn,tfn)
  where tf1 is term frequency of term in docID1
  Eg. enumerating -> (Hurricane_Janet,1) (New_York_City,2)

- Along with generation of output files "Unigrams.txt", "Bigrams.txt" and "Trigrams.txt", of Task 2,
  "Trigrams_Document_Frequency_Table.txt" which contains document frequency table of trigrams (Task 3 deliverable)
   is also generated in Task 2 due to the following reason:
  - The trigrams document frequency table is not possible to store in a dictionary because of high number
    of trigrams and corresponding list of document IDs (given MemoryError if tried to). So, this table is
    generated in runtime by writing into file while traversing the list of trigrams without storing
  - Format : Term -> List of Document ID(s) -> Document Frequency
    Eg. officials opened shelters -> ['Hurricane_Ingrid', 'Hurricane_Matthew'] -> 2

- "Global_statistics.txt" mentions total number of unigrams, bigrams, trigrams

Task 3:

- Format for Term Frequency Table:
  Term -> Sum of all Term frequencies
  Eg. august -> 7260
      hurricane center -> 2780
      of use and -> 1001

- Format for Document Frequency Table:
  Term -> List of Document ID(s) -> Document Frequency
  Eg. disarmed -> ['Mexico', 'World_War_II'] -> 2
      $10,000 in -> ['Hurricane_Georges', 'Lewes,_Delaware'] -> 2