Script to collect scrape and clean sentences in Chhattisgarhi for the project Speech Recognition in Agriculture and Finance for the Poor in India
This script uses Google Sheets API to fetch data from a Google Sheet containing links of sites containing Chhattisgarhi text in the domains of Agriculture and Finance and then does the following:
- Identifies duplicate links
- Generates all the links from the sitemaps of popular Chhattisgarhi news portals and gets a list of all links that are not in the sheet (Note: Not all links are useful for us)
- Scrapes and extracts useful text from the links
- Optionally, can query all the links to get links containing a particular substring in them
- Cleans the extracted text and tokenizes them into words to form a vocabulary of Chhattisgarhi words
- Stores the clean sentences to form our Chhattisgarhi corpus
Create a .env
file and add a SPREADSHEET_ID
field with the unique ID of your Google Sheet. Then run scrape.ipynb
.