This repository contains a crawler for the Open Science Framework website.
This crawler:
- automatically downloads information about registered research projects or preprints from the Open Science Framework website either by crawling the website or by interacting with the official API. It then stores the information in a MongoDB database.
- uses the natural language processing library spaCy to perform common data cleanup steps such as getting rid of stop words and lemmatizing the words and then the LDA algorithm of the topic modelling framework gensim to determine which topics were covered by the downloaded research.
- outputs the most frequent tags, subjects as well as words used in the titles and descriptions in the form of an Excel file as well as the topics found by gensim and the corresponding coherence score of the LDA algorithm.
Purpose | Name |
---|---|
Programming language | Python 3.10 |
Version control system | Git |
HTML parser | BeautifulSoup |
Browser automation library | Pyppeteer |
NLP library | spaCy |
Output generator | OpenPyXL |
Asynchronous framework | asyncio |
Topic modelling framework | gensim |
NoSQL database | MongoDB |
This "OSF Crawler" is published under the MIT licence, which can be found in the LICENSE file.
The "Open Science Framework" logo was taken from the University of Oklahoma Libraries website.