Introduction

A data challenge to scrape and recommend wiki articles for a subcategory like medicine or physics.

A google slide describing the project is here

Please see how much of the project you can complete in roughly 3 hours of work. Please plan to send your results to us in 72 hours, and don’t hesitate to reach out to us if you have any questions. Always send emails to ask questions if you have any and do not guess.

Task

Scrape 500 pages of a specific category on Wikipedia. For example, https://en.wikipedia.org/wiki/Category:Medicine
Build a graph where the nodes represent pages and the edges represent the semantic distance between those pages. Please keep just the 10 nearest neighbors for each node.
Create a visualization of your results.
Please share your work with us (e.g. Jupyter notebook, etc.) with a brief write-up of your results.

Results

The data folder contains the raw data scraped from the Wikipedia with the Physics category.

The nlp_clean folder contain the data after data cleanining, and the notebook for visualizing the text data.

The model folder contains the notebook to compare different models and vectorizers in terms of performance.

The graph_vis folder contains the notebook to generate the network graph for the articles.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
graph_vis		graph_vis
models/topic_modeling		models/topic_modeling
nlp_clean		nlp_clean
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Task

Results

About

Releases

Packages

Languages

edwardcooper/wiki_recommend

Folders and files

Latest commit

History

Repository files navigation

Introduction

Task

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages