vocabulary-scraper

Purpose

I wrote this app to create a large database of vocabulary words for my (upcoming) German language learning website, dasVocab.com.

Problem

I started with a text file of the top 40,000 German words from the Institute of German Language, which is sorted descending by usage frequency and contains only the principal forms of every word. I needed the words sorted by part of speech and saved into tables accordingly.

How it Works

The app first prompts the user for the number of words to scrape. Then it parses the vocab file and starts to scrape the German Wiktionary page for each word. The part of speech is read from the page, and if the word is a noun or a verb, then additional information is gleaned from the HTML as detailed below:

Verbs (all principal parts)
- Infinitive
- 3rd person present conjugation
- Simple past conjugation
- Past participle
- Helping verb (for past tense)
Nouns
- Gender
- Singular form
- Plural form
Adjectives
- Root form only
Adverbs
- Root form only

A vocabulary set number is also assigned to each word based on its location in the list (and therefore its importance). The app then validates all the scraped data and saves it in the appropriate MySQL table.

A live console feed shows the user what is being scraped and where it's being saved to. Any entry which does not pass validation is automattically saved to a "skipped" table for later review.

Voila! Thousands of words with appropriate data.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
config		config
models		models
.gitignore		.gitignore
README.md		README.md
derewo-v-40000g-2009-12-31-0.1		derewo-v-40000g-2009-12-31-0.1
functions.js		functions.js
package.json		package.json
scraper.js		scraper.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vocabulary-scraper

Purpose

Problem

How it Works

About

Releases

Packages

Languages

atriko/vocabulary-scraper

Folders and files

Latest commit

History

Repository files navigation

vocabulary-scraper

Purpose

Problem

How it Works

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages