Wiki Alternate Names

Usage example


python main.py --lang eng --wiki_dump ./data/enwiki-test.xml.bz2 --index_file ./data/enwiki-test.json --num_workers 4

Alternate names for Wiki pages

Problem Description

Articles present on Wikipedia can by design have only one name. Sometimes though, people, cities, or even countries are, or have been known by alternate names. Often these names are in historical context (e.g. the city now called "Gdańsk" can be referred to as "Danzig" Source. People often know these things by their outdated names and by searching for them on the internet (Wikipedia) may not yield relevant satisfactory enough results.

Other cases that may result in alternate names for the same article (as presented by Wikipedia are:

Non-English titles (for english wiki) - used in articles where it is relevant, although not encouraged
Alternate names contained within the first sentence of an article
Biographical - multiple names for the same person. Based on different context, the person is addressed using a different name

Motivation

Unifying these names we make it easier for anyone to access the same information regardless on the query term.

Futhermore, by indexing main article titles with all its alternate titles we make the search performed by Wikipedia (and perhaps the whole of Wikipedia) faster due to reduced amount of requests required to reach articles we intend to reach (we skip the redirect requests). This may result in better experience for the users due to reduced load time of their articles.

Final Solution

The final solution is only a slight modification of the proposed solution. The main difference is a post-processing step at the end which ensures the final index is mapped in a {title: [muiltiple subtitles]} way and makes the resulting index a JSON file. The final architercture is shown in Picture 3, and parts of the system are described below.

Stream Processor

The Stream Processor uses zipped wikipedia multistream files opened with python library BZ2File. This library allows them to be processed as any other stream and thus be an input to another library lxml.etree. This lib. is used to process the stream and build "mini-trees" made of just one page and deleted after processed. The Stream processor is used to extract Title and Text of the wiki page and enqueue them for later processing by workers.

Worker

The general flow of a worker is shown in an image below:

The workers get items (title and text) from the queue and use regex patterns to extract the alt. titles. The pattern to find a "#redirect" clause in a text is as follows:

(#REDIRECT|#redirect){1}\ ?[{2}[A-Za-zá-žÁ-Ž[A-Za-zá-žÁ-Ž$&+,:;=?@#|'"<>.^()%!]-]]{2}

It matches the word "#redirect" followed by brackets containing the alt. title.

Infobox is found using the following pattern: {{2} ?(Infobox){1}[\s\S]*}{2}

and data is later extracted using

or

depending on the language used.

After parsing the text the extracted alt. titles are input into a queue with the original titles.

Index Writer

The index wrtier gets data from queue. The title and alt_title are then written to a file in a "alt_title:title\n" pattern.

Index Post Processing

After all the ale pages are processed, the index gets post processed. The entries are read from the temporary index file, input into a dict and then dumped as a json into a json file. Parallelism is used here as well to speed up this process.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Media		Media
data		data
evaluation		evaluation
indices		indices
python		python
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wiki Alternate Names

Usage example

Alternate names for Wiki pages

Problem Description

Motivation

Final Solution

Stream Processor

Worker

Index Writer

Index Post Processing

About

Releases

Packages

Languages

devslovecoffee/WikiAlternateNames

Folders and files

Latest commit

History

Repository files navigation

Wiki Alternate Names

Usage example

Alternate names for Wiki pages

Problem Description

Motivation

Final Solution

Stream Processor

Worker

Index Writer

Index Post Processing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages