Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
tutorial-zika.ipynb	tutorial-zika.ipynb

Tutorial: Zika Virus

The tutorial shows you some basic and advanced text and data mining methods with the ContentMine toolchain, to explore the scientific publications and get a better understanding of the research around the Zika virus.

The main part will focus on getting information about the Zika virus, the research done about it and the most relevant entities for it. A typical way how to gather information about a pandemic is to have a look at:

a) the virus itself (Zika virus)
b) the virus-transmitting species (also known as disease vectors; in our case Aedes aegypti and Aedes albopictus)
c) drugs, clinical trials and locations mentioned in combination with the virus and the virus-transmitting species
d) similar pandemics/viruses (like the Yellow fever or the Usutu virus, which is a flavivirus amongst a range of suspects for becoming the next pandemic)

As data source, we use all publications from the Europe PMC (EUPMC) API, because they are Open Access.

The used approach should also work as a blueprint to gather information about pandemics in general and help for future research when a new virus starts to spread.

ABOUT ZIKA

The Zika virus (ZIKV) is a member of the virus family Flaviviridae. It is spread by daytime-active Aedes mosquitos, such as Aedes aegypti and Aedes albopictus.

Additional information

The Spondweni virus is mentioned as phylogenetically very close to the Zika virus.

SETUP

All software necessary can be found in installation.md.

Additional requirements

Memory: The downloaded data needs around 700 MB on your harddrive.

As preparation we recommend to have a look at the resources list in installation.md.

TUTORIAL

Download from Europe PMC

The first step always is to get the needed data from the APIs. For this, we use getpapers, the ContentMine tool for getting papers via different Publisher APIs. In this tutorial, we will only use open access literature from Europe PMC. We can search within their database of 3.5 million fulltext papers from the life-sciences. About one million of these are Open Access. Please refer to Europe PMC-Data for details. This will take less than 200MB of memory.

Get Zika publications

First, we want to set the variables for the query and for the folder we want the data to be stored in.

QUERY='zika'
FOLDER='zika'

Then, we have a look at how many results we find for the query term. For further information on how to create more complex queries for the EUPMC API, read here or here.

getpapers -q $QUERY -o $FOLDER

1126 papers were found in this case (14. 03. 2017).

Then, we download the fulltext.xml files for each publication. For this, add -x flag to the query. The results can then be viewed again with the tree command.

getpapers -q $QUERY -o $FOLDER -x
tree zika

Get Aedes aegypti (virus-transmitting Mosquito) publications

We then also want to have a look at all publications of the virus-transmitting Mosquito Aedes aegypti (Stegomyia aegypti is a synonym for Aedes aegypti).

QUERY='stegomyia aedes aegypti'
FOLDER='aedesaegypti'
getpapers -q $QUERY -o $FOLDER -x
tree aedesaegypti

231 publications are found.

Get Usutu virus publications

At last, we want to have a look at the Usutu virus and download the related publications. This virus is amongst several dozens of known viruses that have been found to have a high epidemic potential. So maybe we can apply some lessons learned from the Zika virus to this corpus.

QUERY='usutu'
FOLDER='usutu'
getpapers -q $QUERY -o $FOLDER -x
tree usutu

193 publications are found.

Finally, we have all the available data, which can be used for further analysis.

Optional: Get supplementary materials, if needed

If you want to use the supplementary materials available, here is the command to download it. Simply add -s to the query, and view the results as usual.

getpapers -q 'zika' -o zika -s
tree zika

Optional: Get the PDFs

To download all PDFs of the scientific papers, we add -p to tell that we want PDFs. Please beware that this will take around 3GB of space on your harddrive.

QUERY='zika'
FOLDER='zika'
getpapers -q $QUERY -o $FOLDER -p
tree zika

Save raw data

At this point, we recommend to save the raw data, so you can jump back to this point if you want to later on. You can simply copy your three folders or compress them, whatever you like.

Normalize the data

Before we can start with the extraction and analysis, we have to normalize the raw data and convert it to Scholarly HTML, so it is easier to process further on. For this, we convert with norma the fulltext.xml files to scholarly.html files, and view the results.

norma --project zika -i fulltext.xml -o scholarly.html --transform nlm2html
tree zika

norma --project aedesaegypti -i fulltext.xml -o scholarly.html --transform nlm2html
tree aedesaegypti

norma --project usutu -i fulltext.xml -o scholarly.html --transform nlm2html
tree usutu

Extract the needed facts

The prepared data now can be used to extract the facts via ami's different plugins. This will be — besides the metadata of the publications — the main datasource for the further analysis later on.

Extract Species

First, we use the species plugin to get the genus and binomial nomenclature. The species are especially important, because knowing them could give us a lead in terms of Zika-transmitting animals, in our case Aedes (mosquitos).

ami2-species --project zika -i scholarly.html --sp.species --sp.type genus
ami2-species --project zika -i scholarly.html --sp.species --sp.type binomial
ami2-species --project zika -i scholarly.html --sp.species --sp.type genussp
tree zika

ami2-species --project aedesaegypti -i scholarly.html --sp.species --sp.type genus
ami2-species --project aedesaegypti -i scholarly.html --sp.species --sp.type binomial
ami2-species --project aedesaegypti -i scholarly.html --sp.species --sp.type genussp
tree aedesaegypti

ami2-species --project usutu -i scholarly.html --sp.species --sp.type genus
ami2-species --project usutu -i scholarly.html --sp.species --sp.type binomial
ami2-species --project usutu -i scholarly.html --sp.species --sp.type genussp
tree usutu

Extract word frequencies

Second, we use the word plugin to get frequencies of words in a publication. This can help us to get a better understanding of the most important concepts. This could be a re-occuring location, a method constantly used or an important animal in the transmission process. With this explorative approach, it is planned to find new relations or get a general understanding of important knowledge.

ami2-word --project zika -i scholarly.html --w.words wordFrequencies
tree zika

ami2-word --project aedesaegypti -i scholarly.html --w.words wordFrequencies
tree aedesaegypti

ami2-word --project usutu -i scholarly.html --w.words wordFrequencies
tree usutu

Extract clinical trial IDs

soon to come...

Analyse the data with Jupyter Notebook

The analysis of the extracted data is done with Python in a Jupyter Notebook. There are several methods applied. Some of them are descriptive and show the wanted outcome, but some are explorativ, and conclusions must be done by a domain expert by exploring the data and its presentation by her/himselves. The following analysis is done:

plot a timeline of the publication years
get the most mentioned words and species over the full corpus
find relations between terms (species, words, authors, journals, publications) through network analysis methods, like community detection, co-occurences and network-projection.
find all publications in which a term was mentioned

Get the Jupyter Notebook.

Use the Jupyter Notebook:

Go to the tutorials/zika/ folder and start jupyter via:

jupyter notebook

This should let your browser open a new tab with the actual directory in it. Click on the tutorial-zika.ipynb file to open the jupyter notebook. Then you can execute cell by cell and adapt the notebook to your needs. There is a more detailed description of the analysis done in the Jupyter notebook.

FOLLOW UPS

Do another tutorial from the FutureTDM project: Statistics and Libraries
Learn more about the tools used with our software tutorials
Contribute to this repository.
Send us your results at Discourse or via Email (contact@contentmine.org).
Share the tutorial with others in your department or on the web.
Ask us your questions at Discourse, via Email (contact@contentmine.org) or on Twitter (@TheContentMine).

RESSOURCES

All materials in this repository were produced within the EU Horizon2020 project Future TDM - The Future of Text and Data Mining, an EU Horizon2020 research project with participation of Open Knowledge International and ContentMine.

All content and data is licensed under the Creative Commons Attribution 4.0 International License. All code is under the MIT license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zika

zika

README.md

Tutorial: Zika Virus

ABOUT ZIKA

SETUP

TUTORIAL

Download from Europe PMC

Normalize the data

Extract the needed facts

Extract Species

Extract word frequencies

Extract clinical trial IDs

Analyse the data with Jupyter Notebook

FOLLOW UPS

RESSOURCES

Files

zika

Directory actions

More options

Directory actions

More options

Latest commit

History

zika

Folders and files

parent directory

README.md

Tutorial: Zika Virus

ABOUT ZIKA

SETUP

TUTORIAL

Download from Europe PMC

Normalize the data

Extract the needed facts

Extract Species

Extract word frequencies

Extract clinical trial IDs

Analyse the data with Jupyter Notebook

FOLLOW UPS

RESSOURCES