Skip to content

Latest commit

 

History

History

zika

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Tutorial: Zika Virus

The tutorial shows you some basic and advanced text and data mining methods with the ContentMine toolchain, to explore the scientific publications and get a better understanding of the research around the Zika virus.

The main part will focus on getting information about the Zika virus, the research done about it and the most relevant entities for it. A typical way how to gather information about a pandemic is to have a look at:

As data source, we use all publications from the Europe PMC (EUPMC) API, because they are Open Access.

The used approach should also work as a blueprint to gather information about pandemics in general and help for future research when a new virus starts to spread.

ABOUT ZIKA

The Zika virus (ZIKV) is a member of the virus family Flaviviridae. It is spread by daytime-active Aedes mosquitos, such as Aedes aegypti and Aedes albopictus.

Additional information

  • The Spondweni virus is mentioned as phylogenetically very close to the Zika virus.

SETUP

All software necessary can be found in installation.md.

Additional requirements

  • Memory: The downloaded data needs around 700 MB on your harddrive.

As preparation we recommend to have a look at the resources list in installation.md.

TUTORIAL

Download from Europe PMC

The first step always is to get the needed data from the APIs. For this, we use getpapers, the ContentMine tool for getting papers via different Publisher APIs. In this tutorial, we will only use open access literature from Europe PMC. We can search within their database of 3.5 million fulltext papers from the life-sciences. About one million of these are Open Access. Please refer to Europe PMC-Data for details. This will take less than 200MB of memory.

Get Zika publications

First, we want to set the variables for the query and for the folder we want the data to be stored in.

QUERY='zika'
FOLDER='zika'

Then, we have a look at how many results we find for the query term. For further information on how to create more complex queries for the EUPMC API, read here or here.

getpapers -q $QUERY -o $FOLDER

1126 papers were found in this case (14. 03. 2017).

Then, we download the fulltext.xml files for each publication. For this, add -x flag to the query. The results can then be viewed again with the tree command.

getpapers -q $QUERY -o $FOLDER -x
tree zika

Get Aedes aegypti (virus-transmitting Mosquito) publications

We then also want to have a look at all publications of the virus-transmitting Mosquito Aedes aegypti (Stegomyia aegypti is a synonym for Aedes aegypti).

QUERY='stegomyia aedes aegypti'
FOLDER='aedesaegypti'
getpapers -q $QUERY -o $FOLDER -x
tree aedesaegypti

231 publications are found.

Get Usutu virus publications

At last, we want to have a look at the Usutu virus and download the related publications. This virus is amongst several dozens of known viruses that have been found to have a high epidemic potential. So maybe we can apply some lessons learned from the Zika virus to this corpus.

QUERY='usutu'
FOLDER='usutu'
getpapers -q $QUERY -o $FOLDER -x
tree usutu

193 publications are found.

Finally, we have all the available data, which can be used for further analysis.

Optional: Get supplementary materials, if needed

If you want to use the supplementary materials available, here is the command to download it. Simply add -s to the query, and view the results as usual.

getpapers -q 'zika' -o zika -s
tree zika

Optional: Get the PDFs

To download all PDFs of the scientific papers, we add -p to tell that we want PDFs. Please beware that this will take around 3GB of space on your harddrive.

QUERY='zika'
FOLDER='zika'
getpapers -q $QUERY -o $FOLDER -p
tree zika

Save raw data

At this point, we recommend to save the raw data, so you can jump back to this point if you want to later on. You can simply copy your three folders or compress them, whatever you like.

Normalize the data

Before we can start with the extraction and analysis, we have to normalize the raw data and convert it to Scholarly HTML, so it is easier to process further on. For this, we convert with norma the fulltext.xml files to scholarly.html files, and view the results.

norma --project zika -i fulltext.xml -o scholarly.html --transform nlm2html
tree zika
norma --project aedesaegypti -i fulltext.xml -o scholarly.html --transform nlm2html
tree aedesaegypti
norma --project usutu -i fulltext.xml -o scholarly.html --transform nlm2html
tree usutu

Extract the needed facts

The prepared data now can be used to extract the facts via ami's different plugins. This will be — besides the metadata of the publications — the main datasource for the further analysis later on.

Extract Species

First, we use the species plugin to get the genus and binomial nomenclature. The species are especially important, because knowing them could give us a lead in terms of Zika-transmitting animals, in our case Aedes (mosquitos).

ami2-species --project zika -i scholarly.html --sp.species --sp.type genus
ami2-species --project zika -i scholarly.html --sp.species --sp.type binomial
ami2-species --project zika -i scholarly.html --sp.species --sp.type genussp
tree zika
ami2-species --project aedesaegypti -i scholarly.html --sp.species --sp.type genus
ami2-species --project aedesaegypti -i scholarly.html --sp.species --sp.type binomial
ami2-species --project aedesaegypti -i scholarly.html --sp.species --sp.type genussp
tree aedesaegypti
ami2-species --project usutu -i scholarly.html --sp.species --sp.type genus
ami2-species --project usutu -i scholarly.html --sp.species --sp.type binomial
ami2-species --project usutu -i scholarly.html --sp.species --sp.type genussp
tree usutu

Extract word frequencies

Second, we use the word plugin to get frequencies of words in a publication. This can help us to get a better understanding of the most important concepts. This could be a re-occuring location, a method constantly used or an important animal in the transmission process. With this explorative approach, it is planned to find new relations or get a general understanding of important knowledge.

ami2-word --project zika -i scholarly.html --w.words wordFrequencies
tree zika
ami2-word --project aedesaegypti -i scholarly.html --w.words wordFrequencies
tree aedesaegypti
ami2-word --project usutu -i scholarly.html --w.words wordFrequencies
tree usutu

Extract clinical trial IDs

soon to come...

Analyse the data with Jupyter Notebook

The analysis of the extracted data is done with Python in a Jupyter Notebook. There are several methods applied. Some of them are descriptive and show the wanted outcome, but some are explorativ, and conclusions must be done by a domain expert by exploring the data and its presentation by her/himselves. The following analysis is done:

  • plot a timeline of the publication years
  • get the most mentioned words and species over the full corpus
  • find relations between terms (species, words, authors, journals, publications) through network analysis methods, like community detection, co-occurences and network-projection.
  • find all publications in which a term was mentioned

Get the Jupyter Notebook.

Use the Jupyter Notebook:

Go to the tutorials/zika/ folder and start jupyter via:

jupyter notebook

This should let your browser open a new tab with the actual directory in it. Click on the tutorial-zika.ipynb file to open the jupyter notebook. Then you can execute cell by cell and adapt the notebook to your needs. There is a more detailed description of the analysis done in the Jupyter notebook.

FOLLOW UPS

RESSOURCES

All materials in this repository were produced within the EU Horizon2020 project Future TDM - The Future of Text and Data Mining, an EU Horizon2020 research project with participation of Open Knowledge International and ContentMine.

FutureTDM ContentMine Open Knowledge International

All content and data is licensed under the Creative Commons Attribution 4.0 International License. All code is under the MIT license.

Creative Commons by