The tutorial shows you some basic and advanced text and data mining methods with the ContentMine toolchain, to explore the scientific publications and get a better understanding of the research around the Zika virus.
The main part will focus on getting information about the Zika virus, the research done about it and the most relevant entities for it. A typical way how to gather information about a pandemic is to have a look at:
- a) the virus itself (Zika virus)
- b) the virus-transmitting species (also known as disease vectors; in our case Aedes aegypti and Aedes albopictus)
- c) drugs, clinical trials and locations mentioned in combination with the virus and the virus-transmitting species
- d) similar pandemics/viruses (like the Yellow fever or the Usutu virus, which is a flavivirus amongst a range of suspects for becoming the next pandemic)
As data source, we use all publications from the Europe PMC (EUPMC) API, because they are Open Access.
The used approach should also work as a blueprint to gather information about pandemics in general and help for future research when a new virus starts to spread.
The Zika virus (ZIKV) is a member of the virus family Flaviviridae. It is spread by daytime-active Aedes mosquitos, such as Aedes aegypti and Aedes albopictus.
Additional information
- The Spondweni virus is mentioned as phylogenetically very close to the Zika virus.
All software necessary can be found in installation.md.
Additional requirements
- Memory: The downloaded data needs around 700 MB on your harddrive.
As preparation we recommend to have a look at the resources list in installation.md.
The first step always is to get the needed data from the APIs. For this, we use getpapers, the ContentMine tool for getting papers via different Publisher APIs. In this tutorial, we will only use open access literature from Europe PMC. We can search within their database of 3.5 million fulltext papers from the life-sciences. About one million of these are Open Access. Please refer to Europe PMC-Data for details. This will take less than 200MB of memory.
Get Zika publications
First, we want to set the variables for the query and for the folder we want the data to be stored in.
QUERY='zika'
FOLDER='zika'
Then, we have a look at how many results we find for the query term. For further information on how to create more complex queries for the EUPMC API, read here or here.
getpapers -q $QUERY -o $FOLDER
1126 papers were found in this case (14. 03. 2017).
Then, we download the fulltext.xml
files for each publication. For this, add -x
flag to the query. The results can then be viewed again with the tree command.
getpapers -q $QUERY -o $FOLDER -x
tree zika
Get Aedes aegypti (virus-transmitting Mosquito) publications
We then also want to have a look at all publications of the virus-transmitting Mosquito Aedes aegypti (Stegomyia aegypti
is a synonym for Aedes aegypti).
QUERY='stegomyia aedes aegypti'
FOLDER='aedesaegypti'
getpapers -q $QUERY -o $FOLDER -x
tree aedesaegypti
231 publications are found.
Get Usutu virus publications
At last, we want to have a look at the Usutu virus and download the related publications. This virus is amongst several dozens of known viruses that have been found to have a high epidemic potential. So maybe we can apply some lessons learned from the Zika virus to this corpus.
QUERY='usutu'
FOLDER='usutu'
getpapers -q $QUERY -o $FOLDER -x
tree usutu
193 publications are found.
Finally, we have all the available data, which can be used for further analysis.
Optional: Get supplementary materials, if needed
If you want to use the supplementary materials available, here is the command to download it. Simply add -s
to the query, and view the results as usual.
getpapers -q 'zika' -o zika -s
tree zika
Optional: Get the PDFs
To download all PDFs of the scientific papers, we add -p
to tell that we want PDFs. Please beware that this will take around 3GB of space on your harddrive.
QUERY='zika'
FOLDER='zika'
getpapers -q $QUERY -o $FOLDER -p
tree zika
Save raw data
At this point, we recommend to save the raw data, so you can jump back to this point if you want to later on. You can simply copy your three folders or compress them, whatever you like.
Before we can start with the extraction and analysis, we have to normalize the raw data and convert it to Scholarly HTML, so it is easier to process further on. For this, we convert with norma the fulltext.xml
files to scholarly.html
files, and view the results.
norma --project zika -i fulltext.xml -o scholarly.html --transform nlm2html
tree zika
norma --project aedesaegypti -i fulltext.xml -o scholarly.html --transform nlm2html
tree aedesaegypti
norma --project usutu -i fulltext.xml -o scholarly.html --transform nlm2html
tree usutu
The prepared data now can be used to extract the facts via ami's different plugins. This will be — besides the metadata of the publications — the main datasource for the further analysis later on.
First, we use the species plugin to get the genus and binomial nomenclature. The species are especially important, because knowing them could give us a lead in terms of Zika-transmitting animals, in our case Aedes (mosquitos).
ami2-species --project zika -i scholarly.html --sp.species --sp.type genus
ami2-species --project zika -i scholarly.html --sp.species --sp.type binomial
ami2-species --project zika -i scholarly.html --sp.species --sp.type genussp
tree zika
ami2-species --project aedesaegypti -i scholarly.html --sp.species --sp.type genus
ami2-species --project aedesaegypti -i scholarly.html --sp.species --sp.type binomial
ami2-species --project aedesaegypti -i scholarly.html --sp.species --sp.type genussp
tree aedesaegypti
ami2-species --project usutu -i scholarly.html --sp.species --sp.type genus
ami2-species --project usutu -i scholarly.html --sp.species --sp.type binomial
ami2-species --project usutu -i scholarly.html --sp.species --sp.type genussp
tree usutu
Second, we use the word plugin to get frequencies of words in a publication. This can help us to get a better understanding of the most important concepts. This could be a re-occuring location, a method constantly used or an important animal in the transmission process. With this explorative approach, it is planned to find new relations or get a general understanding of important knowledge.
ami2-word --project zika -i scholarly.html --w.words wordFrequencies
tree zika
ami2-word --project aedesaegypti -i scholarly.html --w.words wordFrequencies
tree aedesaegypti
ami2-word --project usutu -i scholarly.html --w.words wordFrequencies
tree usutu
soon to come...
The analysis of the extracted data is done with Python in a Jupyter Notebook. There are several methods applied. Some of them are descriptive and show the wanted outcome, but some are explorativ, and conclusions must be done by a domain expert by exploring the data and its presentation by her/himselves. The following analysis is done:
- plot a timeline of the publication years
- get the most mentioned words and species over the full corpus
- find relations between terms (species, words, authors, journals, publications) through network analysis methods, like community detection, co-occurences and network-projection.
- find all publications in which a term was mentioned
Get the Jupyter Notebook.
Use the Jupyter Notebook:
Go to the tutorials/zika/
folder and start jupyter via:
jupyter notebook
This should let your browser open a new tab with the actual directory in it. Click on the tutorial-zika.ipynb
file to open the jupyter notebook. Then you can execute cell by cell and adapt the notebook to your needs. There is a more detailed description of the analysis done in the Jupyter notebook.
- Do another tutorial from the FutureTDM project: Statistics and Libraries
- Learn more about the tools used with our software tutorials
- Contribute to this repository.
- Send us your results at Discourse or via Email (contact@contentmine.org).
- Share the tutorial with others in your department or on the web.
- Ask us your questions at Discourse, via Email (contact@contentmine.org) or on Twitter (@TheContentMine).
- ContentMine/Hypothes.is Proposal
- Wikidata:WikiProject Source MetaData/Wikidata lists/Items about Zika virus or fever
- Article: Zika may be spread by up to 35 species of mosquitoes, researchers say
All materials in this repository were produced within the EU Horizon2020 project Future TDM - The Future of Text and Data Mining, an EU Horizon2020 research project with participation of Open Knowledge International and ContentMine.
All content and data is licensed under the Creative Commons Attribution 4.0 International License. All code is under the MIT license.