COVID-19 Open Research Dataset (CORD-19), offered by Allen Institute for AI and other leading research groups, is a collection of thousands of articles related to COVID-19 and related coronaviruses. Here, we are using text analytics techniques in MATLAB to explore the articles and use topic modeling and document summarization to answer some of the relevant questions.
Goal: Explore relevant articles to understand “what do we know about transmission?”
Data used: comm_use_subset from the dataset hosted at COVID-19 Open Research Dataset (CORD-19)
Techniques used: Topic modeling and Document Summarization
MATLAB Live Script: TopicModel_Transmission_comm_use.mlx
Step 1: First, we use a latent Dirichlet allocation (LDA) method to perform topic modeling to discover underlying topics in the articles. We test four different solvers:
- ‘cgs’: collapsed Gibbs sampling
- ‘avb’: approximate variational Bayes
- ‘cvb0’: variational Bayes, zeroth order
- ‘savb’: stochastic approximate variational Bayes
Step 2: After choosing a solver, we then choose the optimum number of topics by comparing validation perplexities for different numbers of topics.
Step 3: Build the final model using the chosen solver and optimum number of topics.
Step 4: In order to answer the question “what do we know about transmission?”, we choose the most relevant article by identifying the topic with the word, “transmission”, having highest probability and then identifying the document in that topic that has the highest probability.
Step 5: An alternate approach is to summarize the top abstracts.
Next Steps
There are many ways to dig deeper into this single question. Some of the possible approaches are:
- Use of ngrams (2 or more words) in topic modeling (Analyze Text Data with Multiword Phrases),
- Use of TFIDF for topic modeling (TFIDF),
- Extract summary with a query (such as transmission) using Maximal Marginal Relevance (MMR) algorithm (mmrScores), and
- Perform correlation analysis to find words most commonly associated to transmission (Co-occurrence Analysis and Visualization).
The aim of this example is to show
- how to use text analytics techniques to explore text data and build predictive models, and
- provide a starting point for researchers to build on it and dive deeper into unanswered questions regarding this pandemic.
Copyright 2020 - 2020 The MathWorks, Inc.