Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore the eLife article data #8

Open
lwinfree opened this issue Aug 27, 2020 · 2 comments
Open

Explore the eLife article data #8

lwinfree opened this issue Aug 27, 2020 · 2 comments

Comments

@lwinfree
Copy link
Collaborator

eLife articles: a dump of all eLife articles in XML and JSON (up to Aug, 2020; ~ 60GB), https://github.com/elifesciences/elife-article-xml and https://github.com/elifesciences/elife-article-json

@chris48s
Copy link
Collaborator

  • How can we identify and extract tabular data from these articles?
  • What tools exist to help us parse these? Anything higher-level/smarter than plain old lxml?
  • Is there a particular subset of tables/tabular data that we are interested in?
  • Once we've identified tabular data in articles, how do we work with them?

@proccaserra
Copy link
Member

proccaserra commented Aug 28, 2020

@chris48s @lwinfree just asked Emmy about retrieving the table.

example of table of interest: table 1 in:
https://elifesciences.org/articles/57525/figures#tables
From Emmy, I understand we could get JATS xml format for the article, parse that and pull tables, get headers and look for entities/patterns (e.g. repeating headers as seen in table 1).

  • The so-called key resource table can be filtered out. I have produced a "Frictionless Template" for it.
  • attempt to detect tables containing statistical results (e.g. field header ~[pval|AdjPval|p-value|q-value]) etc...
  • create bins for other motifs
  • batch convert to Frictionless data package.
  • attempt assigning rdfType to headers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants