Skip to content

A pipeline for inferring gender for acknowledged individuals in scientific literature on a massive scale

License

Notifications You must be signed in to change notification settings

NCBI-Hackathons/Hidden-Figures

Repository files navigation

Hidden Figures

A pipeline for inferring gender for acknowledged individuals in scientific literature on a massive scale

Project Overview

An investigation into the acknowledgments section of research articles within PubMed Central. Prior literature suggests there is a gender discrepancy between men and women in authorship and acknowledgment. Specifically, it has been observed that women were more likely to be acknowledged rather than the author list, in a small sample of theoretical population genetics publications. We tested this observation on a large-scale across biomedical research articles and investigated the contributions of acknowledged individuals.

Hypotheses

  1. Women are more likely to be on the acknowledgments than the author list would suggest.
  2. The acknowledgment for the types of tasks for men and women differ.
  3. The type of praise given men and women differ (fruitful discussion, outstanding analysis).
  4. These trends change over time, reflecting more equality.

Literature Review

Few large-scale studies have been conducted on acknowledgments in research articles; our study is novel in size and scope. Notable previous studies:

Khabsa et al 2012

Khabsa et al., 2012

  • extracted acknowledgments sections from articles in CiteSeerX
  • identified individuals and organizations
  • build network graph of acknowledged entities and authors

Paul-Hus et al 2017

Paul-Hus et al., 2017

  • extracted acknowledgements sections from articles in Web of Science
  • identified acknowledged contributions
  • analyzed trends in contributions by field of study

Analysis of Features

Data Sources and Extraction

Source: PMC FTP

The PMC XML files have an <ack> tag for the Acknowledgments section. For example, consider PMC 4959138:

<ack>
    <p>
	We thank Alexia Prskawetz for the fruitful discussions and remarks. 
	Further on, we would like to thank the referees and editors for their 
	valuable comments. This research was partly supported by the Austrian 
	Science Fund (FWF) under Grant No. P25979-N25 and is an extract out of 
	the Ph.D. thesis (Moser <xref ref-type="bibr" rid="CR30">2014</xref>).
    </p> 
</ack>

Sentence parsing using spaCy

parsing example

PMC acknowledgments over time

ack counts

Natural Language Processing

Extract names and infer gender using genderize

Acknowledgments and, to a lesser extent, authorship is skewed toward men.

fraction female authors

Summary stats

For the PubMed Central subset with acknowledgments (PMCA):

  • Number of pubs in PMCA with authors with identifiable genders: 312,237
  • Fraction of women in PMCA in the pubs: 0.424
  • Fraction of women on PMCA in the acknowledgments: 0.233
  • Median number of people on an acknowledgments: 5
  • Most acknowledgments are uni-gender: 80%
  • Most of these uni-gender acknowledgments are all-male 202,150 vs 47,105
  • Publications with acknowledgments have a much higher RCR than those without 0.8 vs 0.4.

Quality control

Acknowledgment Name Parsing Error Occurrence PMCID Example
Author's Name Listed 4.5% PMC3339585 Smriti Shrivastava is thankful to CSIR for Senior Research Fellowship
Fellowship Name 2.0% PMC5864053 J.S. was funded by a Biotechnology and Biological Sciences Research Council (BBSRC) David Phillips Fellowship (BB/L024551/1)
Organization Name 2.0% PMC4160263 National Institute of Biomedical Imaging and Bioengineering Grant R01 EB006745 Stanford Bio-X, the American Heart Association (Western States Affiliates)
Award Name 1.5% PMC4189622 Seed Grant provided by Michigan Technological University (MTU)
Disclosure 1.5% PMC4147052 In addition, Jin Jin also holds stock in Eli Lilly
Dedication 0.5% PMC4831668 This paper is dedicated to José Luis García Ruano on occasion of his retirement

Extract MeSH terms and analyze based on presence of acknowledgment

MeSH terms from PMC articles without acknowledgments tend to be clincally-focused.

MeSH terms from PMC articles with acknowledgments tend to focus on fundamental research.

Extract nouns and verbs associated with acknowledged individuals

Words associated with acknowledged individuals, colored by gender: purple words are predominantly associated with men and green words are predominantly associated with women; grey is used for words that are equally associated with both genders. Larger words appear more frequently. Gender-specific words were preferentially selected.

Nouns

Verbs

We manually curated a list of keywords to group acknowledgements into six categories based on the type of contribution being acknowledged: Manuscript, Coordination, Procedure, Analysis, Materials, and Advice. For each category, we calculated the representation of female names.

Category Percent of female names
manuscript 52.2%
coordination 50.2%
procedure 43.6%
analysis 42.2%
materials 37.4%
advice 32.7%

Contributors

Project pipeline

  • Literature review
    • Historical acknowledgments research
    • Gender in authorship/acknowledgments
  • Analysis of features
    • Extract acknowledgments from PMC
    • Analyze acknowledgment features
      • % PMC coverage, years, journals, MeSH terms, etc.
      • False negatives?
  • Natural Language Processing (NLP)
    • Names → infer gender with genderize.io
    • Organizations and objects
    • Acknowledged tasks
    • Task modifiers (stretch goal)

About

A pipeline for inferring gender for acknowledged individuals in scientific literature on a massive scale

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published