10.visualize-gender.Rmd

---
title: "Representation analysis of gender"
---

## Setups

```{r message = F, warning=F}
library(tidyverse)
library(lubridate)
library(rnaturalearth)
library(wru)
source("utils/r-utils.R")
theme_set(theme_bw() + theme(legend.title = element_blank()))
```

## Load data

Only keep articles from 2002 because few authors had gender predictions before 2002.
See [093.summary-stats](093.summary-stats.html) for more details.

```{r}
load("Rdata/raws.Rdata")

alpha_threshold <- qnorm(0.975)
gender_df <- read_tsv("data/gender/genderize.tsv")

pubmed_gender_df <- corr_authors %>%
  filter(year(year) >= 2002) %>%
  left_join(gender_df, by = "fore_name_simple")

iscb_gender_df <- keynotes %>%
  left_join(gender_df, by = "fore_name_simple")

start_year <- 1993
end_year <- 2019
n_years <- end_year - start_year
my_jours <- unique(pubmed_gender_df$journal)
my_confs <- unique(iscb_gender_df$conference)
n_jours <- length(my_jours)
n_confs <- length(my_confs)
```

## Prepare data frames for later analyses

- rbind results of race predictions in iscb and Pubmed
- pivot long
- compute mean, sd, marginal error

```{r}
iscb_pubmed <- iscb_gender_df %>%
  rename("journal" = conference) %>%
  select(year, journal, probability_male, publication_date) %>%
  mutate(
    type = "Keynote speakers/Fellows",
    adjusted_citations = 1
  ) %>%
  bind_rows(
    pubmed_gender_df %>%
      select(year, journal, probability_male, publication_date, adjusted_citations) %>%
      mutate(type = "Pubmed authors")
  ) %>%
  mutate(probability_female = 1 - probability_male) %>%
  pivot_longer(contains("probability"),
    names_to = "gender",
    values_to = "probabilities"
  ) %>%
  filter(!is.na(probabilities)) %>%
  group_by(type, year, gender) 

iscb_pubmed_sum <- iscb_pubmed %>%
  summarise(
    # n = n(),
    mean_prob = mean(probabilities, na.rm = T),
    se_prob = sd(probabilities, na.rm = T),
    # n = mean(n),
    me_prob = alpha_threshold * se_prob,
    .groups = "drop"
  )
# https://stats.stackexchange.com/questions/25895/computing-standard-error-in-weighted-mean-estimation
```


```{r}
# save(iscb_pubmed, file = 'Rdata/iscb-pubmed_gender.Rdata')
```


## Figures for paper

### Figure 2: ISCB Fellows and keynote speakers appear more evenly split between men and women than PubMed authors, but the proportion has not reached parity.

```{r fig.height=3}
fig_1 <- iscb_pubmed_sum %>%
  # group_by(year, type, gender) %>%
  gender_breakdown("main", fct_rev(type))
fig_1
ggsave("figs/gender_breakdown.png", fig_1, width = 5, height = 2.5, dpi = 600)
ggsave("figs/gender_breakdown.svg", fig_1, width = 5, height = 2.5)
```

```{r echo=FALSE}
iscb_pubmed_sum %>%
  # group_by(year, type, gender) %>%
  # summarise(mean_prob = mean(probabilities, na.rm = T), .groups = 'drop') %>%
  filter(year(year) > 2016, grepl("female", gender)) %>%
  group_by(type) %>%
  summarise(prob_female_avg = mean(mean_prob))
```

### Supplementary Figure S2 {#sup_fig_s1}

Additional fig. 1 with separated keynote speakers and fellows

```{r}
fig_1d <- iscb_pubmed %>%
  ungroup() %>%
  mutate(
    type2 = case_when(
      type == "Pubmed authors" ~ "Pubmed authors",
      journal == "ISCB Fellow" ~ "ISCB Fellows",
      type == "Keynote speakers/Fellows" ~ "Keynote speakers"
    )
  ) %>%
  group_by(type2, year, gender) %>%
  summarise(
    mean_prob = mean(probabilities),
    se_prob = sd(probabilities)/sqrt(n()),
    me_prob = alpha_threshold * se_prob,
    .groups = "drop"
  ) %>%
  gender_breakdown("main", fct_rev(type2)) +
  scale_x_date(
    labels = scales::date_format("'%y"),
    expand = c(0, 0)
  )
```

<!-- Increasing trend of honorees who were women in each honor category, especially in the group of ISCB Fellows, which markedly increased after 2015.  -->

```{r eval=FALSE, include=FALSE}
# By conference:
# fig_1d <- bind_rows(iscb_gender) %>%
#   gender_breakdown(category = 'sub', journal) +
#   theme(legend.position = 'bottom')

# fig_1d
ggsave("figs/fig_s1.png", fig_1d, width = 7, height = 3)
ggsave("figs/fig_s1.svg", fig_1d, width = 7, height = 3)
```

## Mean and standard deviation of predicted probabilities

```{r}
iscb_pubmed_sum %>%
  filter(gender == "probability_male") %>%
  gam_and_ci(
    df2 = iscb_pubmed %>% filter(gender == "probability_male"),
    start_y = start_year, end_y = end_year
  ) +
  theme(legend.position = c(0.88, 0.2))
```

## Hypothesis testing

```{r echo = F}
get_p <- function(inte, colu) {
  broom::tidy(inte) %>%
    filter(term == "probabilities") %>%
    pull(colu) %>%
    sprintf("%0.5g", .)
}
```

```{r}
iscb_lm <- iscb_pubmed %>%
  filter(gender == "probability_female", !is.na(probabilities)) %>%
  mutate(type = as.factor(type)) %>% 
  mutate(type = type %>% relevel(ref = "Pubmed authors"))
```

```{r}
scaled_iscb <- iscb_lm %>%
  filter(year(year) >= 2002)
# scaled_iscb$s_prob <- scale(scaled_iscb$probabilities, scale = F)
# scaled_iscb$s_year <- scale(scaled_iscb$year, scale = F)

main_lm <- glm(type ~ year + probabilities,
  data = scaled_iscb, # %>% mutate(year = as.factor(year))
  family = "binomial"
)

broom::tidy(main_lm)
inte_lm <- glm(
  # type ~ scale(year, scale = F) * scale(probabilities, scale = F),
  # type ~ s_year * s_prob,
  type ~ year * probabilities,
  data = scaled_iscb, # %>% mutate(year = as.factor(year))
  family = "binomial"
)
broom::tidy(inte_lm)
anova(main_lm, inte_lm, test = "Chisq")
# mean(scaled_iscb$year)
# mean(scaled_iscb$probabilities)
```

The two groups of scientists did not have a significant association with the gender predicted from fore names (_P_ = `r get_p(main_lm, 'p.value')`).
Interaction terms do not predict `type` over and above the main effect of gender probability and year.

```{r include=FALSE, eval=FALSE}
# inte_lm <- glm(type ~ (year * probabilities),
#    data = iscb_lm,
#    family = 'binomial')
```

```{r}
sessionInfo()
```