Skip to content

Commit

Permalink
genderize.io API chapter (#12)
Browse files Browse the repository at this point in the history
* WIP

* WIP

* first complete draft
  • Loading branch information
internaut authored Jan 11, 2022
1 parent 3e5e69f commit 7e7fe21
Show file tree
Hide file tree
Showing 2 changed files with 263 additions and 1 deletion.
198 changes: 198 additions & 0 deletions 99-genderize_API.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# Genderize.io API

<chauthors>Markus Konrad (Wissenschaftszentrum Berlin für Sozialforschung – WZB)</chauthors>
<br><br>

```{r cache-ch2, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE, cache=TRUE)
```

## Provided services/data

* *What data/service is provided by the API?*

The [Genderize.io API](https://genderize.io/) provides a service for predicting the likely gender of a person given their name. The API is provided by Danish company *Demografix ApS* that also provides APIs for predicting age ([Agify.io](https://agify.io/)) and nationality ([Nationalize.io](https://nationalize.io/)) for a given name (more on that later).^[The author of this chapter is in no way affiliated with Demografix ApS.] The service is useful to augment a dataset of individuals with their likely gender, when at least the individuals' given name is known.

The results provided by the API should be taken with care. As with many commercial APIs, the exact data sources and methods for the genderize.io API are not disclosed. The [dedicated "Our Data" page](https://genderize.io/our-data) only states that *"[o]ur data is collected from all over the web"* and provides a list with the amount of data that was collected for many countries (a total of over 114M entries at time of writing). @wais_gender_2016 claims that scraped public social media profiles are used as data source.

Another problem is that there's only a binary categorization of gender. The categorization comes with a prediction probability estimate which in turn depends on the popularity of the name and of course the name itself (since there are many unisex names). Furthermore, gender prediction for a name may depend on country and year of birth (the Genderize.io API allows for country-specific results). One also has to keep in mind different name orders in different cultures. E.g., in North and South America as well as most of Europe it is common that the given name, from which a gender may be predicted, comes first before the family name, whereas in East Asia the given name often comes last.

So in general, predicting an individual's gender from their name is not an easy task. See @wais_gender_2016 for an overview on modern gender prediction methods. You should only consider using the methods in this chapter when there's no other way to obtain the gender information. In case you use the API, you should

- transparently report the limits of the name-based approach,
- use country-specific results whenever possible (see below),
- report the accuracy of the predictions,
- and use a threshold for the minimum accuracy and/or incorporate the prediction accuracy into your models.

For gender prediction, there are also alternatives to using this API:

- the [gender package](https://cran.rstudio.com/web/packages/gender/) provides gender prediction using historical data from the US and a few European countries (see @blevins2015jane)
- [NameCensus.com](https://namecensus.com/) provides lists of most common female and male first names in the US
- WikiPedia provides [categories for given names](https://en.wikipedia.org/wiki/Category:Given_names) by gender


## Prerequesites

At the time of writing, the API can be queried with up to 1000 names per day for free. There's not even an API key required for the free tier. However, if you require more than 1000 API requests per day, you need to obtain an API key from [store.genderize.io](https://store.genderize.io/) – see [this page](https://store.genderize.io/pricing) for pricing.


## Simple API call

* *What does a simple API call look like?*

The API is very simple and basically accepts two parameters for an HTTP GET request:

1. `name` as the given name for which gender prediction is performed; an array of up to 10 names per request can be send
2. `country_id` as optional localization parameter (given as [ISO 3166-1 alpha-2](http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) country code)

When no `country_id` is given, the gender prediction is performed using a database of given names for *all* countries with a notable bias towards Western countries (see the numbers on the ["Our Data" page](https://genderize.io/our-data)). If country information is known, you should provide it in the API request, as this gives more accurate, context-aware results. Especially if you're working with names outside the Western cultural sphere, you should be aware of the Western bias in the datasets used for the predictions and make use of the localization feature.

We can perform a sample request using the `curl` command in a terminal or by simply visiting the URL in a browser:

```{bash, eval=FALSE}
curl 'https://api.genderize.io?name=sasha'
```


The result is an HTTP response with JSON formatted data which contains the predicted gender, the prediction probability estimate and the count of entries which informed the prediction. For the example requests above, the API responds with:

```{text eval=FALSE}
{
"name": "sasha",
"gender": "male",
"probability": 0.51,
"count": 13219
}
```

This tells us that for the requested name "sasha"^[Experiments showed that the API is not case-sensitive, i.e. it doesn't matter if you query the name "sasha", "Sasha" or "SASHA".], the gender was predicted as male, but only with probability 0.51. This makes sense since this name is considered a unisex name in many countries. The prediction is based on 13219 samples in the database, which seems very solid.

Now to show the influence of localization, we try the German variant of this name, "Sascha", and append the `country_id` parameter for Germany:

```{bash, eval=FALSE}
curl 'https://api.genderize.io?name=sascha&country_id=DE'
```

```{text eval=FALSE}
{
"name": "sascha",
"gender": "male",
"probability": 0.99,
"count": 22408,
"country_id": "DE"
}
```

We can see that the localized request for Germany predicts "sascha" as male with 99% probability, based on 22408 database entries.

Interestingly, only the Latinized forms of Sasha seem to be available in the database. Neither the Cyrillic form Саша, nor other non-Latin forms like Saša return results. However, experiments with other names in non-Latin alphabets show a pattern: The German name "Jürgen" has about 700 entries in the genderize database, while "Jurgen" has almost 4000 entries. The Turkish name "Gül" exists only 36 times but "Gul" gives almost 5000 entries. A similar pattern is seen when using accents: "André" exists three times, but "Andre" more than 64,000 times. So in general, it seems you should convert all non-Latin (or non-ASCII) characters in a name to Latin counterparts in order to get better results.

You can send up to ten names per request, by concatenating several `name[]=...` parameters:

```{bash, eval=FALSE}
curl 'https://api.genderize.io?name[]=sasha&name[]=alex&name[]=alexandra'
```

The predictions are then listed for each supplied name:

```{text, eval=FALSE}
[
{"name": "sasha", "gender": "male", "probability": 0.51, "count": 13219},
{"name": "alex", "gender": "male", "probability": 0.9, "count": 411319},
{"name": "alexandra", "gender": "female", "probability": 0.98, "count": 122985}
]
```


## API access

* *How can we access the API from R?*

There are several packages for R that provide convenient functions for communicating with the genderize.io API:

- [DemografixeR (CRAN)](https://cloud.r-project.org/web/packages/DemografixeR/index.html)[Package website](https://matbmeijer.github.io/DemografixeR/)
- [GenderGuesser](https://github.com/eamoncaddigan/GenderGuesser)
- [genderizeR](https://github.com/kalimu/genderizeR)^[At time of writing, the package was no longer maintained and not available on CRAN anymore.]

Since DemografixeR is the only package available on CRAN at time of writing, I will use this package for further demonstration. You can install the package via `install.packages('DemografixeR')`.

### Load package

Once installed, the package can be loaded with the following command:

```{r}
library(DemografixeR)
```

### The `genderize` function and its arguments

The main function to use is the `genderize` function. The first argument is the one or more names (as character string vector) for which you want to predict the gender. So to replicate the first API call from the previous section in R, we could write:

```{r}
genderize('sasha')
```

Note that the output only consists of the gender prediction as character string vector. This is a dangerous default behavior, as it omits important information about the prediction probability and the size of the data pool used for the prediction. We need to set the `simplify` argument to `FALSE` in order to get that information in the form of a dataframe:

```{r}
genderize('sasha', simplify = FALSE)
```

Again, we can localize the request by using the `country_id` parameter:

```{r}
genderize('sascha', country_id = 'DE', simplify = FALSE)
```

Supplying a character string vector will predict the gender of all these names. Note that with the `genderize` function, you're not limited to ten names as when using the API directly. Here, we predict the gender of six names in their original and Latinized variant each. This also shows the higher counts when using only Latin characters in the query:

```{r}
genderize(c('gül', 'gul', 'jürgen', 'jurgen', 'andré', 'andre',
'gökçe', 'gokce', 'jörg', 'jorg', 'rené', 'rene'),
simplify = FALSE)
```

You can also provide a different `country_id` for each name in the request:

```{r}
genderize(c('sasha', 'sascha'), country_id = c('RU', 'DE'), simplify = FALSE)
```

This is especially helpful together with `expand.grid`, which generates all combinations of values in the two vectors:

```{r}
names <- c('sasha', 'sascha')
countries <- c('RU', 'DE')
(names_cntrs <- expand.grid(names = names, countries = countries,
stringsAsFactors = FALSE))
```

```{r}
genderize(names_cntrs$names, country_id = names_cntrs$countries, simplify = FALSE)
```

Lastly, you can set the parameter `meta` to `TRUE`. This will add additional columns to the result with your rate limit (maximum daily number of requests), the remaining number of requests, the seconds until rate limit reset and the time of the request:

```{r}
genderize('judy', simplify = FALSE, meta = TRUE)
```

### Provide API key

If you bought an API key, you can provide it using the `apikey` parameter. It's however recommended to use the `save_api_key` function to safely store such an API key. It will then automatically be used for each request.

Please be careful when dealing with API keys and never publish them.

### Functions for access to other APIs

The package also provides access to the APIs for predicting age ([Agify.io](https://agify.io/)) and nationality ([Nationalize.io](https://nationalize.io/)) from a name. Examples on how to do that are given on the [package's website](https://matbmeijer.github.io/DemografixeR/). However, such predictions are even more problematic than the gender predictions and should never be trusted. Even the examples given on the API's respective websites showcase foolish predictions: A "Michael" is predicted as being 70 years old, living either in the US (9% probability), in Australia (6%) or New Zealand (5%). Pinpointing an age using only a given name is nonsense and the country predictions simply won't help you much for many names, given how many names are internationally used.

## Social science examples

* *Are there social science research examples using the API?*

There seem to be several bibliometric studies that focus on the gender publication gap, which use the genderize.io API to estimate the gender of journal paper authors. Two notable examples are @holman_gender_2018 and @shen_persistent_2018.

@hipp_has_2021 used the genderize.io API to predict the gender from names of GitHub users (for those that provided a valid given name). This was then used to analyze the different impact of the COVID-19 pandemic on the productivity of female and male software developers.

Other examples (also outside academia) are listed on the ["Use cases" page](https://genderize.io/use-cases).
66 changes: 65 additions & 1 deletion references_overall.bib
Original file line number Diff line number Diff line change
Expand Up @@ -529,6 +529,71 @@ @article{fowler2021
doi = {10.1017/S0003055420000696}
}

@article{wais_gender_2016,
title = {Gender {Prediction} {Methods} {Based} on {First} {Names} with {genderizeR}},
volume = {8},
issn = {2073-4859},
url = {https://journal.r-project.org/archive/2016/RJ-2016-002/index.html},
doi = {10.32614/RJ-2016-002},
language = {en},
number = {1},
urldate = {2022-01-03},
journal = {The R Journal},
author = {Wais, Kamil},
year = {2016},
pages = {17},
}

@article{blevins2015jane,
title={Jane, John... Leslie? A Historical Method for Algorithmic Gender Prediction.},
author={Blevins, Cameron and Mullen, Lincoln},
journal={DHQ: Digital Humanities Quarterly},
volume={9},
number={3},
year={2015}
}


@article{hipp_has_2021,
title = {Has {Covid}-19 increased gender inequalities in professional advancement? {Cross}-country evidence on productivity differences between male and female software developers},
url = {https://ubp.uni-bamberg.de/jfr/index.php/jfr/article/view/697},
doi = {10.20377/jfr-697},
journal = {Journal of Family Research},
author = {Hipp, Lena and Konrad, Markus},
month = sep,
year = {2021},
}

@techreport{shen_persistent_2018,
title = {Persistent {Underrepresentation} of {Women}’s {Science} in {High} {Profile} {Journals}},
url = {https://www.biorxiv.org/content/10.1101/275362v2},
language = {en},
urldate = {2022-01-05},
author = {Shen, Yiqin Alicia and Webster, Jason M. and Shoda, Yuichi and Fine, Ione},
month = mar,
year = {2018},
doi = {10.1101/275362}
}

@article{holman_gender_2018,
title = {The gender gap in science: {How} long until women are equally represented?},
volume = {16},
issn = {1545-7885},
shorttitle = {The gender gap in science},
url = {https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2004956},
doi = {10.1371/journal.pbio.2004956},
language = {en},
number = {4},
urldate = {2022-01-05},
journal = {PLOS Biology},
author = {Holman, Luke and Stuart-Fox, Devi and Hauser, Cindy E.},
month = apr,
year = {2018},
note = {Publisher: Public Library of Science},
keywords = {Bibliometrics, Careers, Computer and information sciences, Medical journals, Open access medical journals, Open access publishing, Scientific publishing, Sexual and gender issues},
pages = {e2004956},
}

@article{roberts2021media,
title={{Media Cloud: Massive Open Source Collection of Global News on the Open Web}},
author={Roberts, Hal and Bhargava, Rahul and Valiukas, Linas and Jen, Dennis and Malik, Momin M and Bishop, Cindy and Ndulue, Emily and Dave, Aashka and Clark, Justin and Etling, Bruce and others},
Expand Down Expand Up @@ -607,4 +672,3 @@ @Article{puschmann:2021
publisher = {Open Science Framework},
copyright = {Creative Commons Zero v1.0 Universal}
}

0 comments on commit 7e7fe21

Please sign in to comment.