Feminist books data project: The process

This is a repository with the notebooks and tables used in my 'feminist books project', which analyses the Penguin Random House catalogue of books in Spanish. You can check it out here.

In this first project my aim was to understand the evolution of books about feminism in recent years. There was a huge wave of books on feminism some years ago and I feel it is no longer the case. So my initial question was: Are there more or less new books about feminism now compared to five years ago?

I focused on one country, Spain, where I come from, and the leading publishing house, Penguin Random House. I limited the analysis to the number of books, years and related genres in order to show a glimpse of the current situation and trends.

The main findings are:

Penguin's online collection of Spanish books contains 118 labelled "feminist". The company publishes 1,800 new titles each year.
There are only 3 titles in the current collection published before 2014.
Related genres are sometimes surprising: for example, there are way more feminist books also labelled "influencers' books" than LGBTIQ literature.

The data collection process consisted of, first, scraping Penguin Random House feminist collection pages to obtain the titles and links. Then, I scraped each book's page and collected publication years and related genres. Each book page looked like this. There were 118 of them so running the code took some time.

After collecting the data, I placed it in two different dataframes, one with the books (title, author, year, link) and another with the genres. I chose to have two because most of the books are under several genres so there was a huge risk of distributing them in the wrong places. I also wasn't interested in knowing which book had what related genres, but in a more global reading, so it made even more sense to keep them separately.

I then exported them as tables in csv format as I wanted to keep the scraping notebook for scraping. Pandas value_counts showed me some duplicates in the books dataframe and I cleaned the csv manually following the information shown in the scraping notebook. I then opened a second notebook only to analyse the tables as dataframes. Then, I input the results of the analysis into charts using Datawrapper.

Here is the scraping notebook.
Here is the analysis notebook.
Here is the books table.
Here is the related genres table.

A section about new skills, etc

I've grown a lot in scraping, along with a little passion/obsession for it. I've also learnt to use scatterplots creatively and the whole workflow process.

A section about things I tried

My initial plan was to analyse the words used in the titles and build a classification. Many titles refer to body parts, the natural world, and motherhood or its absence. However, this required deeper qualitative analysis and several editorial decisions, so I chose not to go down that road for the time being.

I also tried to scrape the descriptions of each book for the same aim but when Penguin blocked me I accepted that it was time to pass to something else. Before this, my pre-initial plan was to also scrape the second largest publishing house in Spain and compare results. Time limitation was also the reason not to pursue this plan.

In the visualisation part, I tried to include more 'related genres' comparisons so the squares in the graph would look like little books - which was the idea I had in mind for that visual. But things got confusing pretty quickly.

If I had developed this project in a newsroom, I would have definitely contacted Penguin Random House for comments and questions about their data.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
booksPRH_clean.csv		booksPRH_clean.csv
booksPRH_genres.csv		booksPRH_genres.csv
draft4Project1 - Penguin.ipynb		draft4Project1 - Penguin.ipynb
draft5Project1.ipynb		draft5Project1.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Feminist books data project: The process

A section about new skills, etc

A section about things I tried

About

Releases

Packages

Languages

anaemepe/feminist-books-process

Folders and files

Latest commit

History

Repository files navigation

Feminist books data project: The process

A section about new skills, etc

A section about things I tried

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages