Pundits regularly comment football/soccer games without a solid ground in statistics. They talk about players going through "a rough patch" or "dry streak", with little attention to the span and regularity of such processes. It is also rarely discussed which parameters actually correlate best with perfomance and how this changes over time. As a lifelong United fan, I'm interested in explaining what exactly went wrong after SAF retired in 2013. The work shared on this platform is an attempt to tackle these two issues.
To begin with, I decided to look solely at defenders of the "top six" sides. The code shared here is flexible, inasmuch different teams and playing positions can be incorporated into the pipeline. I plan to do more of this myself, time permitting. I'm sharing an infographic summarizing the results obtained so far (you can check out the .pdf file if that's all you are interested in), but also the two scripts used to scrape the data and generate the figures:
(1) scrape_PL_data.py
- the data was scraped from two sources: [1] https://footballapi.pulselive.com (performance parameters in the PL) and [2] https://www.transfermarkt.com (data on injuries of individual players). The fist two dictionaries in the script control which seasons and players are included in the data (each of these has a special code that needs to be added to the dictionary), and the third one does the same for injuries. When the data is scraped it is stored in a dataframe (where each row is a player, and each column a parameter), and then the dataframe is put in a dictionary with the season ID as the key and stored in a .pkl file. If you are not interested in scraping data, you can just download the current premierleague_data.pkl file where the scraped data is already stored.
(2) PL_data_visualizations
- the figures in the infographic were generated using this script. I've tried to structure it well, but feel free to let me know if something could be improved. You can use the code to look at other teams, such as CFC, AFC and TH, which did not make the infographic for the most part. When I add other analyses and upgrade the scipt, I may even write a GUI.
I'm also on the lookout for better data. I'd be curious to look at perfromance parameters across individual games. If such data is available to you & you wouldn't mind sharing it, feel free to contact me.