Being a huge fan of football for a huge chunk of my life, this task seemed too good to be true. I will be manipulating and visualing the data via exploratory data analysis in this football dataset which dates approximately back to the 2017/18 season.
First we import the given libraries in our Jupyter Notebook Football_Players.ipynb and we load the dataset Built with: Jupyter Notebook, Python 3.10, Pandas, Numpy, Seaborn, Matplotlib
Lets take a look at all the different clubs we have across the world in this dataset with the help of the unique function
We usually don't pronounce the full name of a club, for example: Olympique Lyonnais is better known as OL Lyon.
Now when we play FIFA, we tend to think about the top players so let's view the 20 best players in the world at the time
A bit of data cleaning here and there to fix our dataset for more compatibility.
Usually as a casual viewer of football or sports in general, we tend to know only the players who perform and play for the teams which compete at the highest level, for example: Real Madrid, Manchester City etc. So lets form a filtered datatset of players who belong to this elite tier.
Lets observe the nationality of the players who play for each of these teams
With pairplot we can form permutations of graphs and visualisations comparing various columns and aspects
Its time we started correlating various descriptions about our given players, for example: their value in $ and their ages.
We observe that as a player gets older, their value goes down as they become less efficient and their potential usually goes down.
We use univariate data analysis with the help of pairplots, jointplots and distplots in seaborn to identify various trends that players with an overall rating of around 80.0-82.0 are valued around approximately 10-20 million $.
Let us try to predict the potential of players using techniques like logistic regression. We import the scikit learn library to implement this.
Using train-test split we have created a test and training dataset for our model prediction.
we see that the score is not of a suitable value and hence logistic regression is a bad fit for this
With the help of k nearest neighbours we have implemented a confusion matrix to see the performance of our knn model which is severely underperforming here, hence our data needs to be cleaned more.