Embarking on the 30DaysDuckDBChallenge, I dived into the FIFA and Titanic datasets, leveraging DuckDB, SQL, and Power BI to extract insights and enhance data analysis skills. This journey involved tasks ranging from basic SQL operations to advanced techniques, GitHub integration, and data visualization.
- Tasks Overview
- Technologies Used
- Code Samples
- Results and Findings
- Lessons Learned
- Acknowledgements
- How to Run the Code
- Conclusion
- Contact Information
The challenge encompassed diverse tasks, including data loading, cleaning, advanced SQL techniques, GitHub usage, Power BI visualization, and in-depth analyses of FIFA and Titanic datasets. Tasks explored survival rates, player statistics, family structures, and more.
- DuckDB
- SQL
- MotherDuck
- Power BI
- GitHub
SQL queries used for each day's challenge are provided in separate files within the repository. Refer to the specific challenge day for corresponding code samples.
In the initial phase of the challenge, I delved into the FIFA datasets, leveraging DuckDB for data processing. Noteworthy findings include:
-
Player Position Insights: CB (28.48%), GK (24.2%), ST (21.75%), and CM (9.27%) were dominant positions.
-
Player Value Analysis: The 'Value' column had diverse representations like '€100M' and '€10K.' Using smart parsing, I standardized these values for comprehensive analysis.
-
OVA and POT Rankings: Lionel Messi claimed the highest OVA of 93. The top 10 players with the highest OVA and POT were identified, aiding team scouting strategies.
Transitioning to Power BI, I created an interactive dashboard revealing insights such as:
-
Nationality Distribution: England led with 1,705 occurrences, followed by Germany and Spain.
-
Age vs. Potential Analysis: Younger players exhibited higher potential, peaking at 28 and declining after 42.
-
Top Players by Overall Rating: Lionel Messi (OVA 93) secured the top spot.
As I explored "DuckDB in Action," the book highlighted DuckDB's prowess in data analytics. I successfully uploaded Titanic datasets, discovering subsets in 'train.csv' (891 passengers) and 'test.csv' (418 passengers).
Learning advanced SQL techniques enhanced my querying capabilities. Highlights include:
-
Subquery Mastery: Explored various subquery types, integrating them seamlessly into SQL queries.
-
NTILE Window Functions: Utilized NTILE functions for impactful data bucketing.
Analyzing the Titanic dataset uncovered intriguing patterns:
-
Fare and Survival Rates: Higher fares correlated with increased survival rates.
-
Gender Disparities: Females had a significantly higher survival rate (82.62%) compared to males (12.93%).
-
Age and Survival Correlation: Younger individuals, especially children, exhibited higher survival rates.
A detailed examination of passenger details yielded valuable insights:
-
Survival based on Cabin Level: Cabin levels influenced survival rates, considering passengers sharing the same ticket.
-
Traveler Categorization: Solo travelers, family travelers, and mixed groups were identified based on ticket numbers and surnames.
-
Family Size Calculation: Family size was determined, considering SibSp, ParCh, and passengers sharing the same ticket.
In the recent phase of exploration, the focus was on extracting profound insights from the Titanic dataset:
-
Survival Rates in Third Class: Evident gender and vulnerability impact, aligning with historical evacuation priorities.
-
Survival Across Demographics: Intersectionality of factors like class and embarkation point significantly influences survival probabilities.
-
Advising Chances of Survival: Historical trends, embarkation disadvantages, and age group dynamics contribute to individual outcomes.
The following insights were uncovered from this analysis:
-
Solo travelers are predominant (848), "Sage family" has the largest family size, and adult men show a low survival rate (11%).
-
Men in third class have a survival rate of 37.74%, while females outperform males with 86.47% survival. Cherbourg stands out with the highest embarkation survival rate.
-
Passenger class influences survival rates, with first-class passengers having the highest rates. Southampton, Queenstown, and Cherbourg show varying embarkation survival rates.
-
Overall, the analysis unveils nuanced patterns in survival based on demographics, family dynamics, and socio-economic factors among Titanic passengers.
-
Enhanced SQL skills, learned advanced techniques, and gained proficiency in data visualization with Power BI.
-
Explored the efficiency and versatility of DuckDB for data analytics.
-
I would like to thank the organizers of the #30DaysDuckDBChallenge for providing this valuable learning experience.
-
I also appreciate the creators of DuckDB and Power BI for developing these amazing tools.
Follow these steps to execute the code and reproduce the analysis conducted during the 30DaysDuckDBChallenge:
- Clone this repository to your local machine using the following command:
git clone [repository_url]
-
Install DuckDB on your local machine by following the installation instructions provided on the DuckDB website.
-
Download and install Power BI from the official Microsoft website.
-
Create an account with Motherduck here.
-
Integrate DuckDB with Motherduck to enable collaborative data analysis. Follow the instructions provided in the DuckDB documentation on connecting to MotherDuck.
-
Navigate to the repository's directory on your local machine.
-
Open DuckDB and execute the SQL queries provided in the relevant files for each challenge day. Use the command-line interface or preferred SQL editor.
-
Import the CSV files generated during the analysis into Power BI.
-
Follow the instructions provided in the corresponding Power BI files to recreate the visualizations.
-
Adjust any necessary connections or configurations to match your local environment.
-
Once queries are executed and visualizations are recreated, explore the results to gain insights into the FIFA and Titanic datasets.
-
Dive into the interactive Power BI dashboard to visualize key patterns and trends.
By following these steps, you'll be able to reproduce the 30DaysDuckDBChallenge analysis on your local machine.
The #30DaysDuckDBChallenge was a rewarding journey that significantly expanded my knowledge and skills in data analysis. I encourage anyone interested in data exploration to participate in similar challenges and utilize DuckDB and Power BI for their powerful capabilities.
For any inquiries or collaborations, feel free to reach out:
Email: [bosedeasaji@gmail.com]
LinkedIn: [www.linkedin.com/in/abosede-igharo-a-a64271158]