{ "cells": [ { "metadata": { "_uuid": "9f550639a0754f42a243e5785d895d24ba655515" }, "cell_type": "markdown", "source": " ## <div align=\"center\"> 10 Steps to Become a Data Scientist</div>\n <div align=\"center\">**quite practical and far from any theoretical concepts**</div>\n<div style=\"text-align:center\">last update: <b>23/12/2018</b></div>\n<img src=\"http://s9.picofile.com/file/8338833934/DS.png\"/>" }, { "metadata": { "_uuid": "e02d495da0fb0ad24e0341e91848f4c4cfc35bdb" }, "cell_type": "markdown", "source": "\n\n---------------------------------------------------------------------\nFork and Run this course on GitHub:\n> #### [ GitHub](https://github.com/mjbahmani/10-steps-to-become-a-data-scientist)\n\n\n-------------------------------------------------------------------------------------------------------------\n <b>I hope you find this kernel helpful and some <font color=\"red\"> UPVOTES</font> would be very much appreciated</b>\n \n -----------\n" }, { "metadata": { "_uuid": "85b27cf82d3023fd69c338df2be7afb2d7afaf32" }, "cell_type": "markdown", "source": " <a id=\"top\"></a> <br>\n**Notebook Content**\n\n [Introduction](#Introduction)\n1. [Python](#Python)\n1. [Python Packages](#PythonPackages)\n1. [Mathematics and Linear Algebra](#Algebra)\n1. [Programming & Analysis Tools](#Programming)\n1. [Big Data](#BigData)\n1. [Data visualization](#Datavisualization)\n1. [Data Cleaning](#DataCleaning)\n1. [How to solve Problem?](#Howto)\n1. [Machine Learning](#MachineLearning)\n1. [Deep Learning](#DeepLearning)" }, { "metadata": { "_uuid": "2a77b410b99632c4d99b652c226178cb1ff10b51" }, "cell_type": "markdown", "source": " <a id=\"Introduction\"></a> <br>\n# Introduction\nIf you Read and Follow **Job Ads** to hire a machine learning expert or a data scientist, you find that some skills you should have to get the job. In this Kernel, I want to review **10 skills** that are essentials to get the job. In fact, this kernel is a reference for **10 other kernels**, which you can learn with them, all of the skills that you need. \n\n**Ready to learn**! you will learn 10 skills as data scientist: \n\n1. [Learn Python](https://www.kaggle.com/mjbahmani/the-data-scientist-s-toolbox-tutorial-1)\n1. [Learn python packages](https://www.kaggle.com/mjbahmani/the-data-scientist-s-toolbox-tutorial-2) \n1. [Linear Algebra for Data Scientists](https://www.kaggle.com/mjbahmani/linear-algebra-for-data-scientists)\n1. [Programming & Analysis Tools](https://www.kaggle.com/mjbahmani/machine-learning-workflow-for-house-prices)\n1. [Big Data](https://www.kaggle.com/mjbahmani/a-data-science-framework-for-quora)\n1. [Top 5 Data Visualization Libraries Tutorial](https://www.kaggle.com/mjbahmani/top-5-data-visualization-libraries-tutorial)\n1. [How to solve Problem?](https://www.kaggle.com/mjbahmani/a-data-science-framework-for-quora)\n1. [Data Cleaning](https://www.kaggle.com/mjbahmani/some-eda-for-elo)\n1. [Machine Learning](https://www.kaggle.com/mjbahmani/a-comprehensive-ml-workflow-with-python)\n1. [Deep Learning](https://www.kaggle.com/mjbahmani/top-5-deep-learning-frameworks-tutorial) \n\n###### [go to top](#top)" }, { "metadata": { "_uuid": "5efeff35ad9951e40551d0763eaf26f08bb4119e" }, "cell_type": "markdown", "source": " <a id=\"1\"></a> <br>\n# 1-Python\nThe first step in this course for beginners is Python's quick learning in three days\nJust take **10 hours** to learn Python.\n\nfor Reading this section **please** fork and run the following kernel:\n\n[Learn Python](https://www.kaggle.com/mjbahmani/the-data-scientist-s-toolbox-tutorial-1)\n \n ###### [go to top](#top)" }, { "metadata": { "_uuid": "1a8697f93952e076f6f949997676d40518d7b5a6" }, "cell_type": "markdown", "source": "<a id=\"PythonPackages\"></a> <br>\n# 2-Python Packages\nIn the second step, we will learn the necessary libraries that are essential for any specialist.\n1. Numpy\n1. Pandas\n1. Matplotlib\n1. Seaborn\n1. TensorFlow\n1. NLTK\n1. Sklearn\nand so on\n\n<img src=\"http://s8.picofile.com/file/8338227868/packages.png\">\n\nfor Reading this section **please** fork and run this kernel:\n\n\n\n1. [The data scientist's toolbox tutorial 1](https://www.kaggle.com/mjbahmani/the-data-scientist-s-toolbox-tutorial-1)\n\n1. [The data scientist's toolbox tutorial 2](https://www.kaggle.com/mjbahmani/the-data-scientist-s-toolbox-tutorial-2)\n\n###### [go to top](#top)" }, { "metadata": { "_uuid": "ad8fa54ba57aa4a336080eb044109702c743d7a0" }, "cell_type": "markdown", "source": "<a id=\"Algebra\"></a> <br>\n## 3- Mathematics and Linear Algebra\nLinear algebra is the branch of mathematics that deals with vector spaces. good understanding of Linear Algebra is intrinsic to analyze Machine Learning algorithms, especially for Deep Learning where so much happens behind the curtain.you have my word that I will try to keep mathematical formulas & derivations out of this completely mathematical topic and I try to cover all of subject that you need as data scientist.\n\n<img src=\" https://s3.amazonaws.com/www.mathnasium.com/upload/824/images/algebra.jpg \" height=\"300\" width=\"300\">\n\nfor Reading this section **please** fork and run this kernel:\n\n[Linear Algebra for Data Scientists](https://www.kaggle.com/mjbahmani/linear-algebra-for-data-scientists)\n###### [go to top](#top)" }, { "metadata": { "_uuid": "697ba206ad7adf4d99814cb1d89375b745eaba19" }, "cell_type": "markdown", "source": "<a id=\"Programming\"></a> <br>\n## 4- Programming & Analysis Tools\n\nIt is not completed yet but For Reading an alternative for it **please** fork and run this kernel:\n\n[Programming & Analysis Tools](https://www.kaggle.com/mjbahmani/machine-learning-workflow-for-house-prices)\n\n###### [go to top](#top)" }, { "metadata": { "_uuid": "00f5c5ce80c7e302e83f0ea9b451dfaae7aa52cf" }, "cell_type": "markdown", "source": "<a id=\"BigData\"></a> <br>\n## 5- Big Data\n\nIt is not completed yet but For Reading an alternative for it **please** fork and run this kernel:\n\n[Big Data](https://www.kaggle.com/mjbahmani/a-data-science-framework-for-quora)\n" }, { "metadata": { "_uuid": "33bb9c265bef5e4474dcac0638cc632b5532f1ce" }, "cell_type": "markdown", "source": "<a id=\"Datavisualization\"></a> <br>\n## 6- Data Visualization\nfor Reading this section **please** fork and upvote this kernel:\n\n[Top 5 Data Visualization Libraries Tutorial](https://www.kaggle.com/mjbahmani/top-5-data-visualization-libraries-tutorial)" }, { "metadata": { "_uuid": "9bf1d9444651e2756c4fa4d71914ec20d621305e" }, "cell_type": "markdown", "source": "<a id=\"DataCleaning\"></a> <br>\n## 7- Data Cleaning\nCertainly another important step in the way of specialization is learning how to clean the data.\nIn this section, we will do this on Elo data set.\nfor Reading this section **please** fork and upvote this kernel:\n\n[Data Cleaning](https://www.kaggle.com/mjbahmani/some-eda-for-elo)" }, { "metadata": { "_uuid": "8720a4ddaab64e4bff226bed9e4e200dc9b94913" }, "cell_type": "markdown", "source": "<a id=\"Howto\"></a> <br>\n## 8- How to solve Problem?\nIf you have already read some [machine learning books](https://github.com/mjbahmani/10-steps-to-become-a-data-scientist/tree/master/Ebooks). You have noticed that there are different ways to stream data into machine learning.\n\nmost of these books share the following steps (checklist):\n* Define the Problem(Look at the big picture)\n* Specify Inputs & Outputs\n* Data Collection\n* Exploratory data analysis\n* Data Preprocessing\n* Model Design, Training, and Offline Evaluation\n* Model Deployment, Online Evaluation, and Monitoring\n* Model Maintenance, Diagnosis, and Retraining\n\n**You can see my workflow in the below image** :\n <img src=\"http://s9.picofile.com/file/8338227634/workflow.png\" />\n## 8-1 Real world Application Vs Competitions\nJust a simple comparison between real-world apps with competitions:\n<img src=\"http://s9.picofile.com/file/8339956300/reallife.png\" height=\"600\" width=\"500\" />\n**you should\tfeel free\tto\tadapt \tthis\tchecklist \tto\tyour needs**\n \n## 8-2 Problem Definition\nI think one of the important things when you start a new machine learning project is Defining your problem. that means you should understand business problem.( **Problem Formalization**)\n\nProblem Definition has four steps that have illustrated in the picture below:\n<img src=\"http://s8.picofile.com/file/8338227734/ProblemDefination.png\">\n \n### 8-2-1 Problem Feature\nThe sinking of the Titanic is one of the most infamous shipwrecks in history. **On April 15, 1912**, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing **1502 out of 2224** passengers and crew. That's why the name DieTanic. This is a very unforgetable disaster that no one in the world can forget.\n\nIt took about $7.5 million to build the Titanic and it sunk under the ocean due to collision. The Titanic Dataset is a very good dataset for begineers to start a journey in data science and participate in competitions in Kaggle.\n\nwe will use the classic titanic data set. This dataset contains information about **11 different variables**:\n<img src=\"http://s9.picofile.com/file/8340453092/Titanic_feature.png\" height=\"500\" width=\"500\">\n\n* Survival\n* Pclass\n* Name\n* Sex\n* Age\n* SibSp\n* Parch\n* Ticket\n* Fare\n* Cabin\n* Embarked\n\n### 8-2-2 Aim\n\nIt is your job to predict if a passenger survived the sinking of the Titanic or not. For each PassengerId in the test set, you must predict a 0 or 1 value for the Survived variable.\n\n \n### 8-2-3 Variables\n\n1. **Age** ==>> Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5\n\n2. **Sibsp** ==>> The dataset defines family relations in this way...\n\n a. Sibling = brother, sister, stepbrother, stepsister\n\n b. Spouse = husband, wife (mistresses and fiancés were ignored)\n\n3. **Parch** ==>> The dataset defines family relations in this way...\n\n a. Parent = mother, father\n\n b. Child = daughter, son, stepdaughter, stepson\n\n c. Some children travelled only with a nanny, therefore parch=0 for them.\n\n4. **Pclass** ==>> A proxy for socio-economic status (SES)\n\n * 1st = Upper\n * 2nd = Middle\n * 3rd = Lower\n \n5. **Embarked** ==>> nominal datatype \n6. **Name** ==>> nominal datatype . It could be used in feature engineering to derive the gender from title\n7. **Sex** ==>> nominal datatype \n8. **Ticket** ==>> that have no impact on the outcome variable. Thus, they will be excluded from analysis\n9. **Cabin** ==>> is a nominal datatype that can be used in feature engineering\n11. **Fare** ==>> Indicating the fare\n12. **PassengerID ** ==>> have no impact on the outcome variable. Thus, it will be excluded from analysis\n11. **Survival** is ==>> **[dependent variable](http://www.dailysmarty.com/posts/difference-between-independent-and-dependent-variables-in-machine-learning)** , 0 or 1\n\n\n**<< Note >>**\n\n> You must answer the following question:\nHow does your company expact to use and benfit from your model.\n\nfor Reading this section **please** fork and upvote this kernel:\n\n[How to solve Problem?](https://www.kaggle.com/mjbahmani/a-data-science-framework-for-quora)\n###### [Go to top](#top)" }, { "metadata": { "_uuid": "d4f8718cc7e1a8fc60a3815b55a2ab9a5eeef4f9" }, "cell_type": "markdown", "source": "<a id=\"MachineLearning\"></a> <br>\n## 9- Machine learning \nfor Reading this section **please** fork and upvote this kernel:\n\n[A Comprehensive ML Workflow with Python](https://www.kaggle.com/mjbahmani/a-comprehensive-ml-workflow-with-python)\n\n" }, { "metadata": { "_uuid": "3544d2fd1490f646f2f1c0fd4271f9a8745d2e36" }, "cell_type": "markdown", "source": "<a id=\"DeepLearning\"></a> <br>\n## 10- Deep Learning\n\nfor Reading this section **please** fork and upvote this kernel:\n\n[A-Comprehensive-Deep-Learning-Workflow-with-Python](https://www.kaggle.com/mjbahmani/a-comprehensive-deep-learning-workflow-with-python)\n\n---------------------------\n" }, { "metadata": { "_uuid": "edb768e0b3390ec29acab20593948c3f3bbf5bba", "collapsed": true }, "cell_type": "markdown", "source": "---------------------------------------------------------------------\nFork and Run this kernel on GitHub:\n> ###### [ GitHub](https://github.com/mjbahmani/10-steps-to-become-a-data-scientist)\n\n \n\n-------------------------------------------------------------------------------------------------------------\n <b>I hope you find this kernel helpful and some <font color=\"red\">UPVOTES</font> would be very much appreciated</b>\n \n -----------" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.6.6", "mimetype": "text/x-python", "codemirror_mode": { "name": "ipython", "version": 3 }, "pygments_lexer": "ipython3", "nbconvert_exporter": "python", "file_extension": ".py" } }, "nbformat": 4, "nbformat_minor": 1 }