GitHub - yaowser/data-challenges

kickstarter-dataset

For a given unknown dataset, I was able to do very well because I spent most of my time on dummy variable and feature creation during the EDA process. I filled in all the gaps in terms of uncertainty and common multi class prediction workflow practices. I was able to get a F1 score of .95, precision of .96, and a recall of .96 with optimized random forest parameters.

Message to subsequent data Engineers

Run the whole notebook and use the final model "rfc.predict" to apply the model to an unlabled dataset. Import the unlabeled dataset at the beginning and apply all the feature extractions in the code similar to the original df and then once the model is trained, apply that same model to the unlabeled dataset to get a new set of results

What is your experience with machine learning development in production?

The deliverables of machine learning applications are towards price prediction for Sears and some NLP I did on reddit for spam detection bots

What is your familiarity with Python 3?

8/10, very familiar

Can you briefly describe your experience with NumPy, Pandas and other scientific Python libraries

I have been using numpy and pandas for 4 years.

Do you have any experience with Scikit Learn or Keras?

scikit learn for 3 years, keras and tensorflow for 2

What is your experience with NLP? Do you know libraries such as NLTK and Spacy?

I used NLTK for the challange notebook. I am familiar with naive bayes bag of words and Stanford NLP

Can you tell the difference between Precision, Recall and F1 score?

precision is to optimize true positives over the summation of true positives and false positives recall is to optimize true positives over the summation of true positives and false negatives F1 Score is the harmonic mean between precision and recall

How do you evaluate your classification models?

ROC curve for learning rate, accuracy, F1 score, precision, recall, depending on consumer needs or problem statement

Which classification algorithms do you have most familiarity?

classifiers for decision trees, random forest, log regression

Whats your experience with prediction (regression)?

I've done prediction throughout my github https://github.com/yaowser/qtw

What are the advantages and disadvantages of neural networks?

advantages is that everything can be optimized and that it can find the nuance of data with neurons, dropout rate, and layers. disadvantage is that it's a black box where important features are abstract

Which clustering algorithms do you have most familiarity?

knn, multiple class classifications, geospatial clustering, sound separation

Can you describe your workflow when developing a new model with a new dataset?

background on features, EDA, imputation, dummy variables, nuance variables and auxiliary variables, log transformations, data scaling, model selection and comparison, optimization of features, metrics, final model and results

Do you have experience with Spark or big data tools for machine learning?

I taught apache spark for databricks on youtube for 45 minutes to my class. I am very familar with it. bigtable, big query, some hadoop, aws, I have a certification in azure. video lecture (teach spark final project): https://youtu.be/IVMbSDS4q3A

Can you give examples of scenarios where machine learning is not the best option and where do you think we should apply machine learning?

machine learning is not the best option because you are only as good as your dataset. therefore, the collection of the data and the unbiased sampling procedure comes first before machine learning. After that, we need to read into the dataset to make sense of it before proceeding to model selection and the rest of the workflow. we can apply machine learning on prediction and classification problems based on proper data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kickstarter-dataset

Message to subsequent data Engineers

What is your experience with machine learning development in production?

What is your familiarity with Python 3?

Can you briefly describe your experience with NumPy, Pandas and other scientific Python libraries

Do you have any experience with Scikit Learn or Keras?

What is your experience with NLP? Do you know libraries such as NLTK and Spacy?

Can you tell the difference between Precision, Recall and F1 score?

How do you evaluate your classification models?

Which classification algorithms do you have most familiarity?

Whats your experience with prediction (regression)?

What are the advantages and disadvantages of neural networks?

Which clustering algorithms do you have most familiarity?

Can you describe your workflow when developing a new model with a new dataset?

Do you have experience with Spark or big data tools for machine learning?

Can you give examples of scenarios where machine learning is not the best option and where do you think we should apply machine learning?

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
90 minute timed coding challenge.ipynb		90 minute timed coding challenge.ipynb
Adops & Data Scientist Sample Data.xlsx		Adops & Data Scientist Sample Data.xlsx
Coinbase.ipynb		Coinbase.ipynb
Data Science Take Home Interview .docx		Data Science Take Home Interview .docx
Data_Challenge_08_2018_to_05_2019.csv		Data_Challenge_08_2018_to_05_2019.csv
Hackerrank 2 hour challenge.ipynb		Hackerrank 2 hour challenge.ipynb
Interview_Input.csv		Interview_Input.csv
README(1).md		README(1).md
README.md		README.md
Yao - Moloco Data Scientist Interview Question.ipynb		Yao - Moloco Data Scientist Interview Question.ipynb
Yao - View - Unsupervised Time Series Clustering Classification.html		Yao - View - Unsupervised Time Series Clustering Classification.html
Yao - View - Unsupervised Time Series Clustering Classification.ipynb		Yao - View - Unsupervised Time Series Clustering Classification.ipynb
Yao Multiclass Prediction Challenge.html		Yao Multiclass Prediction Challenge.html
Yao Multiclass Prediction Challenge.ipynb		Yao Multiclass Prediction Challenge.ipynb
Yao Royal Caribbean Data Challenge.html		Yao Royal Caribbean Data Challenge.html
Yao Royal Caribbean Data Challenge.ipynb		Yao Royal Caribbean Data Challenge.ipynb
Yao Uptake Challenge.html		Yao Uptake Challenge.html
Yao Uptake Challenge.ipynb		Yao Uptake Challenge.ipynb
Yao Yao - Data Challenge - Cummins.html		Yao Yao - Data Challenge - Cummins.html
Yao Yao - Data Challenge - Cummins.ipynb		Yao Yao - Data Challenge - Cummins.ipynb
Yao Yao 2 hour hackerrank Questions 2.ipynb		Yao Yao 2 hour hackerrank Questions 2.ipynb
Yao Yao Data Mogul Pre-Screening Test.html		Yao Yao Data Mogul Pre-Screening Test.html
Yao Yao Data Mogul Pre-Screening Test.ipynb		Yao Yao Data Mogul Pre-Screening Test.ipynb
Yao Yao Patientfi Data Challenge.html		Yao Yao Patientfi Data Challenge.html
Yao Yao Patientfi Data Challenge.ipynb		Yao Yao Patientfi Data Challenge.ipynb
Yao Yao Patientfi Data Challenge.pdf		Yao Yao Patientfi Data Challenge.pdf
Yao Yao VSCO Data Science Case Study.html		Yao Yao VSCO Data Science Case Study.html
Yao Yao VSCO Data Science Case Study.ipynb		Yao Yao VSCO Data Science Case Study.ipynb
Yao Yao VSCO Data Science Case Study.pdf		Yao Yao VSCO Data Science Case Study.pdf
Yao Yao VSCO Data Science Case Study.zip		Yao Yao VSCO Data Science Case Study.zip
accounts.csv		accounts.csv
challenge.ipynb		challenge.ipynb
coinbase_takehome.db		coinbase_takehome.db
col_desc.pdf		col_desc.pdf
data.csv		data.csv
ledger.csv		ledger.csv
prediction.csv		prediction.csv
sun.png		sun.png
users.csv		users.csv

yaowser/data-challenges

Folders and files

Latest commit

History

Repository files navigation

kickstarter-dataset

Message to subsequent data Engineers

What is your experience with machine learning development in production?

What is your familiarity with Python 3?

Can you briefly describe your experience with NumPy, Pandas and other scientific Python libraries

Do you have any experience with Scikit Learn or Keras?

What is your experience with NLP? Do you know libraries such as NLTK and Spacy?

Can you tell the difference between Precision, Recall and F1 score?

How do you evaluate your classification models?

Which classification algorithms do you have most familiarity?

Whats your experience with prediction (regression)?

What are the advantages and disadvantages of neural networks?

Which clustering algorithms do you have most familiarity?

Can you describe your workflow when developing a new model with a new dataset?

Do you have experience with Spark or big data tools for machine learning?

Can you give examples of scenarios where machine learning is not the best option and where do you think we should apply machine learning?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages