Skip to content

yaowser/data-challenges

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kickstarter-dataset

For a given unknown dataset, I was able to do very well because I spent most of my time on dummy variable and feature creation during the EDA process. I filled in all the gaps in terms of uncertainty and common multi class prediction workflow practices. I was able to get a F1 score of .95, precision of .96, and a recall of .96 with optimized random forest parameters.

Message to subsequent data Engineers

Run the whole notebook and use the final model "rfc.predict" to apply the model to an unlabled dataset. Import the unlabeled dataset at the beginning and apply all the feature extractions in the code similar to the original df and then once the model is trained, apply that same model to the unlabeled dataset to get a new set of results

What is your experience with machine learning development in production?

The deliverables of machine learning applications are towards price prediction for Sears and some NLP I did on reddit for spam detection bots

What is your familiarity with Python 3?

8/10, very familiar

Can you briefly describe your experience with NumPy, Pandas and other scientific Python libraries

I have been using numpy and pandas for 4 years.

Do you have any experience with Scikit Learn or Keras?

scikit learn for 3 years, keras and tensorflow for 2

What is your experience with NLP? Do you know libraries such as NLTK and Spacy?

I used NLTK for the challange notebook. I am familiar with naive bayes bag of words and Stanford NLP

Can you tell the difference between Precision, Recall and F1 score?

precision is to optimize true positives over the summation of true positives and false positives recall is to optimize true positives over the summation of true positives and false negatives F1 Score is the harmonic mean between precision and recall

How do you evaluate your classification models?

ROC curve for learning rate, accuracy, F1 score, precision, recall, depending on consumer needs or problem statement

Which classification algorithms do you have most familiarity?

classifiers for decision trees, random forest, log regression

Whats your experience with prediction (regression)?

I've done prediction throughout my github https://github.com/yaowser/qtw

What are the advantages and disadvantages of neural networks?

advantages is that everything can be optimized and that it can find the nuance of data with neurons, dropout rate, and layers. disadvantage is that it's a black box where important features are abstract

Which clustering algorithms do you have most familiarity?

knn, multiple class classifications, geospatial clustering, sound separation

Can you describe your workflow when developing a new model with a new dataset?

background on features, EDA, imputation, dummy variables, nuance variables and auxiliary variables, log transformations, data scaling, model selection and comparison, optimization of features, metrics, final model and results

Do you have experience with Spark or big data tools for machine learning?

I taught apache spark for databricks on youtube for 45 minutes to my class. I am very familar with it. bigtable, big query, some hadoop, aws, I have a certification in azure. video lecture (teach spark final project): https://youtu.be/IVMbSDS4q3A

Can you give examples of scenarios where machine learning is not the best option and where do you think we should apply machine learning?

machine learning is not the best option because you are only as good as your dataset. therefore, the collection of the data and the unbiased sampling procedure comes first before machine learning. After that, we need to read into the dataset to make sense of it before proceeding to model selection and the rest of the workflow. we can apply machine learning on prediction and classification problems based on proper data

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published