Skip to content
This repository has been archived by the owner on Oct 19, 2019. It is now read-only.

Dennis Batiste- 3rd Assignment #420

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

dennibat
Copy link

@dennibat dennibat commented Sep 7, 2019

No description provided.

@llpk79
Copy link

llpk79 commented Sep 8, 2019

It's good to see you here, Mr. Bamboo!

@dennibat
Copy link
Author

dennibat commented Sep 8, 2019

LS-DS_112

@dennibat dennibat closed this Sep 8, 2019
@dennibat dennibat reopened this Sep 8, 2019
@dennibat dennibat changed the title Dennis Batiste- First Assignment (First Look at Data) Dennis Batiste- 2nd Assignment Sep 8, 2019
@dennibat dennibat changed the title Dennis Batiste- 2nd Assignment Dennis Batiste- 3rd Assignment Sep 11, 2019
@llpk79
Copy link

llpk79 commented Sep 13, 2019

Okay, so Making Data Backed Assertions.

A little rough here, but it sounds like the idea of binning these values before plotting them makes a lot of sense. I think you'll do fine with the rest after getting that sorted out.

Try making cross tabs with the binned features and explore heatmaps and other plots to help understand better.
We are trying to tell a story with the data. These are tools to help us understand the story, and to tell it when we figure it out.

Keep it up, Dennis!! Please reach out to me, or any of the team, with anything at all. I'm usually up late!!

@llpk79
Copy link

llpk79 commented Sep 17, 2019

Sprint challenge code review:

Part 1 - Load and validate the data

  • Load the data as a pandas data frame.

    • Complete.
  • Validate that it has the appropriate number of observations (you can check the raw file, and also read the dataset description from UCI).

    • Incomplete.
    • Build your headers list before running pd.read_csv(file_or_url, header=header)
    • Or, do like you did, but make header=None
  • Validate that you have no missing values.

    • Incomplete.
    • Run df.isna().sum() on the last line of a cell, or use print()
    • How do you know how many rows there should be?
  • Add informative names to the features.

    • Complete.
  • The survival variable is encoded as 1 for surviving >5 years and 2 for not - change this to be 0 for not surviving and 1 for surviving >5 years (0/1 is a more traditional encoding of binary variables)

    • Complete.
    • Try to think of a more explicit way to do this. What is the value of headers[3]?
  • At the end, print the first five rows of the dataset to demonstrate the above.

    • Incomplete.
    • df.head() will do it.

Part 2 - Examine the distribution and relationships of the features

  • Explore the data - create at least 2 tables (can be summary statistics or crosstabulations) and 2 plots illustrating the nature of the data.
    • Complete.
    • Take careful note of your variable names and the arguments to your functions. Did you intend to do pd.crosstab(df['survived'], *year_of_op_bin*, normalize='columns')? Because it's a good idea, make sure you execute it!
    • What do these plots tell us?

Part 3 - DataFrame Filtering

  • Use DataFrame filtering to subset the data into two smaller dataframes. You should make one dataframe for individuals who survived >5 years and a second dataframe for individuals who did not.
    • Incomplete.
    • Dataframe filtering is like: new_df = df[df[column] == condition]
    • You correctly encoded the survival column above. Here, we are making use of that.
    • Check your syntax when defining functions. There cannot be a space before the ()
  • Create a graph with each of the dataframes (can be the same graph type) to show the differences in Age and Number of Positive Axillary Nodes Detected between the two groups.
    • Incomplete.

Part 4 - Analysis and Interpretation

Answer these as if you were at a job interview speaking with a hiring manager.

@llpk79
Copy link

llpk79 commented Sep 17, 2019

Retry - Sprint challenge code review:

Part 1 - Load and validate the data

  • Load the data as a pandas data frame.
    • Complete.
  • Validate that it has the appropriate number of observations (you can check the raw file, and also read the dataset description from UCI).
    • Complete.
  • Validate that you have no missing values.
    • Complete.
  • Add informative names to the features.
    • Complete.
  • The survival variable is encoded as 1 for surviving >5 years and 2 for not - change this to be 0 for not surviving and 1 for surviving >5 years (0/1 is a more traditional encoding of binary variables)
    • Incomplete.
    • There are no Nan values or strings to replace in this data.
  • At the end, print the first five rows of the dataset to demonstrate the above.
    • Complete.

Part 2 - Examine the distribution and relationships of the features

  • Explore the data - create at least 2 tables (can be summary statistics or crosstabulations) and 2 plots illustrating the nature of the data.
    • Complete.
    • What do these plots tell us?

Part 3 - DataFrame Filtering

  • Use DataFrame filtering to subset the data into two smaller dataframes. You should make one dataframe for individuals who survived >5 years and a second dataframe for individuals who did not.
    • Incomplete.
  • Create a graph with each of the dataframes (can be the same graph type) to show the differences in Age and Number of Positive Axillary Nodes Detected between the two groups.
    • Incomplete.

Part 4 - Analysis and Interpretation

  • While perhaps true, these conclusions are not supported by the data examined in this sprint.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants