datasci_9_data_pre

HHA 507 Assignment #9

Data Cleaning and Transformation Plan

Description of the Datasets

The NYPD Complaint Data Dataset contains all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) for all complete quarters so far this year (2023). There are a total of 36 columns with ~415,000 rows of data. Some examples of the columns are borough name (of where the incident occurred), complaint day (the day the complaint was made), complaint time (the time the complaint was made), level of offense, and demographic information about the victim and suspect. This dataset contains string, integers, floats, and data & time.

Independent / Predictor variables: borough, jurisdiction, law category, location of crime, offense description, victim age
Dependent / Target variables: victim race

The Student Weight Status Category Reporting Dataset contains weight status category data (underweight, healthy weight, overweight or obese, based on BMI-for-age percentile). The dataset includes separate estimates of the percent of students overweight, obese and overweight or obese for all reportable grades within the county and/or region and by grade groups (elementary and middle/high). There are a total of 15 columns with ~32,000 rows of data. Some examples of the columns are number of students overweight, number of students obese, grade level, and sex. This dataset contains string, integers, and floats.

Independent / Predictor variables: county, grade category, sex
Dependent / Target variables: number of obese students

Intended Learning Task

Both datasets were intended to be used for regression.

Steps to Clean and Transform Data

Remove any white space or special characters
Drop columns that are not going to be used
Identify any rows with missing data and drop those rows
Check that the data types of each column are correct (make sure that categorical columns are converted into string)
Detect any outliers that are part of the dataset. Determine what to do with the outliers depending on how they affect the rest of the dataset.

Documentation of Steps to Clean and Transform Data

Uploaded the dataset into the repository
Converted the .csv file into a .pkl file
Removed white spaces and special columns that were in the column names
Dropped columns that had a lot of missing values because that would affect my interpretation of the data. I also dropped columns that I believed had similar information to another column; there's no reason to have more than one column with the same information.
Identified rows that had missing information like "(null)" or "NaN" and dropped them from the dataset
Checked that the data types were accurate for each column, especially in ensuring that categorical columns were converted into objects
Used an Ordinal Encoder code for each of the columns to create .csv files where each unique value was given a number to correspond to
Scaled the data using scaler.transform(X)
Split the data into train, test, and value
Created a baseline model using DummyClassifier
Created a logistic regression models with the train and value variables from the data splitting step

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dataset1		dataset1
dataset2		dataset2
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datasci_9_data_pre

Data Cleaning and Transformation Plan

Description of the Datasets

Intended Learning Task

Steps to Clean and Transform Data

Documentation of Steps to Clean and Transform Data

About

Releases

Packages

Languages

jesschannn/datasci_9_data_prep

Folders and files

Latest commit

History

Repository files navigation

datasci_9_data_pre

Data Cleaning and Transformation Plan

Description of the Datasets

Intended Learning Task

Steps to Clean and Transform Data

Documentation of Steps to Clean and Transform Data

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages