Skip to content

R package for summarizing data frame attributes

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

UBC-MDS/exploreR

Repository files navigation

exploreR

Coverage status

Build Status

A Collaborative Software Development Project

March 2019

Overview

exploreR is an R package loaded with methods to help explore and explain the contents of a dataframe.

Installation

To install exploreR, follow these instructions:

  1. Input the following into the console:

devtools::install_github("UBC-MDS/exploreR", build_opts = c("--no-resave-data", "--no-manual"))

  1. The package is now installed and ready for use.

Functions and Example Usage

Load the package.

library(exploreR)

Function 1 | Variable summary

The function variable_summary will take a data frame as input and provide the total quantity of each type of variable present in the data frame. The output of the function will be a dataframe of size 5 x 2 and will have one row for each variable type with its corresponding quantity. The function will look to identify 5 different types of variables: numerical, character, boolean, date, and an other category.

example usage of variable_summary:

toy_data <- data.frame("letters" = c("a", "b", NA, "d"),
                       "numbers" = c(1, 4, 6, NA),
                       "logical" = c(NA, FALSE, NA, TRUE),
                       "dates" = as.Date(c("2003-01-02", "2002-02-02", "2004-03-03", "2005-04-04")),
                       "integers" = c(2L, 3L, 4L, 5L),
                       stringsAsFactors = FALSE)

variable_summary(toy_data)

example output of variable_summary:

variable_type count
numeric 1
character 1
logical 1
date 1
other 1

Function 2 | Missing values per variable

For each column/variable in the dataframe, this function will count the number of missing values present and report back on that number per column. The function missing_values will accept a dataframe as input and output a corresponding dataframe with the above information detailing the counts of missing values per column/variable. If the input is of size n x d, the output size will be d x 3.

example usage of missing_values:

toy_data <- data.frame("letters" = c("a", "b", NA, "d"),
                       "numbers" = c(1, 4, 6, NA),
                       "logical" = c(NA, FALSE, NA, TRUE),
                       "dates" = as.Date(c("2003-01-02", "2002-02-02", "2004-03-03", "2005-04-04")),
                       "integers" = c(2L, 3L, 4L, 5L),
                       stringsAsFactors = FALSE)

missing_values(toy_data)

example output of missing_values:

variable missing_values percent_missing
letters 1 0.25
numbers 1 0.25
logical 2 0.50
dates 0 0.00
integers 0 0.00

Function3 | Dataset Size/Info

The function size will take in a dataframe and print the shape and size of the dataframe. For the size, the function will print how much memory the dataframe consumes in bytes. The output of the function will be a dataframe of size 1 x 3.

example usage of size:

toy_data <- data.frame("letters" = c("a", "b", NA, "d"),
                       "numbers" = c(1, 4, 6, NA),
                       "logical" = c(NA, FALSE, NA, TRUE),
                       "dates" = as.Date(c("2003-01-02", "2002-02-02", "2004-03-03", "2005-04-04")),
                       "integers" = c(2L, 3L, 4L, 5L),
                       stringsAsFactors = FALSE)

size(toy_data)

example output of size:

rows columns size_in_memory
4 5 1760

Check out the package vignette for more information by entering the following in the console:

vignette("explorer") for viewing inside the RStudio viewer

or

browseVignettes(package="exploreR") for viewing in a browser

Comparable Functions Available in the R Ecosystem

The following are existing functions in R that are similar to those developed within our project.

dim(): used to obtain the shape of a dataframe.
ncol() and nrow(): used to get the number of rows and columns in a dataframe.
str(): provides summary information about the dataframe, including some of the same information as above (i.e. dim, ncol and nrow). str() provides descriptive information about variable and data types in the dataframe.
is.na(): provides the number of missing values in the columns of the data frame.

Collaborators:

name github handle
Rachel K. Riggs @rachelkriggs
Milos Milic @milicmil
Arzan Irani @nazra-inari
James Pushor @jpush1773

Test Results

About

R package for summarizing data frame attributes

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages