From e2c35d73550aa0040a373fcee30d9d2065b83fb9 Mon Sep 17 00:00:00 2001 From: hsonne Date: Tue, 19 Mar 2019 11:45:32 +0100 Subject: [PATCH 1/3] Improve tutorial based on Christoph's review --- vignettes/tutorial.Rmd | 58 +++++++++++++++++++++++------------------- 1 file changed, 32 insertions(+), 26 deletions(-) diff --git a/vignettes/tutorial.Rmd b/vignettes/tutorial.Rmd index fdb11b7..104839f 100644 --- a/vignettes/tutorial.Rmd +++ b/vignettes/tutorial.Rmd @@ -16,36 +16,41 @@ knitr::opts_chunk$set( ) ``` -This package provides functions to read data from Microsoft Excel files. It uses -the package [readxl](https://readxl.tidyverse.org/) under the hood. In contrast -to the default behaviour of readxl the functions in this package read the raw -text information that are contained in the sheets. No type conversions are -performed and all values keep their original position. For example, a value in -cell A3 is returned in `sheet[3, 1]` with `sheet` being the result matrix -returned when reading one Excel sheet. By default, `readxl::read_excel()` -returns this value in the first row of the returned data frame, given that the -first two rows are empty. This may be useful in some cases but for validation -purposes I prefer to keep the original positions. +This package provides functions to read spreadsheet data from Microsoft Excel +files. To achieve this, it uses functions from the package +[readxl](https://readxl.tidyverse.org/). + +In contrast to the default behaviour of the readxl functions, this package reads +the raw text information from spreadsheets. No type conversions are performed. +Being all of the same type (character) the text values are returned in a matrix +and not in a data frame as it is the case for the readxl package. + +Also, when using this package, all values keep their original position. By +default, the readxl-functions remove empty rows at the beginning of a sheet. +This leads to row numbers in the returned data frame that do not correspond to +the original row numbers in the Excel sheet. For validation purposes, I prefer +that all values appear at the same positions in the returned matrix as they have +in the sheet. This is what this package does. ## Reading Table Data from a Spreadsheet Program -MS Excel is a spreadsheet program. It does not really know where a table starts +MS Excel is a spreadsheet program. It does not "know" where a table starts and where it ends and it does not clearly assign a type to a column. Instead, each cell can have its own type so that there can be cells of different types in -one and the same column. However, functions such as `readxl::read_excel()` try +the same column. However, functions such as `readxl::read_excel()` try to communicate with Excel as if it was a database management system. For each -column the values in the first few rows of the column are inspected and the type -of the column is guessed. Then, all values in the column are tried to be -converted to the guessed type. This leads to conversion errors, e.g. if the -first few values look numeric but the column contains text values, such as -`">1000"` further down. In the returned data frame the text values are removed, -i.e. set to `NA`. This package avoids these data losses as it keeps the original -(text) information and lets the user decide what to do. +column, the values in the first few rows of the column are inspected and based +on the found type the whole column is then converted accordingly. This may lead +to conversion errors, e.g. if the first few values look numeric but further +contains text values, such as `">1000"`. In these cases the readxl functions +return a data frame in whith the text values are removed, i.e. set to `NA`. This +package avoids these data losses as it keeps the original (text) information and +lets the user decide what to do. -## Read sheet with readxl +## Read MS spreadsheet with readxl We demonstrate this behaviour by reading a sheet from an example Excel file. The -top and bottom parts of the "table" contained in the sheet look as follows: +top and bottom parts of the "table" look as follows: [![table top](images/example_2_top.png)]() [![table top](images/example_2_bottom.png)]() @@ -64,13 +69,14 @@ data <- readxl::read_excel(file) head(data) ``` -As described above, the two empty rows on top were skipped and the second -column was assumed to be of numeric type even though it contains a text value -in row 1004 (see image above). +As described above, the two empty rows on top were skipped. Also, column B was +assumed to be numeric (double \) even though it contains a text value in row +1004 (see image above). ## Read sheet with kwb.readxl -We now read the same sheet with `get_raw_text_from_xlsx()` from this package: +Now we read the same spreadsheet with `get_raw_text_from_xlsx()` from this +package: ```{r} # Read the sheet into a list of character matrices @@ -85,7 +91,7 @@ tail(sheets$sheet_01) The first two empty rows are kept so that the row number in the returned matrix corresponds to the row number in the Excel file. This is helpful if we want to -warn the user about possible problems, such as the non-numeric value in row +warn the user about possible problems, such as the non-numeric value in row 1004: ```{r} From e7a5e85de6fcbb5c56f89cfa913b40cf0f81ae57 Mon Sep 17 00:00:00 2001 From: hsonne Date: Tue, 19 Mar 2019 11:51:49 +0100 Subject: [PATCH 2/3] Correct one sentence, reflow --- vignettes/tutorial.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/vignettes/tutorial.Rmd b/vignettes/tutorial.Rmd index 104839f..1a52775 100644 --- a/vignettes/tutorial.Rmd +++ b/vignettes/tutorial.Rmd @@ -42,10 +42,10 @@ to communicate with Excel as if it was a database management system. For each column, the values in the first few rows of the column are inspected and based on the found type the whole column is then converted accordingly. This may lead to conversion errors, e.g. if the first few values look numeric but further -contains text values, such as `">1000"`. In these cases the readxl functions -return a data frame in whith the text values are removed, i.e. set to `NA`. This -package avoids these data losses as it keeps the original (text) information and -lets the user decide what to do. +values cannot be interpreted as numeric, such as `">1000"`. In these cases the +readxl functions return a data frame in whith the text values are removed, i.e. +set to `NA`. This package avoids these data losses as it keeps the original +(text) information and lets the user decide what to do. ## Read MS spreadsheet with readxl @@ -70,8 +70,8 @@ head(data) ``` As described above, the two empty rows on top were skipped. Also, column B was -assumed to be numeric (double \) even though it contains a text value in row -1004 (see image above). +assumed to be numeric (double \) even though it contains a text value in +row 1004 (see image above). ## Read sheet with kwb.readxl From 9c59588becbde529e56e5785817a51346e8f7f1b Mon Sep 17 00:00:00 2001 From: hsonne Date: Tue, 19 Mar 2019 11:52:59 +0100 Subject: [PATCH 3/3] Fix typo --- vignettes/tutorial.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/tutorial.Rmd b/vignettes/tutorial.Rmd index 1a52775..8bbe350 100644 --- a/vignettes/tutorial.Rmd +++ b/vignettes/tutorial.Rmd @@ -43,7 +43,7 @@ column, the values in the first few rows of the column are inspected and based on the found type the whole column is then converted accordingly. This may lead to conversion errors, e.g. if the first few values look numeric but further values cannot be interpreted as numeric, such as `">1000"`. In these cases the -readxl functions return a data frame in whith the text values are removed, i.e. +readxl functions return a data frame in which the text values are removed, i.e. set to `NA`. This package avoids these data losses as it keeps the original (text) information and lets the user decide what to do.