Skip to content

Commit

Permalink
Import in-cell formatting (fixes #5)
Browse files Browse the repository at this point in the history
The new column `character_formatted` is a list-column of data frames, one per
cell.  Each row is a substring, with the applicable formatting.  Where formats
are `NA`, the overall cell format applies.
  • Loading branch information
nacnudus committed Oct 20, 2017
1 parent 662e385 commit 2ed15e3
Show file tree
Hide file tree
Showing 11 changed files with 201 additions and 26 deletions.
1 change: 1 addition & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
* Provides a utility function `is_range()` to check whether formulas are simply
ranges of cells.
* Returns formatting of alignment and cell protection (#20).
* Returns in-cell formatting of strings (#5)

# tidyxl 0.2.3

Expand Down
13 changes: 13 additions & 0 deletions R/tidy_xlsx.R
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,8 @@
#' * `numeric` The numeric value of a cell.
#' * `date` The date value of a cell.
#' * `character` The string value of a cell.
#' * `character_formatted` A data frame of substrings and their individual
#' formatting.
#' * `formula` The formula in a cell (see 'Details').
#' * `is_array` Whether or not the formula is an array formula.
#' * `formula_ref` The address of a range of cells group to which an array
Expand Down Expand Up @@ -140,6 +142,12 @@
#' the RGB values in a spreadsheet program (e.g. Excel, LibreOffice,
#' Gnumeric), and use the [grDevices::rgb()] function to convert these to a
#' hexadecimal string.
#'
#' Strings can be formatted within a cell, so that a single cell can contain
#' substrings with different formatting. This in-cell formatting is available
#' in the column `character_formatted`, which is a list-column of data frames.
#' Each row of each data frame describes a substring and its formatting.
#' When a particular format is `NA`, the overall cell format applies.
#' }
#'
#' @export
Expand Down Expand Up @@ -171,6 +179,11 @@
#' # the relevant indices, and then filter the cells by those indices.
#' bold_indices <- which(x$formats$local$font$bold)
#' Sheet1[Sheet1$local_format_id %in% bold_indices, ]
#'
#' # In-cell formatting is available in the `character_formatted` column as a
#' # data frame, one row per substring. When a particular format is `NA`, the
#' # overall cell format applies.
#' tidy_xlsx(examples)$data$Sheet1$character_formatted[77]
#' }
tidy_xlsx <- function(path, sheets = NA) {
.Deprecated(msg = paste("'tidy_xlsx()' is deprecated.",
Expand Down
11 changes: 11 additions & 0 deletions R/xlsx_cells.R
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,12 @@
#' it as an index into the format structure. E.g. to look up the font size,
#' `my_formats$local$font$size[local_format_id]`. To see all available
#' formats, type `str(my_formats$local)`.
#'
#' Strings can be formatted within a cell, so that a single cell can contain
#' substrings with different formatting. This in-cell formatting is available
#' in the column `character_formatted`, which is a list-column of data frames.
#' Each row of each data frame describes a substring and its formatting.
#' When a particular format is `NA`, the overall cell format applies.
#' }
#'
#' @export
Expand All @@ -124,6 +130,11 @@
#' # the relevant indices, and then filter the cells by those indices.
#' bold_indices <- which(formats$local$font$bold)
#' Sheet1[Sheet1$local_format_id %in% bold_indices, ]
#'
#' # In-cell formatting is available in the `character_formatted` column as a
#' # data frame, one row per substring. When a particular format is `NA`, the
#' # overall cell format applies.
#' xlsx_cells(examples)$character_formatted[77]
xlsx_cells <- function(path, sheets = NA) {
path <- check_file(path)
all_sheets <- utils_xlsx_sheet_files(path)
Expand Down
Binary file modified inst/extdata/examples.xlsx
Binary file not shown.
16 changes: 14 additions & 2 deletions man/tidy_xlsx.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

14 changes: 12 additions & 2 deletions man/xlsx_cells.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

119 changes: 119 additions & 0 deletions src/utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

#include <Rcpp.h>
#include "rapidxml.h"
#include "xlsxstyles.h"
#include <R_ext/GraphicsDevice.h> // Rf_ucstoutf8 is exported in R_ext/GraphicsDevice.h

// Follow tidyverse/readxl
Expand Down Expand Up @@ -145,4 +146,122 @@ inline bool parseString(const rapidxml::xml_node<>* string, std::string& out) {
return found;
}

// Return a dataframe, one row per substring, columns for formatting
inline Rcpp::List parseFormattedString(
const rapidxml::xml_node<>* string,
xlsxstyles& styles) {
int n(0); // number of substrings
for (rapidxml::xml_node<>* node = string->first_node();
node != NULL;
node = node->next_sibling()) {
++n;
}

std::vector<std::string> character;
std::vector<int> bold;
std::vector<int> italic;
std::vector<int> underline;
std::vector<double> size;
Rcpp::CharacterVector color_rgb(n, NA_STRING);
std::vector<int> color_theme;
std::vector<int> color_indexed;
Rcpp::CharacterVector font(n, NA_STRING);
std::vector<int> family;
Rcpp::CharacterVector scheme(n, NA_STRING);

int i(0); // ith substring
for (rapidxml::xml_node<>* node = string->first_node();
node != NULL;
node = node->next_sibling()) {
std::string node_name = node->name();
if (node_name == "t") {
character.push_back(node->value());
bold.push_back(NA_LOGICAL);
italic.push_back(NA_LOGICAL);
underline.push_back(NA_LOGICAL);
size.push_back(NA_REAL);
color_theme.push_back(NA_INTEGER);
color_indexed.push_back(NA_INTEGER);
family.push_back(NA_INTEGER);
} else {
character.push_back(node->first_node("t")->value());
rapidxml::xml_node<>* rPr = node->first_node("rPr");
if (rPr != NULL) {
bold.push_back(rPr->first_node("b") != NULL);
italic.push_back(rPr->first_node("i") != NULL);
underline.push_back(rPr->first_node("u") != NULL);
rapidxml::xml_node<>* sz = rPr->first_node("sz");
if (sz != NULL) {
size.push_back(strtod(sz->value(), NULL));
} else {
size.push_back(NA_REAL);
}
rapidxml::xml_node<>* color = rPr->first_node("color");
if (color != NULL) {
rapidxml::xml_attribute<>* color_attr = color->first_attribute();
std::string attr_name = color_attr->name();
if (attr_name == "rgb") {
color_rgb[i] = color_attr->value();
color_theme.push_back(NA_INTEGER);
color_indexed.push_back(NA_INTEGER);
} else if (attr_name == "theme") {
int theme_int = strtol(color_attr->value(), NULL, 10) + 1;
color_rgb[i] = styles.theme_[theme_int - 1];
color_theme.push_back(theme_int);
color_indexed.push_back(NA_INTEGER);
} else if (attr_name == "indexed") {
// I haven't been able to force this in real life
int indexed_int = strtol(color_attr->value(), NULL, 10) + 1;
color_rgb[i] = styles.indexed_[indexed_int - 1];
color_theme.push_back(NA_INTEGER);
color_indexed.push_back(indexed_int);
}
} else {
color_theme.push_back(NA_INTEGER);
color_indexed.push_back(NA_INTEGER);
}
rapidxml::xml_node<>* rFont = rPr->first_node("rFont");
if (rFont != NULL) {
font[i] = rFont->first_attribute("val")->value();
}
rapidxml::xml_node<>* family_node = rPr->first_node("family");
if (family_node != NULL) {
family.push_back(strtol(family_node->first_attribute("val")->value(), NULL, 10));
} else {
family.push_back(NA_INTEGER);
}
rapidxml::xml_node<>* scheme_node = rPr->first_node("scheme");
if (scheme_node != NULL) {
scheme[i] = scheme_node->first_attribute("val")->value();
}
} else {
bold.push_back(NA_LOGICAL);
italic.push_back(NA_LOGICAL);
underline.push_back(NA_LOGICAL);
size.push_back(NA_REAL);
color_theme.push_back(NA_INTEGER);
color_indexed.push_back(NA_INTEGER);
family.push_back(NA_INTEGER);
}
}
++i;
}
Rcpp::List out = Rcpp::List::create(
Rcpp::_["character"] = character,
Rcpp::_["bold"] = bold,
Rcpp::_["italic"] = italic,
Rcpp::_["underline"] = underline,
Rcpp::_["size"] = size,
Rcpp::_["color_rgb"] = color_rgb,
Rcpp::_["color_theme"] = color_theme,
Rcpp::_["color_indexed"] = color_indexed,
Rcpp::_["font"] = font,
Rcpp::_["family"] = family,
Rcpp::_["scheme"] = scheme);
// Turn list of vectors into a data frame without checking anything
out.attr("class") = Rcpp::CharacterVector::create("tbl_df", "tbl", "data.frame");
out.attr("row.names") = Rcpp::IntegerVector::create(NA_INTEGER, -n); // Dunno how this works (the -n part)
return out;
}

#endif
48 changes: 27 additions & 21 deletions src/xlsxbook.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,9 @@ void xlsxbook::cacheStrings() {
std::string out;
parseString(string, out); // missing strings are treated as empty ""
strings_.push_back(out);

Rcpp::List out_df = parseFormattedString(string, styles_);
strings_formatted_.push_back(out_df);
}
}

Expand Down Expand Up @@ -147,6 +150,7 @@ void xlsxbook::initializeColumns() {
formula_ref_ = CharacterVector(cellcount_, NA_STRING);
formula_group_ = IntegerVector(cellcount_, NA_INTEGER);
comment_ = CharacterVector(cellcount_, NA_STRING);
character_formatted_ = List(cellcount_);
height_ = NumericVector(cellcount_, NA_REAL);
width_ = NumericVector(cellcount_, NA_REAL);
style_format_ = CharacterVector(cellcount_, NA_STRING);
Expand Down Expand Up @@ -200,7 +204,7 @@ void xlsxbook::cacheInformation() {
/* _["style_format"] = style_format_, */
/* _["local_format_id"] = local_format_id_); */

information_ = List(20);
information_ = List(21);
information_[0] = sheet_;
information_[1] = address_;
information_[2] = row_;
Expand All @@ -212,17 +216,18 @@ void xlsxbook::cacheInformation() {
information_[8] = numeric_;
information_[9] = date_;
information_[10] = character_;
information_[11] = formula_;
information_[12] = is_array_;
information_[13] = formula_ref_;
information_[14] = formula_group_;
information_[15] = comment_;
information_[16] = height_;
information_[17] = width_;
information_[18] = style_format_;
information_[19] = local_format_id_;

std::vector<std::string> names(20);
information_[11] = character_formatted_;
information_[12] = formula_;
information_[13] = is_array_;
information_[14] = formula_ref_;
information_[15] = formula_group_;
information_[16] = comment_;
information_[17] = height_;
information_[18] = width_;
information_[19] = style_format_;
information_[20] = local_format_id_;

std::vector<std::string> names(21);
names[0] = "sheet";
names[1] = "address";
names[2] = "row";
Expand All @@ -234,15 +239,16 @@ void xlsxbook::cacheInformation() {
names[8] = "numeric";
names[9] = "date";
names[10] = "character";
names[11] = "formula";
names[12] = "is_array";
names[13] = "formula_ref";
names[14] = "formula_group";
names[15] = "comment";
names[16] = "height";
names[17] = "width";
names[18] = "style_format";
names[19] = "local_format_id";
names[11] = "character_formatted";
names[12] = "formula";
names[13] = "is_array";
names[14] = "formula_ref";
names[15] = "formula_group";
names[16] = "comment";
names[17] = "height";
names[18] = "width";
names[19] = "style_format";
names[20] = "local_format_id";

information_.attr("names") = names;

Expand Down
4 changes: 3 additions & 1 deletion src/xlsxbook.h
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@ class xlsxbook {
Rcpp::CharacterVector sheet_names_; // worksheet names
Rcpp::CharacterVector comments_paths_; // comments files
std::vector<std::string> strings_; // strings table

std::vector<Rcpp::List> strings_formatted_; // strings with inline formatting
// list of data frames
xlsxstyles styles_;

int dateSystem_; // 1900 or 1904
Expand Down Expand Up @@ -44,6 +45,7 @@ class xlsxbook {
Rcpp::CharacterVector formula_ref_; // If present
Rcpp::IntegerVector formula_group_; // If present
Rcpp::CharacterVector comment_;
Rcpp::List character_formatted_; // data frame
Rcpp::NumericVector height_; // Provided to cell constructor
Rcpp::NumericVector width_; // Provided to cell constructor
Rcpp::CharacterVector style_format_; // cellXfs xfId links to cellStyleXfs entry
Expand Down
1 change: 1 addition & 0 deletions src/xlsxcell.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ void xlsxcell::cacheValue(
// into the string table.
book.data_type_[i] = "character";
SET_STRING_ELT(book.character_, i, Rf_mkCharCE(book.strings_[strtol(vvalue.c_str(), NULL, 10)].c_str(), CE_UTF8));
book.character_formatted_[i] = book.strings_formatted_[strtol(vvalue.c_str(), NULL, 10)];
return;
} else if (tvalue == "str") {
// Formula, which could have evaluated to anything, so only a string is safe
Expand Down
Binary file modified tests/testthat/examples.xlsx
Binary file not shown.

0 comments on commit 2ed15e3

Please sign in to comment.