Return complex in-cell string formatting #5

nacnudus · 2016-11-06T13:06:04Z

When a cell contains a string, parts of which are formatted differently from other parts, the formatting is expressed within the tag of sharedStrings.xml (and might validly be expressed in within the tag in the worksheet).

For example, the first below is formatted, but in only one way, so the formatting is expressed elswhere, but the second has many different formats for different sections of the string.

<si>
  <t>mergedsubscript</t>
</si>
<si>
  <r>
    <t>in-cell</t>
  </r>
  <r>
    <rPr>
      <b/>
      <sz val="11"/>
      <color theme="1"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>bold</t>
  </r>
  <r>
    <rPr>
      <i/>
      <sz val="11"/>
      <color theme="1"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>italic</t>
  </r>
  <r>
    <rPr>
      <u/>
      <sz val="11"/>
      <color theme="1"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>underline</t>
  </r>
  <r>
    <rPr>
      <b/>
      <i/>
      <u/>
      <sz val="11"/>
      <color theme="1"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>all-three</t>
  </r>
  <r>
    <rPr>
      <sz val="11"/>
      <color rgb="FFFF0000"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>red</t>
  </r>
  <r>
    <rPr>
      <sz val="8"/>
      <color rgb="FFFF0000"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>size</t>
  </r>
  <r>
    <rPr>
      <sz val="8"/>
      <color rgb="FFFF0000"/>
      <rFont val="Arial"/>
      <family val="2"/>
    </rPr>
    <t>arial multilinecustomheight</t>
  </r>
</si>

In the second case, only the string is returned, not the formatting, unless there is also formatting applied to the whole cell.

I'd like to return the formatting somehow, but I'd need a use-case to make it a priority, and some suggestions of how to structure the information. The lazy way to return it would be in its raw form, which can be done like so:

    std::string inlineString;
    rapidxml::print(std::back_inserter(inlineString), *is, 0);

stla · 2017-02-13T19:53:44Z

Hello,

This looks related to #9

Are you looking for ideas to structure the information in R ?

You could use a type richtext for multiple lines with different styles. And store the lines as lists:

xlsxdata <- data.frame(address=c("A1","B1"), data_type=c("character", "richtext"), richtext=I(list(NA, list("first line", "second line"))))

And use a list of lists instead of a list of vectors to store the fonts:

xlsxfont <- list(bold=list(FALSE, c(TRUE, FALSE)), italic=list(FALSE, c(FALSE, FALSE)))

nacnudus · 2017-02-13T20:49:46Z

Thanks for the suggestion.

I'd prefer nested data frames, since they are more easily handled by tidyverse tools. In-cell and in-comment (inline?) formatting is quite limited, so there could be a column per style, which would be NA unless applicable.

Feel free to send a PR, otherwise I will wait for more interest or a compelling use-case.

petersmp · 2017-09-29T18:18:20Z

I just found this as I was looking for exactly this functionality. I made a file showing the stripped down version of my use case:
example_formatting_complaint.xlsx

The file contents are:

ID	Count	Value	Note	ideal_product	ideal_run	ideal_category	ideal_result	ideal_altCat
A-1-TEST	1	2	Test was great	A	1	TEST	OK	TEST
A-2-PRODUCTION	2	5	Released product	A	2	PRODUCTION	OK	PRODUCTION
A-3-~~PRODUCTION~~	3	8	Flaw found after packaging; retained for testing	A	3	PRODUCTION	FAIL	failed_PRODUCTION
B-1-TEST	4	11	Test failed	B	1	TEST	OK	PRODUCTION

Where "~~PRODUCTION~~" (also red in the file) indicates a product run intended for production that failed (i.e., was not actually used as production). In an ideal world, the data would be normalized by having the ID separated into its component parts (product code, run index, usage category) and the change from usage would be coded (either by noting the failure or by adding an additional category). In practice, I am working from old files and would rather not have to manually code all of those runs as OK/FAIL.

For my use case, even just a text field listing all formatting used in the cell would be sufficient to identify the failed case. Better still would be the full coding (to protect against future uses) or a breakdown of which string components have each coding, e.g.:

address	part	character	formatting
A4	1	A-3-
A4	2	PRODUCTION	strikethrough

nacnudus · 2017-09-29T18:50:24Z

Thanks @petersmp, I'd call that a compelling use-case. Old files, weird formatting, and an operation to perform that shouldn't be as hard as it is.

Thinking aloud, the formatting column could be a list column, with each cell being a vector of all the formats applied to a given span of characters. So, since "~~PRODUCTION~~" is also red, it would be something like

library(tibble)

x <- tibble(order = 1:2,
            text = c("A-3-", "PRODUCTION"),
            formatting = list(character(), c("strikethrough", "red")))

x
#> # A tibble: 2 x 3
#>   order       text formatting
#>   <int>      <chr>     <list>
#> 1     1       A-3-  <chr [0]>
#> 2     2 PRODUCTION  <chr [2]>

x$formatting
#> [[1]]
#> character(0)
#> 
#> [[2]]
#> [1] "strikethrough" "red"

What do you think?

I'm afraid I'm not likely to implement anything very soon -- there's several weeks' worth of work to do under the hood to enable further development, for which I'm having to learn more C++ -- but if you need something quickly then I could create a branch with a hack to flags any cells that have any kind of complex formatting.

Also, what software created your files? Not necessarily relevant, just interested.

For reference, a part of your file:

<si>
<t xml:space="preserve">A-2-PRODUCTION</t>
</si>
<si>
<t xml:space="preserve">Released product</t>
</si>
<si>
<t xml:space="preserve">PRODUCTION</t>
</si>
<si>
<r>
<rPr>
<sz val="10"/>
<rFont val="Arial"/>
<family val="2"/>
</rPr>
<t xml:space="preserve">A-3-</t>
</r>
<r>
<rPr>
<strike val="true"/>
<sz val="10"/>
<color rgb="FFFF0000"/>
<rFont val="Arial"/>
<family val="2"/>
</rPr>
<t xml:space="preserve">PRODUCTION</t>
</r>
</si>

petersmp · 2017-09-29T19:17:15Z

Thank you. That sounds like a great approach, though there is no need to create the separate branch for me at this point. It will be faster for me to manually code the few cases I need for now, and I will want a stable solution before I implement wider-scale automation.

The files I will actually be using are hand-entered data in Excel. However, because I run Ubuntu, the example file that I posted was generated using LibreOffice.

nacnudus · 2017-10-20T13:30:27Z

@petersmp This is now done. I would appreciate your thoughts on the implementation. Feel free to reopen if it doesn't work for you.

library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats
library(tidyxl)

xlsx_cells("~/Downloads/example_formatting_complaint.xlsx") %>%
  filter(col == 1) %>%
  select(address, character_formatted) %>%
  unnest()
#> # A tibble: 6 x 12
#>   address      character  bold italic underline  size color_rgb
#>     <chr>          <chr> <int>  <int>     <int> <dbl>     <chr>
#> 1      A1             ID    NA     NA        NA    NA      <NA>
#> 2      A2       A-1-TEST    NA     NA        NA    NA      <NA>
#> 3      A3 A-2-PRODUCTION    NA     NA        NA    NA      <NA>
#> 4      A4           A-3-     0      0         0     0      <NA>
#> 5      A4     PRODUCTION     0      0         0     0  FFFF0000
#> 6      A5       B-1-TEST    NA     NA        NA    NA      <NA>
#> # ... with 5 more variables: color_theme <int>, color_indexed <int>,
#> #   font <chr>, family <int>, scheme <chr>

petersmp · 2017-10-20T15:41:58Z

This looks great. In quick testing, I am seeing a few potential issues.

This does not appear to be returning the state of "strike" to indicate the strikethrough text (which, it turns out, is the formatting most relevant for my use case).

The other thing that popped up immediately when I ran this on my own dataset: because the column character_formatted is only returned for "character" datatypes, the unnesting fails completely if any column is non-character (e.g., date or blank). This can be handled on the user end (by filtering to character columns), but may be worth mentioning in the documentation (if not just generating an empty tibble for that column for non-character cells).

The other potentially misleading issue is that it does not capture the styling of the full cell. For example, if I highlight all of the text in cell A5 and highlight/strikethrough, nothing in the "character_formatted" (or other) column(s) indicates that the cell formatting has changed. That formatting is, instead, available with xlsx_formats. I think this separation is acceptable (even if less than ideal) as the user can work around those cases as needed. However, it may be worth noting it more clear in the documentation.

A final note: I am not sure what change caused this, but for me, some of the numeric columns are now returning as dates instead of numeric (columns B, C, and F). I am not sure if that is related to this commit, but I do not remember that issue occurring before (it may have). Thought I should at least point it out.

Thanks again for all the work on this. It is really, really fantastic.

nacnudus · 2017-10-20T16:16:43Z

Thanks for the prompt reply!

So I missed the strike format -- d'oh -- that's what code review is for. I'll create a proper test with all possible formats and complete the set.

Good point about the NULLs upsetting unnest(), I hadn't thought of that. It will be easy enough to remedy with an empty tibble, and will also make more sense if cell-level formatting is propagated to in-cell formatting. I confess to taking the easy option, and then back-justifying it on the basis that you might want to tell the difference between explicit in-cell formatting and the cell-level default, but that's inconsistent with the cell-level formatting where $style formats are propagated to $local formats unless overridden.

I can't reproduce the date/numeric problem. Please could you confirm that with a new issue and a reproducible example?

petersmp · 2017-10-20T16:35:53Z

Happy to help, and thank you for getting this turned around.

If I were building this (and not motivated by my particular use-case/work-flow), I likely would have made the same basic decisions. I think they are acceptable (and understandable) decisions, I was just hoping to see them documented more thoroughly. However, if they are easy to fix, it would improve usability.

I just submitted a new issue on the date/numeric thing ( #22 ).

nacnudus · 2017-10-20T22:23:48Z

Strike and all the remaining formats are now supported, and the logical vectors are actually logical instead of integer. Propagating the cell-level formats as defaults will take a bit more work, but the styles code needs to be overhauled anyway for speed, so I'll get on with it.

The more urgent issue is the date/numeric thing.

nacnudus · 2017-10-22T16:14:34Z

It turns out that providing empty dataframes instead of NULL in the character_formatted column slows down loading of large spreadsheets by about a minute, and uses a lot of memory. Instead I will mention the NULLs in the help file.

nacnudus · 2017-10-27T18:32:07Z

It seems the cell-level defaults are already applied at the in-cell level -- I made a mistake before. The file examples.xlsx Sheet1!A132 illustrates this. I'll close this issue.

At any rate it motivated me to refactor the styles parsing, which dramatically improved its performance, so it was worthwhile!

nacnudus added enhancement help wanted labels Nov 6, 2016

nacnudus mentioned this issue Feb 13, 2017

Multiple lines comment #9

Closed

nacnudus closed this as completed in 2ed15e3 Oct 20, 2017

nacnudus reopened this Oct 20, 2017

petersmp mentioned this issue Oct 20, 2017

Numeric columns rendering as dates #22

Closed

nacnudus added a commit that referenced this issue Oct 20, 2017

Support all in-cell formats (#5)

fbd752b

nacnudus added a commit that referenced this issue Oct 22, 2017

Document NULLs in the character_formatted column (#5)

77806b0

nacnudus closed this as completed Oct 27, 2017

daniel-fahey mentioned this issue May 22, 2018

Incompatible Excel theme #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return complex in-cell string formatting #5

Return complex in-cell string formatting #5

nacnudus commented Nov 6, 2016

stla commented Feb 13, 2017

nacnudus commented Feb 13, 2017

petersmp commented Sep 29, 2017

nacnudus commented Sep 29, 2017

petersmp commented Sep 29, 2017 •

edited

Loading

nacnudus commented Oct 20, 2017 •

edited

Loading

petersmp commented Oct 20, 2017

nacnudus commented Oct 20, 2017

petersmp commented Oct 20, 2017

nacnudus commented Oct 20, 2017

nacnudus commented Oct 22, 2017

nacnudus commented Oct 27, 2017

Return complex in-cell string formatting #5

Return complex in-cell string formatting #5

Comments

nacnudus commented Nov 6, 2016

stla commented Feb 13, 2017

nacnudus commented Feb 13, 2017

petersmp commented Sep 29, 2017

nacnudus commented Sep 29, 2017

petersmp commented Sep 29, 2017 • edited Loading

nacnudus commented Oct 20, 2017 • edited Loading

petersmp commented Oct 20, 2017

nacnudus commented Oct 20, 2017

petersmp commented Oct 20, 2017

nacnudus commented Oct 20, 2017

nacnudus commented Oct 22, 2017

nacnudus commented Oct 27, 2017

petersmp commented Sep 29, 2017 •

edited

Loading

nacnudus commented Oct 20, 2017 •

edited

Loading