Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return complex in-cell string formatting #5

Closed
nacnudus opened this issue Nov 6, 2016 · 12 comments
Closed

Return complex in-cell string formatting #5

nacnudus opened this issue Nov 6, 2016 · 12 comments

Comments

@nacnudus
Copy link
Owner

nacnudus commented Nov 6, 2016

When a cell contains a string, parts of which are formatted differently from other parts, the formatting is expressed within the tag of sharedStrings.xml (and might validly be expressed in within the tag in the worksheet).

For example, the first below is formatted, but in only one way, so the formatting is expressed elswhere, but the second has many different formats for different sections of the string.

<si>
  <t>mergedsubscript</t>
</si>
<si>
  <r>
    <t>in-cell</t>
  </r>
  <r>
    <rPr>
      <b/>
      <sz val="11"/>
      <color theme="1"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>bold</t>
  </r>
  <r>
    <rPr>
      <i/>
      <sz val="11"/>
      <color theme="1"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>italic</t>
  </r>
  <r>
    <rPr>
      <u/>
      <sz val="11"/>
      <color theme="1"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>underline</t>
  </r>
  <r>
    <rPr>
      <b/>
      <i/>
      <u/>
      <sz val="11"/>
      <color theme="1"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>all-three</t>
  </r>
  <r>
    <rPr>
      <sz val="11"/>
      <color rgb="FFFF0000"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>red</t>
  </r>
  <r>
    <rPr>
      <sz val="8"/>
      <color rgb="FFFF0000"/>
      <rFont val="Calibri"/>
      <family val="2"/>
      <scheme val="minor"/>
    </rPr>
    <t>size</t>
  </r>
  <r>
    <rPr>
      <sz val="8"/>
      <color rgb="FFFF0000"/>
      <rFont val="Arial"/>
      <family val="2"/>
    </rPr>
    <t>arial multilinecustomheight</t>
  </r>
</si>

In the second case, only the string is returned, not the formatting, unless there is also formatting applied to the whole cell.

I'd like to return the formatting somehow, but I'd need a use-case to make it a priority, and some suggestions of how to structure the information. The lazy way to return it would be in its raw form, which can be done like so:

    std::string inlineString;
    rapidxml::print(std::back_inserter(inlineString), *is, 0);
@stla
Copy link

stla commented Feb 13, 2017

Hello,

This looks related to #9

Are you looking for ideas to structure the information in R ?

You could use a type richtext for multiple lines with different styles. And store the lines as lists:

xlsxdata <- data.frame(address=c("A1","B1"), data_type=c("character", "richtext"), richtext=I(list(NA, list("first line", "second line"))))

And use a list of lists instead of a list of vectors to store the fonts:

xlsxfont <- list(bold=list(FALSE, c(TRUE, FALSE)), italic=list(FALSE, c(FALSE, FALSE)))

@nacnudus
Copy link
Owner Author

Thanks for the suggestion.

I'd prefer nested data frames, since they are more easily handled by tidyverse tools. In-cell and in-comment (inline?) formatting is quite limited, so there could be a column per style, which would be NA unless applicable.

Feel free to send a PR, otherwise I will wait for more interest or a compelling use-case.

@petersmp
Copy link

I just found this as I was looking for exactly this functionality. I made a file showing the stripped down version of my use case:
example_formatting_complaint.xlsx

The file contents are:

ID Count Value Note ideal_product ideal_run ideal_category ideal_result ideal_altCat
A-1-TEST 1 2 Test was great A 1 TEST OK TEST
A-2-PRODUCTION 2 5 Released product A 2 PRODUCTION OK PRODUCTION
A-3-PRODUCTION 3 8 Flaw found after packaging; retained for testing A 3 PRODUCTION FAIL failed_PRODUCTION
B-1-TEST 4 11 Test failed B 1 TEST OK PRODUCTION

Where "PRODUCTION" (also red in the file) indicates a product run intended for production that failed (i.e., was not actually used as production). In an ideal world, the data would be normalized by having the ID separated into its component parts (product code, run index, usage category) and the change from usage would be coded (either by noting the failure or by adding an additional category). In practice, I am working from old files and would rather not have to manually code all of those runs as OK/FAIL.

For my use case, even just a text field listing all formatting used in the cell would be sufficient to identify the failed case. Better still would be the full coding (to protect against future uses) or a breakdown of which string components have each coding, e.g.:

address part character formatting
A4 1 A-3-
A4 2 PRODUCTION strikethrough

@nacnudus
Copy link
Owner Author

Thanks @petersmp, I'd call that a compelling use-case. Old files, weird formatting, and an operation to perform that shouldn't be as hard as it is.

Thinking aloud, the formatting column could be a list column, with each cell being a vector of all the formats applied to a given span of characters. So, since "PRODUCTION" is also red, it would be something like

library(tibble)

x <- tibble(order = 1:2,
            text = c("A-3-", "PRODUCTION"),
            formatting = list(character(), c("strikethrough", "red")))

x
#> # A tibble: 2 x 3
#>   order       text formatting
#>   <int>      <chr>     <list>
#> 1     1       A-3-  <chr [0]>
#> 2     2 PRODUCTION  <chr [2]>

x$formatting
#> [[1]]
#> character(0)
#> 
#> [[2]]
#> [1] "strikethrough" "red"

What do you think?

I'm afraid I'm not likely to implement anything very soon -- there's several weeks' worth of work to do under the hood to enable further development, for which I'm having to learn more C++ -- but if you need something quickly then I could create a branch with a hack to flags any cells that have any kind of complex formatting.

Also, what software created your files? Not necessarily relevant, just interested.

For reference, a part of your file:

<si>
<t xml:space="preserve">A-2-PRODUCTION</t>
</si>
<si>
<t xml:space="preserve">Released product</t>
</si>
<si>
<t xml:space="preserve">PRODUCTION</t>
</si>
<si>
<r>
<rPr>
<sz val="10"/>
<rFont val="Arial"/>
<family val="2"/>
</rPr>
<t xml:space="preserve">A-3-</t>
</r>
<r>
<rPr>
<strike val="true"/>
<sz val="10"/>
<color rgb="FFFF0000"/>
<rFont val="Arial"/>
<family val="2"/>
</rPr>
<t xml:space="preserve">PRODUCTION</t>
</r>
</si>

@petersmp
Copy link

petersmp commented Sep 29, 2017

Thank you. That sounds like a great approach, though there is no need to create the separate branch for me at this point. It will be faster for me to manually code the few cases I need for now, and I will want a stable solution before I implement wider-scale automation.

The files I will actually be using are hand-entered data in Excel. However, because I run Ubuntu, the example file that I posted was generated using LibreOffice.

@nacnudus
Copy link
Owner Author

nacnudus commented Oct 20, 2017

@petersmp This is now done. I would appreciate your thoughts on the implementation. Feel free to reopen if it doesn't work for you.

library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats
library(tidyxl)

xlsx_cells("~/Downloads/example_formatting_complaint.xlsx") %>%
  filter(col == 1) %>%
  select(address, character_formatted) %>%
  unnest()
#> # A tibble: 6 x 12
#>   address      character  bold italic underline  size color_rgb
#>     <chr>          <chr> <int>  <int>     <int> <dbl>     <chr>
#> 1      A1             ID    NA     NA        NA    NA      <NA>
#> 2      A2       A-1-TEST    NA     NA        NA    NA      <NA>
#> 3      A3 A-2-PRODUCTION    NA     NA        NA    NA      <NA>
#> 4      A4           A-3-     0      0         0     0      <NA>
#> 5      A4     PRODUCTION     0      0         0     0  FFFF0000
#> 6      A5       B-1-TEST    NA     NA        NA    NA      <NA>
#> # ... with 5 more variables: color_theme <int>, color_indexed <int>,
#> #   font <chr>, family <int>, scheme <chr>

@petersmp
Copy link

This looks great. In quick testing, I am seeing a few potential issues.

This does not appear to be returning the state of "strike" to indicate the strikethrough text (which, it turns out, is the formatting most relevant for my use case).

The other thing that popped up immediately when I ran this on my own dataset: because the column character_formatted is only returned for "character" datatypes, the unnesting fails completely if any column is non-character (e.g., date or blank). This can be handled on the user end (by filtering to character columns), but may be worth mentioning in the documentation (if not just generating an empty tibble for that column for non-character cells).

The other potentially misleading issue is that it does not capture the styling of the full cell. For example, if I highlight all of the text in cell A5 and highlight/strikethrough, nothing in the "character_formatted" (or other) column(s) indicates that the cell formatting has changed. That formatting is, instead, available with xlsx_formats. I think this separation is acceptable (even if less than ideal) as the user can work around those cases as needed. However, it may be worth noting it more clear in the documentation.

A final note: I am not sure what change caused this, but for me, some of the numeric columns are now returning as dates instead of numeric (columns B, C, and F). I am not sure if that is related to this commit, but I do not remember that issue occurring before (it may have). Thought I should at least point it out.

Thanks again for all the work on this. It is really, really fantastic.

@nacnudus
Copy link
Owner Author

Thanks for the prompt reply!

So I missed the strike format -- d'oh -- that's what code review is for. I'll create a proper test with all possible formats and complete the set.

Good point about the NULLs upsetting unnest(), I hadn't thought of that. It will be easy enough to remedy with an empty tibble, and will also make more sense if cell-level formatting is propagated to in-cell formatting. I confess to taking the easy option, and then back-justifying it on the basis that you might want to tell the difference between explicit in-cell formatting and the cell-level default, but that's inconsistent with the cell-level formatting where $style formats are propagated to $local formats unless overridden.

I can't reproduce the date/numeric problem. Please could you confirm that with a new issue and a reproducible example?

@petersmp
Copy link

Happy to help, and thank you for getting this turned around.

If I were building this (and not motivated by my particular use-case/work-flow), I likely would have made the same basic decisions. I think they are acceptable (and understandable) decisions, I was just hoping to see them documented more thoroughly. However, if they are easy to fix, it would improve usability.

I just submitted a new issue on the date/numeric thing ( #22 ).

nacnudus added a commit that referenced this issue Oct 20, 2017
@nacnudus
Copy link
Owner Author

Strike and all the remaining formats are now supported, and the logical vectors are actually logical instead of integer. Propagating the cell-level formats as defaults will take a bit more work, but the styles code needs to be overhauled anyway for speed, so I'll get on with it.

The more urgent issue is the date/numeric thing.

@nacnudus
Copy link
Owner Author

It turns out that providing empty dataframes instead of NULL in the character_formatted column slows down loading of large spreadsheets by about a minute, and uses a lot of memory. Instead I will mention the NULLs in the help file.

@nacnudus
Copy link
Owner Author

It seems the cell-level defaults are already applied at the in-cell level -- I made a mistake before. The file examples.xlsx Sheet1!A132 illustrates this. I'll close this issue.

At any rate it motivated me to refactor the styles parsing, which dramatically improved its performance, so it was worthwhile!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants