-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return complex in-cell string formatting #5
Comments
Hello, This looks related to #9 Are you looking for ideas to structure the information in R ? You could use a type
And use a list of lists instead of a list of vectors to store the fonts:
|
Thanks for the suggestion. I'd prefer nested data frames, since they are more easily handled by Feel free to send a PR, otherwise I will wait for more interest or a compelling use-case. |
I just found this as I was looking for exactly this functionality. I made a file showing the stripped down version of my use case: The file contents are:
Where " For my use case, even just a text field listing all formatting used in the cell would be sufficient to identify the failed case. Better still would be the full coding (to protect against future uses) or a breakdown of which string components have each coding, e.g.:
|
Thanks @petersmp, I'd call that a compelling use-case. Old files, weird formatting, and an operation to perform that shouldn't be as hard as it is. Thinking aloud, the library(tibble)
x <- tibble(order = 1:2,
text = c("A-3-", "PRODUCTION"),
formatting = list(character(), c("strikethrough", "red")))
x
#> # A tibble: 2 x 3
#> order text formatting
#> <int> <chr> <list>
#> 1 1 A-3- <chr [0]>
#> 2 2 PRODUCTION <chr [2]>
x$formatting
#> [[1]]
#> character(0)
#>
#> [[2]]
#> [1] "strikethrough" "red" What do you think? I'm afraid I'm not likely to implement anything very soon -- there's several weeks' worth of work to do under the hood to enable further development, for which I'm having to learn more C++ -- but if you need something quickly then I could create a branch with a hack to flags any cells that have any kind of complex formatting. Also, what software created your files? Not necessarily relevant, just interested. For reference, a part of your file: <si>
<t xml:space="preserve">A-2-PRODUCTION</t>
</si>
<si>
<t xml:space="preserve">Released product</t>
</si>
<si>
<t xml:space="preserve">PRODUCTION</t>
</si>
<si>
<r>
<rPr>
<sz val="10"/>
<rFont val="Arial"/>
<family val="2"/>
</rPr>
<t xml:space="preserve">A-3-</t>
</r>
<r>
<rPr>
<strike val="true"/>
<sz val="10"/>
<color rgb="FFFF0000"/>
<rFont val="Arial"/>
<family val="2"/>
</rPr>
<t xml:space="preserve">PRODUCTION</t>
</r>
</si> |
Thank you. That sounds like a great approach, though there is no need to create the separate branch for me at this point. It will be faster for me to manually code the few cases I need for now, and I will want a stable solution before I implement wider-scale automation. The files I will actually be using are hand-entered data in Excel. However, because I run Ubuntu, the example file that I posted was generated using LibreOffice. |
@petersmp This is now done. I would appreciate your thoughts on the implementation. Feel free to reopen if it doesn't work for you. library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag(): dplyr, stats
library(tidyxl)
xlsx_cells("~/Downloads/example_formatting_complaint.xlsx") %>%
filter(col == 1) %>%
select(address, character_formatted) %>%
unnest()
#> # A tibble: 6 x 12
#> address character bold italic underline size color_rgb
#> <chr> <chr> <int> <int> <int> <dbl> <chr>
#> 1 A1 ID NA NA NA NA <NA>
#> 2 A2 A-1-TEST NA NA NA NA <NA>
#> 3 A3 A-2-PRODUCTION NA NA NA NA <NA>
#> 4 A4 A-3- 0 0 0 0 <NA>
#> 5 A4 PRODUCTION 0 0 0 0 FFFF0000
#> 6 A5 B-1-TEST NA NA NA NA <NA>
#> # ... with 5 more variables: color_theme <int>, color_indexed <int>,
#> # font <chr>, family <int>, scheme <chr> |
This looks great. In quick testing, I am seeing a few potential issues. This does not appear to be returning the state of "strike" to indicate the strikethrough text (which, it turns out, is the formatting most relevant for my use case). The other thing that popped up immediately when I ran this on my own dataset: because the column The other potentially misleading issue is that it does not capture the styling of the full cell. For example, if I highlight all of the text in cell A5 and highlight/strikethrough, nothing in the "character_formatted" (or other) column(s) indicates that the cell formatting has changed. That formatting is, instead, available with A final note: I am not sure what change caused this, but for me, some of the numeric columns are now returning as dates instead of numeric (columns B, C, and F). I am not sure if that is related to this commit, but I do not remember that issue occurring before (it may have). Thought I should at least point it out. Thanks again for all the work on this. It is really, really fantastic. |
Thanks for the prompt reply! So I missed the strike format -- d'oh -- that's what code review is for. I'll create a proper test with all possible formats and complete the set. Good point about the NULLs upsetting I can't reproduce the date/numeric problem. Please could you confirm that with a new issue and a reproducible example? |
Happy to help, and thank you for getting this turned around. If I were building this (and not motivated by my particular use-case/work-flow), I likely would have made the same basic decisions. I think they are acceptable (and understandable) decisions, I was just hoping to see them documented more thoroughly. However, if they are easy to fix, it would improve usability. I just submitted a new issue on the date/numeric thing ( #22 ). |
Strike and all the remaining formats are now supported, and the logical vectors are actually logical instead of integer. Propagating the cell-level formats as defaults will take a bit more work, but the styles code needs to be overhauled anyway for speed, so I'll get on with it. The more urgent issue is the date/numeric thing. |
It turns out that providing empty dataframes instead of |
It seems the cell-level defaults are already applied at the in-cell level -- I made a mistake before. The file examples.xlsx At any rate it motivated me to refactor the styles parsing, which dramatically improved its performance, so it was worthwhile! |
When a cell contains a string, parts of which are formatted differently from other parts, the formatting is expressed within the tag of sharedStrings.xml (and might validly be expressed in within the tag in the worksheet).
For example, the first below is formatted, but in only one way, so the formatting is expressed elswhere, but the second has many different formats for different sections of the string.
In the second case, only the string is returned, not the formatting, unless there is also formatting applied to the whole cell.
I'd like to return the formatting somehow, but I'd need a use-case to make it a priority, and some suggestions of how to structure the information. The lazy way to return it would be in its raw form, which can be done like so:
std::string inlineString; rapidxml::print(std::back_inserter(inlineString), *is, 0);
The text was updated successfully, but these errors were encountered: