Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read.xlsx(detectDates = TRUE) failing #288

Closed
jmbarbone opened this issue Nov 7, 2021 · 10 comments · Fixed by #291
Closed

read.xlsx(detectDates = TRUE) failing #288

jmbarbone opened this issue Nov 7, 2021 · 10 comments · Fixed by #291

Comments

@jmbarbone
Copy link
Contributor

Reporting again here as we still have this error

openxlsx::read.xlsx("C:/Users/jmbar/Downloads/bad.xlsx", detectDates = TRUE)
#> Error in read_workbook(cols_in = cell_cols, rows_in = cell_rows, v = v, : basic_string::substr: __pos (which is 8) > this->size() (which is 7)

Created on 2021-11-06 by the reprex package (v2.0.1)

Original

I can confirm the issue is still there. See the file attached, opening it with detectDates=TRUE raises an error:

Error reading date:
44489.4
row: 1
col: 1
Error in read_workbook(cols_in = cell_cols, rows_in = cell_rows, v = v,  : 
  basic_string::substr: __pos (which is 8) > this->size() (which is 7)

bad.xlsx

Originally posted by @aushev in awalker89/openxlsx#249 (comment)

@jmbarbone jmbarbone changed the title read_workbook(detectDates = TRUE) failing read.xlsx(detectDates = TRUE) failing Nov 7, 2021
JanMarvin added a commit to JanMarvin/openxlsx that referenced this issue Nov 8, 2021
JanMarvin added a commit to JanMarvin/openxlsx that referenced this issue Nov 8, 2021
JanMarvin added a commit to JanMarvin/openxlsx that referenced this issue Dec 15, 2021
JanMarvin added a commit that referenced this issue Jan 3, 2022
@ProfFancyPants
Copy link

I have an additional file doing the exact same thing. Would you like me to upload a minimum example for experimental purposes?

@JanMarvin
Copy link
Collaborator

Hi @ProfFancyPants , I assume that we understand fairly well what is going on, the question is more or less, why does it happen. I have pushed a fix to the development branch. Please see if this fixes your issue. Though I assume it is only partially right, it should fix the issue, but it is a hack - solving and hiding a problem that should be fixed somewhere else.

@ProfFancyPants
Copy link

I checked it with the development branch and the issue still remains exactly as before. I was able to get the reader to do some additional interesting things when I deleted choice cells in the date column where it was loading but actually removing values in other columns. If the bulk of this issue isn't in Apache I could help you take a look. What is so bafilling is that doing a complete copy value paste stops it completely, even with restoring all the previous formatting. My assumption was that there is a hidden or exotic character that looks exactly like the normal character but gets coerced back when value pasted.

@JanMarvin
Copy link
Collaborator

If the bulk of this issue isn't in Apache I could help you take a look.

I don't understand what Apache has to do with this issue. When I looked into the issue I've attempted to fix, we expected a string like "2022-03-01", but somehow still had a 7 character wide numeric like "11111.1". Therefore when looking for the part "-01" we fail and the error is thrown. Substring beginning at 8 requested, but only 7 characters provided.
My fix checked for the numeric and initiated a conversion from numeric to date. After this the string is long enough. Therefore, I assume this has already been fixed. Unknown to me is why we ended in this situation in the first place. The symptoms can be treated, but they are not the root of the evil.

If you want to look into this, you're ofc welcome :)

@ProfFancyPants
Copy link

ProfFancyPants commented Mar 1, 2022

I think the issue might be deeper and is related to how excel is saving the .XML files internally. In my test file, two end-user identical sheets down to the simplest reproducible example. By "end-user identical" I mean identical as far as it concerns someone using excel and using everything physically to make the two sheets identical. Divergence in the .XML files is between column and row style.
image

Exactly end-user identical sheets are being saved as different styles in XML. Once "_openxlsx_loadworksheets" reads in styleObjects(styleObjectsSEXP) and xmlFiles(xmlFilesSEXP) any manner of things could be happening, and it doesn't mean that loadworksheets is making the wrong choices based on what the row's "s=X" and "<v>X</v>" is telling it to do. Also, I haven't quite decerned where all the styleObjects are coming from because some aren't in the workbook anymore.

The worst part is that formatting order seems to matter. If I format the date field to one thing and change it back the .XML files aren't identical.

@JanMarvin
Copy link
Collaborator

Sorry, but I don't really get your point. The styles are from styles.xml, some in the <xf.../> and some custom formats are in <numfmt ...>. openxlsx creates styleObjects when loading and saving. Therefore our style IDs must not match those of Excel. The question to solve in this issue is: why is openxlsx currently identifying some cells as dates and why aren't they prepared correctly. Somewhere in the loading process we skip a date creation step. In Excel they are numerics with styles, we convert them from numerics to date/POSIX strings and try to use some substrings for another round of date creation. The entire process is a bit dubious and cannot really say why we're doing it in this specific way.

However, behind the scenes we're working hard on a successor to openxlsx and this problem does not exist in the new code. Therefore my time and interest to solve this one here is currently a bit limited. After all it's not affecting many people.

@deschen1
Copy link
Contributor

deschen1 commented Jul 6, 2022

Just tried to read in the bad.xlsx file posted in the initial post with the potential fix in the development branch. Unfortunately, trying to open the file crashs/terminates R.

sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8    LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                    LC_TIME=German_Germany.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] openxlsx_4.2.5.1 testthat_3.1.4  

loaded via a namespace (and not attached):
 [1] zip_2.2.0         Rcpp_1.0.8.3      compiler_4.2.1    pillar_1.7.0      prettyunits_1.1.1 remotes_2.4.2     tools_4.2.1       digest_0.6.29     pkgbuild_1.3.1    pkgload_1.3.0     memoise_2.0.1     lifecycle_1.0.1   tibble_3.1.7      pkgconfig_2.0.3   rlang_1.0.3       cli_3.3.0        
[17] rstudioapi_0.13   commonmark_1.8.0  xfun_0.31         fastmap_1.1.0     xml2_1.3.3        knitr_1.39        stringr_1.4.0     roxygen2_7.2.0    withr_2.5.0       desc_1.4.1        fs_1.5.2          vctrs_0.4.1       devtools_2.4.3    rprojroot_2.0.3   glue_1.6.2        R6_2.5.1         
[33] processx_3.6.1    fansi_1.0.3       sessioninfo_1.2.2 callr_3.7.0       purrr_0.3.4       magrittr_2.0.3    ps_1.7.1          codetools_0.2-18  ellipsis_0.3.2    usethis_2.1.6     utf8_1.2.2        stringi_1.7.6     cachem_1.0.6      crayon_1.5.1      brio_1.1.3    

@cha-petersumm
Copy link

I have the same issue, with a file that was simply written by openxlsx and then read back again (so Excel has never been near it).

write.xlsx(CVAD_list, "CVAD list.xlsx")
CVAD_list2 <- read.xlsx("CVAD list.xlsx")
CVAD_list3 <- read.xlsx("CVAD list.xlsx", detectDates = T)

Error message from the third line above is as follows:

Error reading date:
44696.5
row: 59
col: 6
Error: basic_string::substr: __pos (which is 8) > this->size() (which is 7)

There's nothing obviously odd about cell F59.

I'm happy to create an example that I can post if it would be useful.

packageVersion("openxlsx")
[1] ‘4.2.5.2’

Copy link

This issue is stale because it has been open 365 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Mar 30, 2024
@CarolusKwok
Copy link

This is not stale, I got the same error just now. Is there a fix? thanks :D

@github-actions github-actions bot removed the Stale label Apr 6, 2024
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Jan 4, 2025
# openxlsx 4.2.7.1

* It's now possible to insert a hyperlinked image by passing a URL,
  relative or absolute file path, or mailto string to the new
  `address` parameter of `insertImage()`.

# openxlsx 4.2.7

* Fixed warning on `dataValidation(..., type = "list")`
  ([#342](ycphs/openxlsx#342))

* Added optional argument to `loadWorkbook` to decide if empty/blank
  cells should be converted to NA_character_ (the default) or left
  blank as is

* `saveWorkbook()` now succeeds when called after the user has set
  column widths for a range of columns (e.g. 1:2), saved the workbook,
  then set column widths for a new range that is inclusive of the
  previous one (e.g. 1:5)
  ([#493](ycphs/openxlsx#493)).

## Improvements

* Improve detectDates
  ([#288](ycphs/openxlsx#288))

* Preserve window size and position, also `getWindowSize()` and
  `setWindowSize()`
  ([466](ycphs/openxlsx#466))

# openxlsx 4.2.6

* Fix external links
  ([#410](ycphs/openxlsx#410))

* Do not add unneccessary sheetPr node
  ([#409](ycphs/openxlsx#409))

* Add support for `namedRegion`s having dots and other special
  characters ([#338](ycphs/openxlsx#338)).

* Add type blanks and not blanks to conditional formatting
  ([#311](ycphs/openxlsx#311))

# openxlsx 4.2.5

## Fixes

* `openxlsx_setOp()` now works with named list
  ([#215](ycphs/openxlsx#215))

* `loadWorkbook()` imports `inlineStr`. Values remain `inlineStr` when
  writing the workbook with `saveWorkbook()`. Similar `read.xlsx` and
  `readWorkbook` import `inlineStr`.

* `read.xlsx()` no longer changes random seed
  ([#183](ycphs/openxlsx#183))

* fixed a regression that caused fonts to be read in incorrectly
  ([#207](ycphs/openxlsx#207))

* add option to save as read only recommended
  ([#201](ycphs/openxlsx#201))

* fixed writing hyperlink formulas
  ([#200](ycphs/openxlsx#200))

* `write.xlsx()` now throws an error if it doesn't have write
  permissions ([#190](ycphs/openxlsx#190))

* `write.xlsx()` now again uses the default of `overwrite = TRUE` for
  saving files ([#249](ycphs/openxlsx#249))

* `as.character.formula()` exported to warn about potential conflicts
  with other packages
  ([#312](ycphs/openxlsx#312),
  [#315](ycphs/openxlsx#315))

## Improvements

* `options()` are more consistently set in functions (see:
  [#289](ycphs/openxlsx#262))

* `Workbook$show()` no longer fails when called in a 0 sheet
  workbook([#240](ycphs/openxlsx#240))

* `read.xlsx()` again accepts `.xlsm` files
([#205](ycphs/openxlsx#205),
[#209](ycphs/openxlsx#209))

* `makeHyperlinkString()` does no longer require a sheet argument
  ([#57](ycphs/openxlsx#57),
  [#58](ycphs/openxlsx#58))

* improvements in how `openxlsx` creates temporary directories (see
  [#262](ycphs/openxlsx#262))

* `writeData()` calls `force(x)` to evaluate the object before options
  are set ([#264](ycphs/openxlsx#264))

* `createComment()` now correctly handles `integers` in `width` and
  `height` ([#275](ycphs/openxlsx#275))

* `setStyles()` accepts `halign="justify"`
  ([#305](ycphs/openxlsx#305))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants