-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse _rels/.rels to get paths to package parts #437
Conversation
src/XlsxWorkBook.h
Outdated
class SheetRelations { | ||
// holds objects related to package parts, such file paths, but also | ||
// worksheet position, name, and id | ||
class PackageRelations { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New name reflects that we now store paths for non-(work)sheet package parts.
int n_; | ||
Rcpp::CharacterVector names_; | ||
Rcpp::CharacterVector id_; | ||
std::map<std::string, std::string> target_; | ||
|
||
// populate workbook element of part_ | ||
void parse_package_rels(const std::string& path) { | ||
std::string rels_xml_file = zip_buffer(path, "_rels/.rels"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readxl previously ignored this file and basically hard-wired the most prevalent structure for a SpreadsheetML package. The LAPD sheets demonstrate this is unsafe and it is more correct to parse this file.
Rcpp::stop("Spreadsheet has no package-level relationships"); | ||
} | ||
|
||
std::map<std::string, std::string> scratch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm parsing the whole thing because it's easy and one day we might care about the other files covered here. But I'm parsing into a scratch
map, because for now I only care about worksheet part. That is identified and stored properly in part_
right after this loop.
src/XlsxWorkBook.h
Outdated
const std::string workbook_dir = dirName(workbook_path); | ||
std::string rels_xml_path = | ||
workbook_dir + "/_rels/" + baseName(workbook_path) + ".rels"; | ||
rels_xml_path = deSlash(rels_xml_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LAPD sheets teach us that an xl/
subdirectory is not necessarily used, i.e. workbook_dir
can be ""
. So we must construct this path.
if (it == part_.end()) { | ||
// e.g., sharedStrings may not be present | ||
// downstream functions should handle non-existent path | ||
return ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better sentinel value? I don't want to stop()
if the key is not found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the empty string is fine, you could theoretically use something like \0
, but I think it is more trouble than it is worth.
src/utils.h
Outdated
size_t start = s.find_first_not_of("/"); | ||
return s.substr(start); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh to have access to path handling functions. Do these look ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think calling deSlash
something like removeLeadingSlash
would make it more clear what it does, I was not sure if deSlash
was supposed to remove the leading slashes, trailing slashes or all slashes.
Also it does not handle the case where the input consists of only slashes, e.g. what happens if find_first_not_of
returns std::string::npos
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK yes. I've now addressed the case of input that consists entirely of slashes. Also renamed to removeLeadingSlashes()
for complete truth in advertising.
src/XlsxWorkBook.h
Outdated
// worksheet position, name, and id | ||
class PackageRelations { | ||
// non-worksheet package parts | ||
std::map<std::string, std::string> part_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new private map to hold paths of non-worksheet package parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I made a few small comments
src/utils.h
Outdated
size_t start = s.find_first_not_of("/"); | ||
return s.substr(start); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think calling deSlash
something like removeLeadingSlash
would make it more clear what it does, I was not sure if deSlash
was supposed to remove the leading slashes, trailing slashes or all slashes.
Also it does not handle the case where the input consists of only slashes, e.g. what happens if find_first_not_of
returns std::string::npos
.
if (found == std::string::npos) { | ||
return path; | ||
} | ||
return path.substr(found + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about it path
== /
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be OK because found + 1
is equal to the length of path
and we get the empty string back.
From std::string::substr:
pos If this is equal to the string length, the function returns an empty string.
I also doublechecked in a little example.
if (it == part_.end()) { | ||
// e.g., sharedStrings may not be present | ||
// downstream functions should handle non-existent path | ||
return ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the empty string is fine, you could theoretically use something like \0
, but I think it is more trouble than it is worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me. You might want to have @jimhester take a look too since my C++ is rusty right now.
Fixes #435 Be able to read sheets produced by the Los Angeles Police Department
The xlsx produced by LAPD is a very literal implementation of the example on pp65-66 of 5th edition of ECMA:
I used this as motivation to make a major upgrade to how readxl parses the package structure. Basically, readxl previously did not parse it but, rather, assumed a very common default. And the LAPD sheets show that other structures are possible (Excel can open the LAPD files, btw). Now we can read LAPD sheets, while still being able to read all the other weird 3rd party sheets in our test collection 🤞
The main feature of the LAPD sheets is that they do not use an
xl/
subdirectory, whereas every other xlsx we've seen before does. Browsable examples:xl/
subdirectory: excelgesis/los-angeles-arrests-xlsxxl/
subdirectory.xl/
but insists on giving paths in/xl/foo/bar
form in all the relationship files, which is peculiar (very different from embedded-chartsheet example). It's also completely in Cyrillic, which is extra fun.