-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Magic for xlsx #36
Merged
Merged
Magic for xlsx #36
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
81b7791
add xlsx documents with irregular structure as fixture
fursich 1a82451
seek wider areas for [Content_Types].xml to magicmatch ms xlsx
fursich 1409843
add office documents that have irregular structure as fixture
fursich 4379fe3
seek wider areas for [Content_Types].xml to magicmatch office documents
fursich File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,7 +23,7 @@ | |
|
||
<magic priority="50"> | ||
<match value="PK\003\004" type="string" offset="0"> | ||
<match value="[Content_Types].xml" type="string" offset="30"> | ||
<match value="[Content_Types].xml" type="string" offset="30:65536"> | ||
<match value="word/" type="string" offset="0:4096" /> | ||
</match> | ||
|
||
|
@@ -39,7 +39,7 @@ | |
|
||
<magic priority="50"> | ||
<match value="PK\003\004" type="string" offset="0"> | ||
<match value="[Content_Types].xml" type="string" offset="30"> | ||
<match value="[Content_Types].xml" type="string" offset="30:65536"> | ||
<match value="ppt/" type="string" offset="0:4096" /> | ||
</match> | ||
|
||
|
@@ -55,7 +55,7 @@ | |
|
||
<magic priority="50"> | ||
<match value="PK\003\004" type="string" offset="0"> | ||
<match value="[Content_Types].xml" type="string" offset="30"> | ||
<match value="[Content_Types].xml" type="string" offset="30:65536"> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At least for a blank xlsx file 4096 would be sufficient. I used a common range for the three different format though, so as to make these numbers less cryptic. |
||
<match value="xl/" type="string" offset="0:4096" /> | ||
</match> | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file added
BIN
+31.3 KB
...penxmlformats-officedocument.presentationml.presentation/converted_from_google_slide.pptx
Binary file not shown.
Binary file added
BIN
+4.5 KB
...on/vnd.openxmlformats-officedocument.spreadsheetml.sheet/converted_from_google_sheet.xlsx
Binary file not shown.
Binary file added
BIN
+5.94 KB
...nd.openxmlformats-officedocument.wordprocessingml.document/converted_from_google_doc.docx
Binary file not shown.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The composing parts of google-generated office documents seems to have reversed order. Consequently, the larger the file gets, the later magic strings appears in the file - meaning, we have to go searching for longer bytes, and some sort of trade-off would be needed at certain point.
pptx naturally has larger components (even though test fixture is nothing more than a blank slide), so it has to be at least 30:28168 to find out the string (but that'll be not enough for a pptx with a few contents, so tentatively I put 65536 bytes as maximum search range)
Looks like we have two choices here:
to align with mimemagic that searches the string upto ~5000 bytes. In this case we cannot add google-generated pptx fixture, as it naturally fails.
to take bigger range than mimemagic does, so that it can identify (at least) some office files.
I'm assuming the latter might make sense, but please feel free to correct me if I'm wrong.
In any case we might have to rely on other fallback strategies for larger google-generated office files. Probably the question would come down to "to what extent do we want to try hard to find out embedded magic string?" - let me know what you would think.