Magic for xlsx #36

fursich · 2021-03-29T19:16:06Z

this PR is to demonstrate reproduction case that I described with #35 .

Attached xlsx is a simple blank sheet, generated by google sheet (by exporting the original spreadsheet to xlsx).

fursich · 2021-03-29T19:24:50Z

data/custom.xml

@@ -55,7 +55,7 @@

    <magic priority="50">
      <match value="PK\003\004" type="string" offset="0">
-        <match value="[Content_Types].xml" type="string" offset="30">
+        <match value="[Content_Types].xml" type="string" offset="30:4096">


this would allow marcel to look up wider range (I have to admit it's heuristic though) for [Content_Types].xml.

I'm guessing it can be intentionally done for low cost operation, but let me know what you think.

georgeclaghorn · 2021-03-29T20:21:34Z

This is fine, thanks. Could you do the same for the .docx and .pptx matches above?

fursich · 2021-03-30T01:22:30Z

data/custom.xml

@@ -39,7 +39,7 @@

    <magic priority="50">
      <match value="PK\003\004" type="string" offset="0">
-        <match value="[Content_Types].xml" type="string" offset="30">
+        <match value="[Content_Types].xml" type="string" offset="30:65536">


The composing parts of google-generated office documents seems to have reversed order. Consequently, the larger the file gets, the later magic strings appears in the file - meaning, we have to go searching for longer bytes, and some sort of trade-off would be needed at certain point.

pptx naturally has larger components (even though test fixture is nothing more than a blank slide), so it has to be at least 30:28168 to find out the string (but that'll be not enough for a pptx with a few contents, so tentatively I put 65536 bytes as maximum search range)

Looks like we have two choices here:

to align with mimemagic that searches the string upto ~5000 bytes. In this case we cannot add google-generated pptx fixture, as it naturally fails.

to take bigger range than mimemagic does, so that it can identify (at least) some office files.

I'm assuming the latter might make sense, but please feel free to correct me if I'm wrong.

In any case we might have to rely on other fallback strategies for larger google-generated office files. Probably the question would come down to "to what extent do we want to try hard to find out embedded magic string?" - let me know what you would think.

fursich · 2021-03-30T04:25:29Z

data/custom.xml

@@ -55,7 +55,7 @@

    <magic priority="50">
      <match value="PK\003\004" type="string" offset="0">
-        <match value="[Content_Types].xml" type="string" offset="30">
+        <match value="[Content_Types].xml" type="string" offset="30:65536">


At least for a blank xlsx file 4096 would be sufficient. I used a common range for the three different format though, so as to make these numbers less cryptic.

fursich added 2 commits March 30, 2021 04:09

add xlsx documents with irregular structure as fixture

81b7791

seek wider areas for [Content_Types].xml to magicmatch ms xlsx

1a82451

fursich commented Mar 29, 2021

View reviewed changes

fursich added 2 commits March 30, 2021 09:28

add office documents that have irregular structure as fixture

1409843

seek wider areas for [Content_Types].xml to magicmatch office documents

4379fe3

fursich commented Mar 30, 2021

View reviewed changes

ghiculescu mentioned this pull request Mar 30, 2021

Replace mimemagic with marcel kreeti/kt-paperclip#54

Merged

georgeclaghorn merged commit 0e494f6 into rails:main Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Magic for xlsx #36

Magic for xlsx #36

fursich commented Mar 29, 2021

fursich Mar 29, 2021

georgeclaghorn commented Mar 29, 2021

fursich Mar 30, 2021 •

edited

Loading

fursich Mar 30, 2021

Magic for xlsx #36

Magic for xlsx #36

Conversation

fursich commented Mar 29, 2021

fursich Mar 29, 2021

Choose a reason for hiding this comment

georgeclaghorn commented Mar 29, 2021

fursich Mar 30, 2021 • edited Loading

Choose a reason for hiding this comment

fursich Mar 30, 2021

Choose a reason for hiding this comment

fursich Mar 30, 2021 •

edited

Loading