Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow users to override content type for aux files #8241 #8282

Merged
merged 4 commits into from
Dec 7, 2021
Merged

Conversation

pdurbin
Copy link
Member

@pdurbin pdurbin commented Dec 6, 2021

What this PR does / why we need it:

Some auxiliary were being saved with the wrong content type (MIME type) but now the user can supply the content type on upload, overriding the type that would otherwise be assigned.

Which issue(s) this PR closes:

Closes #8241

Special notes for your reviewer:

Markdown files can now be given a better content type of text/markdown but on download, no file extension is added. This is because Tika's getDefaultMimeTypes doesn't include "text/markdown". I left a TODO about how perhaps we could look up more esoteric content types in the MimeTypeDetectionByFileExtension.properties file we maintain (.geojson was added recently, for example).

Suggestions on how to test this:

Run the curl commands. You should be able to set the content type to whatever you want.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No.

Is there a release notes update needed for this change?:

Included but feel free to leave it out.

Additional documentation:

Included.

@@ -577,6 +577,7 @@ private boolean isAuxiliaryObjectCached(StorageIO storageIO, String auxiliaryTag
}
}

// TODO: Return ".md" for "text/markdown" as well as other extensions in MimeTypeDetectionByFileExtension.properties
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: We have FileUtil.generateOriginalExtension for tabular files but it uses a hardcoded list rather than the properties file.

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

export FILE_ID='12345'
export FORMAT_TAG='dpJson'
export FORMAT_VERSION='v1'
export TYPE='DP'
export SERVER_URL=https://demo.dataverse.org

curl -H X-Dataverse-key:$API_TOKEN -X POST -F "file=@$FILENAME" -F 'origin=myApp' -F 'isPublic=true' -F "type=$TYPE" "$SERVER_URL/api/access/datafile/$FILE_ID/auxiliary/$FORMAT_TAG/$FORMAT_VERSION"
curl -H X-Dataverse-key:$API_TOKEN -X POST -F "file=@$FILENAME;type=$FILETYPE" -F 'origin=myApp' -F 'isPublic=true' -F "type=$TYPE" "$SERVER_URL/api/access/datafile/$FILE_ID/auxiliary/$FORMAT_TAG/$FORMAT_VERSION"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very mildly rubbed the wrong way by the fact that there are now 2 instances of type= in the command line above... Would have been better if we had named that type parameter something more specific (auxtype? dvtype? - idk, something indicating that this is our, proprietary type...) to differentiate it from the standard mime type... But I'm not going to propose changing it now.

This API is for software clients only; nobody should ever be using it on the command line except an occasional developer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought: two cases of "type" is weird and unfortunate. Perhaps we'll do a final cleanup of this API before we move its docs from the dev guide to the API guide and right some of these wrongs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether this specific parameter is worth changing or not, I really like the fact that we have a note in the doc saying that this API is experimental, so we reserve the right to keep changing it without guaranteeing backward compatibility, etc.

@@ -97,8 +99,12 @@ public AuxiliaryFile processAuxiliaryFile(InputStream fileInputStream, DataFile
storageIO.saveInputStreamAsAux(di, auxExtension);
auxFile.setChecksum(FileUtil.checksumDigestToString(di.getMessageDigest().digest()));

Tika tika = new Tika();
auxFile.setContentType(tika.detect(storageIO.getAuxFileAsInputStream(auxExtension)));
if (mediaType != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question - is mediaType (formDataBodyPart.getMediaType();) definitely going to be null here, if not supplied by the client? - As opposed to defaulting to something like application/octet-stream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. You're right. When I leave off ;type=WHATEVER (the old way), mediaType is non-null and its toString method prints application/octet-stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it should just be if (mediaType != null && (!mediaType.equals("application/octet-stream"))) here. We are doing something similar throughout the "normal" datafile upload process.

Otherwise everything looks good in the PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't make any code changes but I tweaked the tests so I could make assertions on what happens if you leave off ;type=WHATEVER: 787dbed

I'm happy to make changes, of course. Right now we're saying the user can change the content type to whatever we want. We could change is such that if they supply application/octet-stream as a content type we'll run it though Tika and save whatever Tika says it is. That is, we could do something like this:

        if (mediaType != null && !mediaType.toString().equals("application/octet-stream")) {
            auxFile.setContentType(mediaType.toString());
        } else {
            Tika tika = new Tika();
            auxFile.setContentType(tika.detect(storageIO.getAuxFileAsInputStream(auxExtension)));
        }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I was saying above, how that "if" line should look like. Nobody would ever want to supply "application/octet-stream" as the type. That type just means "type unknown" or "type not specified".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(otherwise Tika will never be called... and we want to leave this option, of not supplying the type and letting the application identify it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry, I didn't see that comment until I had written mine. I just pushed the change in 985b552. Please take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it looks good now - but I'd like @scolapasta and @qqmyers to chime in, on whether we really want to be doing this instead, and not in addition to saving the file name, as uploaded.

"application/octet-stream" is the default when the user doesn't supply a
content type. So if it's this, send it through Tika. Yes, a user
can supply "application/octet-stream" and this will also be sent through
Tika.
Tika tika = new Tika();
auxFile.setContentType(tika.detect(storageIO.getAuxFileAsInputStream(auxExtension)));
// The null check prevents an NPE but we expect mediaType to be non-null
// and to default to "application/octet-stream".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I believe the null check is necessary, since the mediaType can still be null, if an input stream is used for the file body.

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the code looks good, and everybody appears to be in agreement, that this is how we want to address both of the underlying issues, the mime type and the file name.
If somehow some picky user requests being able to have non-standard extensions, and/or custom download filenames for these, we'll just addres it separately.

@kcondon kcondon self-assigned this Dec 7, 2021
@kcondon
Copy link
Contributor

kcondon commented Dec 7, 2021

@pdurbin Would you update this pr from develop? I'm getting a flyway mismatch error:
Caused by: org.flywaydb.core.api.FlywayException: Validate failed: Migration checksum mismatch for
migration version 5.8.0.2
-> Applied to database : 1609268027
-> Resolved locally : 343737208

@djbrooke
Copy link
Contributor

djbrooke commented Dec 7, 2021

Hi @kcondon - I just updated from develop.

@pdurbin apologies if I messed something up, but I also trust your git skills to revert in the morning if needed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

auxiliary file download: JSON file extension/type is lost
5 participants