Some images are treated as unsupported file types in thumbnails, despite being supported file types #4852
Labels
🗄️ aspect: data
Concerns the data in our catalog and/or databases
✨ goal: improvement
Improvement to an existing user-facing feature
🟨 priority: medium
Not blocking but should be addressed soon
🧱 stack: api
Related to the Django API
Description
When an image extension is not known (e.g., it is not in
filetype
and cannot be pulled from the URL extension orHEAD
content-type), we assume it is unsupported and do not try to make a thumbnail request.This ends up primarily affecting only a subset of providers whose services function in such a way that we cannot know the file type ahead of the upstream thumbnail request. The example I've found is Smithsonian:
https://api.openverse.org/v1/images/ebdbe147-bceb-4c84-9736-d9d06a37a6a9/
The
url
has no extension, the record has nofiletype
, and the HEAD response has no content-type header:To fix this, rather than assume the unknown file type is unsupported, it would be great if we could still try sending the request upstream to Site Accel.:
In the case of a successful response from Site Accel, we can cache that fact in Redis and bypass the extension check for that media in the future.
I'm not 100% sure of the content-type header that Site Accel returns. If it's accurate to the media type of the upstream image, we should cache that and in such a way that we can ETL it back into the catalogue data for that work, as with #3585. Site Accelerator claims that it will return webp for clients that support it, but I sent Accept / in my httpie request and got a jpeg. When I try it in the browser, even with compression and resizing enabled, I still get a jpeg back. The upstream image is a jpeg, but I don't know if that's the reason Site Accel. returns a jpeg or if it would convert a PNG to a jpeg. I've looked at the Site Accelerator image processing code (formerly known as Photon), and I don't really see what would cause it to return a different file type than the upstream image. It would be worth reaching out to the Jetpack folks to see if they can clarify this for us. If we can reliably retrieve the file type after the request for works for which we don't have that information, it would be great to store and eventually ETL back into the catalogue!
We might also check and see whether Smithsonian in general has this issue, and implement special handling for them instead of needing to check at all.
Additional context
Provider-specific special handling for thumbnail requests has precedence in #4736.
The text was updated successfully, but these errors were encountered: