Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tumblr] Investigate possible image URL alternative / Extraction fallback for _original_image #64

Closed
Hrxn opened this issue Jan 7, 2018 · 8 comments

Comments

@Hrxn
Copy link
Contributor

Hrxn commented Jan 7, 2018

def _original_image(url):
if url.endswith(".gif") and "_inline_" in url:
return url
return re.sub(
(r"https?://\d+\.media\.tumblr\.com"
r"/([0-9a-f]+)/tumblr_([^/?&#.]+)_\d+\.([0-9a-z]+)"),
r"http://data.tumblr.com/\1/tumblr_\2_raw.\3", url
)

Apparently, there is another way to do this. Example:

Here, taken directly from Tumblr (Browser > Right Click > View/Open):
https://68.media.tumblr.com/52828dee073b4ea2e123121f27efb35f/tumblr_p21rciUR931twyphzo1_1280.jpg

What _original_image is doing currently:
http://data.tumblr.com/52828dee073b4ea2e123121f27efb35f/tumblr_p21rciUR931twyphzo1_raw.jpg

What seems to be working as well:
https://s3.amazonaws.com/data.tumblr.com/52828dee073b4ea2e123121f27efb35f/tumblr_p21rciUR931twyphzo1_raw.jpg

The same image. And this seems to work for all URLs I've tested so far.
Has the potential benefit of using HTTPS. Not sure if that is really necessary, but that's another topic. I don't know about potential downsides, I did not notice anything so far. Maybe it's a bit slower. Hence investigate 😄

I've found this by searching around on Greasyfork, and so far each of these
https://greasyfork.org/en/scripts/31873-use-tumblr-raw-image
https://greasyfork.org/en/scripts/9014-tumblr-image-size
https://greasyfork.org/en/scripts/31593-tumblr-images-to-hd-redirector
use s3.amazonaws.com/data.tumblr.com, so maybe it's fine to use.

The first script changes the src of img elements on the page, the other two only run if an image on Tumblr is already open in the current tab. But they both check for availability with a response code check, which might be a good idea for gallery-dl as well. I can't remember that a raw link hasn't worked, but for older images it might be a possibility. (All scripts are rather short and can be directly viewed with the "Code" tab, in case you're not familiar with Greasyfork)

@mikf
Copy link
Owner

mikf commented Jan 7, 2018

data.tumblr.com and s3.amazonaws.com are aliases of the same underlying domain:

$ host data.tumblr.com
data.tumblr.com is an alias for s3-1.amazonaws.com.
s3-1.amazonaws.com has address 52.216.21.13

$ host s3.amazonaws.com
s3.amazonaws.com is an alias for s3-1.amazonaws.com.
s3-1.amazonaws.com has address 54.231.40.130

Using https://s3.amazonaws.com/data.tumblr.com/… should therefore work in exactly the same way and even allow for HTTPS to be usable … nice.

Sending a HEAD request to check for availability for every image seems a bit much and would make things a bit slower, but this might be an option for inline GIFs. I've checked with some older images from 2011-2014 and they all seem to work just fine.

@Hrxn
Copy link
Contributor Author

Hrxn commented Jan 9, 2018

The only difference in the browser is that s3.amazonaws.com is lacking any favicons. 😄

But HTTPS is definitely preferable. At least that is my opinion here.

I've expressed myself badly with the availability check of the userscripts, I apologize for the misunderstanding. Sending additional requests would slow the process down, obviously. But that's not really what I meant, besides, this suggestion is rather pointless, because we're doing the request anyway - for the download, or not?
What I was actually trying to say is that checking the result of the response would be good, so that in case of any error gallery-dl could use a fallback, like the URL originally returned from the API.

@Hrxn
Copy link
Contributor Author

Hrxn commented Jan 13, 2018

Oh my, almost forgot this one...
I think I've found some URLs.

Consider this:

https://s3.amazonaws.com/data.tumblr.com/ee54205c6c3f9fdf0cf9ef6537deb3b6/tumblr_mesng84nLy1rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/acfb3bf747b1d5b06baadb4beee43231/tumblr_mesk01ea0v1rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/266663bf199f30c5eabaa78239eb6a46/tumblr_mes4k1Ch931rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/a96d0cac5a3a7589dd3e8c0ee715ab61/tumblr_mes35is1wL1rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/018d4128730bf2205e6dbd0ac642b4a9/tumblr_mes34oLUw01rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/715d296f7963094575324e03ae5e8a5b/tumblr_mer2coa6Zk1rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/27703cc82200cb482fe51cf5a12b1e27/tumblr_mer0vlNxNy1rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/a66fe310e823f63a6f610728a923ceaf/tumblr_mer0v4WYza1rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/660f51ff1f2ee2b02534beb60d36bf4e/tumblr_mer0ueeukX1rnfejco1_raw.jpg

versus this:

https://68.media.tumblr.com/ee54205c6c3f9fdf0cf9ef6537deb3b6/tumblr_mesng84nLy1rnfejco1_1280.jpg
https://68.media.tumblr.com/acfb3bf747b1d5b06baadb4beee43231/tumblr_mesk01ea0v1rnfejco1_1280.jpg
https://68.media.tumblr.com/266663bf199f30c5eabaa78239eb6a46/tumblr_mes4k1Ch931rnfejco1_1280.jpg
https://68.media.tumblr.com/a96d0cac5a3a7589dd3e8c0ee715ab61/tumblr_mes35is1wL1rnfejco1_1280.jpg
https://68.media.tumblr.com/018d4128730bf2205e6dbd0ac642b4a9/tumblr_mes34oLUw01rnfejco1_1280.jpg
https://68.media.tumblr.com/715d296f7963094575324e03ae5e8a5b/tumblr_mer2coa6Zk1rnfejco1_1280.jpg
https://68.media.tumblr.com/27703cc82200cb482fe51cf5a12b1e27/tumblr_mer0vlNxNy1rnfejco1_1280.jpg
https://68.media.tumblr.com/a66fe310e823f63a6f610728a923ceaf/tumblr_mer0v4WYza1rnfejco1_1280.jpg
https://68.media.tumblr.com/660f51ff1f2ee2b02534beb60d36bf4e/tumblr_mer0ueeukX1rnfejco1_1280.jpg

The s3.amazonaws.com does not make any difference, as it should be. That's the good news.
And I picked the 68. part myself randomly, but other numbers should also work.
But that are the links like returned by the API.

@Hrxn Hrxn changed the title [tumblr] Investigate possible image URL alternative [tumblr] Investigate possible image URL alternative / Extraction fallback for _original_image Jan 15, 2018
mikf added a commit that referenced this issue Jan 19, 2018
Each image now produces 3 URLs:
- amazonaws.com _raw (or _1280 for older images)
- amazonaws.com _500
- media.tumblr.com (URL returned by API)
@Hrxn
Copy link
Contributor Author

Hrxn commented Jan 22, 2018

@mikf
So that is the reason for the Urllist type, good to know.

Just for correct understanding: The 9fccd7b commit msg says

Each image now produces 3 URLs:

  • amazonaws.com _raw (or _1280 for older images)
  • amazonaws.com _500
  • media.tumblr.com (URL returned by API)

Which is basically this, right?

return (
"".join((root, path, "_raw." if key else "_1280.", ext)),
"".join((root, path, "_500.", ext)),
url,
)

So return gives us three URLs, which are then handled by the specific gallery-dl functions, in job.py, right?

BTW, that inline x if y else z seems somehow unusual to me. Some Python specific syntactic trick? Reminds me of the classic ternary operator, but it's still called if.. Yeah, Python seems a bit strange to me, to be honest.

But okay, back to topic, gallery-dl tries the first of the listed URLs, and if the server does not respond with some error, we're basically done here. Otherwise, try URL 2, and if necessary, URL 3, right?

I think I know the first difference with regard to suffixes (_raw vs. _1280), because older URLs miss this part which is called key as the capture group here.
But what about the second URL in this list? I don't know if this works differently with s3.amazonaws.com based URLs, but for "normal" tumblr URLs (seen in the browser, returned from API), the suffix part (in this case: _500) is actually the limit of either width or height of the image.
You can see the difference to _1280 easily, given that the originally uploaded image is large enough in terms of dimension. For smaller images, _1280 results in the same image as the other (smaller) suffix variants, but the URL with _1280 always works in any case.

In comparison with the URL returned by the API, i.e here, I think:

photo.update(photo["original_size"])

which - not necessarily, but in the most cases - gives us a URL with _1280, what would be clearly preferable to the _500 "raw" URL from the second step of the URL list. Given that _500 really works in the same way with s3.amazonaws.com URLs as with the classic xy.media.tumblr...URLs... which is something I've never tested myself, ironically, but it would definitely be weird if they work in a different manner here.
Not sure, but I hope you get what I mean 😄

@mikf
Copy link
Owner

mikf commented Jan 22, 2018

The code snippet you posted is exactly what now produces 3 instead of just 1 URL for each image, where number 2 and 3 are just fallbacks if the first one fails.

To explain my choices for these URLs:
Tumblr has 2-3 classes of images: regular, inline, and old-style

regular:
https://78.media.tumblr.com/0f5f0dda0ba7f4d563b8e7e9addb1b76/tumblr_ov3jatbw4s1u199yso1_1280.jpg

inline
https://78.media.tumblr.com/94d56c599223c59f3feb71ea603484d1/tumblr_inline_ozgwizXgA71vq6t1o_540.png

old-style (things from before 2014?)
https://78.media.tumblr.com/tumblr_kzjlfiTnfe1qz4rgho1_1280.jpg

Regular and inline URLs almost always allow for the amazonaws.com…_raw transformation, which is why this type is the first in the list.

An exception are those image URLs you posted above and some inline GIFs (see #48), but all of these have one thing in common:
They all have a maximum width of 500 and there is no difference between the _1280, _540 and _500 version when using xy.media.tumblr.com URLs (at least when comparing pixel values; embedded metadata is another thing altogether).

For all of these exception-images, the amazonaws.com…_500 version, and only this one, seems to always work and holds a lot more metadata (EXIF, etc.) than the version from xy.media.tumblr.com.

Pixel-values in those 2 images are exactly the same, but the latter one has 2.5 times the filesize of the former and a lot more metadata, but that alone can't be the only reason for its bloated filesize.

Old-style URLs don't support _raw, but amazonaws.com…_1280 seems to work for them. Here the same problem as before arises: images from amazonaws.com have a much larger filesize, but no other difference than metadata.

I'm not entirely sure if it is worth it to have unnecessary large files only to have the "best"/highest quality version available, so the _500 and _1280 for old-style URLs should maybe be scrapped. What do you think about this? Should it maybe be another option? (but what this one be called?)

And by the way: x if y else z is Python's equivalent to C's ternary operator y ? x : z. The reasoning behind it, as far as I know, is human readability, i.e. you can read this statement from left to right and it "just makes sense". I dislike it as well, but what can you do.

@Hrxn
Copy link
Contributor Author

Hrxn commented Jan 22, 2018

For all of these exception-images, the amazonaws.com…_500 version, and only this one, seems to always work and holds a lot more metadata (EXIF, etc.) than the version from xy.media.tumblr.com.

If the _500 version is the one that always works, then this should obviously be the preferred choice here, for exception-images. And if they are all limited to 500 px in width or height, including "Old-style" images, even more so. That there seem to be no amazonaws.com…_1280 examples with larger image dimensions seems a bit illogical to me, but on the other hand, this wouldn't be the first inconsistency we've encountered here.

Otherwise, I can't think of any potential downside right now when checking the _1280 URL in these cases first, and only fallback to _500 if that first attempt fails. If I am not misunderstanding something here, not sure, I am really tired right now..

About the example images you linked:

PS E:\Transfer> ls *.jpg


    Directory: E:\Transfer


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----       22.01.2018     22:24          55556 mediatumblr.jpg
-a----       22.01.2018     22:24         148956 s3aws.jpg


PS E:\Transfer>

That isn't too bad in terms of size difference, in my opinion [1]. Normally, I think that quality should always be the priority, but I understand that in a case like this here, for a very marginal difference (but I can see it, if I look really closely) in visual quality, others might think that almost the double of the file size seems a bit too much. But another point that can be taken into consideration is generational loss of JPEG files, which is a real issue. Not in this case, not yet, by a long shot, but as image files gets saved locally, uploaded and transcoded and shared online again, and again, and this cycle continues, at some point this will be a problem. So, in the spirit of long-term thinking, keeping the bigger JPEG, which seems to be a normal baseline JPEG with an estimated quality at 100 (vs. progressive JPEG estimated at 92), is also a very reasonable choice.
I also believe that worrying to much about file sizes is a bit out of scope for gallery-dl, and this should be dealt with by the user instead. By adding more storage capacity, or by re-compressing files for themselves if necessary, because there are definitely better options here than using the transcoded images provided by Tumblr. I don't know any details about their image processing backend, but I'd definitely bet that it is optimized for processing speed, and not for the optimal compression possible.

And by the way: x if y else z is Python's equivalent to C's ternary operator y ? x : z. The reasoning behind it, as far as I know, is human readability, i.e. you can read this statement from left to right and it "just makes sense". I dislike it as well, but what can you do.

Eh, good one. So it actually is a standard ternary operator, but it does not use the usual ternary operator syntax. Chapeau. An if statement where the then part comes before the check..
But you're right, what can you do.. It's not worth to think about stuff like this too much, if at all.

Edit:

[1] Okay, the difference is significant, technically, but 140 KiB is not something worth losing my sleep over it, in my opinion 😄

@mikf
Copy link
Owner

mikf commented Jan 23, 2018

Ok, I made a huge mistake when writing that last comment. I trusted the output of Imagemagick's compare I had installed and assumed both images linked above are identical when comparing them pixel-by-pixel. Turns out my compare didn't show any difference for any two images, but I only realized this after reading your comment and testing some other image-comparison software. Updating fixed it ...

The _500 image from amazonaws servers is clearly the higher quality version, now that I've looked at it proper, and should definitely be preferred over the version from media.tumblr.com, wihch it currently is.

@Hrxn
Copy link
Contributor Author

Hrxn commented Jan 23, 2018

No worries.. 😄

Yeah, the _500 from s3.amazonaws is the better choice all in all, and since in this case there is no difference between the _1280, _540 and _500 version when using xy.media.tumblr.com, picking _1280 here would be pointless.
So the current order in tumblr.py is good..
Excellent, this means that everything is basically done here, so I'm closing this issue now.
Thanks again!

@Hrxn Hrxn closed this as completed Jan 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants