[tumblr] Investigate possible image URL alternative / Extraction fallback for _original_image #64

Hrxn · 2018-01-07T11:27:45Z

gallery-dl/gallery_dl/extractor/tumblr.py

Lines 17 to 24 in 974e73b

    
           def _original_image(url): 
        
               if url.endswith(".gif") and "_inline_" in url: 
        
                   return url 
        
               return re.sub( 
        
                   (r"https?://\d+\.media\.tumblr\.com" 
        
                    r"/([0-9a-f]+)/tumblr_([^/?&#.]+)_\d+\.([0-9a-z]+)"), 
        
                   r"http://data.tumblr.com/\1/tumblr_\2_raw.\3", url 
        
               )

Apparently, there is another way to do this. Example:

Here, taken directly from Tumblr (Browser > Right Click > View/Open):
https://68.media.tumblr.com/52828dee073b4ea2e123121f27efb35f/tumblr_p21rciUR931twyphzo1_1280.jpg

What _original_image is doing currently:
http://data.tumblr.com/52828dee073b4ea2e123121f27efb35f/tumblr_p21rciUR931twyphzo1_raw.jpg

What seems to be working as well:
https://s3.amazonaws.com/data.tumblr.com/52828dee073b4ea2e123121f27efb35f/tumblr_p21rciUR931twyphzo1_raw.jpg

The same image. And this seems to work for all URLs I've tested so far.
Has the potential benefit of using HTTPS. Not sure if that is really necessary, but that's another topic. I don't know about potential downsides, I did not notice anything so far. Maybe it's a bit slower. Hence investigate 😄

I've found this by searching around on Greasyfork, and so far each of these
https://greasyfork.org/en/scripts/31873-use-tumblr-raw-image
https://greasyfork.org/en/scripts/9014-tumblr-image-size
https://greasyfork.org/en/scripts/31593-tumblr-images-to-hd-redirector
use s3.amazonaws.com/data.tumblr.com, so maybe it's fine to use.

The first script changes the src of img elements on the page, the other two only run if an image on Tumblr is already open in the current tab. But they both check for availability with a response code check, which might be a good idea for gallery-dl as well. I can't remember that a raw link hasn't worked, but for older images it might be a possibility. (All scripts are rather short and can be directly viewed with the "Code" tab, in case you're not familiar with Greasyfork)

The text was updated successfully, but these errors were encountered:

mikf · 2018-01-07T15:00:02Z

data.tumblr.com and s3.amazonaws.com are aliases of the same underlying domain:

$ host data.tumblr.com
data.tumblr.com is an alias for s3-1.amazonaws.com.
s3-1.amazonaws.com has address 52.216.21.13

$ host s3.amazonaws.com
s3.amazonaws.com is an alias for s3-1.amazonaws.com.
s3-1.amazonaws.com has address 54.231.40.130

Using https://s3.amazonaws.com/data.tumblr.com/… should therefore work in exactly the same way and even allow for HTTPS to be usable … nice.

Sending a HEAD request to check for availability for every image seems a bit much and would make things a bit slower, but this might be an option for inline GIFs. I've checked with some older images from 2011-2014 and they all seem to work just fine.

Hrxn · 2018-01-09T02:13:22Z

The only difference in the browser is that s3.amazonaws.com is lacking any favicons. 😄

But HTTPS is definitely preferable. At least that is my opinion here.

I've expressed myself badly with the availability check of the userscripts, I apologize for the misunderstanding. Sending additional requests would slow the process down, obviously. But that's not really what I meant, besides, this suggestion is rather pointless, because we're doing the request anyway - for the download, or not?
What I was actually trying to say is that checking the result of the response would be good, so that in case of any error gallery-dl could use a fallback, like the URL originally returned from the API.

Hrxn · 2018-01-13T18:14:48Z

Oh my, almost forgot this one...
I think I've found some URLs.

Consider this:

https://s3.amazonaws.com/data.tumblr.com/ee54205c6c3f9fdf0cf9ef6537deb3b6/tumblr_mesng84nLy1rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/acfb3bf747b1d5b06baadb4beee43231/tumblr_mesk01ea0v1rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/266663bf199f30c5eabaa78239eb6a46/tumblr_mes4k1Ch931rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/a96d0cac5a3a7589dd3e8c0ee715ab61/tumblr_mes35is1wL1rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/018d4128730bf2205e6dbd0ac642b4a9/tumblr_mes34oLUw01rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/715d296f7963094575324e03ae5e8a5b/tumblr_mer2coa6Zk1rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/27703cc82200cb482fe51cf5a12b1e27/tumblr_mer0vlNxNy1rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/a66fe310e823f63a6f610728a923ceaf/tumblr_mer0v4WYza1rnfejco1_raw.jpg
https://s3.amazonaws.com/data.tumblr.com/660f51ff1f2ee2b02534beb60d36bf4e/tumblr_mer0ueeukX1rnfejco1_raw.jpg

versus this:

https://68.media.tumblr.com/ee54205c6c3f9fdf0cf9ef6537deb3b6/tumblr_mesng84nLy1rnfejco1_1280.jpg
https://68.media.tumblr.com/acfb3bf747b1d5b06baadb4beee43231/tumblr_mesk01ea0v1rnfejco1_1280.jpg
https://68.media.tumblr.com/266663bf199f30c5eabaa78239eb6a46/tumblr_mes4k1Ch931rnfejco1_1280.jpg
https://68.media.tumblr.com/a96d0cac5a3a7589dd3e8c0ee715ab61/tumblr_mes35is1wL1rnfejco1_1280.jpg
https://68.media.tumblr.com/018d4128730bf2205e6dbd0ac642b4a9/tumblr_mes34oLUw01rnfejco1_1280.jpg
https://68.media.tumblr.com/715d296f7963094575324e03ae5e8a5b/tumblr_mer2coa6Zk1rnfejco1_1280.jpg
https://68.media.tumblr.com/27703cc82200cb482fe51cf5a12b1e27/tumblr_mer0vlNxNy1rnfejco1_1280.jpg
https://68.media.tumblr.com/a66fe310e823f63a6f610728a923ceaf/tumblr_mer0v4WYza1rnfejco1_1280.jpg
https://68.media.tumblr.com/660f51ff1f2ee2b02534beb60d36bf4e/tumblr_mer0ueeukX1rnfejco1_1280.jpg

The s3.amazonaws.com does not make any difference, as it should be. That's the good news.
And I picked the 68. part myself randomly, but other numbers should also work.
But that are the links like returned by the API.

Each image now produces 3 URLs: - amazonaws.com _raw (or _1280 for older images) - amazonaws.com _500 - media.tumblr.com (URL returned by API)

Hrxn · 2018-01-22T02:19:22Z

@mikf
So that is the reason for the Urllist type, good to know.

Just for correct understanding: The 9fccd7b commit msg says

Each image now produces 3 URLs:

amazonaws.com _raw (or _1280 for older images)

amazonaws.com _500

media.tumblr.com (URL returned by API)

Which is basically this, right?

gallery-dl/gallery_dl/extractor/tumblr.py

Lines 28 to 32 in 9fccd7b

    
           return ( 
        
               "".join((root, path, "_raw." if key else "_1280.", ext)), 
        
               "".join((root, path, "_500.", ext)), 
        
               url, 
        
           )

So return gives us three URLs, which are then handled by the specific gallery-dl functions, in job.py, right?

BTW, that inline x if y else z seems somehow unusual to me. Some Python specific syntactic trick? Reminds me of the classic ternary operator, but it's still called if.. Yeah, Python seems a bit strange to me, to be honest.

But okay, back to topic, gallery-dl tries the first of the listed URLs, and if the server does not respond with some error, we're basically done here. Otherwise, try URL 2, and if necessary, URL 3, right?

I think I know the first difference with regard to suffixes (_raw vs. _1280), because older URLs miss this part which is called key as the capture group here.
But what about the second URL in this list? I don't know if this works differently with s3.amazonaws.com based URLs, but for "normal" tumblr URLs (seen in the browser, returned from API), the suffix part (in this case: _500) is actually the limit of either width or height of the image.
You can see the difference to _1280 easily, given that the originally uploaded image is large enough in terms of dimension. For smaller images, _1280 results in the same image as the other (smaller) suffix variants, but the URL with _1280 always works in any case.

In comparison with the URL returned by the API, i.e here, I think:

gallery-dl/gallery_dl/extractor/tumblr.py

Line 98 in 9fccd7b

photo.update(photo["original_size"])

which - not necessarily, but in the most cases - gives us a URL with _1280, what would be clearly preferable to the _500 "raw" URL from the second step of the URL list. Given that _500 really works in the same way with s3.amazonaws.com URLs as with the classic xy.media.tumblr...URLs... which is something I've never tested myself, ironically, but it would definitely be weird if they work in a different manner here.
Not sure, but I hope you get what I mean 😄

mikf · 2018-01-22T20:07:46Z

The code snippet you posted is exactly what now produces 3 instead of just 1 URL for each image, where number 2 and 3 are just fallbacks if the first one fails.

To explain my choices for these URLs:
Tumblr has 2-3 classes of images: regular, inline, and old-style

regular:
https://78.media.tumblr.com/0f5f0dda0ba7f4d563b8e7e9addb1b76/tumblr_ov3jatbw4s1u199yso1_1280.jpg

inline
https://78.media.tumblr.com/94d56c599223c59f3feb71ea603484d1/tumblr_inline_ozgwizXgA71vq6t1o_540.png

old-style (things from before 2014?)
https://78.media.tumblr.com/tumblr_kzjlfiTnfe1qz4rgho1_1280.jpg

Regular and inline URLs almost always allow for the amazonaws.com…_raw transformation, which is why this type is the first in the list.

An exception are those image URLs you posted above and some inline GIFs (see #48), but all of these have one thing in common:
They all have a maximum width of 500 and there is no difference between the _1280, _540 and _500 version when using xy.media.tumblr.com URLs (at least when comparing pixel values; embedded metadata is another thing altogether).

For all of these exception-images, the amazonaws.com…_500 version, and only this one, seems to always work and holds a lot more metadata (EXIF, etc.) than the version from xy.media.tumblr.com.

Pixel-values in those 2 images are exactly the same, but the latter one has 2.5 times the filesize of the former and a lot more metadata, but that alone can't be the only reason for its bloated filesize.

Old-style URLs don't support _raw, but amazonaws.com…_1280 seems to work for them. Here the same problem as before arises: images from amazonaws.com have a much larger filesize, but no other difference than metadata.

I'm not entirely sure if it is worth it to have unnecessary large files only to have the "best"/highest quality version available, so the _500 and _1280 for old-style URLs should maybe be scrapped. What do you think about this? Should it maybe be another option? (but what this one be called?)

And by the way: x if y else z is Python's equivalent to C's ternary operator y ? x : z. The reasoning behind it, as far as I know, is human readability, i.e. you can read this statement from left to right and it "just makes sense". I dislike it as well, but what can you do.

Hrxn · 2018-01-22T22:03:47Z

For all of these exception-images, the amazonaws.com…_500 version, and only this one, seems to always work and holds a lot more metadata (EXIF, etc.) than the version from xy.media.tumblr.com.

If the _500 version is the one that always works, then this should obviously be the preferred choice here, for exception-images. And if they are all limited to 500 px in width or height, including "Old-style" images, even more so. That there seem to be no amazonaws.com…_1280 examples with larger image dimensions seems a bit illogical to me, but on the other hand, this wouldn't be the first inconsistency we've encountered here.

Otherwise, I can't think of any potential downside right now when checking the _1280 URL in these cases first, and only fallback to _500 if that first attempt fails. If I am not misunderstanding something here, not sure, I am really tired right now..

About the example images you linked:

PS E:\Transfer> ls *.jpg


    Directory: E:\Transfer


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----       22.01.2018     22:24          55556 mediatumblr.jpg
-a----       22.01.2018     22:24         148956 s3aws.jpg


PS E:\Transfer>

That isn't too bad in terms of size difference, in my opinion [1]. Normally, I think that quality should always be the priority, but I understand that in a case like this here, for a very marginal difference (but I can see it, if I look really closely) in visual quality, others might think that almost the double of the file size seems a bit too much. But another point that can be taken into consideration is generational loss of JPEG files, which is a real issue. Not in this case, not yet, by a long shot, but as image files gets saved locally, uploaded and transcoded and shared online again, and again, and this cycle continues, at some point this will be a problem. So, in the spirit of long-term thinking, keeping the bigger JPEG, which seems to be a normal baseline JPEG with an estimated quality at 100 (vs. progressive JPEG estimated at 92), is also a very reasonable choice.
I also believe that worrying to much about file sizes is a bit out of scope for gallery-dl, and this should be dealt with by the user instead. By adding more storage capacity, or by re-compressing files for themselves if necessary, because there are definitely better options here than using the transcoded images provided by Tumblr. I don't know any details about their image processing backend, but I'd definitely bet that it is optimized for processing speed, and not for the optimal compression possible.

And by the way: x if y else z is Python's equivalent to C's ternary operator y ? x : z. The reasoning behind it, as far as I know, is human readability, i.e. you can read this statement from left to right and it "just makes sense". I dislike it as well, but what can you do.

Eh, good one. So it actually is a standard ternary operator, but it does not use the usual ternary operator syntax. Chapeau. An if statement where the then part comes before the check..
But you're right, what can you do.. It's not worth to think about stuff like this too much, if at all.

Edit:

[1] Okay, the difference is significant, technically, but 140 KiB is not something worth losing my sleep over it, in my opinion 😄

mikf · 2018-01-23T00:36:11Z

Ok, I made a huge mistake when writing that last comment. I trusted the output of Imagemagick's compare I had installed and assumed both images linked above are identical when comparing them pixel-by-pixel. Turns out my compare didn't show any difference for any two images, but I only realized this after reading your comment and testing some other image-comparison software. Updating fixed it ...

The _500 image from amazonaws servers is clearly the higher quality version, now that I've looked at it proper, and should definitely be preferred over the version from media.tumblr.com, wihch it currently is.

Hrxn · 2018-01-23T01:24:38Z

No worries.. 😄

Yeah, the _500 from s3.amazonaws is the better choice all in all, and since in this case there is no difference between the _1280, _540 and _500 version when using xy.media.tumblr.com, picking _1280 here would be pointless.
So the current order in tumblr.py is good..
Excellent, this means that everything is basically done here, so I'm closing this issue now.
Thanks again!

mikf added the enhancement label Jan 7, 2018

mikf added a commit that referenced this issue Jan 9, 2018

[tumblr] use s3.amazonaws.com for image URLs (#64)

75b2e84

Hrxn changed the title ~~[tumblr] Investigate possible image URL alternative~~ [tumblr] Investigate possible image URL alternative / Extraction fallback for _original_image Jan 15, 2018

mikf added a commit that referenced this issue Jan 19, 2018

[tumblr] provide fallback URLs (#64)

9fccd7b

Each image now produces 3 URLs: - amazonaws.com _raw (or _1280 for older images) - amazonaws.com _500 - media.tumblr.com (URL returned by API)

Hrxn closed this as completed Jan 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tumblr] Investigate possible image URL alternative / Extraction fallback for _original_image #64

[tumblr] Investigate possible image URL alternative / Extraction fallback for _original_image #64

Hrxn commented Jan 7, 2018

mikf commented Jan 7, 2018

Hrxn commented Jan 9, 2018

Hrxn commented Jan 13, 2018 •

edited

Loading

Hrxn commented Jan 22, 2018 •

edited

Loading

mikf commented Jan 22, 2018

Hrxn commented Jan 22, 2018 •

edited

Loading

mikf commented Jan 23, 2018

Hrxn commented Jan 23, 2018

[tumblr] Investigate possible image URL alternative / Extraction fallback for _original_image #64

[tumblr] Investigate possible image URL alternative / Extraction fallback for _original_image #64

Comments

Hrxn commented Jan 7, 2018

mikf commented Jan 7, 2018

Hrxn commented Jan 9, 2018

Hrxn commented Jan 13, 2018 • edited Loading

Hrxn commented Jan 22, 2018 • edited Loading

mikf commented Jan 22, 2018

Hrxn commented Jan 22, 2018 • edited Loading

mikf commented Jan 23, 2018

Hrxn commented Jan 23, 2018

Hrxn commented Jan 13, 2018 •

edited

Loading

Hrxn commented Jan 22, 2018 •

edited

Loading

Hrxn commented Jan 22, 2018 •

edited

Loading