-
-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tumblr] Investigate possible image URL alternative / Extraction fallback for _original_image #64
Comments
data.tumblr.com and s3.amazonaws.com are aliases of the same underlying domain:
Using https://s3.amazonaws.com/data.tumblr.com/… should therefore work in exactly the same way and even allow for HTTPS to be usable … nice. Sending a HEAD request to check for availability for every image seems a bit much and would make things a bit slower, but this might be an option for inline GIFs. I've checked with some older images from 2011-2014 and they all seem to work just fine. |
The only difference in the browser is that But HTTPS is definitely preferable. At least that is my opinion here. I've expressed myself badly with the availability check of the userscripts, I apologize for the misunderstanding. Sending additional requests would slow the process down, obviously. But that's not really what I meant, besides, this suggestion is rather pointless, because we're doing the request anyway - for the download, or not? |
Oh my, almost forgot this one... Consider this:
versus this:
The |
Each image now produces 3 URLs: - amazonaws.com _raw (or _1280 for older images) - amazonaws.com _500 - media.tumblr.com (URL returned by API)
@mikf Just for correct understanding: The 9fccd7b commit msg says
Which is basically this, right? gallery-dl/gallery_dl/extractor/tumblr.py Lines 28 to 32 in 9fccd7b
So BTW, that inline But okay, back to topic, gallery-dl tries the first of the listed URLs, and if the server does not respond with some error, we're basically done here. Otherwise, try URL 2, and if necessary, URL 3, right? I think I know the first difference with regard to suffixes ( In comparison with the URL returned by the API, i.e here, I think: gallery-dl/gallery_dl/extractor/tumblr.py Line 98 in 9fccd7b
which - not necessarily, but in the most cases - gives us a URL with _1280 , what would be clearly preferable to the _500 "raw" URL from the second step of the URL list. Given that _500 really works in the same way with s3.amazonaws.com URLs as with the classic xy.media.tumblr... URLs... which is something I've never tested myself, ironically, but it would definitely be weird if they work in a different manner here.Not sure, but I hope you get what I mean 😄 |
The code snippet you posted is exactly what now produces 3 instead of just 1 URL for each image, where number 2 and 3 are just fallbacks if the first one fails. To explain my choices for these URLs:
Regular and inline URLs almost always allow for the An exception are those image URLs you posted above and some inline GIFs (see #48), but all of these have one thing in common: For all of these exception-images, the
Pixel-values in those 2 images are exactly the same, but the latter one has 2.5 times the filesize of the former and a lot more metadata, but that alone can't be the only reason for its bloated filesize. Old-style URLs don't support I'm not entirely sure if it is worth it to have unnecessary large files only to have the "best"/highest quality version available, so the And by the way: |
If the Otherwise, I can't think of any potential downside right now when checking the About the example images you linked: PS E:\Transfer> ls *.jpg
Directory: E:\Transfer
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 22.01.2018 22:24 55556 mediatumblr.jpg
-a---- 22.01.2018 22:24 148956 s3aws.jpg
PS E:\Transfer> That isn't too bad in terms of size difference, in my opinion [1]. Normally, I think that quality should always be the priority, but I understand that in a case like this here, for a very marginal difference (but I can see it, if I look really closely) in visual quality, others might think that almost the double of the file size seems a bit too much. But another point that can be taken into consideration is generational loss of JPEG files, which is a real issue. Not in this case, not yet, by a long shot, but as image files gets saved locally, uploaded and transcoded and shared online again, and again, and this cycle continues, at some point this will be a problem. So, in the spirit of long-term thinking, keeping the bigger JPEG, which seems to be a normal baseline JPEG with an estimated quality at 100 (vs. progressive JPEG estimated at 92), is also a very reasonable choice.
Eh, good one. So it actually is a standard ternary operator, but it does not use the usual ternary operator syntax. Chapeau. An Edit: [1] Okay, the difference is significant, technically, but 140 KiB is not something worth losing my sleep over it, in my opinion 😄 |
Ok, I made a huge mistake when writing that last comment. I trusted the output of Imagemagick's The |
No worries.. 😄 Yeah, the |
gallery-dl/gallery_dl/extractor/tumblr.py
Lines 17 to 24 in 974e73b
Apparently, there is another way to do this. Example:
Here, taken directly from Tumblr (Browser > Right Click > View/Open):
https://68.media.tumblr.com/52828dee073b4ea2e123121f27efb35f/tumblr_p21rciUR931twyphzo1_1280.jpg
What
_original_image
is doing currently:http://data.tumblr.com/52828dee073b4ea2e123121f27efb35f/tumblr_p21rciUR931twyphzo1_raw.jpg
What seems to be working as well:
https://s3.amazonaws.com/data.tumblr.com/52828dee073b4ea2e123121f27efb35f/tumblr_p21rciUR931twyphzo1_raw.jpg
The same image. And this seems to work for all URLs I've tested so far.
Has the potential benefit of using HTTPS. Not sure if that is really necessary, but that's another topic. I don't know about potential downsides, I did not notice anything so far. Maybe it's a bit slower. Hence investigate 😄
I've found this by searching around on Greasyfork, and so far each of these
https://greasyfork.org/en/scripts/31873-use-tumblr-raw-image
https://greasyfork.org/en/scripts/9014-tumblr-image-size
https://greasyfork.org/en/scripts/31593-tumblr-images-to-hd-redirector
use
s3.amazonaws.com/data.tumblr.com
, so maybe it's fine to use.The first script changes the
src
ofimg
elements on the page, the other two only run if an image on Tumblr is already open in the current tab. But they both check for availability with a response code check, which might be a good idea for gallery-dl as well. I can't remember that araw
link hasn't worked, but for older images it might be a possibility. (All scripts are rather short and can be directly viewed with the "Code" tab, in case you're not familiar with Greasyfork)The text was updated successfully, but these errors were encountered: