Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tumblr, not all images are downloaded #48

Closed
michaelx opened this issue Nov 3, 2017 · 9 comments
Closed

Tumblr, not all images are downloaded #48

michaelx opened this issue Nov 3, 2017 · 9 comments

Comments

@michaelx
Copy link

michaelx commented Nov 3, 2017

First of all, great project! The Tumblr extractor seems to download only a limited amount of images.

E.g.

gallery-dl "http://wrapmagazine.tumblr.com/tagged/illustration"

gallery-dl gives me 78 images, while DiSiqueira/TumblrDownloader gives me all 123 images.

@Hrxn
Copy link
Contributor

Hrxn commented Nov 3, 2017

Just as I was tinkering around with Tumblr and thinking about opening a new issue.. 😄

Testing with
https://api.tumblr.com/console/calls/blog/posts
Gives me "total_posts": 150, for that tag (illustration).
Not sure if 123 is really correct either.

The returned array for posts contains only 20 entries there, it seems. Not sure if that is just the Web API console or the response in general.

When I'm home I'll try it with this:
https://github.com/tumblr/pytumblr / https://pypi.python.org/pypi/PyTumblr

Might depend on the post type..

PS:
Another great project here on GitHub dealing with Tumblr:
https://github.com/bbolli/tumblr-utils/blob/master/tumblr_backup.py

@mikf mikf closed this as completed in d6bed9f Nov 3, 2017
@mikf
Copy link
Owner

mikf commented Nov 3, 2017

DiSiqueira/TumblrDownloader and gallery-dl are both using tumblr's old API, which only reports 123 posts (not images) in total. https://github.com/bbolli/tumblr-utils/blob/master/tumblr_backup.py actually downloaded more than 300 images for wrapmagazine/illustration.

It might be time to switch to the new API version ...

@mikf mikf reopened this Nov 3, 2017
mikf added a commit that referenced this issue Nov 3, 2017
@mikf
Copy link
Owner

mikf commented Nov 3, 2017

Tumblr API v2 is up and running and produces the same results as the old API, so the initial amount of 123 images appears to be correct.

It seems that using tags gives some pretty counter-intuitive results for "total_posts". Here are some numbers for http://wrapmagazine.tumblr.com - edit: for Type set to photo:

Tags total_posts Actual Posts Images
none 123 123 169
illustration 150 85 123
print 5 2 3

@Hrxn
Copy link
Contributor

Hrxn commented Nov 4, 2017

How do you get to 123 posts in total without any tags set?

From https://api.tumblr.com/console/calls/blog/posts [1], I see
.response.total_posts= 287

Additionally:
.response.blog.posts = 287
Which seems to be the basic blog information, in general part of the API response, I assume.

Can also be obtained via:
https://api.tumblr.com/console/calls/blog/info [2]

Besides, 150 posts for illustration, while 123 in total, does not really make much sense 😄

[1] https://www.tumblr.com/docs/en/api/v2#posts
[2] https://www.tumblr.com/docs/en/api/v2#blog-info

As I understand it..

@mikf
Copy link
Owner

mikf commented Nov 4, 2017

The numbers above are for posts with "Type" set to photo. Sorry for not explicitly mentioning that.

There are indeed 287 posts in total (.response.blog.posts), of which 123 are of type photo (.response.total_posts) and only those contain a photos object with information about actual images.

Applying the illustration tag changes .response.total_posts to 150, which doesn't make any sense, I agree on that. This number stays the same regardless of Type selected, which indicates that Tumblr only applies the Tag filter and disregards Type to get to this number.

There is also other "weird" or unexpected behavior when using tags: requesting for example posts 51 to 100 sometimes only gets you, lets say, 32 posts instead of the expected 50, even though there are more posts after that.

@Hrxn
Copy link
Contributor

Hrxn commented Nov 4, 2017

Applying the illustration tag changes .response.total_posts to 150, which doesn't make any sense [..]

Well, it does make sense, not accounting for type, 287 posts in total, of which 150 have the tag 'illustration'.
If you use /posts only with a tag defined, I think total_posts and the actual number of posts returned from the API should match.

This number stays the same regardless of Type selected, which indicates that Tumblr only applies the Tag filter and disregards Type to get to this number.

Indeed. I see what you mean. I think this basically means that you can't do something like that
SELECT * FROM posts WHERE type = 'x' AND tag = 'y'

Because the API simply does not support it (because of additional load?).
It seems that only one property gets used, and apparently tag takes precedence.

Or to be more specific, maybe you actually can, but have to ignore total_posts because it is not longer accurate then.

I just tried tag = print, type = text, and it actually gives me 3 posts. From your table above, 2 post for type = photo, and this would land us at 5 of 5.

But okay, this is all not really the issue, I'd say, because relying on type = photo is kind of a red herring anyway. This was one of the primary reasons I've been thinking about opening a new issue for Tumblr enhancements lately, before @michaelx kinda beat me to it 😉

The crux is the way how Tumblr works, which is a bit needlessly complicated (others might argue it's flexible), I'd say. So I'm really not surprised that this discussion thread here exists 😄

The 'Make a post' functionality on your Tumblr Dashboard lets you pick between the seven types, but not all Blogs on Tumblr make the sensible choice to only use Photo (which can be a single photo post or a photo set) for images. Some users have the habit to use the Link feature, which automatically creates embedded images if used in conjunction with certain sites (I can definitely say Instagram, and I think Flickr as well, probably more). And there's of course the Text post, which lets you insert photos and even videos with the click of a single button (and the obligatory GIF search, obviously.) for more joyful inlined content. On top of that, you can do the same for Quote. So, basically, full HTML as the post body.

mikf added a commit that referenced this issue Nov 18, 2017
This adds support for audio and video posts (most videos are shared
from youtube/instagram which isn't supported -> youtube-dl),
as well as link posts and image-search inside of text posts.

Most of this is just WIP and will need some sort of improvement
and options to enable/disable different media types etc.
mikf added a commit that referenced this issue Nov 23, 2017
- posts   : list of post-types to inspect
- inline  : scan post bodies for inline images
- external: follow external links
@mikf
Copy link
Owner

mikf commented Nov 23, 2017

I think the last commit pretty much implements everything @Hrxn's last paragraph hints at (even if it took me far longer than it should have):

  • You can select which post types should be scanned for photo/audio/video files to download (internally it requests information about all posts and filters the unwanted ones out).
  • It can search post bodies for inline images.
  • It can follow links to external sites (mainly useful for "Link" posts).
  • Image and video URLs are transformed to their "raw" form (*)
    • https://78.media.tumblr.com/ee589c6345f29d2d5935cecb49b0a705/tumblr_oztu02dIHp1wgha4yo1_1280.png
      -->
    • http://data.tumblr.com/ee589c6345f29d2d5935cecb49b0a705/tumblr_oztu02dIHp1wgha4yo1_raw.png

By default everything should behave like it did before and only get images from "Photo" posts, but it is now possible to configure gallery-dl to get everything ... hopefully.

(*)

  1. The SSL certificate of data.tumblr.com is only valid for amazonaws.com and is therefore considered invalid, which means raw URLs can't use HTTPS.
  2. Roughly one third of all inline GIFs (and only those) yield a "403 Forbidden" when accessing them via their raw URL. Some work, some don't and I don't know why. Try gallery-dl -o posts=text,chat,link,audio,video -o inline=true --filter "extension == 'gif'" http://setheverman.tumblr.com/ if you want to test this yourself.
  3. Even "raw" videos have been (post)processed by Tumblr are not the original files that where uploaded.

Some users have the habit to use the Link feature, which automatically creates embedded images if used in conjunction with certain sites (I can definitely say Instagram, and I think Flickr as well, probably more)

Tumblr even supports Danbooru and Pixiv, which is really not what I would have expected.

@Hrxn
Copy link
Contributor

Hrxn commented Nov 26, 2017

Hey, great news!
Thanks a lot for this. And don't worry about the time it took, just do it however you feel about doing it, it's perfectly fine. 😄

It can follow links to external sites (mainly useful for "Link" posts).

Probably best used together with --write-unsupported?

Image and video URLs are transformed to their "raw" form (*)

Great idea!

The SSL certificate of data.tumblr.com is only valid for amazonaws.com and is therefore considered invalid, which means raw URLs can't use HTTPS.

Expected, doesn't work in the browser either.

Roughly one third of all inline GIFs (and only those) yield a "403 Forbidden" when accessing them via their raw URL.

Expected, I think. Probably caused by a set of "standard" GIFs on Tumblr, displayed in the editor interface etc. for quick access as "reaction GIFs", I presume. As far as I know, they still use an older URL address scheme. And I saw this is already fixed with b14de6f, basically.

Even "raw" videos have been (post)processed by Tumblr are not the original files that where uploaded.

That is true. But this is usual behaviour, I'd say, not just for Tumblr. And it is still better than what youtube-dl does, for example, which doesn't use these 'raw' URLs and thus returns 720p at best.

@mikf
Copy link
Owner

mikf commented Nov 30, 2017

Probably best used together with --write-unsupported?

For the most part yes, that is what it is being useful for, as most external links seem to point to youtube, instagram, vine, etc., but I have also found a user linking to his flickr images, which would then be downloaded using the flickr extractors.

Probably caused by a set of "standard" GIFs on Tumblr, displayed in the editor interface etc. for quick access as "reaction GIFs", I presume. As far as I know, they still use an older URL address scheme. And I saw this is already fixed with b14de6f, basically.

I don't think this has necessarily something to do with "standard" GIFs, especially when looking at what kind of GIFs are affected by this, but then again I don't use Tumblr myself.

What I've figured out so far is that all GIF URLs end in either _raw.gif or _500.gif when using the data.tumblr.com variant (raw, 500) and I had hoped to find some way of determining which one it is other than sending a HEAD request and looking at the status code, but maybe that would be good enough.
b14de6f circumvents the problem, but it causes GIFs that exceed Tumblr's filesize limit to only consist of 1 frame: normal, raw

There are even some audio files which have similar problems (403 Forbidden, infinite redirect) and there is nothing that can be done about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants