Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Twitter giving frequent rate limits #3557

Open
Twi-Hard opened this issue Jan 22, 2023 · 11 comments
Open

Twitter giving frequent rate limits #3557

Twi-Hard opened this issue Jan 22, 2023 · 11 comments

Comments

@Twi-Hard
Copy link

Ever since the twitter extractor was fixed after it broke I've been getting frequent rate limits. This doesn't happen with snscrape. I haven't tested other scrapers. I used to run 10+ instances of gallery-dl at a time, very fast and without rate limiting. snscrape scrapes faster as usual (probably because it isn't creating a ton of files like gallery-dl) and doesn't get rate limited. Is there something that can be done to fix this? I've tried 2 different accounts with username and password but that didn't fix the issue.
Thanks :)

@ClosedPort22
Copy link
Contributor

Please do a test run using --verbose --ignore-config and post the log file.

@mikf
Copy link
Owner

mikf commented Jan 23, 2023

I think this is because of cached guest tokens and Twitter reducing the rate limit for searches to 350 per 15m.

Twitter rate limits are bound to a guest token or account, and gallery-dl reuses the same guest token for up to one hour, even across multiple gallery-dl instances. snscrape on the other hand requests a new token each time it is run.

You can prevent guest token reuse by disabling gallery-dl's cache: -o cache.file=

@Twi-Hard
Copy link
Author

Disabling the cache fixes it if I also disable my username and password but I still get rate limited when I'm logged in. I need to be logged in as a huge amount of content I'm trying to get is NSFW (I'm not focused on NSFW, but it's still really common for many accounts). Is there anything I can do about this?
I hope the many logins isn't a concern (I tried it with a concurrency of 10 to test it)
image

@mikf
Copy link
Owner

mikf commented Jan 23, 2023

I need to be logged in

Then there is nothing that can be done. I'm afraid, or at least nothing that I'm aware of.

When you are logged in, you have a rate limit separate from any guest tokens at also 350 requests every 15 minutes, and it applies to all requests that your account sends.

Sending a guest token together with your login cookies, which gallery-dl currently does not do - either token when logged out or cookies, does not help either. In this case Twitter still uses your account's rate limit and ignores the token.

You might be able to use the syndication API while not logged in, if that's an option for you.


snscrape doesn't support login/cookies for Twitter, does it?

@Twi-Hard
Copy link
Author

The snscrape dev has made it very clear he'll never add support for authentication (source)
The reason I switched to gallery-dl for twitter was because I was missing too many tweets because of the lack of authentication (which made me find many other good reasons to use gallery-dl instead too).

Perhaps there's a way to only search age restricted tweets that I can do logged in after the rest of the download?

How well would the syndication api work for me? Would I still get every tweet I would have if I was logged in and is the metadata much different? Metadata is really important to me.

@rautamiekka
Copy link
Contributor

I've used

        "twitter": {
            "sleep": 0.5,
            "sleep-request": 0.5
        },

together with a dummy acc for a few weeks by now: nowhere nearly as much rate-limiting after they limited the request count, which

##SFW.
gallery-dl -v 'https://twitter.com/MidPrem' 'https://twitter.com/MidPrem/media'

alone always got a couple times (I think) even with the archive file.

I chose 0.5 totally arbitrarily and is most likely overkill, but we'll see when I can bother to start testing and crunching numbers.

@Twi-Hard
Copy link
Author

I have way too many accounts to download for only 1 instance of the downloader to ever get through them. I usually have 10 running at once. Adding a 0.5 second delay wouldn't fix it for me.

@rautamiekka
Copy link
Contributor

Yeah, your use case is too extreme for simple delays. Only now realized you were the OP, to boot.

@ClosedPort22
Copy link
Contributor

Would I still get every tweet I would have if I was logged in

Probably. As long as Twitter returns the IDs of age-restricted tweets there would be no difference.

is the metadata much different? Metadata is really important to me.

The only difference I'd noticed was the metadata for users. I implemented the syndication=extended option (#3483) to specifically solve this problem.

The caveat with the syndication API is that it needs to be called for each age-restricted tweet, so you'll probably going to run into rate limits as well.

I have way too many accounts to download for only 1 instance of the downloader to ever get through them.

There's always the option of investing in a Raspberry Pi and letting your download jobs run 24/7. I don't have a lot of accounts to download, so I don't care if I have to set a 10 sec delay and let it run for several days.

@KonoVitoDa
Copy link

I think this is because of cached guest tokens and Twitter reducing the rate limit for searches to 350 per 15m.

Was the rate limit reduced even more? I'm being able to download only 50 posts each 15 minutes. I'm using an input-file with a bunch of links.

@Kavolc
Copy link

Kavolc commented Aug 30, 2023

Was the rate limit reduced even more? I'm being able to download only 50 posts each 15 minutes. I'm using an input-file with a bunch of links.

Same here, I tried with an old and new account, I can only download/see 50 post every 15 minutes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants