-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Deviantart] Missing tag warnings & errors for literature #6686
Comments
As a workaround, you can pass logged-in cookies to be able to access the HTML directly:
|
- fix "KeyError: 'attrs'" for links without 'href' - support 'strike' text markers - support 'heading' content blocks
Partially fixed in 6059ffc
|
Thanks, that seems to do the trick, output files look great so far. For posterity, I'll leave another message when I find out if Deviantart logs you out after a few hundred file downloads, since that was a big problem with using JDownloader. Edit: I'll have to keep using cookies until Edit2: IMO, the best way I know of to scrape DA now is concurrent GDL instances running. Having many parallel terminals, even with the same API key in my config, with very long sleep times has resulted in my scraping getting choked by persistent 429 responses far less often when scraping 24/7. I can't tell why this is but I'm happy enough with it. I can't find any warnings/errors of significance in my log files of verbose output from GDL, but around half of them were deleted so I can't be certain. |
It seems like gallery-dl is missing some of the tags used by tiptap on Deviantart. Examples in log are NSFW.
Using https://www.deviantart.com/dreameater-at-da/gallery/87166350/mobile-vore-stories produces a warning/error for every post in the folder. I've seen it happen on literature posts, so I assume it's the same for journals. The tags I've found it happen with are
da-gallery
,heading
, andstrike
(presumably strikethrough) on a previous post I can't find.Also, the body of the htm file actually downloaded by the first URL https://www.deviantart.com/dreameater-at-da/art/Gaslight-Gatekeep-Girlboss-Mobile-988206435 only contains the first line of text from the source. Of the 52 posts in the folder, only that one and https://www.deviantart.com/dreameater-at-da/art/Sassy-Moody-and-Nasty-Mobile-988410674 produced this result. In the other htm files only the text in the affected tags is missing, instead of the entire document.
The text was updated successfully, but these errors were encountered: