Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Deviantart] Missing tag warnings & errors for literature #6686

Open
sarma-tyrant opened this issue Dec 18, 2024 · 3 comments
Open

[Deviantart] Missing tag warnings & errors for literature #6686

sarma-tyrant opened this issue Dec 18, 2024 · 3 comments

Comments

@sarma-tyrant
Copy link

sarma-tyrant commented Dec 18, 2024

It seems like gallery-dl is missing some of the tags used by tiptap on Deviantart. Examples in log are NSFW.

D:\DF\Desktop\tool\gallery-dl>gallery-dl -v https://www.deviantart.com/dreameater-at-da/art/Gaslight-Gatekeep-Girlboss-Mobile-988206435
[gallery-dl][debug] Version 1.28.2-dev
[gallery-dl][debug] Python 3.12.2 - Windows-11-10.0.22621-SP0
[gallery-dl][debug] requests 2.32.3 - urllib3 1.26.19
[gallery-dl][debug] Configuration Files ['%APPDATA%\\gallery-dl\\config.json']
[gallery-dl][debug] Starting DownloadJob for 'https://www.deviantart.com/dreameater-at-da/art/Gaslight-Gatekeep-Girlboss-Mobile-988206435'
[deviantart][debug] Using DeviantartDeviationExtractor for 'https://www.deviantart.com/dreameater-at-da/art/Gaslight-Gatekeep-Girlboss-Mobile-988206435'
[deviantart][debug] Using custom API credentials (client-id ***)
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): www.deviantart.com:443
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/user/profile/dreameater-at-da HTTP/1.1" 200 1388
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /DreamEater-at-DA/art/988206435 HTTP/1.1" 200 None
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/deviation/1F58E335-EDB3-406F-3A5F-F6E486E4B33E HTTP/1.1" 200 2174
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/deviation/1F58E335-EDB3-406F-3A5F-F6E486E4B33E HTTP/1.1" 200 2185
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/deviation/metadata?deviationids%5B0%5D=1F58E335-EDB3-406F-3A5F-F6E486E4B33E&mature_content=true&ext_description=1&ext_tags=1&ext_gallery=1 HTTP/1.1" 200 1659
[deviantart][debug] Using download archive 'C:/Users/***/AppData/Roaming/gallery-dl/archive.sqlite3'
[deviantart][debug] Active postprocessor modules: [MetadataPP, MetadataPP]
[deviantart][debug] 988206435: Failed to extract journal HTML from webpage. Falling back to __INITIAL_STATE__ markup.
[deviantart][warning] Unsupported content type 'heading'
[deviantart][debug]
Traceback (most recent call last):
  File "C:\Users\***\AppData\Roaming\Python\Python312\site-packages\gallery_dl\extractor\deviantart.py", line 409, in _textcontent_to_html
    return self._tiptap_to_html(markup)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\***\AppData\Roaming\Python\Python312\site-packages\gallery_dl\extractor\deviantart.py", line 425, in _tiptap_to_html
    self._tiptap_process_content(html, block)
  File "C:\Users\***\AppData\Roaming\Python\Python312\site-packages\gallery_dl\extractor\deviantart.py", line 446, in _tiptap_process_content
    self._tiptap_process_content(html, block)
  File "C:\Users\***\AppData\Roaming\Python\Python312\site-packages\gallery_dl\extractor\deviantart.py", line 452, in _tiptap_process_content
    self._tiptap_process_text(html, content)
  File "C:\Users\***\AppData\Roaming\Python\Python312\site-packages\gallery_dl\extractor\deviantart.py", line 482, in _tiptap_process_text
    html.append(text.escape(mark["attrs"]["href"]))
                            ~~~~^^^^^^^^^
KeyError: 'attrs'
[deviantart][error] 988206435: 'KeyError: 'attrs''
[deviantart][warning] 988206435: Unsupported 'tiptap' markup.
D:\DF\Desktop\tool\gallery-dl\deviantart☢\2023101505›DreamEater-at-DA›Gaslight, Gatekeep, Girlboss (Mobile)※0988206435+087‹dev☢¿.htm

D:\DF\Desktop\tool\gallery-dl>gallery-dl -v https://www.deviantart.com/dreameater-at-da/art/Feeding-a-Bratty-Daughter-Mobile-1103202192
[gallery-dl][debug] Version 1.28.2-dev
[gallery-dl][debug] Python 3.12.2 - Windows-11-10.0.22621-SP0
[gallery-dl][debug] requests 2.32.3 - urllib3 1.26.19
[gallery-dl][debug] Configuration Files ['%APPDATA%\\gallery-dl\\config.json']
[gallery-dl][debug] Starting DownloadJob for 'https://www.deviantart.com/dreameater-at-da/art/Feeding-a-Bratty-Daughter-Mobile-1103202192'
[deviantart][debug] Using DeviantartDeviationExtractor for 'https://www.deviantart.com/dreameater-at-da/art/Feeding-a-Bratty-Daughter-Mobile-1103202192'
[deviantart][debug] Using custom API credentials (client-id ***)
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): www.deviantart.com:443
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/user/profile/dreameater-at-da HTTP/1.1" 200 1388
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /DreamEater-at-DA/art/1103202192 HTTP/1.1" 200 None
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/deviation/B4CD6635-434F-D49F-8D08-04337BA5AA54 HTTP/1.1" 200 985
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/deviation/B4CD6635-434F-D49F-8D08-04337BA5AA54 HTTP/1.1" 200 993
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/deviation/metadata?deviationids%5B0%5D=B4CD6635-434F-D49F-8D08-04337BA5AA54&mature_content=true&ext_description=1&ext_tags=1&ext_gallery=1 HTTP/1.1" 200 1355
[deviantart][debug] Using download archive 'C:/Users/***/AppData/Roaming/gallery-dl/archive.sqlite3'
[deviantart][debug] Active postprocessor modules: [MetadataPP, MetadataPP]
[deviantart][debug] 1103202192: Failed to extract journal HTML from webpage. Falling back to __INITIAL_STATE__ markup.
[deviantart][warning] Unsupported content type 'heading'
[deviantart][warning] Unsupported content type 'da-gallery'
[deviantart][warning] Unsupported content type 'heading'
[deviantart][warning] Unsupported content type 'heading'
[deviantart][warning] Unsupported content type 'heading'
[deviantart][warning] Unsupported content type 'heading'
[deviantart][warning] Unsupported content type 'heading'
[deviantart][warning] Unsupported content type 'heading'
[deviantart][warning] Unsupported content type 'heading'
D:\DF\Desktop\tool\gallery-dl\deviantart☢\2024092700›DreamEater-at-DA›Feeding a Bratty Daughter (Mobile)※1103202192+091‹dev☢¿.htm

Using https://www.deviantart.com/dreameater-at-da/gallery/87166350/mobile-vore-stories produces a warning/error for every post in the folder. I've seen it happen on literature posts, so I assume it's the same for journals. The tags I've found it happen with are da-gallery, heading, and strike (presumably strikethrough) on a previous post I can't find.

Also, the body of the htm file actually downloaded by the first URL https://www.deviantart.com/dreameater-at-da/art/Gaslight-Gatekeep-Girlboss-Mobile-988206435 only contains the first line of text from the source. Of the 52 posts in the folder, only that one and https://www.deviantart.com/dreameater-at-da/art/Sassy-Moody-and-Nasty-Mobile-988410674 produced this result. In the other htm files only the text in the affected tags is missing, instead of the entire document.

@mikf
Copy link
Owner

mikf commented Dec 19, 2024

As a workaround, you can pass logged-in cookies to be able to access the HTML directly:

$ gallery-dl --cookies cookies-deviantart-com.txt https://www.deviantart.com/dreameater-at-da/art/Gaslight-Gatekeep-Girlboss-Mobile-988206435
./deviantart/DreamEater-at-DA/devi…aslight, Gatekeep, Girlboss (Mobile).htm

mikf added a commit that referenced this issue Dec 20, 2024
- fix "KeyError: 'attrs'" for links without 'href'
- support 'strike' text markers
- support 'heading' content blocks
@mikf
Copy link
Owner

mikf commented Dec 20, 2024

Partially fixed in 6059ffc

da-gallery blocks aren't supported yet, but the rest of what you mentioned should work.

@sarma-tyrant
Copy link
Author

sarma-tyrant commented Dec 20, 2024

As a workaround, you can pass logged-in cookies to be able to access the HTML directly:

$ gallery-dl --cookies cookies-deviantart-com.txt https://www.deviantart.com/dreameater-at-da/art/Gaslight-Gatekeep-Girlboss-Mobile-988206435
./deviantart/DreamEater-at-DA/devi…aslight, Gatekeep, Girlboss (Mobile).htm

Thanks, that seems to do the trick, output files look great so far. For posterity, I'll leave another message when I find out if Deviantart logs you out after a few hundred file downloads, since that was a big problem with using JDownloader.

Edit: I'll have to keep using cookies until da-gallery is supported, but I appreciate the new update.

Edit2:
I've downloaded ~70k files, and DA will definitely log you out of your browser session for scraping, necessitating some manual intervention. However, it seems to take considerably longer than JDownloader--a few times a day instead of maybe once an hour. Admittedly, GDL also is slower than JD, but I think it all works out favorably. I'm not yet aware of any actual damage caused to literature files from being logged out in the latest version of GDL, but I'm still wary.

IMO, the best way I know of to scrape DA now is concurrent GDL instances running. Having many parallel terminals, even with the same API key in my config, with very long sleep times has resulted in my scraping getting choked by persistent 429 responses far less often when scraping 24/7. I can't tell why this is but I'm happy enough with it.

I can't find any warnings/errors of significance in my log files of verbose output from GDL, but around half of them were deleted so I can't be certain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants