Connectionpool establishes connection to file after information is grabbed and skip is true? #2603

biggestsonicfan · 2022-05-18T23:01:20Z

Greetings,

So I happened to download a lot from kemonoparty with the following filename stetting {user}—{id}—{type[0]}—{date:%Y.%m.%d}—{filename[0:140]}.{extension}. It wasn't until yesterday that I realize certain patreon creators are ~~absolutely insane~~ inserting files with the same filename (ex. "2.png") in a single post, which resulted in gallery-dl skipping the identical filenames (skip set to true in extractor config).

In order to avoid redownloading the entire kemono folder of data I already have, what I have done is setup a preprocessor rule to occur before downloading to check if the old filename exists and equal to the hash returned by gallery-dl:

extractor config:

        "kemonoparty":
        {
            "archive": "/run/media/XXXXX/bfd18/dl/gallery-dl/sql/kemonoparty.sqlite3",
            "sleep": 20.0,
            "sleep-request": 20.0,
            "cookies":
            {
                "__ddgid": XXXXX,
                "__ddgmark":XXXXX,
                "__ddg2": XXXXX,
                "__ddg1": XXXXX
            },
            "metadata": true,
            "new-directory": ["kemonoparty-v2", "{service}", "{user}—{username}"], 
            "directory": ["kemonoparty", "{service}", "{user}—{username}"],
            "old-filename": "{user}—{id}—{type[0]}—{date:%Y.%m.%d}—{filename[0:140]}.{extension}",
            "filename": "{id}_{title}_{num:>02}_{filename[:180]}.{extension}",
            "archive-format": "{service}_{user}_{id}_{num}",
            "postprocessors": ["embeds", "deupe"]
        },

    "postprocessor": 
    {
        "embeds": 
        {
            "name": "metadata",
            "event": "post",
            "filename": "{date:%Y-%m-%d}_{id}_{title}.json",
            "mode": "custom",
            "format": "{content}\n{embed[url]:?/\n/}",
            "filter": "embed or 'http://' in content or 'https://' in content",
            "directory": "metadata"
        },
        "deupe": {
            "name": "exec",
            "async": "false",
            "command": [
                "/run/media/XXXXX/bfd18/dl/verify_file.sh",
                "{hash}",
                "{_directory}{user}—{id}—{type[0]}—{date:%Y.%m.%d}—{filename[0:140]}.{extension}",
                "{_path}"
                ],
            "event": "prepare"
        }

    },

where verify_file.sh is as follows:

#!/bin/bash

if ! echo "$1  $2" | sha256sum --check --status --ignore-missing -; then
    echo "Checksum failed. File will download or be skipped if it exists." >&2
    exit 1
else
	echo "Checksum matched! Renaming $2 to $3" >&1
	mv "$2" "$3"
	exit 0
fi

I began executing this new configuration on the artist where I first noticed the post had two different "2.png" files in it (nsfw link). However, while things seemed to be going fine, I encountered this in my output:

/run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/19911192—_Asura/56680904_Chel - Road to Eldorado (Patreon Poll winner)_04_ChelPlug.png
Checksum matched! Renaming /run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/19911192—_Asura/19911192—56390374—a—2021.09.20—Nude.png to /run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/19911192—_Asura/56390374_OC Notorious Ivy commission_01_Nude.png
downloader.http: '403 Forbidden' for 'https://kemono.party/data/68/06/68060d55bdcada9cf2d7678381914ed6d43ac3dd1ded2632ccfb8c62a2d89180.png'
download: Failed to download 56390374_OC Notorious Ivy commission_01_Nude.png

Apparently, the (pre)postprocessor runs correctly moving the file because the checksum matched, however skip:true must not be checked after the (pre)postprocessor and before the download starts, because the (now) identical file exists in it's place yet it's redownloading the data anyway.

What I am doing now to get around this is to use the --no-download flag to just rename the files. I will still have to rerun gallery-dl to download the missed items that I did not already have downloaded.

I am still wondering if I just have a config issue or if this might be an oversight in the order of operations gallery-dl uses.

EDIT: Apparently --no-download is still writing to the archive, so I can't create/use an archive file until I process all the renaming issues...

The text was updated successfully, but these errors were encountered:

mikf · 2022-05-19T17:39:50Z

however skip:true must not be checked after the (pre)postprocessor and before the download starts

Whether to skip a download or not gets checked after event: prepare post processors run:

build path
run "prepare" post processors
check archive
check filesystem

gallery-dl/gallery_dl/job.py

Lines 217 to 233 in 688d655

    
           # prepare download 
        
           pathfmt.set_filename(kwdict) 
        
           if "prepare" in hooks: 
        
               for callback in hooks["prepare"]: 
        
                   callback(pathfmt) 
        
           if archive and archive.check(kwdict): 
        
               pathfmt.fix_extension() 
        
               self.handle_skip() 
        
               return 
        
           if pathfmt.exists(): 
        
               if archive: 
        
                   archive.add(kwdict) 
        
               self.handle_skip() 
        
               return

Afterwards it proceeds to download, which for whatever reason failed with a 403 Forbidden error in your case. This error is completely unrelated to your post processor, by the way.

biggestsonicfan · 2022-05-19T18:31:02Z

Afterwards it proceeds to download, which for whatever reason failed with a 403 Forbidden error in your case. This error is completely unrelated to your post processor, by the way.

While I understood the error was unrelated to the postprocessor, my initial confusion is why an attempt at the file was made when the filesystem check returned true, yet an attempt was made to access the data anyway, even though all information regarding the file was already provided. After reading your response, I initiated (something I should have done earlier) a verbose ouput of gallery-dl:

gallery-dl: Version 1.22.0-dev
gallery-dl: Python 3.8.13 - Linux-5.17.7-1-default-x86_64-with-glibc2.34
gallery-dl: requests 2.27.1 - urllib3 1.25.11
gallery-dl: Starting DownloadJob for 'https://kemono.party/patreon/user/15617066'
kemonoparty: Using KemonopartyUserExtractor for 'https://kemono.party/patreon/user/15617066'
urllib3.connectionpool: Starting new HTTPS connection (1): kemono.party:443
urllib3.connectionpool: https://kemono.party:443 "GET /patreon/user/15617066 HTTP/1.1" 200 6042
kemonoparty: Sleeping for 19.99 seconds
urllib3.connectionpool: https://kemono.party:443 "GET /api/patreon/user/15617066?o=0 HTTP/1.1" 200 4952
kemonoparty: Using download archive '/run/media/XXXXX/bfd18/dl/gallery-dl/sql/kemonoparty.sqlite3'
kemonoparty: Active postprocessor modules: [MetadataPP, ExecPP]
postprocessor.exec: Running '['/run/media/XXXXX/bfd18/dl/verify_file.sh', '1a8fb2164388c738bfff94a4fb0963a54ae17c8a6e6fe6b8fd9b2bdd08b44f32', '/run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/15617066—54118725—a—2021.07.26—2021_7.zip', '/run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/54118725_ミドリ＆モモイ（ブルーアーカイブ-Blue Archive-）_01_2021_7.zip']'
Checksum matched! Renaming /run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/15617066—54118725—a—2021.07.26—2021_7.zip to /run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/54118725_ミドリ＆モモイ（ブルーアーカイブ-Blue Archive-）_01_2021_7.zip
urllib3.connectionpool: https://kemono.party:443 "GET /data/1a/8f/1a8fb2164388c738bfff94a4fb0963a54ae17c8a6e6fe6b8fd9b2bdd08b44f32.zip HTTP/1.1" 302 138
urllib3.connectionpool: Starting new HTTPS connection (1): data8.kemono.party:443
urllib3.connectionpool: https://data8.kemono.party:443 "GET /data/1a/8f/1a8fb2164388c738bfff94a4fb0963a54ae17c8a6e6fe6b8fd9b2bdd08b44f32.zip HTTP/1.1" 200 30496560
/run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/54118725_ミドリ＆モモイ（ブルーアーカイブ-Blue Archive-）_01_2021_7.zip

Why are the GETs and new connections being established if the file was verified to be skipped given the conditions?

EDIT: I should also add that the color behavior is odd as well. Usually, when files are skipped, the color of the font is not changed. In this case, the colors are turning that seafoam green color as if a download happened.

AlttiRi · 2022-05-19T20:06:00Z

Is https://kemono.party/patreon/user/15617066 one of problems URLs?

It has:

Total: 285 (358)
Attachments: 285 (285)
Files: 73 (73)
Embeds: 0 (0)
Inlines: 0 (0)

With

           "filename": "[{category}] {user}—{id}—{type[0]}—{date:%Y.%m.%d}—{title}—{filename[0:140]}.{extension}",
           "archive-format": "{service}_{user}_{type[0]}_{id}_{filename}.{extension}"

gallery-dl downloads 285 files as expected.

285 unique files of 358 total (with duplicates). Because of gallery-dl skips dups based on the hash by default.

In fact this filename pattern is suited to download all 358 files.
Set extractor.kemonoparty.duplicates to true and it will download all 358 files.

The same filename within one post may have only the file and one of attachments.
So, it's enough to use just the first letter of {type} instead of {num}.

Show me an example, if I'm wrong.

biggestsonicfan · 2022-05-19T20:42:46Z

Is https://kemono.party/patreon/user/15617066 one of problems URLs?

No, it's unrelated and just an example of the verbose output to begin the rename process for that folder. The problem URL was given in my first post. The verbose output was merely to show connections being established to the server directly to the file even though gallery-dl had all the information needed to skip those connections, unless those connections are needed somehow even when skip is set to true.

I had verified with my old configuration, files with identical names were being skipped. Reprocessing the user with the new configuration renamed the files as well as grabbing the one it previously skipped. That is no longer the issue. The issue now is the creation of connections directly to the file even after all checks have passed for gallery-dl to skip that file even when it won't download it, it still establishes a connection to it.

AlttiRi · 2022-05-20T00:57:36Z

Fuck. There are cases when the attachments of one post can be with same name. Helpfully, it happens very rare.
A post example (NSFW): https://kemono.party/patreon/user/19911192/post/41033600

Downloading of this artist entire will have 3 unique (for the artist) images missed.
(The artist has 252 unique images (265 with per post de-duplication), but currently it downloads only 262 (with this name pattern) and 372 total without de-duplication (344 are downloaded currently).)

Okay, the most trivial fix is to introduce a new key — unique_filename, or better — filename_num.
In order to use it at the end of {filename}, for example, after trimming too long names {filename[0:140]}{filename_num:?_//}.

So, it's the proper usage:

"filename": "[{category}] {user}—{id}—{type[0]}—{date:%Y.%m.%d}—{title}—{filename[0:140]}{filename_num:?_//}.{extension}",
"archive-format": "{service}_{user}_{type[0]}_{id}_{filename}{filename_num:?_//}.{extension}"

On the next run the missed images will be downloaded.

Well, but currently there are no filename_num key.
@mikf could you add it?
I will also edit my prev comments in this case.

Here is a JavaScript "pseudo-code":

const attachments = [...]; // URLs

const nameCounts = new Map(); // here
let num = 0;
for (const attachment of attachments) {
    const {filename, extension, hash} = parseUrl(attachment);
    num++;
    
    // and here
    let filename_num = null; // None
    const count = nameCounts.get(filename) || 0;    
    if (count > 0) {
        filename_num = count;
    }
    nameCounts.set(filename, count + 1);

    yield {filename, extension, hash, num, filename_num};
}

So, if some filename is appeared multiple time the associated file will have filename_num with 1, 2... value. By default it should be None. No prefix by default for the first file, and _1, _2 in case a filename collision for the next files.

The more correct key name is attachment_filename_num. It should count filenames per attachments, not per the post entirely.

UPD:
Also it's better to use attachment_filename_num in my case after {type[0]}, or {title} (instead of after {filename}) to prevent theoretical filename collision (if a post contains, for example: 2.png, 2_1.png, 2.png attachments):

"filename": "[{category}] {user}—{id}—{type[0]}{filename_num:?-//}—{date:%Y.%m.%d}—{title}—{filename[0:140]}.{extension}",
"archive-format": "{service}_{user}_{type[0]}{filename_num:?-//}_{id}_{filename}.{extension}"

Not ideally, but it will works fine. (if attachment_filename_num could be introduced)

That's all.

biggestsonicfan · 2022-05-20T01:22:52Z

A post example (NSFW): https://kemono.party/patreon/user/19911192/post/41033600

Yup, that's the one in my first post as well and what lead to my discovery. I discovered it completely by accident too lol.

AlttiRi · 2022-05-21T03:23:21Z

I have fixed it the same way as it is in my JS example above:

here:

                post["type"] = file["type"]
                post["num"] += 1
                post["_http_headers"] = headers
+               post["_a_dup_name_num"] = file.get("_attachment_duplicate_filename_num", None)

                if url[0] == "/":
                    url = self.root + "/data" + url

here: (and, possibly, here (I never used Discord scrapper))

    def _attachments(self, post):
+       attachments_unique_name_num = dict()
        for attachment in post["attachments"]:
            attachment["type"] = "attachment"
+           filename = attachment["name"]
+           _name_num = attachments_unique_name_num.get(filename, None)
+           attachment["_attachment_duplicate_filename_num"] = _name_num
+           if _name_num is None:
+               attachments_unique_name_num[filename] = 1
+           else:
+               attachments_unique_name_num[filename] = attachments_unique_name_num[filename] + 1
        return post["attachments"]

In config add {_a_dup_name_num:?-//} after {type[0]} for "filename" and "archive-format".

           "filename": "[{category}] {user}—{id}—{type[0]}{_a_dup_name_num:?-//}—{date:%Y.%m.%d}—{title}—{filename[0:140]}.{extension}",
           "archive-format": "{service}_{user}_{type[0]}{_a_dup_name_num:?-//}_{id}_{filename}.{extension}",

With this patch it works 100 % correctly.
On next run you will download the missed files (attachments with duplicate names within one post).

+3 images with "duplicates": false, or +28 with "duplicates": true for https://kemono.party/patreon/user/19911192.

@mikf I'm not only one who uses {type[0]}, so I hope you will add this fix in any form.

biggestsonicfan · 2022-06-02T13:39:49Z

I just attempted to redownload a few images from a kemono gallery where some images were skipped due to time-out, but this time with the archive database skip enabled, and it absolutely zoomed through the already existing images.

Something is clearly wrong when using the filename skip.

biggestsonicfan · 2022-06-14T23:57:28Z

Where are both the filename skip and sqlite skip handled? I'm having trouble locating them.

biggestsonicfan · 2022-06-17T04:30:04Z

Okay, a way to absolutely confirm is something is wrong with filename skip is to do the following:

Begin downloading an exhentai gallery without using the sqlite database
Cancel the download at some point, and view your "Image Limits" in your e-hentai "My Home" tab
Wait until the Image Limits resets to 0
Start the gallery download again with filename skip enabled
You will see your "Image Limits" usage increase even though you are not downloading any images
Cancel the download
Wait until the Image Limits resets to 0
Create an entry for the sqlite archive in your config
Begin downloading the gallery for a 3rd time
Cancel and see your Image Limits get used again
Wait until the Image Limits resets to 0
Download the gallery for the 4th time
Images that are skipped no longer contribute to Image Limit

This right there shows this is obviously a problem.

biggestsonicfan · 2022-09-01T23:24:25Z

I have had to modify both my "dedupe" post(pre)processor and verify_file.sh bash script as follows:

        "deupe": {
            "name": "exec",
            "async": "false",
            "command": [
                "/run/media/xxx/bfd18/dl/verify_file.sh",
                "{hash}",
                "{_directory}{user}—{id}—{type[0]}—{date:%Y.%m.%d}—{filename[0:140]}.{extension}",
                "{_path}",
                "{_directory}",
                "{user}—{id}—{type[0]}—{date:%Y.%m.%d}—"
                ],
            "event": "prepare"

#!/bin/bash

readarray -d '' findarray < <(find "$4" -name "$5*" -print0)


if ! echo "$1  $2" | sha256sum --check --status --ignore-missing -; then
	#If it isn't, lets see if an old part file exists
	if [ -f "$2.part" ]; then
		#if the part file exists and the new file doesn't already exist then rename the part file
		if ! [ -f "$3" ]; then
		    echo "Partial data '$2.part' exists exists moving to '$3.part'!"
		    mv "$2.part" "$3.part"
		#But if the hash file matches the new data, we don't need the .part anymore
		elif echo "$1  $3" | sha256sum --check --status --ignore-missing -; then
			echo "$3 hash matches new data, discarding old $2.part data"
			rm "$2.part"
		fi
	else 
		for i in "${findarray[@]}"
		do
			if echo "$1 $i" | sha256sum --check --status --ignore-missing -; then
				echo "Similarly named file found at $i with matching checksum!"
				if ! [ -f "$3" ]; then
					echo "Renaming old data $i to $3"
					mv "$i" "$3"
				elif echo "$1  $3" | sha256sum --check --status --ignore-missing -; then
					echo "Checksum matched, but $3 also matches checkum. Deleting old $i"
					rm "$i"
				fi
			fi
		done
	fi
	sleep 1
    exit 1
else
	#If the new data doesn't exist and the checksum matches, rename it
	if ! [ -f "$3" ]; then
		echo "Checksum matched! Renaming $2 to $3" >&1
		mv "$2" "$3"
	#If the new data exists and the new data matches the checksum, delete the old data
	elif echo "$1  $3" | sha256sum --check --status --ignore-missing -; then
		echo "Checksum matched, but $3 also matches checkum. Deleting old $2."
		rm "$2"
	fi
	sleep 1
	exit 0
fi

The reason is that kemonoparty has, for a reason not known to me, changed filenames, so I can no longer trust the filename parameter passed to bash script to search for a pre-existing file and I need to recursively search based on other parameters.

I am now running my script through 504 already downloaded, incorrectly named, galleries, and many are detecting checksums and correctly renaming the file... however for larger files, such as .zip files.... it's downloading them.

If the prepare post processor really does execute in this order as stated above:

    build path
    run "prepare" post processors
    check archive
    check filesystem

The prepare post processor should have created satisfactory conditions for the "check filesystem" check for skipping, but it's just not...

biggestsonicfan · 2022-09-06T04:44:15Z

Okay, this postprocessor isn't working, as it's yeeting files into the abyss. The files "exist", and can be opened directly if you use the path, but ls and file explorers won't report them as being a file in their respective folder. Rebooting into Windows and running chkdisk seems to be the only solution.

Can you run a python script instead of command line, and also can the postprocessor actually finish before gallery-dl resumes? I thought that's what "async": false, but it seems to be random when it actually does wait or not.

EDIT: Time I learned find does not play well with exec
EDIT2: I've re-written my script from the ground up, but it's still yeeting files into the void and only chkdisk can allow the index to see them once again.

biggestsonicfan · 2022-10-21T10:31:23Z

Alright, I rewrote the pre-postprocessor in python and it still was corrupting my index of my NTFS drive, and because I am on linux, I was using the newer Paragon ntfs3 driver and that apparently was not happy with the access and index rewriting. I changed the driver back to ntfs-3g and was able to remove around 260 Gigs of duplicate files.

biggestsonicfan · 2023-11-23T09:26:13Z

Most likely resolved with 43d0c49 per #2842 (comment)

biggestsonicfan changed the title ~~[Postprocessor] extractor skip:true check happens before postprocessor event:prepare?~~ Connectionpool establishes connection to file after information is grabbed and skip is true? May 21, 2022

biggestsonicfan mentioned this issue Sep 1, 2022

Ignore files? #2882

Open

biggestsonicfan mentioned this issue Dec 8, 2022

Skipped file filenames replaced with "?" in console if matched to sqlite archive entry #3377

Closed

biggestsonicfan mentioned this issue Nov 4, 2023

[feature request] removing completed URLs from input file #4732

Closed

biggestsonicfan mentioned this issue Nov 22, 2023

[exhentai] Skipped files count against image limit #2842

Open

biggestsonicfan closed this as completed Nov 23, 2023

biggestsonicfan mentioned this issue Jun 24, 2024

IDs Added to Download Archive Despite Being Skipped #5784

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connectionpool establishes connection to file after information is grabbed and skip is true? #2603

Connectionpool establishes connection to file after information is grabbed and skip is true? #2603

biggestsonicfan commented May 18, 2022 •

edited

Loading

mikf commented May 19, 2022

biggestsonicfan commented May 19, 2022 •

edited

Loading

AlttiRi commented May 19, 2022

biggestsonicfan commented May 19, 2022 •

edited

Loading

AlttiRi commented May 20, 2022 •

edited

Loading

biggestsonicfan commented May 20, 2022 •

edited

Loading

AlttiRi commented May 21, 2022 •

edited

Loading

biggestsonicfan commented Jun 2, 2022

biggestsonicfan commented Jun 14, 2022

biggestsonicfan commented Jun 17, 2022

biggestsonicfan commented Sep 1, 2022 •

edited

Loading

biggestsonicfan commented Sep 6, 2022 •

edited

Loading

biggestsonicfan commented Oct 21, 2022

biggestsonicfan commented Nov 23, 2023

Connectionpool establishes connection to file after information is grabbed and skip is true? #2603

Connectionpool establishes connection to file after information is grabbed and skip is true? #2603

Comments

biggestsonicfan commented May 18, 2022 • edited Loading

mikf commented May 19, 2022

biggestsonicfan commented May 19, 2022 • edited Loading

AlttiRi commented May 19, 2022

biggestsonicfan commented May 19, 2022 • edited Loading

AlttiRi commented May 20, 2022 • edited Loading

biggestsonicfan commented May 20, 2022 • edited Loading

AlttiRi commented May 21, 2022 • edited Loading

biggestsonicfan commented Jun 2, 2022

biggestsonicfan commented Jun 14, 2022

biggestsonicfan commented Jun 17, 2022

biggestsonicfan commented Sep 1, 2022 • edited Loading

biggestsonicfan commented Sep 6, 2022 • edited Loading

biggestsonicfan commented Oct 21, 2022

biggestsonicfan commented Nov 23, 2023

biggestsonicfan commented May 18, 2022 •

edited

Loading

biggestsonicfan commented May 19, 2022 •

edited

Loading

biggestsonicfan commented May 19, 2022 •

edited

Loading

AlttiRi commented May 20, 2022 •

edited

Loading

biggestsonicfan commented May 20, 2022 •

edited

Loading

AlttiRi commented May 21, 2022 •

edited

Loading

biggestsonicfan commented Sep 1, 2022 •

edited

Loading

biggestsonicfan commented Sep 6, 2022 •

edited

Loading