Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connectionpool establishes connection to file after information is grabbed and skip is true? #2603

Closed
biggestsonicfan opened this issue May 18, 2022 · 14 comments

Comments

@biggestsonicfan
Copy link

biggestsonicfan commented May 18, 2022

Greetings,

So I happened to download a lot from kemonoparty with the following filename stetting {user}—{id}—{type[0]}—{date:%Y.%m.%d}—{filename[0:140]}.{extension}. It wasn't until yesterday that I realize certain patreon creators are absolutely insane inserting files with the same filename (ex. "2.png") in a single post, which resulted in gallery-dl skipping the identical filenames (skip set to true in extractor config).

In order to avoid redownloading the entire kemono folder of data I already have, what I have done is setup a preprocessor rule to occur before downloading to check if the old filename exists and equal to the hash returned by gallery-dl:

extractor config:

        "kemonoparty":
        {
            "archive": "/run/media/XXXXX/bfd18/dl/gallery-dl/sql/kemonoparty.sqlite3",
            "sleep": 20.0,
            "sleep-request": 20.0,
            "cookies":
            {
                "__ddgid": XXXXX,
                "__ddgmark":XXXXX,
                "__ddg2": XXXXX,
                "__ddg1": XXXXX
            },
            "metadata": true,
            "new-directory": ["kemonoparty-v2", "{service}", "{user}—{username}"], 
            "directory": ["kemonoparty", "{service}", "{user}—{username}"],
            "old-filename": "{user}—{id}—{type[0]}—{date:%Y.%m.%d}—{filename[0:140]}.{extension}",
            "filename": "{id}_{title}_{num:>02}_{filename[:180]}.{extension}",
            "archive-format": "{service}_{user}_{id}_{num}",
            "postprocessors": ["embeds", "deupe"]
        },
    "postprocessor": 
    {
        "embeds": 
        {
            "name": "metadata",
            "event": "post",
            "filename": "{date:%Y-%m-%d}_{id}_{title}.json",
            "mode": "custom",
            "format": "{content}\n{embed[url]:?/\n/}",
            "filter": "embed or 'http://' in content or 'https://' in content",
            "directory": "metadata"
        },
        "deupe": {
            "name": "exec",
            "async": "false",
            "command": [
                "/run/media/XXXXX/bfd18/dl/verify_file.sh",
                "{hash}",
                "{_directory}{user}—{id}—{type[0]}—{date:%Y.%m.%d}—{filename[0:140]}.{extension}",
                "{_path}"
                ],
            "event": "prepare"
        }

    },

where verify_file.sh is as follows:

#!/bin/bash

if ! echo "$1  $2" | sha256sum --check --status --ignore-missing -; then
    echo "Checksum failed. File will download or be skipped if it exists." >&2
    exit 1
else
	echo "Checksum matched! Renaming $2 to $3" >&1
	mv "$2" "$3"
	exit 0
fi

I began executing this new configuration on the artist where I first noticed the post had two different "2.png" files in it (nsfw link). However, while things seemed to be going fine, I encountered this in my output:

/run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/19911192—_Asura/56680904_Chel - Road to Eldorado (Patreon Poll winner)_04_ChelPlug.png
Checksum matched! Renaming /run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/19911192—_Asura/19911192—56390374—a—2021.09.20—Nude.png to /run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/19911192—_Asura/56390374_OC Notorious Ivy commission_01_Nude.png
downloader.http: '403 Forbidden' for 'https://kemono.party/data/68/06/68060d55bdcada9cf2d7678381914ed6d43ac3dd1ded2632ccfb8c62a2d89180.png'
download: Failed to download 56390374_OC Notorious Ivy commission_01_Nude.png

Apparently, the (pre)postprocessor runs correctly moving the file because the checksum matched, however skip:true must not be checked after the (pre)postprocessor and before the download starts, because the (now) identical file exists in it's place yet it's redownloading the data anyway.

What I am doing now to get around this is to use the --no-download flag to just rename the files. I will still have to rerun gallery-dl to download the missed items that I did not already have downloaded.

I am still wondering if I just have a config issue or if this might be an oversight in the order of operations gallery-dl uses.

EDIT: Apparently --no-download is still writing to the archive, so I can't create/use an archive file until I process all the renaming issues...

@mikf
Copy link
Owner

mikf commented May 19, 2022

however skip:true must not be checked after the (pre)postprocessor and before the download starts

Whether to skip a download or not gets checked after event: prepare post processors run:

  1. build path
  2. run "prepare" post processors
  3. check archive
  4. check filesystem

# prepare download
pathfmt.set_filename(kwdict)
if "prepare" in hooks:
for callback in hooks["prepare"]:
callback(pathfmt)
if archive and archive.check(kwdict):
pathfmt.fix_extension()
self.handle_skip()
return
if pathfmt.exists():
if archive:
archive.add(kwdict)
self.handle_skip()
return

Afterwards it proceeds to download, which for whatever reason failed with a 403 Forbidden error in your case. This error is completely unrelated to your post processor, by the way.

@biggestsonicfan
Copy link
Author

biggestsonicfan commented May 19, 2022

Afterwards it proceeds to download, which for whatever reason failed with a 403 Forbidden error in your case. This error is completely unrelated to your post processor, by the way.

While I understood the error was unrelated to the postprocessor, my initial confusion is why an attempt at the file was made when the filesystem check returned true, yet an attempt was made to access the data anyway, even though all information regarding the file was already provided. After reading your response, I initiated (something I should have done earlier) a verbose ouput of gallery-dl:

gallery-dl: Version 1.22.0-dev
gallery-dl: Python 3.8.13 - Linux-5.17.7-1-default-x86_64-with-glibc2.34
gallery-dl: requests 2.27.1 - urllib3 1.25.11
gallery-dl: Starting DownloadJob for 'https://kemono.party/patreon/user/15617066'
kemonoparty: Using KemonopartyUserExtractor for 'https://kemono.party/patreon/user/15617066'
urllib3.connectionpool: Starting new HTTPS connection (1): kemono.party:443
urllib3.connectionpool: https://kemono.party:443 "GET /patreon/user/15617066 HTTP/1.1" 200 6042
kemonoparty: Sleeping for 19.99 seconds
urllib3.connectionpool: https://kemono.party:443 "GET /api/patreon/user/15617066?o=0 HTTP/1.1" 200 4952
kemonoparty: Using download archive '/run/media/XXXXX/bfd18/dl/gallery-dl/sql/kemonoparty.sqlite3'
kemonoparty: Active postprocessor modules: [MetadataPP, ExecPP]
postprocessor.exec: Running '['/run/media/XXXXX/bfd18/dl/verify_file.sh', '1a8fb2164388c738bfff94a4fb0963a54ae17c8a6e6fe6b8fd9b2bdd08b44f32', '/run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/15617066—54118725—a—2021.07.26—2021_7.zip', '/run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/54118725_ミドリ&モモイ(ブルーアーカイブ-Blue Archive-)_01_2021_7.zip']'
Checksum matched! Renaming /run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/15617066—54118725—a—2021.07.26—2021_7.zip to /run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/54118725_ミドリ&モモイ(ブルーアーカイブ-Blue Archive-)_01_2021_7.zip
urllib3.connectionpool: https://kemono.party:443 "GET /data/1a/8f/1a8fb2164388c738bfff94a4fb0963a54ae17c8a6e6fe6b8fd9b2bdd08b44f32.zip HTTP/1.1" 302 138
urllib3.connectionpool: Starting new HTTPS connection (1): data8.kemono.party:443
urllib3.connectionpool: https://data8.kemono.party:443 "GET /data/1a/8f/1a8fb2164388c738bfff94a4fb0963a54ae17c8a6e6fe6b8fd9b2bdd08b44f32.zip HTTP/1.1" 200 30496560
/run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/54118725_ミドリ&モモイ(ブルーアーカイブ-Blue Archive-)_01_2021_7.zip

Why are the GETs and new connections being established if the file was verified to be skipped given the conditions?

EDIT: I should also add that the color behavior is odd as well. Usually, when files are skipped, the color of the font is not changed. In this case, the colors are turning that seafoam green color as if a download happened.

@AlttiRi
Copy link

AlttiRi commented May 19, 2022

Is https://kemono.party/patreon/user/15617066 one of problems URLs?

It has:

Total: 285 (358)
Attachments: 285 (285)
Files: 73 (73)
Embeds: 0 (0)
Inlines: 0 (0)

With

           "filename": "[{category}] {user}—{id}—{type[0]}—{date:%Y.%m.%d}—{title}—{filename[0:140]}.{extension}",
           "archive-format": "{service}_{user}_{type[0]}_{id}_{filename}.{extension}"

gallery-dl downloads 285 files as expected.

285 unique files of 358 total (with duplicates). Because of gallery-dl skips dups based on the hash by default.

In fact this filename pattern is suited to download all 358 files.
Set extractor.kemonoparty.duplicates to true and it will download all 358 files.


The same filename within one post may have only the file and one of attachments.
So, it's enough to use just the first letter of {type} instead of {num}.

Show me an example, if I'm wrong.

@biggestsonicfan
Copy link
Author

biggestsonicfan commented May 19, 2022

Is https://kemono.party/patreon/user/15617066 one of problems URLs?

No, it's unrelated and just an example of the verbose output to begin the rename process for that folder. The problem URL was given in my first post. The verbose output was merely to show connections being established to the server directly to the file even though gallery-dl had all the information needed to skip those connections, unless those connections are needed somehow even when skip is set to true.

I had verified with my old configuration, files with identical names were being skipped. Reprocessing the user with the new configuration renamed the files as well as grabbing the one it previously skipped. That is no longer the issue. The issue now is the creation of connections directly to the file even after all checks have passed for gallery-dl to skip that file even when it won't download it, it still establishes a connection to it.

@AlttiRi
Copy link

AlttiRi commented May 20, 2022

Fuck. There are cases when the attachments of one post can be with same name. Helpfully, it happens very rare.
A post example (NSFW): https://kemono.party/patreon/user/19911192/post/41033600

Downloading of this artist entire will have 3 unique (for the artist) images missed.
(The artist has 252 unique images (265 with per post de-duplication), but currently it downloads only 262 (with this name pattern) and 372 total without de-duplication (344 are downloaded currently).)


Okay, the most trivial fix is to introduce a new key — unique_filename, or better — filename_num.
In order to use it at the end of {filename}, for example, after trimming too long names {filename[0:140]}{filename_num:?_//}.

So, it's the proper usage:

"filename": "[{category}] {user}—{id}—{type[0]}—{date:%Y.%m.%d}—{title}—{filename[0:140]}{filename_num:?_//}.{extension}",
"archive-format": "{service}_{user}_{type[0]}_{id}_{filename}{filename_num:?_//}.{extension}"

On the next run the missed images will be downloaded.

Well, but currently there are no filename_num key.
@mikf could you add it?
I will also edit my prev comments in this case.

Here is a JavaScript "pseudo-code":

const attachments = [...]; // URLs

const nameCounts = new Map(); // here
let num = 0;
for (const attachment of attachments) {
    const {filename, extension, hash} = parseUrl(attachment);
    num++;
    
    // and here
    let filename_num = null; // None
    const count = nameCounts.get(filename) || 0;    
    if (count > 0) {
        filename_num = count;
    }
    nameCounts.set(filename, count + 1);

    yield {filename, extension, hash, num, filename_num};
}

So, if some filename is appeared multiple time the associated file will have filename_num with 1, 2... value. By default it should be None. No prefix by default for the first file, and _1, _2 in case a filename collision for the next files.

The more correct key name is attachment_filename_num. It should count filenames per attachments, not per the post entirely.

UPD:
Also it's better to use attachment_filename_num in my case after {type[0]}, or {title} (instead of after {filename}) to prevent theoretical filename collision (if a post contains, for example: 2.png, 2_1.png, 2.png attachments):

"filename": "[{category}] {user}—{id}—{type[0]}{filename_num:?-//}—{date:%Y.%m.%d}—{title}—{filename[0:140]}.{extension}",
"archive-format": "{service}_{user}_{type[0]}{filename_num:?-//}_{id}_{filename}.{extension}"

Not ideally, but it will works fine. (if attachment_filename_num could be introduced)

That's all.

@biggestsonicfan
Copy link
Author

biggestsonicfan commented May 20, 2022

A post example (NSFW): https://kemono.party/patreon/user/19911192/post/41033600

Yup, that's the one in my first post as well and what lead to my discovery. I discovered it completely by accident too lol.

@AlttiRi
Copy link

AlttiRi commented May 21, 2022

I have fixed it the same way as it is in my JS example above:

                post["type"] = file["type"]
                post["num"] += 1
                post["_http_headers"] = headers
+               post["_a_dup_name_num"] = file.get("_attachment_duplicate_filename_num", None)

                if url[0] == "/":
                    url = self.root + "/data" + url
  • here: (and, possibly, here (I never used Discord scrapper))
    def _attachments(self, post):
+       attachments_unique_name_num = dict()
        for attachment in post["attachments"]:
            attachment["type"] = "attachment"
+           filename = attachment["name"]
+           _name_num = attachments_unique_name_num.get(filename, None)
+           attachment["_attachment_duplicate_filename_num"] = _name_num
+           if _name_num is None:
+               attachments_unique_name_num[filename] = 1
+           else:
+               attachments_unique_name_num[filename] = attachments_unique_name_num[filename] + 1
        return post["attachments"]

In config add {_a_dup_name_num:?-//} after {type[0]} for "filename" and "archive-format".

           "filename": "[{category}] {user}—{id}—{type[0]}{_a_dup_name_num:?-//}—{date:%Y.%m.%d}—{title}—{filename[0:140]}.{extension}",
           "archive-format": "{service}_{user}_{type[0]}{_a_dup_name_num:?-//}_{id}_{filename}.{extension}",

With this patch it works 100 % correctly.
On next run you will download the missed files (attachments with duplicate names within one post).

+3 images with "duplicates": false, or +28 with "duplicates": true for https://kemono.party/patreon/user/19911192.

@mikf I'm not only one who uses {type[0]}, so I hope you will add this fix in any form.

@biggestsonicfan biggestsonicfan changed the title [Postprocessor] extractor skip:true check happens before postprocessor event:prepare? Connectionpool establishes connection to file after information is grabbed and skip is true? May 21, 2022
@biggestsonicfan
Copy link
Author

I just attempted to redownload a few images from a kemono gallery where some images were skipped due to time-out, but this time with the archive database skip enabled, and it absolutely zoomed through the already existing images.

Something is clearly wrong when using the filename skip.

@biggestsonicfan
Copy link
Author

Where are both the filename skip and sqlite skip handled? I'm having trouble locating them.

@biggestsonicfan
Copy link
Author

Okay, a way to absolutely confirm is something is wrong with filename skip is to do the following:

  1. Begin downloading an exhentai gallery without using the sqlite database
  2. Cancel the download at some point, and view your "Image Limits" in your e-hentai "My Home" tab
  3. Wait until the Image Limits resets to 0
  4. Start the gallery download again with filename skip enabled
  5. You will see your "Image Limits" usage increase even though you are not downloading any images
  6. Cancel the download
  7. Wait until the Image Limits resets to 0
  8. Create an entry for the sqlite archive in your config
  9. Begin downloading the gallery for a 3rd time
  10. Cancel and see your Image Limits get used again
  11. Wait until the Image Limits resets to 0
  12. Download the gallery for the 4th time
  13. Images that are skipped no longer contribute to Image Limit

This right there shows this is obviously a problem.

@biggestsonicfan
Copy link
Author

biggestsonicfan commented Sep 1, 2022

I have had to modify both my "dedupe" post(pre)processor and verify_file.sh bash script as follows:

        "deupe": {
            "name": "exec",
            "async": "false",
            "command": [
                "/run/media/xxx/bfd18/dl/verify_file.sh",
                "{hash}",
                "{_directory}{user}—{id}—{type[0]}—{date:%Y.%m.%d}—{filename[0:140]}.{extension}",
                "{_path}",
                "{_directory}",
                "{user}—{id}—{type[0]}—{date:%Y.%m.%d}—"
                ],
            "event": "prepare"
#!/bin/bash

readarray -d '' findarray < <(find "$4" -name "$5*" -print0)


if ! echo "$1  $2" | sha256sum --check --status --ignore-missing -; then
	#If it isn't, lets see if an old part file exists
	if [ -f "$2.part" ]; then
		#if the part file exists and the new file doesn't already exist then rename the part file
		if ! [ -f "$3" ]; then
		    echo "Partial data '$2.part' exists exists moving to '$3.part'!"
		    mv "$2.part" "$3.part"
		#But if the hash file matches the new data, we don't need the .part anymore
		elif echo "$1  $3" | sha256sum --check --status --ignore-missing -; then
			echo "$3 hash matches new data, discarding old $2.part data"
			rm "$2.part"
		fi
	else 
		for i in "${findarray[@]}"
		do
			if echo "$1 $i" | sha256sum --check --status --ignore-missing -; then
				echo "Similarly named file found at $i with matching checksum!"
				if ! [ -f "$3" ]; then
					echo "Renaming old data $i to $3"
					mv "$i" "$3"
				elif echo "$1  $3" | sha256sum --check --status --ignore-missing -; then
					echo "Checksum matched, but $3 also matches checkum. Deleting old $i"
					rm "$i"
				fi
			fi
		done
	fi
	sleep 1
    exit 1
else
	#If the new data doesn't exist and the checksum matches, rename it
	if ! [ -f "$3" ]; then
		echo "Checksum matched! Renaming $2 to $3" >&1
		mv "$2" "$3"
	#If the new data exists and the new data matches the checksum, delete the old data
	elif echo "$1  $3" | sha256sum --check --status --ignore-missing -; then
		echo "Checksum matched, but $3 also matches checkum. Deleting old $2."
		rm "$2"
	fi
	sleep 1
	exit 0
fi

The reason is that kemonoparty has, for a reason not known to me, changed filenames, so I can no longer trust the filename parameter passed to bash script to search for a pre-existing file and I need to recursively search based on other parameters.

I am now running my script through 504 already downloaded, incorrectly named, galleries, and many are detecting checksums and correctly renaming the file... however for larger files, such as .zip files.... it's downloading them.

If the prepare post processor really does execute in this order as stated above:

    build path
    run "prepare" post processors
    check archive
    check filesystem

The prepare post processor should have created satisfactory conditions for the "check filesystem" check for skipping, but it's just not...

@biggestsonicfan
Copy link
Author

biggestsonicfan commented Sep 6, 2022

Okay, this postprocessor isn't working, as it's yeeting files into the abyss. The files "exist", and can be opened directly if you use the path, but ls and file explorers won't report them as being a file in their respective folder. Rebooting into Windows and running chkdisk seems to be the only solution.

Can you run a python script instead of command line, and also can the postprocessor actually finish before gallery-dl resumes? I thought that's what "async": false, but it seems to be random when it actually does wait or not.

EDIT: Time I learned find does not play well with exec
EDIT2: I've re-written my script from the ground up, but it's still yeeting files into the void and only chkdisk can allow the index to see them once again.

@biggestsonicfan
Copy link
Author

Alright, I rewrote the pre-postprocessor in python and it still was corrupting my index of my NTFS drive, and because I am on linux, I was using the newer Paragon ntfs3 driver and that apparently was not happy with the access and index rewriting. I changed the driver back to ntfs-3g and was able to remove around 260 Gigs of duplicate files.

@biggestsonicfan
Copy link
Author

Most likely resolved with 43d0c49 per #2842 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants