-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connectionpool establishes connection to file after information is grabbed and skip is true? #2603
Comments
Whether to skip a download or not gets checked after
Lines 217 to 233 in 688d655
Afterwards it proceeds to download, which for whatever reason failed with a 403 Forbidden error in your case. This error is completely unrelated to your post processor, by the way. |
While I understood the error was unrelated to the postprocessor, my initial confusion is why an attempt at the file was made when the filesystem check returned
Why are the GETs and new connections being established if the file was verified to be skipped given the conditions? EDIT: I should also add that the color behavior is odd as well. Usually, when files are skipped, the color of the font is not changed. In this case, the colors are turning that seafoam green color as if a download happened. |
Is https://kemono.party/patreon/user/15617066 one of problems URLs? It has:
With "filename": "[{category}] {user}—{id}—{type[0]}—{date:%Y.%m.%d}—{title}—{filename[0:140]}.{extension}",
"archive-format": "{service}_{user}_{type[0]}_{id}_{filename}.{extension}" gallery-dl downloads 285 files as expected. 285 unique files of 358 total (with duplicates). Because of gallery-dl skips dups based on the hash by default. In fact this filename pattern is suited to download all 358 files. The same filename within one post may have only the file and one of attachments. Show me an example, if I'm wrong. |
No, it's unrelated and just an example of the verbose output to begin the rename process for that folder. The problem URL was given in my first post. The verbose output was merely to show connections being established to the server directly to the file even though gallery-dl had all the information needed to skip those connections, unless those connections are needed somehow even when skip is set to true. I had verified with my old configuration, files with identical names were being skipped. Reprocessing the user with the new configuration renamed the files as well as grabbing the one it previously skipped. That is no longer the issue. The issue now is the creation of connections directly to the file even after all checks have passed for gallery-dl to skip that file even when it won't download it, it still establishes a connection to it. |
Fuck. There are cases when the attachments of one post can be with same name. Helpfully, it happens very rare. Downloading of this artist entire will have 3 unique (for the artist) images missed. Okay, the most trivial fix is to introduce a new key — So, it's the proper usage: "filename": "[{category}] {user}—{id}—{type[0]}—{date:%Y.%m.%d}—{title}—{filename[0:140]}{filename_num:?_//}.{extension}",
"archive-format": "{service}_{user}_{type[0]}_{id}_{filename}{filename_num:?_//}.{extension}" On the next run the missed images will be downloaded. Well, but currently there are no Here is a JavaScript "pseudo-code": const attachments = [...]; // URLs
const nameCounts = new Map(); // here
let num = 0;
for (const attachment of attachments) {
const {filename, extension, hash} = parseUrl(attachment);
num++;
// and here
let filename_num = null; // None
const count = nameCounts.get(filename) || 0;
if (count > 0) {
filename_num = count;
}
nameCounts.set(filename, count + 1);
yield {filename, extension, hash, num, filename_num};
} So, if some filename is appeared multiple time the associated file will have The more correct key name is UPD: "filename": "[{category}] {user}—{id}—{type[0]}{filename_num:?-//}—{date:%Y.%m.%d}—{title}—{filename[0:140]}.{extension}",
"archive-format": "{service}_{user}_{type[0]}{filename_num:?-//}_{id}_{filename}.{extension}" Not ideally, but it will works fine. (if That's all. |
Yup, that's the one in my first post as well and what lead to my discovery. I discovered it completely by accident too lol. |
I have fixed it the same way as it is in my JS example above: post["type"] = file["type"]
post["num"] += 1
post["_http_headers"] = headers
+ post["_a_dup_name_num"] = file.get("_attachment_duplicate_filename_num", None)
if url[0] == "/":
url = self.root + "/data" + url def _attachments(self, post):
+ attachments_unique_name_num = dict()
for attachment in post["attachments"]:
attachment["type"] = "attachment"
+ filename = attachment["name"]
+ _name_num = attachments_unique_name_num.get(filename, None)
+ attachment["_attachment_duplicate_filename_num"] = _name_num
+ if _name_num is None:
+ attachments_unique_name_num[filename] = 1
+ else:
+ attachments_unique_name_num[filename] = attachments_unique_name_num[filename] + 1
return post["attachments"] In config add "filename": "[{category}] {user}—{id}—{type[0]}{_a_dup_name_num:?-//}—{date:%Y.%m.%d}—{title}—{filename[0:140]}.{extension}",
"archive-format": "{service}_{user}_{type[0]}{_a_dup_name_num:?-//}_{id}_{filename}.{extension}", With this patch it works 100 % correctly. +3 images with @mikf I'm not only one who uses |
I just attempted to redownload a few images from a kemono gallery where some images were skipped due to time-out, but this time with the archive database skip enabled, and it absolutely zoomed through the already existing images. Something is clearly wrong when using the filename skip. |
Where are both the filename skip and sqlite skip handled? I'm having trouble locating them. |
Okay, a way to absolutely confirm is something is wrong with filename skip is to do the following:
This right there shows this is obviously a problem. |
I have had to modify both my "dedupe" post(pre)processor and
The reason is that kemonoparty has, for a reason not known to me, changed filenames, so I can no longer trust the filename parameter passed to bash script to search for a pre-existing file and I need to recursively search based on other parameters. I am now running my script through 504 already downloaded, incorrectly named, galleries, and many are detecting checksums and correctly renaming the file... however for larger files, such as .zip files.... it's downloading them. If the prepare post processor really does execute in this order as stated above:
The prepare post processor should have created satisfactory conditions for the "check filesystem" check for skipping, but it's just not... |
Okay, this postprocessor isn't working, as it's yeeting files into the abyss. The files "exist", and can be opened directly if you use the path, but Can you run a python script instead of command line, and also can the postprocessor actually finish before gallery-dl resumes? I thought that's what EDIT: Time I learned |
Alright, I rewrote the pre-postprocessor in python and it still was corrupting my index of my NTFS drive, and because I am on linux, I was using the newer Paragon ntfs3 driver and that apparently was not happy with the access and index rewriting. I changed the driver back to ntfs-3g and was able to remove around 260 Gigs of duplicate files. |
Most likely resolved with 43d0c49 per #2842 (comment) |
Greetings,
So I happened to download a lot from kemonoparty with the following filename stetting
{user}—{id}—{type[0]}—{date:%Y.%m.%d}—{filename[0:140]}.{extension}
. It wasn't until yesterday that I realize certain patreon creators areabsolutely insaneinserting files with the same filename (ex. "2.png") in a single post, which resulted in gallery-dl skipping the identical filenames (skip set to true in extractor config).In order to avoid redownloading the entire kemono folder of data I already have, what I have done is setup a preprocessor rule to occur before downloading to check if the old filename exists and equal to the hash returned by gallery-dl:
extractor config:
where verify_file.sh is as follows:
I began executing this new configuration on the artist where I first noticed the post had two different "2.png" files in it (nsfw link). However, while things seemed to be going fine, I encountered this in my output:
Apparently, the (pre)postprocessor runs correctly moving the file because the checksum matched, however
skip:true
must not be checked after the (pre)postprocessor and before the download starts, because the (now) identical file exists in it's place yet it's redownloading the data anyway.What I am doing now to get around this is to use the
--no-download
flag to just rename the files. I will still have to rerun gallery-dl to download the missed items that I did not already have downloaded.I am still wondering if I just have a config issue or if this might be an oversight in the order of operations gallery-dl uses.
EDIT: Apparently
--no-download
is still writing to the archive, so I can't create/use an archive file until I process all the renaming issues...The text was updated successfully, but these errors were encountered: