Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Patreon skip duplicate files && implement a similar function than "chapter-range" #590

Closed
CaptainJawZ opened this issue Jan 25, 2020 · 6 comments

Comments

@CaptainJawZ
Copy link

Hello! so I know there are workarounds for this, but none of them are perfect
So what I want to do is to be able to download all files from a patreon but skip duplicates.

Scenario 1)
A lot of artists upload the pictures to the patreon gallery post/body and ALSO add them as attachments.

Scenario 2)
Sometimes the body of the post (aka content) contains several images all named 1.png (not sure if this is a global thing or something of the handful creators I follow)


The default behaviour seems to be that if a filename of the same exists it will skip it, the problem is that when it comes to scenario 1, it will download those files as duplicates for scenario 2 following the unique filename behaviour it will only download the first 1.png but skip the rest of the files even if they are different. so it will not download unique files.

So to fix that I made my configuration file so uses the compare postprocessors, which to work requires skip = false, and part = true. And on the compare.action I use the "enumerate" setting.

Now this is perfect as I get to keep the original filename, it skips duplicated posts, ensuring that the user downloads 100% of the files, it will rename the files with conflicting filename and it will skip duplicate files.

The problem is that setting skip to false, will prevent the archive from working, so next time I run the program it will start enumerating new duplicates over the already existing files.

I think perhaps if the skip/compare function could be uniquely merged for the patreon extractor, it could allow for the download of the files without sacrificing the archiving functionality, it would be very handful.


Also I don't know how difficult it would be to implement this, but if there could be a way to implement chapter-range to patreon so it only downloads from the newest post, instead of going through the whole thing. range works, but it's unreliable as some galleries contain more pictures than others.

Thanks so much!

@mikf
Copy link
Owner

mikf commented Jan 29, 2020

Assuming you are currently using something like "{filename}.{extension}" as filename format, I'd recommend changing it to "{filename}.{num}.{extension}" or at least something containing {num}.

{num} is a "builtin" enumeration index for files inside a Patreon post. Using it will give you unique filenames per post, allowing the usual skip and archive functionality to work, while having a similar structure as with the "compare" post processor and its "enumerate" action.

Speaking of, this post processor is supposed to compare already downloaded versions of a file with a potentially new one, and replace or enumerate it in case it changed, maybe because the extractor in question got improved and now provides higher quality images. That's why you can't (or at least shouldn't) skip already downloaded files, otherwise you couldn't compare old with new.

@CaptainJawZ
Copy link
Author

Oh this sort of works, but it still downloads duplicate files, which I wish was a way to avoid, I know that the postprocessor only works on the first session, but usually that's enough because the duplicated files tend to happen per post, I know this will keep duplicate files posted at different times on duplicate posts but still is a way better alternative than downloading the same file twice per session.

@mikf
Copy link
Owner

mikf commented Feb 9, 2020

Turns out all download URLs have a hash digest in them:

https://c10.patreonusercontent.com/3/eyJwIjoxfQ%3D%3D/patreon-media/p/post/19987002/bc25dbbe8e8c40b3b8d18e0f40fb7d45/1.png?token-time=1582416000&token-hash=3KNiScP3b_LHc_ltnZuc4Os3lU7jRfFFQWL_mKA4nvc%3D
-> bc25dbbe8e8c40b3b8d18e0f40fb7d45

109f6c8 uses those to (hopefully) filter and ignore duplicates. It also restructures the way files are extracted by quite a bit, so I would appreciate it if you could test if everything works as it should. (I don't have any patreon subscriptions on my own and am relying on creators that have their stuff available for free)

@CaptainJawZ
Copy link
Author

Hello! thank you so very much for your work on this I really appreciate the work on my request!
So what I did was use the upgrade to master
$ python3 -m pip install --upgrade https://github.com/mikf/gallery-dl/archive/master.tar.gz

So I ran this with 1.13.0-dev
I ran the test with

  1. skip: true
  2. skip false, compare replace
  3. skip false, compare enumerate

And all the tests ran with the same behaviour as the last stable version, so not changes.

Let me know if I can adjust the config file to try something else! and thanks once again for your time

mikf added a commit that referenced this issue Feb 12, 2020
@mikf
Copy link
Owner

mikf commented Feb 12, 2020

no changes

Hmm, 109f6c8 should at least solve "Scenario 1)", i.e. it shouldn't download duplicates anymore. Are you sure this didn't change? I've added some debug logging messages for duplicate files in b9c574b. Could you try downloading from a post with duplicate files in post/body and attachments while using -v and see what pops up?

And for "Scenario 2)", you should either change the filename format string to either include num or the new hash to ensure unique filenames (again, 109f6c8 should automatically skip all duplicate files in a post, regardless of your config settings)

You definitely shouldn't be using the compare post processor for this, or at least not at the moment. It doesn't really help with what you want, I think. Just stay on skip: true.

@CaptainJawZ
Copy link
Author

I updated the master again with the same command and running 1.13.0-dev
I ran the test only on scenario one, so true and removed the post processor.
I ran it on two patreons

  1. files are duplicated on the gallery post and attachment, in this scenario it worked perfectly didn't downloaded any dupes and downloaded the amount of files expected, the amount i wouldve downloaded by hand and the number matches.
  2. patreon 2 is one where the images are added on the body of the post, and thus all the files are named 1.ext in this scenario unfortunately the script still doesn't download all the files and only downloads the first one.
* D:\OneDrive\Pictures\Saved Pictures\To Organize\Patreon...e\(20190508) pg 22 background characters!  (26716450)\1.png
# D:\OneDrive\Pictures\Saved Pictures\To Organize\Patreon...e\(20190508) pg 22 background characters!  (26716450)\1.png
# D:\OneDrive\Pictures\Saved Pictures\To Organize\Patreon...e\(20190508) pg 22 background characters!  (26716450)\1.png

There is another unrelated problem, I can't seem to be able to run the -v flag, I doesn't do anything on the lastest master, I did tried running the same command on my laptop that didnt had gallery-dl updated and it showed the expected verbose content but after upgrading to the master it stopped showing the verbose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants