[bunkr] fixed extractor #4529

Yakabuff · 2023-09-14T01:00:42Z

The cdn link can now be either /v/ or /i/ depending on whether it's a video or image
Location of link has been adjusted
Image URL now needs to be unescaped

Yakabuff · 2023-09-14T02:51:16Z

HeavenlyVice · 2023-09-14T18:54:57Z

I was testing this fix out and running into an issue. It's possible I did something incorrectly, however I believe the way the code currently reads in this PR is that it's pulling the cdn from the download page based on the string found between <source src=" and the next " for videos and between <img src=" and the next " for images. Which pulls the entire source URL for that album/gallery item. However, it then truncates that url to only grab the cdn root (i.e., 'https://media-files12.bunker.la/') and then appends the end of the {self} URL starting from 'v/' or 'i/' which works for the test album because the image in the file is found at the link https://bunkrr.su/i/test-%E3%83%86%E3%82%B9%E3%83%88-%22&%3E-QjgneIQv.png which is the same as the file name.
However, this isn't usually the case for files on bunkrr in my experience. For example, I just went and grabbed a random bunkrr album here (NSFW content as it seems most of the public albums are....):
https://bunkrr.su/a/aZM5f6WS

Opening the first file goes to the link:
https://bunkrr.su/v/kLG2yrlpg7DSk

However, the src URL for the download is:
https://media-files12.bunkr.la/Woods-KbuqDmbn-rftcXF0I2H1v.mp4

Currently, this code grabs the 'https://media-files12.bunkr.la/' appends the 'v/kLG2yrlpg7DSk' from the page link for that item and tries to download using 'https://media-files12.bunkr.la/v/kLG2yrlpg7DSk' which 404s as it's an invalid link. I tried commenting out lines 106 and 107 and adding in 'url = cdn' to just use the https://media-files12.bunkr.la/Woods-KbuqDmbn-rftcXF0I2H1v.mp4 link that is pulled by @Yakabuff's if/else statement, which worked.... For the first item in the gallery. It isn't iterating through all of the files in the gallery as the headers variable is iterating through the 'v/kLG2yrlpg7DSk' URLs instead of iterating through the actual download links that are needed. It just attempts to download the https://media-files12.bunkr.la/Woods-KbuqDmbn-rftcXF0I2H1v.mp4 file over and over 844 times I believe (one for each file in the gallery). I can look into this later, but don't have the time to dive into the code right this second. Figured I'd explain all this here in case @Yakabuff would be able to correct this quickly or if there's something I'm missing in how I was testing it such that this PR will actually resolve the bunkrr.su issue.

HeavenlyVice · 2023-09-14T20:05:23Z

I was testing this fix out and running into an issue. It's possible I did something incorrectly, however I believe the way the code currently reads in this PR is that it's pulling the cdn from the download page based on the string found between <source src=" and the next " for videos and between <img src=" and the next " for images. Which pulls the entire source URL for that album/gallery item. However, it then truncates that url to only grab the cdn root (i.e., 'https://media-files12.bunker.la/') and then appends the end of the {self} URL starting from 'v/' or 'i/' which works for the test album because the image in the file is found at the link https://bunkrr.su/i/test-%E3%83%86%E3%82%B9%E3%83%88-%22&%3E-QjgneIQv.png which is the same as the file name. However, this isn't usually the case for files on bunkrr in my experience. For example, I just went and grabbed a random bunkrr album here (NSFW content as it seems most of the public albums are....): https://bunkrr.su/a/aZM5f6WS

Opening the first file goes to the link: https://bunkrr.su/v/kLG2yrlpg7DSk

However, the src URL for the download is: https://media-files12.bunkr.la/Woods-KbuqDmbn-rftcXF0I2H1v.mp4

Currently, this code grabs the 'https://media-files12.bunkr.la/' appends the 'v/kLG2yrlpg7DSk' from the page link for that item and tries to download using 'https://media-files12.bunkr.la/v/kLG2yrlpg7DSk' which 404s as it's an invalid link. I tried commenting out lines 106 and 107 and adding in 'url = cdn' to just use the https://media-files12.bunkr.la/Woods-KbuqDmbn-rftcXF0I2H1v.mp4 link that is pulled by @Yakabuff's if/else statement, which worked.... For the first item in the gallery. It isn't iterating through all of the files in the gallery as the headers variable is iterating through the 'v/kLG2yrlpg7DSk' URLs instead of iterating through the actual download links that are needed. It just attempts to download the https://media-files12.bunkr.la/Woods-KbuqDmbn-rftcXF0I2H1v.mp4 file over and over 844 times I believe (one for each file in the gallery). I can look into this later, but don't have the time to dive into the code right this second. Figured I'd explain all this here in case @Yakabuff would be able to correct this quickly or if there's something I'm missing in how I was testing it such that this PR will actually resolve the bunkrr.su issue.

It seems that removing Line #96 if not cdn: works along with the changes I mentioned. So:

Remove Line 96 if not cdn:
Remove Line 106 cdn = cdn[:cdn.index("/", 8)]
Remove Line 107 `url = cdn + url[2:]
Add url = cdn as new Line 105 after the end of the else statement.
Adjust indentation for Line 97-105 appropriately

Yakabuff · 2023-09-14T22:25:31Z

@HeavenlyVice Thanks, I will try implementing that and do further testing

Yakabuff · 2023-09-15T00:31:30Z

Yeah, there seems to be 2 different formats: https://bunkrr.su/v/<id> and https://bunkrr.su/v/<filename>
Maybe it has something to do with their ongoing migration?

This means we will have to either:

Make additional requests for every image to fetch the actual CDN url as we can no longer reliably just concatenate url to cdn root. This is much slower and uses double the requests but is more future proof
Make an assumption that if the first url is in the first cdn url, all subsequent files in the album will be in the https://bunkrr.su/v/<filename> format and we can safely use the concatenation method. If not, we assume it is in the https://bunkrr.su/v/<id> format and make requests to fetch the CDN url for every item in the album

Yakabuff · 2023-09-15T04:40:09Z

@mikf @HeavenlyVice Seems to work now. I am able to download links from both https://bunkrr.su/v/<id> and https://bunkrr.su/v/<filename> formats. From what it looks like, I think they are transitioning to the https://bunkrr.su/v/<id> format as it is used in new albums.

HeavenlyVice · 2023-09-15T14:49:13Z

@mikf @HeavenlyVice Seems to work now. I am able to download links from both https://bunkrr.su/v/<id> and https://bunkrr.su/v/<filename> formats. From what it looks like, I think they are transitioning to the https://bunkrr.su/v/<id> format as it is used in new albums.

Yep, looks good to me. It may run slower than how it was previously working, but this should at least address the issue and get it working again. Then maybe we can figure out a faster way of doing it later. The only other note I'd have is to remove the comment on Line 96 or update it since we're no longer just grabbing the cdn root but the entire download link to clarify for anyone working on future development for the Bunkr extractor.

I'm not sure if it's worthwhile to maybe have it display a process notification in the CLI for users to know that it is actually working when it's first grabbing all of the URLs for larger albums. When I was initially testing I wasn't sure that it was working at first because it just was hanging until I escaped it and ran it with the --verbose flag. It doesn't take a massive amount of time, but I made the mistake of testing it on a larger album (like 844 items in the album I believe?) at first so it did take a few minutes to run through them all. But if it were to just print to the CLI something along the lines of "Fetching download URLs..." or "Processing album information..." or something of the sort. Just a thought.

sixinchfootlong · 2023-09-15T22:02:37Z

The reason it's slower now is that's having to fetch a separate page for each and every file it's going to download.
There is a faster way but it's going to require redoing how the extractor parses out URLs because it needs information from three separate locations:

The CDN hostname from one of the download pages. This can be cached for the whole album.
Within the gallery, the filename portion of the thumbnail URL. The thumbnail will have a different CDN and the wrong file extension but other than that the filename is correct.
The displayed file name so that we can replace the thumbnail's .png extension.

Example:

CDN hostname: media-files12.bunkr.la
Thumbnail filename: Woods-KbuqDmbn-rftcXF0I2H1v.png (bonus: if the filename contains non-ASCII characters, they're already stripped out of the thumbnail name)
Displayed filename: Woods-KbuqDmbn.mp4

Resulting download URL: https://media-files12.bunkr.la/Woods-KbuqDmbn-rftcXF0I2H1v.mp4

Unfortunately, this sort of parsing isn't well suited for text.extr and would probably be easier with an actual HTML parser.

bhaskoro-muthohar · 2023-09-16T04:21:51Z

I tried to install it from your branch, but I got

PS E:\> python -m gallery_dl "https://bunkrr.su/a/XJKNZPzj"
[downloader.http][warning] '403 Forbidden' for 'https://big-taco-1.bunkr.ru/mvngokitty-sexy-secretary-mpC90sg3-lirGsroJ.mp4'
[download][error] Failed to download mvngokitty-sexy-secretary-mpC90sg3-lirGsroJ.mp4
[downloader.http][warning] '403 Forbidden' for 'https://big-taco-1.bunkr.ru/mvngokitty-2-FxbYOFhM-VMuh66dW.mp4'
[download][error] Failed to download mvngokitty-2-FxbYOFhM-VMuh66dW.mp4
[downloader.http][warning] '403 Forbidden' for 'https://big-taco-1.bunkr.ru/Mvngokitty-Best-friends-mom-oRVWkYpw-U6Wpexp2.mp4'
[download][error] Failed to download Mvngokitty-Best-friends-mom-oRVWkYpw-U6Wpexp2.mp4
[downloader.http][warning] '403 Forbidden' for 'https://big-taco-1.bunkr.ru/Mvngokitty-Red-Lingerie-Masturbation-Oz8vNwoD-hI3thlMw.mp4'
[download][error] Failed to download Mvngokitty-Red-Lingerie-Masturbation-Oz8vNwoD-hI3thlMw.mp4
[downloader.http][warning] '403 Forbidden' for 'https://big-taco-1.bunkr.ru/mvngokitty-gym-buddy-creampie-b0QifI19-VEOuHMdp.mp4'
[download][error] Failed to download mvngokitty-gym-buddy-creampie-b0QifI19-VEOuHMdp.mp4
[downloader.http][warning] '403 Forbidden' for 'https://big-taco-1.bunkr.ru/MvngoKitty-OnlyFans-2019_09_25_5d8bfa7ea895855e90713-Video-l9sWxSjB-wPbhOAfO.mp4'

---edit---

I tried to download it manually but got DDoS-Guard T_T

sixinchfootlong · 2023-09-16T18:39:30Z

@bhaskoro-muthohar that's not a problem with the code. You need to set your config to a browser User-Agent string or you'll get blocked.

bhaskoro-muthohar · 2023-09-17T04:53:31Z

@bhaskoro-muthohar that's not a problem with the code. You need to set your config to a browser User-Agent string or you'll get blocked.

What is the User-Agent value for bunkr you recommend?

fixed bunkr

03fd1a7

Yakabuff marked this pull request as draft September 14, 2023 02:07

Yakabuff added 4 commits September 13, 2023 22:25

fixed extractor for images and urls with html entities

9073609

fixed formatting

8d22f7e

more formatting

9b35673

more formatting

6134722

Yakabuff marked this pull request as ready for review September 14, 2023 02:50

made extractor support more url formats

f2710f0

mikf added a commit that referenced this pull request Oct 1, 2023

[bunkr] fix extraction (#4514, #4532, #4529, #4540)

b92645c

mikf closed this Oct 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bunkr] fixed extractor #4529

[bunkr] fixed extractor #4529

Yakabuff commented Sep 14, 2023 •

edited

Loading

Yakabuff commented Sep 14, 2023

HeavenlyVice commented Sep 14, 2023

HeavenlyVice commented Sep 14, 2023

Yakabuff commented Sep 14, 2023 •

edited

Loading

Yakabuff commented Sep 15, 2023 •

edited

Loading

Yakabuff commented Sep 15, 2023

HeavenlyVice commented Sep 15, 2023

sixinchfootlong commented Sep 15, 2023

bhaskoro-muthohar commented Sep 16, 2023 •

edited

Loading

sixinchfootlong commented Sep 16, 2023

bhaskoro-muthohar commented Sep 17, 2023

[bunkr] fixed extractor #4529

[bunkr] fixed extractor #4529

Conversation

Yakabuff commented Sep 14, 2023 • edited Loading

Yakabuff commented Sep 14, 2023

HeavenlyVice commented Sep 14, 2023

HeavenlyVice commented Sep 14, 2023

Yakabuff commented Sep 14, 2023 • edited Loading

Yakabuff commented Sep 15, 2023 • edited Loading

Yakabuff commented Sep 15, 2023

HeavenlyVice commented Sep 15, 2023

sixinchfootlong commented Sep 15, 2023

bhaskoro-muthohar commented Sep 16, 2023 • edited Loading

sixinchfootlong commented Sep 16, 2023

bhaskoro-muthohar commented Sep 17, 2023

Yakabuff commented Sep 14, 2023 •

edited

Loading

Yakabuff commented Sep 14, 2023 •

edited

Loading

Yakabuff commented Sep 15, 2023 •

edited

Loading

bhaskoro-muthohar commented Sep 16, 2023 •

edited

Loading