-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nebula] Add new extractor #24805
[nebula] Add new extractor #24805
Conversation
Read coding conventions. |
Many more direct field lookups have been turned into optional lookups.
A secondary generator had already been implemented for the video URL. Everything else is now optional, I don't see any obvious opportunity for fallbacks on meta data, but I'll keep my eyes open when implementing future improvements.
There aren't really any capture groups used.
The RE to extract the script tag has been made more relaxed, by searching for a script tag without a MIME type and by allowing other attributes on the element to come before the ID.
Two more long code lines have been split or factored down.
I don't believe any such cases exist in the code.
I haven't seen an opportunity for those yet.
Fixed one of these (even though I personally think it hurts structure readability). I wasn't quite sure whether this only is relevant for list parentheses or also to the curly brackets of dict literals.
I have made some more use of |
It doesn't look like Nebula even has a concept of a creator/uploader vs. a channel, they're one and the same things. Some creators are channels for themselves (eg. Mia Mulder, Super Bunnyhop), sometimes a channel is just a series from a creator (eg. Tom Scott presents: Money), but doesn't seem to be internally linked to the creator. Sometimes a creator has a channel of their own and a series/special-specific channel (eg. Wendover and Wendover - The World's Most Useful Airport, Lindsay Ellis and Lindsay Ellis - Tom Hooper's Les Miserables); in these cases the site itself seems to know about the connection (on the Les Miserables video, the uploader link points to Lindsay Ellis' main channel rather than the one-video series). Seems like the code you added to find the first named category solves this, though - although there may still be edge cases. Digging into the pages a bit myself, it seems like the state has been removed from the video pages themselves somewhere between last time I looked (~April 9) and now. There's no JSON information on the video pages, and no XHR call to Nebula's own APIs for video info. I assume it's now parsing the URI for the video slug, and then fetching the video metadata (including the sparse category list) from Zype itself. For "The World's Most Useful Airport" (one of the edge cases above), I see the following relevant requests: // GET https://api.zype.com/videos?friendly_title=wendover-the-worlds-most-useful-airport&per_page=1&api_key=JlSv9XTImxelHi-eAHUVDy_NUM3uAtEogEpEdFoWHEOl9SKf5gl9pCHB1AYbY3QF
{
"response": [
{
"_id": "5ddeb29de338f95ddff5e03e",
"active": true,
"categories": [
{
"_id": "5ddeb31849594b5de68a2105",
"category_id": "5ced8c581c026e5ce6aaf047",
"title": "Animation",
"value": []
},
{
"_id": "5ddeb31849594b5de68a2106",
"category_id": "5c186aaa649f0f1342004248",
"title": "Explainers",
"value": [
"Wendover"
]
},
// a couple category entries snipped
{
"_id": "5ddeb31849594b5de68a210b",
"category_id": "5c624f363195970dec1057bb",
"title": "Originals",
"value": [
"Wendover — The World's Most Useful Airport"
]
},
// there's a bunch more after this too
],
"created_at": "2019-11-27T12:30:05.781-05:00",
"description": "snip", // (this is the long-form description, and seems to be in Markdown format. does ytdl have a parser for that?)
"duration": 2723,
"friendly_title": "wendover-the-worlds-most-useful-airport",
"keywords": [
"st helena",
// bunch more, snipped
],
"marketplace_ids": {},
"ott_description": "snip", // Seems to be equal to the "description" field above, may be differences? (what's "ott" short for?)
"published_at": "2019-11-27T13:20:00.000-05:00",
"site_id": "5c182d06649f0f134a001703",
"status": "created",
"subscription_ads_enabled": true, // (???)
"title": "The World's Most Useful Airport",
"updated_at": "2020-04-10T13:43:45.065-04:00",
"thumbnails": [
// snip
],
"short_description": "After 500 years of isolation from the world, in 2017, St Helena got an airport. This changed everything, but was that enough to save the forgotten island?",
"subscription_required": true
// I've removed a bunch of seemingly-useless properties for post size reasons, a lot were null/false/empty
}
]
} I also see a call to The API key in those requests is constant and hardcoded in source files, so I'm not leaking any personal tokens here. It's in This also means we can't grab the video stream URL from the iframe itself, since that's no longer there. The source format is This is fetched via: // GET https://api.watchnebula.com/api/v1/auth/user/
// Authorization: Token [nebula_token] (from the nebula_auth cookie's json data)
{
"account_type": "curiositystream",
"is_subscribed": true,
"has_curiositystream_subscription": true,
"zype_auth_info": {
"access_token": "[snip]",
"expires_at": 1588594215,
"refresh_token": "[snip 2]",
"zype_created_at": 1587989415
}
// more personal info fields (user ID, email) snipped
// although info about subscription data and type may be useful for eg. error messages?
} Hope some of this helps, that way you don't need to do too much digging on your own :) And hey, maybe they just accidentally turned off server-side rendering and the info will return to the main video page soon. But for now, this should be foolproof(-er)? |
@xSke Wow, haha, this is so frustrating! I cannot believe they changed their entire frontend just days after finished the extractor. (I last updated the PR on Apr 18, so at that point they were definitely still using the old one.) Thank you for your analysis! I did quickly walk through the new page and could confirm all your findings. (Including having an identical — apparently not-personalized — Zype API key and it only being found in the JS chunks.) I'll see when I'll get around to updating (rewriting!?) my PR. Until then, I'll switch it to draft mode. |
I have some spare time - let me know if you'd like me to take over this? Don't want to step on anyone's toes :) |
@xSke I appreciate your concern and request. And you're right, I feel somewhat committed to this extractor now and would actually like to take a shot at it myself. That said, I'd gladly take contributions (failing test cases, cleanups, especially anything that will make the PR more likely to be accepted) and code reviews, once I have pushed my initial prototype. Deal? |
@xSke In the meantime, do you know anything about the best practices for youtube-dl around sites with no public videos? See the Can someone point me to another extractor, which deals (well) with videos behind authentication, so I can figure out the best practices? How do I actually accept authentication info from the user in 'production runs' (outside test cases)? Should I support |
@xSke Here's the new implementation. Code review and feedback is very welcome! The only bit that's missing is the extraction of the Zype API key from the JS chunk. Once that is in, I'll un-draft this PR. |
Here's the Zype API key extraction. For now, I have chosen not to include the known — and apparently static and global — API key as a fallback (or even a default) in the code. Mostly because I'm not sure what this project's policies are on including static keys like this. That said, there is e.g. the |
Let me know if you want me to squash my commits. |
Merged in upstream master and verified as working on 2020-06-02. |
This seems to work 2020-07-18. But I wonder if the error text could be tweaked to be a little more explanatory.
I needed to check the source code of the extractor to figure out that a) (and c)) did not want the actual value of the Although I have no immediate suggestions as to what a better copy would be 😅 Just my 2¢! |
No, you're absolutely right. I went back and forth a few times on what precisely should be passed in (passing in the cookie is an easier copy-paste job, but passing in the token in cleaner) and I think the error message lagged behind in one of those changes. I'll look into improving it! That said, is this generally the right approach to support subscription-only platforms on youtube-dl? Should I offer fewer methods of authentication? Or more, is there an essential one missing? I don't think I've seen other extractors support the environment variable, but I couldn't figure out another way to run the tests. |
I found your extractor and tried it out and was unpleasantly surprised with that error message and copying tokens from browser cookies, how user-unfriendly! :)
Most of them, I think :) grep _NETRC_MACHINE *.py
The way you do this is just what all those extractors do: username, password = self._get_login_info() This magic function automatically pulls user+password from .netrc or -u/-p on the command line. Super easy, convenient for you and convenient for me as a user (use -u/-p for one time download, or just put email+password in .netrc and forget about it forever while I'm happily opening video links with mpv). Of course then you have to implement logging in with email+password in your extractor, but I went through this in my Floatplane extractor - even logging in each time you watch a video (wasting a second) is better than copying cookies every week when the token changes, so in the end, I figured out I'm too lazy to NOT add that in ;) |
@Lamieur Thanks, that's exactly the kind of best practice advice I was looking for! It's not mentioned in the Developer Instructions, nor in the extractor base class docstring, we should probably reference it in those. I'll implement that, it will probably take me a couple of days to get around to it, though. |
5e26784
to
da2069f
Compare
@Lamieur Thanks again for your tip! I've implemented the user/pass login as a proof of concept. Simple and works well. I just need to do some cleanup and error handling. What's the correct way to supply these credentials during unit tests? |
The credentials-based authentication has now been fully implemented! I would appreciate if you could give the extractor another try and let me know if it now behaves the way you would have expected it to. Cheers! |
One small issue :) It worked for one video, but after that, my mpv started saying: youtube-dl -v said: I found out that the problem is my use of "--cookies /home/lam/.config/youtube-dl/cookies.txt" in my .config/youtube-dl/config In that cookie jar, I now have two related cookies: As a bad temporary work-around, I've added I mean, ideally it could probably remember being logged in, and re-run the logon process only when necessary, but a fresh login every run is at least functional :) So many other extractors do this every time and don't bother with reusing old session cookies, so even if this feels super bad, it's not so out of ordinary in this context, right? :) Oh, and can I be super nitpicky? :) Using "machine watchnebula" in the .netrc is maybe not so trivial - the service's name is Nebula, extractor is nebula.py, I would just call it "nebula" :) Otherwise great improvement over command-line trickery with environment variables! I'll go watch something now! |
I reforked the repo and rebased my changes: https://github.com/hheimbuerger/youtube-dl-post-dmca-refork/tree/add-nebula-support |
I could not reproduce this based on your description, and I'm a bit confused. Could you maybe give me reproduction steps to follow? My suspicion is that your dropped the
I'm new to |
That's kind of what is already happening. First login works, a cookie is stored by the site. Then the cookie is loaded on youtube-dl's startup with all the other cookies, and sent with requests to the site. The only issue is - the extractor tries to log in again, and Nebula has the strangest reaction - it spews a 403 :) But I'm still using that work-around (cleaning up the sessionid cookie) and it still works 100%.
For youtube-dl's use, the "machine" is just a label. It doesn't matter what it is, but to make things simple, for YouTube it uses "machine youtube", for CuriosityStream it uses "machine curiositystream". Note: those are not only not actual machine names, but not even domains :) For me it would be simplest for Nebula to use "machine nebula", which is also what the extractor is called, which I think is more intuitive than the current "watchnebula". It doesn't (technically) matter, I found it in a second. I just thought making it more intuitive may one day save some novice some time googling :) |
It's not guaranteed to be faster than a fresh login every time though, so again: who cares :) As long as it works, right? :) |
@Lamieur Ah yes, thanks, I can reproduce this now. For some reason I was only ever testing with either switch for no good reason, my bad. I definitely agree that it's a fair expectation for this to work, if not with using the cookie jar as a form of 'authentication cache', then at least by successfully re-authenticating on subsequent runs. I'll look into this and fix it. |
Side-note on some weird behavior I just noticed. (I don't know what to make of this yet, so for now I'm just documenting findings here.) More than once, I've noticed that when I picked up work on this implementation after some days or weeks had been passed (during which I didn't visit Nebula in my browser), my unit tests would suddenly fail. I then always assumed that something had be changed, breaking my interpretation. So I log in to the site with my browser to investigate, after which my unit tests suddenly started passing again. I think I've now finally witnessed what's going on: You cannot see it on the screenshot, but on the first request to My interpretation is that this Zype authentication token expires after a while. The frontend knows this, and apparently tries to fetch it directly then ( This seems to be 'account-global', meaning that even though the Nebula extractor currently doesn't implement any of this behavior, loading a random video in the browser will fix this problem for the Nebula extractor as well. |
@Lamieur The issue during authentication when combining I can corroborate your analysis that apparently, the login endpoint of Nebula responds with a 403 when you supply a session ID cookie. I don't quite understand why it behaves that way, but at least it's reproducible. I considered implementing some kind of only-login-if-needed approach because it would be faster, but I ultimately decided against it because:
So all in all, it doesn't really appear to be worth it and instead, I'm now simply not sending cookies along when authenticating. It's a bit ugly, but effective. (I would have prefered to deactivate cookie jar support on a specific HTTP request, but looking through library code and documentation of youtube-dl, I couldn't find a way to do so.) |
That was the last major issue I'm aware of. GitHub Support since got back to me and let me know that there isn't a way for them to transfer this PR to the new fork. (Reminder: as part of the temporary DMCA takedown of the Therefore, I will now close this PR and open a new one. Please provide further feedback on this new PR. This PR has been closed in favor of #27859. |
Before submitting a pull request make sure you have:
In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:
What is the purpose of your pull request?
Description of your pull request and other information
This is a new extractor for the video platform Nebula (https://watchnebula.com/), created by the streamer community Standard.
Nebula uses the Zype video infrastructure and this extractor is using the
url_transparent
mode to hand off video extraction to the Zype extractor (which has shown to be super reliable here).Development discussion has occured on #21258. I didn't open the issue, but I hijacked it for my implementation.
All videos require a subscription to watch. There are no known freely available videos. The extractor contains three test cases, but they only pass when an auth cookie to a user account with a (paid) subscription is present.
I have aimed to provide comprehensive documentation of approach and behavior in class and method docstrings.
There's a few technical open issues -- all of which have been marked with
FIXME
-- that I'm hoping to receive feedback about in the review process.