-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[extractor/sbs] Overhaul extractor for new APIs #31880
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks heaps for this PR! I will test it tomorrow. Anything specific for me to test?
youtube_dl/extractor/sbs.py
Outdated
'uploader': 'SBSC', | ||
} | ||
|
||
def _old_real_extract(self, url): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _old_real_extract(self, url): | |
def _old_real_extract(self, url): //disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll be removed before merging.
youtube_dl/extractor/sbs.py
Outdated
'title': traverse_media(('displayTitles', Ellipsis, 'title'), | ||
get_all=False) or media['title'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these improvements to the metadata retrieval. One small issue; this change to title setting loses the episode title for tv episodes and only sets the series name; the old version used 'name' or similar, which could be parsed with title:(?P<series>.+?(?= S[0-9]+)) S(?P<season_number>[0-9]+) Ep(?P<episode_number>[0-9]+)(?P<episode>.*)
. As 'episode' not set directly, it gets the default Episode <episode number>
instead.
The easy fix is to go back to using media['name'] to set the title, since the Series Title is set correctly anyway, but it's more complicated to pick-up the episode title as that appears as 'subtitle' within the 'displayTitles' node. But for movies, SBS uses subtitle field for genre (e.g. "Horror") so that would be incorrect as a general solution, unless the episode setting can be guarded by e.g. externalRelations.sbsondemand.entity_type == 'EPISODE'
, or by testing displayTitles.components.SeasonEpisodeFull == ''
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extensive test suite (2 shows!) that I exercised included 1 show that was an episode, which had no specific episode name (Episode 6
, or some such).
If the title
or episode
should be set to something like {series} S{season_number}E{episode_number}
with any non-trivial episode
value appended, that's fine. Show some examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, that explains it. My suggestions:
For title
, use media['name']
, because:
- this is backwards compatible with the prior behaviour;
- since
title
is used for the default output template, just setting to the series name isn't particularly helpful when dealing with multiple episodes of the same series.
If this is done, then title
will take the form of {series} S{season_number} Ep{episode number} - {episode}
if the episode is named, or {series} S{season_number} Ep{episode number}
, if not. I've tested this change successfully, i.e. 'title': media['name'],
. Since SBS build this string already, there's no assembly required for this. For non-episodic programs it also works fine (e.g. for a film, it's just the title of the film).
For episode
, assuming this doesn't hold up landing this PR too much (since it's pretty urgent fix for breakage), use player_data['name']
, (i.e. the name
element of the VideoObject
), iff partOfSeries
node exists; sorry, not sure how this would be tested for so I haven't got proposed code for this. I think this is a reasonable way to determine whether the video is part of a series, and could therefore have an episode specific title. I wouldn't test for partOfSeason
as some episodic programs have none (e.g. mini-series, or current affairs). I've suggested this element because unlike the videoDescriptions, it's blank when it should be (for episodes), per the examples below.
Examples:
Video id 2170811459503
: title
should be: From Dusk Till Dawn
and episode
empty.
Video id 2158330947797
: title
should be: Blackport S1 Ep8 - Iceland
and episode
should be Iceland
.
Video id 2165730371885
: title
should be: Devils S1 Ep1
and episode
empty.
For this last example, there may be a preference to set episode
to Episode 1
rather than empty, as I think this is again more consistent with previous default behaviour. However, this is redundant given that the episode number
is now being properly set, where provided, so I'd argue the episode
field should be reserved for actual episode titles and not a fabricated generic string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In re-jigging this PR for yt-dlp, discovered that the episode title is in the metadata from the catalogue API as title, and this has been added in the yt-dlp fix to the media
object for later return as epName
. Notwithstanding the traverse_obj
differences, I expect this could be adapted for this PR:
# For named episodes, use the catalogue's title to set episode, rather than generic 'Episode N'.
if traverse_obj(media, ('partOfSeries', {dict})):
media['epName'] = traverse_obj(media, ('title', {str}))`
Just tried out the #31880 for new SBS platform and it worked fine. Bit of a learning experience for me as first time with git, but good docs helped! Thanks! |
youtube_dl/extractor/sbs.py
Outdated
def _get_player_data(self, video_id, headers=None, fatal=False): | ||
return self._download_json(update_url_query( | ||
self._PLAYER_API + 'video_stream', {'id': video_id, 'context': 'tv'}), | ||
video_id, headers=headers, fatal=fatal) or {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe headers
should default to an empty dict here and in _call_api
. Setting headers
to None results in options like --geo-bypass-country
breaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may have a point, ATM. In extractor/common.py
ll.619ff.
--- git master
+++ WIP code pulling stuff from yt-dlp
if self._x_forwarded_for_ip:
- if 'X-Forwarded-For' not in headers:
- headers['X-Forwarded-For'] = self._x_forwarded_for_ip
+ headers = (headers or {}).copy()
+ headers.setdefault('X-Forwarded-For', self._x_forwarded_for_ip)
The new style is to avoid accidentally sharing mutable default values by passing None
instead of {}
. So we should for now have headers or {}
. As _download_webpage_handle()
is being specialised and is on the call path for all the extractor web requests, that would be a good place to tweak it.
youtube_dl/extractor/sbs.py
Outdated
else: | ||
kwargs['expected_status'] = exp | ||
kwargs = compat_kwargs(kwargs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address #31880 (comment) (not required for yt-dlp or future yt-dl)
# for now, ensure that headers passed is not None | |
if len(args) > 5: | |
if args[5] is None: | |
# replace with empty dict | |
args = list(args) | |
args[5] = {} | |
elif kwargs.get('headers', False) is None: | |
# enforce callee default | |
del kwargs['headers'] | |
Works great. Just copied and pasted the latest version from the pull request and it works. Thanks! |
youtube_dl/extractor/sbs.py
Outdated
https?://(?:www\.)?sbs\.com\.au/(?: | ||
ondemand(?: | ||
/video/(?:single/)?| | ||
/movie/[^/]+/| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjustment needed to match single video programs (e.g. documentaries such as https://www.sbs.com.au/ondemand/tv-program/autun-romes-forgotten-sister/2116212803602
),
like so:
/movie/[^/]+/| | |
/(?:tv-program|movie)/[^/]+/| |
youtube_dl/extractor/sbs.py
Outdated
'title': traverse_media(('displayTitles', Ellipsis, 'title'), | ||
get_all=False) or media['title'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on discussion above, update for title:
'title': traverse_media(('displayTitles', Ellipsis, 'title'), | |
get_all=False) or media['title'], | |
'title': media['name'], |
Thanks for this, dirkf; was expecting to land your earlier commits largely unchanged in yt-dlp which was why I went ahead, but got a lot of feedback on that side. However, it did provide the opportunity to resolve the metadata issue for episodes, so calling that a win. Hope the testing on the ytdl side goes well. |
I have used this file to download a movie from SBS that previously I wasn't able to, so it appears to work ok. Thanks for the change. |
Code seems to download episodes OK, but not their subtitles (using --write-subs). Links to some episodes which have subtitles follow. https://www.sbs.com.au/ondemand/tv-series/alex-polizzis-secret-italy/season-1/alex-polizzi-secret-italy-s1-ep3/1378014275543 https://www.sbs.com.au/ondemand/tv-series/beerland/season-1/beerland-s1-ep1/1068730947696 |
Same, no subtitles on several items I checked using yt-dlp, which is essentially this code afaik. |
I wouldn't expect any further action at yt-dlp given that PR has landed and the related issue closed; the issues with UTF-16 encoding, based formatting (as well as delivery of WebVTT subs when dxfp is requested in some cases*) makes further efforts around SBS subs unlikely, unless and until SBS themselves fix some stuff, despite comments over on Whirlpool. I think everyone's hair is quite safe here. Suggest opening a new issue on the subs problems specifically, particularly if you don't want to delay this PR from landing in the short term and getting the basic downloading functions back. The bottom line is, as far as I know, the issues with subtitles are not related to anything in this PR, so no reason to delay it. *for example, episodes 3 to 6 of 'Rogue Heroes'. If |
I've updated the extractor from the latest yt-dlp version and fixed the extraction of Live testing welcome. |
I have inserted this sbs.py file into Youtube_Dl and it appears to work ok. |
It's better than the master version, then. Good to go, I think. |
Boilerplate: own code, bug fix
## Please follow the guide belowx
into all the boxes [ ] relevant to your pull request (like that [x])Before submitting a pull request make sure you have:
In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:
What is the purpose of your pull request?
Description of your pull request and other information
Australian provider SBS has changed its hosting arrangements and APIs, breaking the existing extractor.
The PR is intended to deal with the new APIs.
Some specialisations of extractor methods are included:
_download_webpage_handle()
detects geo-restriction_extract_m3u8_formats()
defaults to the native downloader.As most shows of interest are geo-restricted to Australia, it needs thorough testing in-region (please).
The initial version retains the old extractor as dead code. If none of it turns out to be needed, a future commit will remove it.
Resolves #31841.
See also yt-dlp/yt-dlp#6543.