Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[extractor/sbs] Overhaul extractor for new APIs #31880

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

dirkf
Copy link
Contributor

@dirkf dirkf commented Mar 19, 2023

Boilerplate: own code, bug fix ## Please follow the guide below
  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

Australian provider SBS has changed its hosting arrangements and APIs, breaking the existing extractor.

The PR is intended to deal with the new APIs.

Some specialisations of extractor methods are included:

  • _download_webpage_handle() detects geo-restriction
  • _extract_m3u8_formats() defaults to the native downloader.

As most shows of interest are geo-restricted to Australia, it needs thorough testing in-region (please).

The initial version retains the old extractor as dead code. If none of it turns out to be needed, a future commit will remove it.

Resolves #31841.
See also yt-dlp/yt-dlp#6543.

Copy link

@gamer191 gamer191 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks heaps for this PR! I will test it tomorrow. Anything specific for me to test?

youtube_dl/extractor/sbs.py Show resolved Hide resolved
youtube_dl/extractor/sbs.py Show resolved Hide resolved
'uploader': 'SBSC',
}

def _old_real_extract(self, url):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _old_real_extract(self, url):
def _old_real_extract(self, url): //disabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be removed before merging.

youtube_dl/extractor/sbs.py Show resolved Hide resolved
youtube_dl/extractor/sbs.py Show resolved Hide resolved
youtube_dl/extractor/sbs.py Outdated Show resolved Hide resolved
Comment on lines 241 to 242
'title': traverse_media(('displayTitles', Ellipsis, 'title'),
get_all=False) or media['title'],
Copy link

@vidiot720 vidiot720 Mar 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these improvements to the metadata retrieval. One small issue; this change to title setting loses the episode title for tv episodes and only sets the series name; the old version used 'name' or similar, which could be parsed with title:(?P<series>.+?(?= S[0-9]+)) S(?P<season_number>[0-9]+) Ep(?P<episode_number>[0-9]+)(?P<episode>.*). As 'episode' not set directly, it gets the default Episode <episode number> instead.
The easy fix is to go back to using media['name'] to set the title, since the Series Title is set correctly anyway, but it's more complicated to pick-up the episode title as that appears as 'subtitle' within the 'displayTitles' node. But for movies, SBS uses subtitle field for genre (e.g. "Horror") so that would be incorrect as a general solution, unless the episode setting can be guarded by e.g. externalRelations.sbsondemand.entity_type == 'EPISODE', or by testing displayTitles.components.SeasonEpisodeFull == ''.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extensive test suite (2 shows!) that I exercised included 1 show that was an episode, which had no specific episode name (Episode 6, or some such).

If the title or episode should be set to something like {series} S{season_number}E{episode_number} with any non-trivial episode value appended, that's fine. Show some examples.

Copy link

@vidiot720 vidiot720 Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, that explains it. My suggestions:
For title, use media['name'], because:

  • this is backwards compatible with the prior behaviour;
  • since title is used for the default output template, just setting to the series name isn't particularly helpful when dealing with multiple episodes of the same series.

If this is done, then title will take the form of {series} S{season_number} Ep{episode number} - {episode} if the episode is named, or {series} S{season_number} Ep{episode number}, if not. I've tested this change successfully, i.e. 'title': media['name'],. Since SBS build this string already, there's no assembly required for this. For non-episodic programs it also works fine (e.g. for a film, it's just the title of the film).

For episode, assuming this doesn't hold up landing this PR too much (since it's pretty urgent fix for breakage), use player_data['name'], (i.e. the name element of the VideoObject), iff partOfSeries node exists; sorry, not sure how this would be tested for so I haven't got proposed code for this. I think this is a reasonable way to determine whether the video is part of a series, and could therefore have an episode specific title. I wouldn't test for partOfSeason as some episodic programs have none (e.g. mini-series, or current affairs). I've suggested this element because unlike the videoDescriptions, it's blank when it should be (for episodes), per the examples below.

Examples:

Video id 2170811459503: title should be: From Dusk Till Dawn and episode empty.
Video id 2158330947797: title should be: Blackport S1 Ep8 - Iceland and episode should be Iceland.
Video id 2165730371885: title should be: Devils S1 Ep1 and episode empty.

For this last example, there may be a preference to set episode to Episode 1 rather than empty, as I think this is again more consistent with previous default behaviour. However, this is redundant given that the episode number is now being properly set, where provided, so I'd argue the episode field should be reserved for actual episode titles and not a fabricated generic string.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In re-jigging this PR for yt-dlp, discovered that the episode title is in the metadata from the catalogue API as title, and this has been added in the yt-dlp fix to the media object for later return as epName. Notwithstanding the traverse_obj differences, I expect this could be adapted for this PR:

        # For named episodes, use the catalogue's title to set episode, rather than generic 'Episode N'.
        if traverse_obj(media, ('partOfSeries', {dict})):
            media['epName'] = traverse_obj(media, ('title', {str}))`

@icenov
Copy link

icenov commented Mar 25, 2023

Just tried out the #31880 for new SBS platform and it worked fine. Bit of a learning experience for me as first time with git, but good docs helped! Thanks!

def _get_player_data(self, video_id, headers=None, fatal=False):
return self._download_json(update_url_query(
self._PLAYER_API + 'video_stream', {'id': video_id, 'context': 'tv'}),
video_id, headers=headers, fatal=fatal) or {}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe headers should default to an empty dict here and in _call_api. Setting headers to None results in options like --geo-bypass-country breaking.

Copy link
Contributor Author

@dirkf dirkf Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may have a point, ATM. In extractor/common.py ll.619ff.

--- git master
+++ WIP code pulling stuff from yt-dlp

         if self._x_forwarded_for_ip:
-            if 'X-Forwarded-For' not in headers:
-                headers['X-Forwarded-For'] = self._x_forwarded_for_ip
+            headers = (headers or {}).copy()
+            headers.setdefault('X-Forwarded-For', self._x_forwarded_for_ip)

The new style is to avoid accidentally sharing mutable default values by passing None instead of {}. So we should for now have headers or {}. As _download_webpage_handle() is being specialised and is on the call path for all the extractor web requests, that would be a good place to tweak it.

else:
kwargs['expected_status'] = exp
kwargs = compat_kwargs(kwargs)

Copy link
Contributor Author

@dirkf dirkf Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address #31880 (comment) (not required for yt-dlp or future yt-dl)

Suggested change
# for now, ensure that headers passed is not None
if len(args) > 5:
if args[5] is None:
# replace with empty dict
args = list(args)
args[5] = {}
elif kwargs.get('headers', False) is None:
# enforce callee default
del kwargs['headers']

@Rygle
Copy link

Rygle commented Apr 13, 2023

Works great. Just copied and pasted the latest version from the pull request and it works. Thanks!

https?://(?:www\.)?sbs\.com\.au/(?:
ondemand(?:
/video/(?:single/)?|
/movie/[^/]+/|
Copy link

@vidiot720 vidiot720 Apr 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjustment needed to match single video programs (e.g. documentaries such as https://www.sbs.com.au/ondemand/tv-program/autun-romes-forgotten-sister/2116212803602),
like so:

Suggested change
/movie/[^/]+/|
/(?:tv-program|movie)/[^/]+/|

Comment on lines 235 to 236
'title': traverse_media(('displayTitles', Ellipsis, 'title'),
get_all=False) or media['title'],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on discussion above, update for title:

Suggested change
'title': traverse_media(('displayTitles', Ellipsis, 'title'),
get_all=False) or media['title'],
'title': media['name'],

youtube_dl/extractor/sbs.py Outdated Show resolved Hide resolved
@vidiot720
Copy link

Thanks for this, dirkf; was expecting to land your earlier commits largely unchanged in yt-dlp which was why I went ahead, but got a lot of feedback on that side. However, it did provide the opportunity to resolve the metadata issue for episodes, so calling that a win. Hope the testing on the ytdl side goes well.

@dirkf dirkf mentioned this pull request May 11, 2023
2 tasks
@peternewell
Copy link

I have used this file to download a movie from SBS that previously I wasn't able to, so it appears to work ok.

Thanks for the change.

@crazee1234
Copy link

Code seems to download episodes OK, but not their subtitles (using --write-subs). Links to some episodes which have subtitles follow. https://www.sbs.com.au/ondemand/tv-series/alex-polizzis-secret-italy/season-1/alex-polizzi-secret-italy-s1-ep3/1378014275543 https://www.sbs.com.au/ondemand/tv-series/beerland/season-1/beerland-s1-ep1/1068730947696

@Rygle
Copy link

Rygle commented May 22, 2023

Code seems to download episodes OK, but not their subtitles

Same, no subtitles on several items I checked using yt-dlp, which is essentially this code afaik.

@vidiot720
Copy link

Code seems to download episodes OK, but not their subtitles

Same, no subtitles on several items I checked using yt-dlp, which is essentially this code afaik.

I wouldn't expect any further action at yt-dlp given that PR has landed and the related issue closed; the issues with UTF-16 encoding, based formatting (as well as delivery of WebVTT subs when dxfp is requested in some cases*) makes further efforts around SBS subs unlikely, unless and until SBS themselves fix some stuff, despite comments over on Whirlpool. I think everyone's hair is quite safe here.

Suggest opening a new issue on the subs problems specifically, particularly if you don't want to delay this PR from landing in the short term and getting the basic downloading functions back. The bottom line is, as far as I know, the issues with subtitles are not related to anything in this PR, so no reason to delay it.

*for example, episodes 3 to 6 of 'Rogue Heroes'. If --convert-sub is used, a zero-length file results, because the input format to the conversion is unexpected.

youtube_dl/extractor/sbs.py Outdated Show resolved Hide resolved
youtube_dl/extractor/sbs.py Outdated Show resolved Hide resolved
youtube_dl/extractor/sbs.py Outdated Show resolved Hide resolved
youtube_dl/extractor/sbs.py Outdated Show resolved Hide resolved
@dirkf
Copy link
Contributor Author

dirkf commented Oct 9, 2024

I've updated the extractor from the latest yt-dlp version and fixed the extraction of categories and tags. The Dingo Conservation test clip succeeds.

Live testing welcome.

@peternewell
Copy link

I have inserted this sbs.py file into Youtube_Dl and it appears to work ok.

@dirkf
Copy link
Contributor Author

dirkf commented Oct 9, 2024

It's better than the master version, then. Good to go, I think.

@dirkf dirkf linked an issue Oct 15, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SBS on demand. One of six episodes fails to download. Youtube site not having Cache attribute
8 participants