[extractor/sbs] Overhaul extractor for new APIs #31880

dirkf · 2023-03-19T02:47:41Z

Boilerplate: own code, bug fix

## Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

Australian provider SBS has changed its hosting arrangements and APIs, breaking the existing extractor.

The PR is intended to deal with the new APIs.

Some specialisations of extractor methods are included:

_download_webpage_handle() detects geo-restriction
_extract_m3u8_formats() defaults to the native downloader.

As most shows of interest are geo-restricted to Australia, it needs thorough testing in-region (please).

The initial version retains the old extractor as dead code. If none of it turns out to be needed, a future commit will remove it.

Resolves #31841.
See also yt-dlp/yt-dlp#6543.

gamer191

Thanks heaps for this PR! I will test it tomorrow. Anything specific for me to test?

youtube_dl/extractor/sbs.py

gamer191 · 2023-03-19T11:24:19Z

youtube_dl/extractor/sbs.py

+            'uploader': 'SBSC',
+        }
+
+    def _old_real_extract(self, url):


Suggested change

def _old_real_extract(self, url):

def _old_real_extract(self, url): //disabled

It'll be removed before merging.

youtube_dl/extractor/sbs.py

vidiot720 · 2023-03-19T21:43:38Z

youtube_dl/extractor/sbs.py

+            'title': traverse_media(('displayTitles', Ellipsis, 'title'),
+                                    get_all=False) or media['title'],


Thanks for these improvements to the metadata retrieval. One small issue; this change to title setting loses the episode title for tv episodes and only sets the series name; the old version used 'name' or similar, which could be parsed with title:(?P<series>.+?(?= S[0-9]+)) S(?P<season_number>[0-9]+) Ep(?P<episode_number>[0-9]+)(?P<episode>.*). As 'episode' not set directly, it gets the default Episode <episode number> instead.
The easy fix is to go back to using media['name'] to set the title, since the Series Title is set correctly anyway, but it's more complicated to pick-up the episode title as that appears as 'subtitle' within the 'displayTitles' node. But for movies, SBS uses subtitle field for genre (e.g. "Horror") so that would be incorrect as a general solution, unless the episode setting can be guarded by e.g. externalRelations.sbsondemand.entity_type == 'EPISODE', or by testing displayTitles.components.SeasonEpisodeFull == ''.

The extensive test suite (2 shows!) that I exercised included 1 show that was an episode, which had no specific episode name (Episode 6, or some such).

If the title or episode should be set to something like {series} S{season_number}E{episode_number} with any non-trivial episode value appended, that's fine. Show some examples.

No worries, that explains it. My suggestions:
For title, use media['name'], because:

this is backwards compatible with the prior behaviour;

since title is used for the default output template, just setting to the series name isn't particularly helpful when dealing with multiple episodes of the same series.

If this is done, then title will take the form of {series} S{season_number} Ep{episode number} - {episode} if the episode is named, or {series} S{season_number} Ep{episode number}, if not. I've tested this change successfully, i.e. 'title': media['name'],. Since SBS build this string already, there's no assembly required for this. For non-episodic programs it also works fine (e.g. for a film, it's just the title of the film).

For episode, assuming this doesn't hold up landing this PR too much (since it's pretty urgent fix for breakage), use player_data['name'], (i.e. the name element of the VideoObject), iff partOfSeries node exists; sorry, not sure how this would be tested for so I haven't got proposed code for this. I think this is a reasonable way to determine whether the video is part of a series, and could therefore have an episode specific title. I wouldn't test for partOfSeason as some episodic programs have none (e.g. mini-series, or current affairs). I've suggested this element because unlike the videoDescriptions, it's blank when it should be (for episodes), per the examples below.

Examples:

Video id 2170811459503: title should be: From Dusk Till Dawn and episode empty.
Video id 2158330947797: title should be: Blackport S1 Ep8 - Iceland and episode should be Iceland.
Video id 2165730371885: title should be: Devils S1 Ep1 and episode empty.

For this last example, there may be a preference to set episode to Episode 1 rather than empty, as I think this is again more consistent with previous default behaviour. However, this is redundant given that the episode number is now being properly set, where provided, so I'd argue the episode field should be reserved for actual episode titles and not a fabricated generic string.

In re-jigging this PR for yt-dlp, discovered that the episode title is in the metadata from the catalogue API as title, and this has been added in the yt-dlp fix to the media object for later return as epName. Notwithstanding the traverse_obj differences, I expect this could be adapted for this PR:

# For named episodes, use the catalogue's title to set episode, rather than generic 'Episode N'. if traverse_obj(media, ('partOfSeries', {dict})): media['epName'] = traverse_obj(media, ('title', {str}))`

icenov · 2023-03-25T02:59:48Z

Just tried out the #31880 for new SBS platform and it worked fine. Bit of a learning experience for me as first time with git, but good docs helped! Thanks!

Hazza117 · 2023-03-30T13:48:56Z

youtube_dl/extractor/sbs.py

+    def _get_player_data(self, video_id, headers=None, fatal=False):
+        return self._download_json(update_url_query(
+            self._PLAYER_API + 'video_stream', {'id': video_id, 'context': 'tv'}),
+            video_id, headers=headers, fatal=fatal) or {}


I believe headers should default to an empty dict here and in _call_api. Setting headers to None results in options like --geo-bypass-country breaking.

You may have a point, ATM. In extractor/common.py ll.619ff.

--- git master +++ WIP code pulling stuff from yt-dlp if self._x_forwarded_for_ip: - if 'X-Forwarded-For' not in headers: - headers['X-Forwarded-For'] = self._x_forwarded_for_ip + headers = (headers or {}).copy() + headers.setdefault('X-Forwarded-For', self._x_forwarded_for_ip)

The new style is to avoid accidentally sharing mutable default values by passing None instead of {}. So we should for now have headers or {}. As _download_webpage_handle() is being specialised and is on the call path for all the extractor web requests, that would be a good place to tweak it.

dirkf · 2023-03-30T15:33:23Z

youtube_dl/extractor/sbs.py

+            else:
+                kwargs['expected_status'] = exp
+                kwargs = compat_kwargs(kwargs)
+


Address #31880 (comment) (not required for yt-dlp or future yt-dl)

Suggested change

# for now, ensure that headers passed is not None

if len(args) > 5:

if args[5] is None:

# replace with empty dict

args = list(args)

args[5] = {}

elif kwargs.get('headers', False) is None:

# enforce callee default

del kwargs['headers']

Rygle · 2023-04-13T12:26:51Z

Works great. Just copied and pasted the latest version from the pull request and it works. Thanks!

vidiot720 · 2023-04-16T14:26:30Z

youtube_dl/extractor/sbs.py

+        https?://(?:www\.)?sbs\.com\.au/(?:
+            ondemand(?:
+                /video/(?:single/)?|
+                /movie/[^/]+/|


Adjustment needed to match single video programs (e.g. documentaries such as https://www.sbs.com.au/ondemand/tv-program/autun-romes-forgotten-sister/2116212803602),
like so:

Suggested change

/movie/[^/]+/|

/(?:tv-program|movie)/[^/]+/|

vidiot720 · 2023-04-16T14:29:55Z

youtube_dl/extractor/sbs.py

+            'title': traverse_media(('displayTitles', Ellipsis, 'title'),
+                                    get_all=False) or media['title'],


Based on discussion above, update for title:

Suggested change

'title': traverse_media(('displayTitles', Ellipsis, 'title'),

get_all=False) or media['title'],

'title': media['name'],

youtube_dl/extractor/sbs.py

vidiot720 · 2023-05-04T10:50:49Z

Thanks for this, dirkf; was expecting to land your earlier commits largely unchanged in yt-dlp which was why I went ahead, but got a lot of feedback on that side. However, it did provide the opportunity to resolve the metadata issue for episodes, so calling that a win. Hope the testing on the ytdl side goes well.

peternewell · 2023-05-12T21:34:07Z

I have used this file to download a movie from SBS that previously I wasn't able to, so it appears to work ok.

Thanks for the change.

crazee1234 · 2023-05-20T04:06:56Z

Code seems to download episodes OK, but not their subtitles (using --write-subs). Links to some episodes which have subtitles follow. https://www.sbs.com.au/ondemand/tv-series/alex-polizzis-secret-italy/season-1/alex-polizzi-secret-italy-s1-ep3/1378014275543 https://www.sbs.com.au/ondemand/tv-series/beerland/season-1/beerland-s1-ep1/1068730947696

Rygle · 2023-05-22T06:21:13Z

Code seems to download episodes OK, but not their subtitles

Same, no subtitles on several items I checked using yt-dlp, which is essentially this code afaik.

vidiot720 · 2023-05-30T01:51:48Z

Code seems to download episodes OK, but not their subtitles

Same, no subtitles on several items I checked using yt-dlp, which is essentially this code afaik.

I wouldn't expect any further action at yt-dlp given that PR has landed and the related issue closed; the issues with UTF-16 encoding, based formatting (as well as delivery of WebVTT subs when dxfp is requested in some cases*) makes further efforts around SBS subs unlikely, unless and until SBS themselves fix some stuff, despite comments over on Whirlpool. I think everyone's hair is quite safe here.

Suggest opening a new issue on the subs problems specifically, particularly if you don't want to delay this PR from landing in the short term and getting the basic downloading functions back. The bottom line is, as far as I know, the issues with subtitles are not related to anything in this PR, so no reason to delay it.

*for example, episodes 3 to 6 of 'Rogue Heroes'. If --convert-sub is used, a zero-length file results, because the input format to the conversion is unexpected.

youtube_dl/extractor/sbs.py

dirkf · 2024-10-09T00:51:26Z

I've updated the extractor from the latest yt-dlp version and fixed the extraction of categories and tags. The Dingo Conservation test clip succeeds.

Live testing welcome.

peternewell · 2024-10-09T05:29:07Z

I have inserted this sbs.py file into Youtube_Dl and it appears to work ok.

dirkf · 2024-10-09T09:44:35Z

It's better than the master version, then. Good to go, I think.

[extractor/sbs] Overhaul extractor for new APIs

bcb8143

dirkf mentioned this pull request Mar 19, 2023

SBS on demand. One of six episodes fails to download. #31841

Open

6 tasks

gamer191 mentioned this pull request Mar 19, 2023

All downloads from SBS broken yt-dlp/yt-dlp#6543

Closed

11 tasks

gamer191 reviewed Mar 19, 2023

View reviewed changes

youtube_dl/extractor/sbs.py Show resolved Hide resolved

youtube_dl/extractor/sbs.py Show resolved Hide resolved

gamer191 reviewed Mar 19, 2023

View reviewed changes

[SBS] Updates for review and rethink

d1dbd37

gamer191 reviewed Mar 19, 2023

View reviewed changes

youtube_dl/extractor/sbs.py Show resolved Hide resolved

youtube_dl/extractor/sbs.py Show resolved Hide resolved

youtube_dl/extractor/sbs.py Outdated Show resolved Hide resolved

vidiot720 reviewed Mar 19, 2023

View reviewed changes

Hazza117 reviewed Mar 30, 2023

View reviewed changes

dirkf commented Mar 30, 2023

View reviewed changes

vidiot720 reviewed Apr 16, 2023

View reviewed changes

youtube_dl/extractor/sbs.py Show resolved Hide resolved

vidiot720 mentioned this pull request Apr 17, 2023

[extractor/sbs] Overhaul extractor for new APIs yt-dlp/yt-dlp#6839

Merged

9 tasks

Align PR with merged yt-dlp code

b028b2f

dirkf commented May 1, 2023

View reviewed changes

youtube_dl/extractor/sbs.py Outdated Show resolved Hide resolved

Update youtube_dl/extractor/sbs.py

ab5617b

dirkf mentioned this pull request May 11, 2023

Latest EXE Download version #32179

Closed

2 tasks

dirkf added 2 commits October 8, 2024 23:51

Update 2023 draft to current API version, etc

de91fe7

Merge branch 'ytdl-org:master' into df-sbs-extractor-ovrhaul

3abae00

dirkf commented Oct 9, 2024

View reviewed changes

youtube_dl/extractor/sbs.py Outdated Show resolved Hide resolved

dirkf commented Oct 9, 2024

View reviewed changes

youtube_dl/extractor/sbs.py Outdated Show resolved Hide resolved

dirkf commented Oct 9, 2024

View reviewed changes

youtube_dl/extractor/sbs.py Outdated Show resolved Hide resolved

dirkf commented Oct 9, 2024

View reviewed changes

youtube_dl/extractor/sbs.py Outdated Show resolved Hide resolved

Fix/improve timestamp, categories, tags

80d03d7

dirkf mentioned this pull request Oct 9, 2024

Youtube site not having Cache attribute #32943

Closed

dirkf linked an issue Oct 15, 2024 that may be closed by this pull request

Youtube site not having Cache attribute #32943

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[extractor/sbs] Overhaul extractor for new APIs #31880

[extractor/sbs] Overhaul extractor for new APIs #31880

dirkf commented Mar 19, 2023

gamer191 left a comment

gamer191 Mar 19, 2023

dirkf Mar 19, 2023

vidiot720 Mar 19, 2023 •

edited

Loading

dirkf Mar 20, 2023

vidiot720 Mar 20, 2023 •

edited

Loading

vidiot720 Apr 19, 2023

icenov commented Mar 25, 2023

Hazza117 Mar 30, 2023

dirkf Mar 30, 2023 •

edited

Loading

dirkf Mar 30, 2023 •

edited

Loading

Rygle commented Apr 13, 2023

vidiot720 Apr 16, 2023 •

edited

Loading

vidiot720 Apr 16, 2023

vidiot720 commented May 4, 2023

peternewell commented May 12, 2023

crazee1234 commented May 20, 2023

Rygle commented May 22, 2023

vidiot720 commented May 30, 2023

dirkf commented Oct 9, 2024

peternewell commented Oct 9, 2024

dirkf commented Oct 9, 2024

	def _old_real_extract(self, url):
	def _old_real_extract(self, url): //disabled

		'title': traverse_media(('displayTitles', Ellipsis, 'title'),
		get_all=False) or media['title'],

+        # for now, ensure that headers passed is not None
+        if len(args) > 5:
+            if args[5] is None:
+                # replace with empty dict
+                args = list(args)
+                args[5] = {}
+        elif kwargs.get('headers', False) is None:
+            # enforce callee default
+            del kwargs['headers']

	'title': traverse_media(('displayTitles', Ellipsis, 'title'),
	get_all=False) or media['title'],
	'title': media['name'],

[extractor/sbs] Overhaul extractor for new APIs #31880

Are you sure you want to change the base?

[extractor/sbs] Overhaul extractor for new APIs #31880

Conversation

dirkf commented Mar 19, 2023

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

gamer191 left a comment

Choose a reason for hiding this comment

gamer191 Mar 19, 2023

Choose a reason for hiding this comment

dirkf Mar 19, 2023

Choose a reason for hiding this comment

vidiot720 Mar 19, 2023 • edited Loading

Choose a reason for hiding this comment

dirkf Mar 20, 2023

Choose a reason for hiding this comment

vidiot720 Mar 20, 2023 • edited Loading

Choose a reason for hiding this comment

vidiot720 Apr 19, 2023

Choose a reason for hiding this comment

icenov commented Mar 25, 2023

Hazza117 Mar 30, 2023

Choose a reason for hiding this comment

dirkf Mar 30, 2023 • edited Loading

Choose a reason for hiding this comment

dirkf Mar 30, 2023 • edited Loading

Choose a reason for hiding this comment

Rygle commented Apr 13, 2023

vidiot720 Apr 16, 2023 • edited Loading

Choose a reason for hiding this comment

vidiot720 Apr 16, 2023

Choose a reason for hiding this comment

vidiot720 commented May 4, 2023

peternewell commented May 12, 2023

crazee1234 commented May 20, 2023

Rygle commented May 22, 2023

vidiot720 commented May 30, 2023

dirkf commented Oct 9, 2024

peternewell commented Oct 9, 2024

dirkf commented Oct 9, 2024

vidiot720 Mar 19, 2023 •

edited

Loading

vidiot720 Mar 20, 2023 •

edited

Loading

dirkf Mar 30, 2023 •

edited

Loading

dirkf Mar 30, 2023 •

edited

Loading

vidiot720 Apr 16, 2023 •

edited

Loading