Skip to content
benoit74 edited this page Oct 11, 2024 · 5 revisions

Youtube scraper is creating a ZIM from:

  • one Youtube channel (myChannel)
  • one Youtube user (myUser)
  • one Youtube handle (@myHandle)
  • one or multiple Youtube playlist(s)

The ZIM will contain the selected set of videos.

Standard (Zimfarm) operation

When archiving a channel / user, it is very important to decide if you create a single ZIM or if you create one ZIM per playlist of the Youtube user or channel.

Since videos are consuming a lot of disk storage, it is wise to avoid creating a ZIM with too many videos which would be too big for ZIM end-users. This is typically a situation encountered when creating a ZIM from a proficient Youtube user or channel. It is preferable to split the Youtube user or channel content into multiple ZIMs, based on playlists which are usually available on proficient user or channel. Unfortunately, not all user or channel are providing playlists, or sometimes these playlists are incomplete or not up to date.

While the scraper supports a Playlists mode, capable to automatically create one ZIM per playlist of a given channel / user, this mode has been disable in the Zimfarm due to poor metadata management in such a situation. Should we want to create one ZIM per playlist, then one has to create one Zimfarm recipe per playlist. It has the advantage of allowing to group small playlists in a single ZIM if it makes sense, and still create of ZIM per big playlist.

Advanced playlists mode

This mode is not supported on the Zimfarm because we realized that it is inducing ZIMs with poor metadata since e.g. title and description are automatically sourced from the Youtube playlist, and it rarely match our expectations of length and quality.

In playlists mode, all playlists of a single user / channel / handle are archived into multiple ZIMs

  • for instance, if you are archiving a user with 10 playlists, you will get 10 ZIMs, one per playlist
  • if the user / channel / handle creates a new playlist, the corresponding ZIM will automatically be created at next scraper execution
  • the advantage of this mode was that you have only one Zimfarm recipe to maintain, all ZIMs are updated at the same time, and you avoid to overwhelm Zimfarm workers with too many tasks for a single user / channel ; but you cannot request update of a single ZIM, all of them will be updated at the same time

Scraper flags

In addition to descriptions already presented in the Zimfarm UI, you will find below some explanations about each flag.

Usage column explains if the flag should be set only in Single mode, only in Playlists mode, or for both modes.

Flags used in both modes

Flag name Comments
Optimization Cache URL This is very important, please ensure that proper bucket is used to not mess with data storage
API Key This is the technical identifier used to query Youtube API. Use proper API key, new key has to be created by dev tech. Very important as well to use proper value here to not mess with other recipes
Youtube ID(s) If Type is set to channel, this flag holds a single ID of the user or channel or handle. It could be either the displayed name found in the URL (e.g. Madrasa with or without leading @) or the technical ID.
If Type is set to playlist, this flag must contains the ID of one to multiple playlists, separated by a comma. All videos of all selected playlists will be archived in a single ZIM file
Language
Content Creator
ZIM Tags
Locale
Only after date Note that in addition to the simple YYYYMMDD format (e.g. 20230612 for 12th June 2023), it also support relative dates with the (now|today)[+-][0-9](day|week|month|year)(s) format. For instance, if you want to archive all videos published after two months ago, you use "now-2months" and the date limit will be updated at each recipe execution
Video format
Low Quality
Use any optimized version Every video processed by the scraper is re-encoded to limit its size and stored in what is called a Zimfarm cache (identified by the Optimization cache URL). To avoid re-encoding the same video multiple times across multiple executions, the scrapper first check for the presence of each video in the cache. If the video is present in the cache, this is directly archived in the ZIM to save bandwidth and computing time. However, some videos might change only due to a very small edit, creating a new video version in Youtube. By default, the scraper will detect this, not use the obsolete cached version, download this new version and re-encode it. To avoid this (e.g. when a user / channel has many videos or is known to made too many subtle version updates), you can set this flag to Enabled.
All Subtitles By default, the scraper only includes manually edited Youtube subtitles in the ZIM. If you set this flag to Enabled, automatically generated Youtube subtitles are also included in the ZIM. Beware that automated subtitles quality is known to be limited in quality.
Pagination How many videos to display per page in the ZIM UI
Auto-play Does the video starts automatically when opening it in the ZIM UI
Profile Image Customize the profile image used in the ZIM UI
Banner Image Customize the banner image used in the ZIM UI
Main Color Customize the main color used in the ZIM UI
Secondary Color Customize the secondary color used in the ZIM UI
Debug
Metadata JSON Expert option, used by developers when custom need arises.
Concurrency Default value is 1 and it is not recommended to modify this value without developer approval
Output folder Technical flag, do not modify, its value must be /output
Temp folder Technical flag, do not modify, its value must be /output

Flags used in standard mode

Flag name Comments
Type Either channel for archiving a channel, a user or a handle in a single ZIM, or playlist for archiving a set of playlists in a single ZIM.
ZIM Name
ZIM Filename Not mandatory, if not provided it will be automatically computed based on the ZIM Name. If set, this parameter must include the {period} placeholder which will be replaced by the current period at ZIM creation (e.g. super_zim_eng_all_{period}.zim).
Title
Description

Flags used in playlist mode

You activate the playlists mode by setting the --indiv-playlists flag.

Flag name Comments
Type In Playlists mode, the type must be channel.
Playlists Name Format use to build ZIM Name for each playlist / ZIM.
It can be a static value like "Super Name" (meaning all ZIMs will share the same name, not recommended.
It should use placeholders like "{creator_name} - {title}". For each ZIM, the placeholders will be replaced by their respective values for the current playlist.
E.g. if value is "{creator_name} - {title}", and the playlist has a title "Super playlist 1" and the creator name is "Bob", then the ZIM name will be "Super playlist 1 - Bob".
You can use any combination of placeholders and raw text.
Playlists ZIM Filename Not mandatory, if not provided it will be automatically computed based on the Playlists Name parameter.
Playlists title
Playlists description

Some remarks:

  • --playlists-title and --playlists-description allows you to dynamically customize them via some playlist-related variables:
    • {title}: the playlist title
    • {description}: the playlist description
    • {slug}: slugified version of the playlist title
    • {playlist_id}: playlist ID on youtube
    • {creator_id}: playlist's owner channel/user ID.
    • {creator_name}: playlist's owner channel/user name.
  • You can omit them and youtube2zim will auto-generate those.
  • you must specify --playlists-name (supports variables listed above).
  • --playlists-name is used to set the Name metadata of the ZIM (should be unique) and if not set separately, the output file name for the ZIM.
  • --metadata-from allows to specify a path or URL to a JSON file specifying custom static metadata for individual playlists.

JSON --metadata-from format:

{
    "<playlist-id>": {
        "name": "",
        "zim-file": "",
        "title": "",
        "description": "",
        "tags": "",
        "creator": "",
        "profile": "",
        "banner": ""
    }
}