Extractor documentation #6489

AyluinReymaer · 2024-11-16T13:45:05Z

AyluinReymaer
Nov 16, 2024

The purpose of this discussion is to improve the documentation on how to create extractors.

I've noticed that there is no proper doc regarding this topic and it was asked multiple times in the past.

The best discussion I found that goes into more detail on how to create an extractor is this one: #1656

But it wasn't properly explaining how to use Message.Queue which I managed to find slightly more info here: #1345

Also, the Message class has a Doc String which explains a little bit on what it does which also helped to understand what Message.Queue does.

So, I tried compiling as much information I could find about this topic, taking into account the above 2 discussions.

Hopefully, this can later be added into the Wiki for easy access:

https://github.com/mikf/gallery-dl/wiki

Contribution is greatly appreciated as there might be some things I got wrong or is not complete.

I am still working on more details on this doc, but for now, this is what I have.

@mikf , If possible, can you confirm if the below is correct, or if there is something I got wrong?

Extractor

To create an extractor, the following 2 methods are important:

items() should run the below statements:
- Required: yield Message.Directory, metadata
  - Sets the target directory for all following items
  - 2nd element is a dictionary containing general metadata
- Required: yield Message.Url, url, metadata
  - Sets the item's URL to download and its metadata
  - 2nd element is the URL as a string
  - 3rd element is a dictionary with item-specific metadata
- Optional: yield Message.Queue, url, metadata:
  - (External) URL that should be handled by another extractor
  - 2nd element is the (external) URL as a string
  - 3rd element is a dictionary containing URL-specific metadata
  - Example: If you are extracting all items of a facebook post, but one of the items is a youtube video embedded, you would use yield Message.Queue, youtube_url, metadata to use the youtube extractor to download the video.
  - Note: If you want to use a specific extractor class, define a dict variable containing the target extractor class to use in a _extractor property and pass it as metadata like so:
    - metadata = { "_extractor": SpecificExtractor } yield Message.Queue, youtube_url, metadata
request() is used for HTTP requests. It works more or less like request's session.request() in that you'd do something like self.request(url, params=params, headers=headers).json() to for example fetch a JSON resource.

The below attributes are also important for your extractor:

category: It is the category of the extractor. You can think of this as the name of the target site to extract. For example, on an extractor for facebook, you would put facebook here. But if for example, the same extractor can work for multiple sites (like extracting from all Wordpress sites using the Madara theme), you would put wordpress-madara or something like that.
subcategory: What you are extracting. For example, in the case of instagram, you have multiple categories like posts, stories, reels, tagged posts, and others.
*_fmt: These are the default string formats to use (overriden by the config), each with their respective purposes as described below:
- directory_fmt: The default name of the directory when downloading items.
  - e.g. directory_fmt = ("{category}", "{username}")
- filename_fmt: The default filename of the target item to download.
  - e.g. filename_fmt = "{media_id}.{extension}"
- archive_fmt: The default name of the item id to store in the archive.
  - The archive is what keeps track of which items were already downloaded and which were not. Gallery-DL uses this to prevent downloading the same item over and over again. You need to make sure that each item has a unique name to prevent collisions with other items.
  - e.g. archive_fmt = "{media_id}"
pattern: This is the regular expression that should match all URLs the extractor can handle. The resulting match object is the first real argument of an extractors's __init__()

Sample code

from .common import Extractor, Message

sample_post = {
    "title": "hello world",
    "items": [
        {
            "media_id": "1",
            "filename": "1.jpg",
            "extension": "jpg",
            "url": "https://www.example.com/p1/1.jpg"
        },
        {
            "media_id": "2",
            "filename": "2.jpg",
            "extension": "jpg",
            "url": "https://www.example.com/p1/2.jpg"
        },
        {
            "media_id": "3",
            "filename": "3.jpg",
            "extension": "jpg",
            "url": "https://www.example.com/p1/3.jpg"
        },
        {
            "media_id": "4",
            "youtube_url": "https://www.youtube.com/watch?v=example"
        }
    ]
}

class BaseExampleExtractor(Extractor):
    category = "example"
    # below *_fmt variables are the default format strings to use, in case the user config did not specify any format string.
    directory_fmt = ("{category}", "{title}")
    filename_fmt = "{media_id}.{extension}"
    archive_fmt = "{media_id}"

class ExamplePostsExtractor(BaseExampleExtractor):
    subcategory = "posts"
    pattern = r"(?:https?://)?(?:www\.)?example\.com"

    def get_posts(self):
        return [sample_post]

    def items(self):
        posts = self.get_posts()

        for post in posts:
            # Yield the target directory for each post.
            # Provide metadata via the "post" variable in case the user config is defined to save metadata
            yield Message.Directory, post

            for item in post.get("items", []):
                if item.get("youtube_url"):
                    # Yield the item's URL to another extractor and, optionally, metadata.
                    # Gallery-DL will send this URL to any extractor that matches the URL pattern for further processing.
                    # In this case, it will use the Youtube extractor.
                    yield Message.Queue, item["youtube_url"], {}
                else:
                    # Yield the item's URL and item's metadata.
                    # Gallery-DL will proceed to download the item.
                    yield Message.Url, item["url"], item

Here, we create the BaseExampleExtractor class inheriting from Gallery-DL's Extractor class.

The BaseExampleExtractor class contains the base configuration for all your extractors.

Then, we create the ExamplePostsExtractor class, inheriting from BaseExampleExtractor, which implements the actual logic to download the items.

Separating the logic this way keeps things organized and scalable. Sure, you could go ahead, use only 1 class and put all the logic in that class. But its not easily scalable, not easy to maintain and it gets hard to read if you have multiple types (or subcategories) of items being downloaded in the same extractor.

As a rule, try to have 1 extractor class per subcategory and have each extractor class inheriting from a base class with the default config for all your extractors.

SpiffyChatterbox · 2024-11-30T00:44:09Z

SpiffyChatterbox
Nov 30, 2024

For what it's worth, I've attempted some of this myself. I'm not a strong programmer, but thought doing some documentation might help me understand my gaps.

Here is my attempt: https://github.com/SpiffyChatterbox/gallery-dl/wiki

As well as some other discussions: #5750

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extractor documentation #6489

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Extractor documentation #6489

AyluinReymaer Nov 16, 2024

Extractor

Sample code

Replies: 1 comment

SpiffyChatterbox Nov 30, 2024

AyluinReymaer
Nov 16, 2024

SpiffyChatterbox
Nov 30, 2024