Extractor documentation #6489
AyluinReymaer
started this conversation in
General
Replies: 1 comment
-
For what it's worth, I've attempted some of this myself. I'm not a strong programmer, but thought doing some documentation might help me understand my gaps. Here is my attempt: https://github.com/SpiffyChatterbox/gallery-dl/wiki As well as some other discussions: #5750 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The purpose of this discussion is to improve the documentation on how to create extractors.
I've noticed that there is no proper doc regarding this topic and it was asked multiple times in the past.
The best discussion I found that goes into more detail on how to create an extractor is this one: #1656
But it wasn't properly explaining how to use
Message.Queue
which I managed to find slightly more info here: #1345Also, the Message class has a Doc String which explains a little bit on what it does which also helped to understand what Message.Queue does.
So, I tried compiling as much information I could find about this topic, taking into account the above 2 discussions.
Hopefully, this can later be added into the Wiki for easy access:
https://github.com/mikf/gallery-dl/wiki
Contribution is greatly appreciated as there might be some things I got wrong or is not complete.
I am still working on more details on this doc, but for now, this is what I have.
@mikf , If possible, can you confirm if the below is correct, or if there is something I got wrong?
Extractor
To create an extractor, the following 2 methods are important:
items()
should run the below statements:yield Message.Directory, metadata
yield Message.Url, url, metadata
yield Message.Queue, url, metadata
:yield Message.Queue, youtube_url, metadata
to use the youtube extractor to download the video._extractor
property and pass it as metadata like so:request()
is used for HTTP requests. It works more or less like request'ssession.request()
in that you'd do something likeself.request(url, params=params, headers=headers).json()
to for example fetch a JSON resource.The below attributes are also important for your extractor:
category
: It is the category of the extractor. You can think of this as the name of the target site to extract. For example, on an extractor for facebook, you would put facebook here. But if for example, the same extractor can work for multiple sites (like extracting from all Wordpress sites using the Madara theme), you would put wordpress-madara or something like that.subcategory
: What you are extracting. For example, in the case of instagram, you have multiple categories like posts, stories, reels, tagged posts, and others.*_fmt
: These are the default string formats to use (overriden by the config), each with their respective purposes as described below:directory_fmt
: The default name of the directory when downloading items.directory_fmt = ("{category}", "{username}")
filename_fmt
: The default filename of the target item to download.filename_fmt = "{media_id}.{extension}"
archive_fmt
: The default name of the item id to store in the archive.archive_fmt = "{media_id}"
pattern
: This is the regular expression that should match all URLs the extractor can handle. The resulting match object is the first real argument of an extractors's__init__()
Sample code
Here, we create the
BaseExampleExtractor
class inheriting from Gallery-DL'sExtractor
class.The
BaseExampleExtractor
class contains the base configuration for all your extractors.Then, we create the
ExamplePostsExtractor
class, inheriting fromBaseExampleExtractor
, which implements the actual logic to download the items.Separating the logic this way keeps things organized and scalable. Sure, you could go ahead, use only 1 class and put all the logic in that class. But its not easily scalable, not easy to maintain and it gets hard to read if you have multiple types (or subcategories) of items being downloaded in the same extractor.
As a rule, try to have 1 extractor class per subcategory and have each extractor class inheriting from a base class with the default config for all your extractors.
Beta Was this translation helpful? Give feedback.
All reactions