Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying selectors for extracting links. #217

Open
ttaomae opened this issue Feb 2, 2023 · 4 comments · May be fixed by #689
Open

Specifying selectors for extracting links. #217

ttaomae opened this issue Feb 2, 2023 · 4 comments · May be fixed by #689

Comments

@ttaomae
Copy link

ttaomae commented Feb 2, 2023

I came across a site which uses an <area> tag with an href attribute to create links with a non-standard shape. I don't know if this is the correct way to approach this, but I was able to capture these links by implementing the following custom driver.

module.exports = async ({data, page, crawler}) => {
  await crawler.loadPage(page, data, [
      {selector:"a[href]", extract:"href", isAttribute:false},
      {selector:"area[href]", extract:"href", isAttribute:false}
  ]);
};

However, I did not see anything in the documentation hinting at this and it required reading through the source code to even determine that the driver is what I should be looking into.

Furthermore, I've noticed that defaultDriver.js has changed significantly over time, so it is not clear to me whether this approach will remain valid in the long run. And to emphasize that point, it is worth mentioning that this driver works in 0.7.1 but breaks in 0.8.0-beta.1 (though I realize that fixing it just requires changing module.exports = to export default).

Would you consider implementing an easier way to configure the link extraction selectors? Or, if a custom driver is the recommended approach, is this documented somewhere?

@ikreymer
Copy link
Member

ikreymer commented Feb 2, 2023

I just recommended using a custom driver in the other issue! Yes, these are all good points!
You're right, there's not an example of driver usage in the current readme, which is a bit of an oversight.

The example you have is the current best option, however, it would be fairly easy to add a custom selector via cmdline,
perhaps --selector a[href]:href --selector area[href]:href
that is then passed to the driver in the same way as you have there.
I think that'd be a pretty simple thing to add (just want to be careful with the syntax).

These are all good suggestions - for now you can use the driver script you have, we'll add to this ticket once we have a chance to add this!

The tool is still pre 1.0.0 release, so a few things are changing, like the switch to ESM modules, but we hope to have a stable driver format in place soon!

@benoit74
Copy link
Contributor

We are impacted by this issue as well at Kiwix, we have a website to ZIM relying on <area> as well.

Should we also develop a custom driver or would you recommend that we make a PR to add selectors via cmdline as suggested?

@tw4l
Copy link
Member

tw4l commented Jun 18, 2024

We are impacted by this issue as well at Kiwix, we have a website to ZIM relying on <area> as well.

Should we also develop a custom driver or would you recommend that we make a PR to add selectors via cmdline as suggested?

Hi @benoit74, I'd suggest that perhaps a PR to add selectors via a cmdline argument would be the better/more flexible approach here. It shouldn't be too difficult, it would just be a matter of checking if the argument was provided (perhaps as a json string) and if so, applying the settings by overwriting the selectors default argument to extractLinks. Might want to add some validation as well to ensure that the string being passed in is valid.

@benoit74
Copy link
Contributor

Thank you @tw4l for the detailed suggestions.

Just for the record, the work on this from Kiwix has been postponed to "later", and since it might mean "months", should someone want to contribute to this issue, feel free, we will not collide on this. Should we start to work on this I will notify here first.

ikreymer added a commit that referenced this issue Sep 18, 2024
…aram

- selectors are of the form [css selector]->[property to use] or [css selector]->@[attribute to use], default being 'a[href]->href'
- fixes #217
@ikreymer ikreymer linked a pull request Sep 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants