Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for wrap_in #180

Closed
Kdecherf opened this issue Dec 2, 2018 · 4 comments · Fixed by #262
Closed

Add support for wrap_in #180

Kdecherf opened this issue Dec 2, 2018 · 4 comments · Fixed by #262

Comments

@Kdecherf
Copy link
Collaborator

Kdecherf commented Dec 2, 2018

Some site configs have wrap_in([string]): [xpath] directives.

As this directive is not supported by fivefilters (see fivefilters/ftr-site-config#249 (comment)) there's no documentation to say what it does and how it works. Its name suggests that it instructs the content extractor to wrap elements matching [xpath] with string tags, e.g.: wrap_in(figure): //img.

This directive would let us to wrap span quotes with blockquote tags like on this page: https://washingtonmonthly.com/magazine/january-february-march-2018/how-to-fix-facebook-before-it-fixes-us/

Should we implement it in Graby?

@j0k3r
Copy link
Owner

j0k3r commented Dec 2, 2018

Might be an interesting feature.

@techexo
Copy link
Contributor

techexo commented Jan 10, 2019

I am wondering if it's not possible to implement something more drastic, and which may be more interesting on the long term: the possibility to transform an element (given an XPath) into another element.

In an ideal case, it could be possible to:

  1. transform elements selected by //span[contains(@class, 'epq-pull-quote')] (to take the example of @Kdecherf) into blockquote;
  2. transform all elements selected by //p/span into strong, for websites using CSS classes instead of semantically good markups; information which is lost with readability hardcore span stripping;
  3. transform //a[@class='dictionnary_link_wathever'] into span (then stripped by php-readability) to get rid of conjugation links on LeMonde (alternative to dissolve as explained here);
  4. transform //ul/li[div] into div (to take this example). This one might be trickier to do, actually.

Finally, I think that methods (maybe called functions in PHP? Coming from Python, sorry) able to do that would also allow to manipulate the DOM in an easier way and avoid the use of regular expressions in php-readability.

@techexo
Copy link
Contributor

techexo commented Jan 10, 2019

I've seen this on SO : https://stackoverflow.com/a/4675664, which allows to delete an element selected by an XPath while keeping its content. It would be an easy answer to the point 3 of my post above.

Renaming a node is a bit more troublesome, because it consists to select everything in that node, create a new element, move everything into the latter, then replacing the first node by the new one. However, as tricky as it could be to implement, the method could then be used for all the examples above.

Edit: it might be easier than I thought, see this link.
Edit 2 : and maybe see this to get rid of some of php-readability regex prefilters.

@Kdecherf
Copy link
Collaborator Author

@techexo thanks for your feedback. A way to transform/rename a node could be interesting, I think a separate issue for it would be very welcome ;-)

Kdecherf added a commit to Kdecherf/graby that referenced this issue May 22, 2021
Fixes j0k3r#180

Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
Kdecherf added a commit to Kdecherf/graby that referenced this issue May 22, 2021
Fixes j0k3r#180

Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
Kdecherf added a commit to Kdecherf/graby that referenced this issue May 23, 2021
Fixes j0k3r#180

Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
Kdecherf added a commit to Kdecherf/graby that referenced this issue May 29, 2021
Fixes j0k3r#180

Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
@j0k3r j0k3r closed this as completed in #262 Jun 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants