-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using pup to modify html #51
Comments
Would love this but it seems like this would need it's own language to describe the modification (like |
I'm not aware of any easy/simple similar tool to do modifications. There is XSLT, which is more complex, but would support this kind of transformation. |
What about a flag which would emit the original file except for each part that has been selected the tool print what the display function returns. While not complete this sounds like a really good start. |
This actually rely on having a slightly more powerful set of display functions. Following jq a good start could be to have: (1) a way to emit arbitrary text using a template string. (2) a way to emit html tags. |
@np these seem like inverse selectors rather than transformations. I actually like the idea of having a |
I was thinking that combining these templates plus a global flag which would print back the context, namely what surrounds the selected parts. |
@np, could you provide an example of what that would look like? I'm having trouble understanding how this relates to transformations. |
Ok, this would work as follows, let's say we have a new flag First, running pup with a selector but no display function would be pretty useless as this would print the original page. So Now combined with a display function one can transform/edit the page. For instance assuming a
This command would wrap all |
I was going to open an issue but ended up here instead. I have that need too, and using Since you guys added the Also thank you for that incredible piece of software. |
For converting any markup formats, or similar, pandoc is probably the best idea. https://github.com/jgm/pandoc Conversion from/to JSON and any other supported format should also be possible. |
Thank you! this looks nice indeed. I was looking for a way to safely replace stuff inside of the html, but I might have found a way using |
Here's an idea for how to accomplish this with pup, elaborating on @tehmoon's statement: Add a new output display function
{
"match": [
{
"tag": "p",
"text": "This is my website."
},
{
"tag": "p",
"text": "I hope you like it."
},
{
"tag": "p",
"text": "Ok cya later."
}
],
"tree": {
"children": [
{
"children": [
{
"children": [
{
"tag": "title",
"text": "Hello World"
}
],
"tag": "head"
},
{
"children": [
{
"children": [
{
"alt": "logo",
"src": "https://example.com/logo.png",
"tag": "img"
},
{
"tag": "h1",
"text": "Hello World"
}
],
"tag": "header"
},
{
"children": [
{
"match": 0
},
{
"match": 1
},
{
"match": 2
}
],
"tag": "div"
}
],
"tag": "body"
}
],
"tag": "html"
}
],
"tag": ""
}
} Then jq could be used to do the actual mutation:
{
"match": [
{
"tag": "p",
"text": "This is my website.",
"class": "foobar"
},
{
"tag": "p",
"text": "I hope you like it.",
"class": "foobar"
},
{
"tag": "p",
"text": "Ok cya later.",
"class": "foobar"
}
],
"tree": {
"children": [
{
"children": [
{
"children": [
{
"tag": "title",
"text": "Hello World"
}
],
"tag": "head"
},
{
"children": [
{
"children": [
{
"alt": "logo",
"src": "https://example.com/logo.png",
"tag": "img"
},
{
"tag": "h1",
"text": "Hello World"
}
],
"tag": "header"
},
{
"children": [
{
"match": 0
},
{
"match": 1
},
{
"match": 2
}
],
"tag": "div"
}
],
"tag": "body"
}
],
"tag": "html"
}
],
"tag": ""
}
} And finally pup could convert this back into HTML:
<html>
<head>
<title>Hello World</title>
</head>
<header>
<img src="https://example.com/logo.png" alt="logo">
<h1>Hello World</h1>
</header>
<div>
<p class="foobar">This is my website.</p>
<p class="foobar">I hope you like it.</p>
<p class="foobar">Ok cya later.</p>
</div>
</html> This assumes that the tree output by pup contains all of the information necessary to reconstruct the original HTML (semantically, at least). This seems to be mostly true, but notably the doctype seems to be omitted by The exact behavior would need to be worked out:
Implementing this shouldn't be too hard. All that would be needed:
The biggest benefit of this approach is that it obviates the need to implement some sort of mutation/templating DSL inside pup, since many other utilities like jq have already done that and done it well. This leaves pup to do one thing, per the unix philosophy: parse HTML. Unfortunately, the project seems largely unmaintained, so I don't feel super comfortable attempting to implement this if the PR would just sit unnoticed for months. It's a shame because I think pup fills an important niche in the world of unix utilities. I think it could be even more synergistic with jq and other unix utilities if something like this were implemented. Editing to say that, upon closer inspection of the issues, several bugs/inconsistencies in the JSON output would need to be fixed before this would work dependably:
Really, although this feature itself would be simple, the prerequisite would be that pup can properly parse all valid HTML, preserving all information necessary to reconstruct it in the JSON output, which is no small feat. I can think of a few alternative but similar strategies that might mitigate some of these issues - e.g. rather than expecting pup to be able to properly parse the whole tree, the input HTML could be passed to both invocations of pup, and for the replacement you would pass pup a mutated version of the JSON output from the first invocation. Alternatively, pup could support a '--mutate-cmd' which would accept a command that pup would run on the matched JSON and use the output to update the HTML. This could behave similar to how xargs works. An added benefit over the previous suggestions would be that only one invocation of pup would be necessary. |
Sketch of imaginary pup/hx/xmlstarlet-like tool that is able to modify html (I drop it here as my idea for how tool like this could work)
Commands
Examples# move charset meta tag to top
hu 'meta[charset]' move 'head::before'
# remove empty links
hu 'a:empty' del
# extract all embedded css, process it with external tool (hipotetical cssmin), and paste again
css=$(hu 'style' sel -c <in.html | cssmin)
hu 'style' del <in.html | hu 'head:before' setf '<style>%s</style>' "$css" >out.html Other ideas
|
I have a use case where I want to add a
class="table"
to all table tags, that don't already have that class specified. Currently I use a hacky sed to do it, but I was wondering if pup could be a more robust way of doing that.The text was updated successfully, but these errors were encountered: