Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using pup to modify html #51

Open
bramp opened this issue Aug 26, 2015 · 13 comments
Open

Using pup to modify html #51

bramp opened this issue Aug 26, 2015 · 13 comments

Comments

@bramp
Copy link

bramp commented Aug 26, 2015

I have a use case where I want to add a class="table" to all table tags, that don't already have that class specified. Currently I use a hacky sed to do it, but I was wondering if pup could be a more robust way of doing that.

@ericchiang
Copy link
Owner

Would love this but it seems like this would need it's own language to describe the modification (like s/foo/bar/ for sed). Know of any tools that do something similar?

@bramp
Copy link
Author

bramp commented Aug 30, 2015

I'm not aware of any easy/simple similar tool to do modifications. There is XSLT, which is more complex, but would support this kind of transformation.

@np
Copy link

np commented Dec 20, 2015

What about a flag which would emit the original file except for each part that has been selected the tool print what the display function returns.

While not complete this sounds like a really good start.

@np
Copy link

np commented Dec 20, 2015

This actually rely on having a slightly more powerful set of display functions. Following jq a good start could be to have: (1) a way to emit arbitrary text using a template string. (2) a way to emit html tags.

@ericchiang
Copy link
Owner

@np these seem like inverse selectors rather than transformations. I actually like the idea of having a --template flag of some sort, but I think these are new issues.

@np
Copy link

np commented Dec 22, 2015

I was thinking that combining these templates plus a global flag which would print back the context, namely what surrounds the selected parts.

@ericchiang
Copy link
Owner

@np, could you provide an example of what that would look like? I'm having trouble understanding how this relates to transformations.

@np
Copy link

np commented Dec 24, 2015

Ok, this would work as follows, let's say we have a new flag --transform for instance.

First, running pup with a selector but no display function would be pretty useless as this would print the original page. So pup --transform <SOME_SELECTOR> would do the same as just pup.

Now combined with a display function one can transform/edit the page. For instance assuming a --template we could write something such as:

pup --transform --template '<div class=foo>%s</div>' a

This command would wrap all a tags with a div tag.

@tehmoon
Copy link

tehmoon commented May 30, 2019

I was going to open an issue but ended up here instead. I have that need too, and using sed is incredibly annoying for html.

Since you guys added the json{} is there a way to import JSON instead? I feel like everything can be done in jq if it is possible to do json -> html. Alas, I did not find any tool to do so. Open to suggestions!!

Also thank you for that incredible piece of software.

@Hrxn
Copy link

Hrxn commented May 30, 2019

For converting any markup formats, or similar, pandoc is probably the best idea.

https://github.com/jgm/pandoc
https://pandoc.org/

Conversion from/to JSON and any other supported format should also be possible.

@tehmoon
Copy link

tehmoon commented May 30, 2019

Thank you! this looks nice indeed. I was looking for a way to safely replace stuff inside of the html, but I might have found a way using pup -i 0 to format everything nicely and just sed to match the line. This way it actually is a little bit more easier and safer to regex using sed.

@b0o
Copy link

b0o commented Jul 29, 2020

Here's an idea for how to accomplish this with pup, elaborating on @tehmoon's statement:

Add a new output display function json-full{} which returns an object containing the matched elements along with the full HTML tree:

$ < input.html pup 'p json-full{}' 
{
  "match": [
    {
      "tag": "p",
      "text": "This is my website."
    },
    {
      "tag": "p",
      "text": "I hope you like it."
    },
    {
      "tag": "p",
      "text": "Ok cya later."
    }
  ],
  "tree": {
    "children": [
      {
        "children": [
          {
            "children": [
              {
                "tag": "title",
                "text": "Hello World"
              }
            ],
            "tag": "head"
          },
          {
            "children": [
              {
                "children": [
                  {
                    "alt": "logo",
                    "src": "https://example.com/logo.png",
                    "tag": "img"
                  },
                  {
                    "tag": "h1",
                    "text": "Hello World"
                  }
                ],
                "tag": "header"
              },
              {
                "children": [
                  {
                    "match": 0
                  },
                  {
                    "match": 1
                  },
                  {
                    "match": 2
                  }
                ],
                "tag": "div"
              }
            ],
            "tag": "body"
          }
        ],
        "tag": "html"
      }
    ],
    "tag": ""
  }
}

Then jq could be used to do the actual mutation:

$ < input.html pup 'p json-full{}' | jq '.match = (.match | map(.class = "foobar"))'
{
  "match": [
    {
      "tag": "p",
      "text": "This is my website.",
      "class": "foobar"
    },
    {
      "tag": "p",
      "text": "I hope you like it.",
      "class": "foobar"
    },
    {
      "tag": "p",
      "text": "Ok cya later.",
      "class": "foobar"
    }
  ],
  "tree": {
    "children": [
      {
        "children": [
          {
            "children": [
              {
                "tag": "title",
                "text": "Hello World"
              }
            ],
            "tag": "head"
          },
          {
            "children": [
              {
                "children": [
                  {
                    "alt": "logo",
                    "src": "https://example.com/logo.png",
                    "tag": "img"
                  },
                  {
                    "tag": "h1",
                    "text": "Hello World"
                  }
                ],
                "tag": "header"
              },
              {
                "children": [
                  {
                    "match": 0
                  },
                  {
                    "match": 1
                  },
                  {
                    "match": 2
                  }
                ],
                "tag": "div"
              }
            ],
            "tag": "body"
          }
        ],
        "tag": "html"
      }
    ],
    "tag": ""
  }
}

And finally pup could convert this back into HTML:

$ < input.html pup 'p json-full{}' | jq '.match = (.match | map(.class = "foobar"))' | pup --from-json
<html>
  <head>
    <title>Hello World</title>
  </head>
  <header>
    <img src="https://example.com/logo.png" alt="logo">
    <h1>Hello World</h1>
  </header>
  <div>
    <p class="foobar">This is my website.</p>
    <p class="foobar">I hope you like it.</p>
    <p class="foobar">Ok cya later.</p>
  </div>
</html>

This assumes that the tree output by pup contains all of the information necessary to reconstruct the original HTML (semantically, at least). This seems to be mostly true, but notably the doctype seems to be omitted by pup '* json{}'.

The exact behavior would need to be worked out:

  • should matched nodes be included in both a matches array and their original position in the DOM tree, or should they only appear in the matches array, or should they appear in both?
  • should the matches array contain the path to the matched node?

Implementing this shouldn't be too hard. All that would be needed:

  1. The json-full{} (pending a better name) display function:
  • the parsed tree should already be available
  • annotate the parsed tree with the match index on matched nodes
  • construct the json output with the matches + the annotated tree
  1. The ability to reverse the JSON output back to HTML. I haven't looked into pup's source but I would imagine some of the existing code for parsing the HTML could be used.

The biggest benefit of this approach is that it obviates the need to implement some sort of mutation/templating DSL inside pup, since many other utilities like jq have already done that and done it well. This leaves pup to do one thing, per the unix philosophy: parse HTML.

Unfortunately, the project seems largely unmaintained, so I don't feel super comfortable attempting to implement this if the PR would just sit unnoticed for months. It's a shame because I think pup fills an important niche in the world of unix utilities. I think it could be even more synergistic with jq and other unix utilities if something like this were implemented.


Editing to say that, upon closer inspection of the issues, several bugs/inconsistencies in the JSON output would need to be fixed before this would work dependably:

Really, although this feature itself would be simple, the prerequisite would be that pup can properly parse all valid HTML, preserving all information necessary to reconstruct it in the JSON output, which is no small feat.

I can think of a few alternative but similar strategies that might mitigate some of these issues - e.g. rather than expecting pup to be able to properly parse the whole tree, the input HTML could be passed to both invocations of pup, and for the replacement you would pass pup a mutated version of the JSON output from the first invocation.

Alternatively, pup could support a '--mutate-cmd' which would accept a command that pup would run on the matched JSON and use the output to update the HTML. This could behave similar to how xargs works. An added benefit over the previous suggestions would be that only one invocation of pup would be necessary.

@gizlu
Copy link

gizlu commented Sep 20, 2022

Sketch of imaginary pup/hx/xmlstarlet-like tool that is able to modify html (I drop it here as my idea for how tool like this could work)

cat file.html | hu 'selectors' [command] command_arg...

Commands

sel [-c]
  extract elements matching selector. If multiple elements are matched, they are concated together
    -c print content only. Without -c start/end tags are printed as well
del
  remove matched elements from html
set sth
  replace each match with suplied string. Selectors might use pseudo-clasess :before and :after
setf fmt ...
  set, but printf-like. Post-fmt args can use funcs like match() or file()
move dest-sel
  move match into suplied destination. dest-sel must use pseudo-class like :before or :after
aset key value
  set atribute of each matched element

Examples

# move charset meta tag to top
hu 'meta[charset]' move 'head::before'
# remove empty links
hu 'a:empty' del
# extract all embedded css, process it with external tool (hipotetical cssmin), and paste again
css=$(hu 'style' sel -c <in.html | cssmin)
hu 'style' del <in.html | hu 'head:before' setf '<style>%s</style>' "$css" >out.html

Other ideas

  • Add "propertiary" pseudoselectors that insert stuff before/after match. Yes, there are standard :before and :after but they don't really do what you would expect from their names (they insert stuff at beginning and end of match)
  • aget [-c] [keys] command - get atributes of matched elem
  • count command - print count of occurences of each selector (possible use case: removing dead css)
  • command, that would spawn other program, supply match to it, and replace match with its output (it would make cssmin example simpler)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants