Fast HTML5 Parser (Crystal binding for awesome lexborisov's myhtml and Modest). This shard used in production to parse millions of pages per day, very stable and fast.
WARNING: original libraries (myhtml and Modest) not maintained since july 2020, i recommend switch to successor parser: Lexbor.
Add this to your application's shard.yml
:
dependencies:
myhtml:
github: kostya/myhtml
And run shards install
require "myhtml"
html = <<-HTML
<html>
<body>
<div id="t1" class="red">
<a href="/#">O_o</a>
</div>
<div id="t2"></div>
</body>
</html>
HTML
myhtml = Myhtml::Parser.new(html)
myhtml.nodes(:div).each do |node|
id = node.attribute_by("id")
if first_link = node.scope.nodes(:a).first?
href = first_link.attribute_by("href")
link_text = first_link.inner_text
puts "div with id #{id} have link [#{link_text}](#{href})"
else
puts "div with id #{id} have no links"
end
end
# Output:
# div with id t1 have link [O_o](/#)
# div with id t2 have no links
require "myhtml"
html = <<-HTML
<html>
<body>
<table id="t1">
<tr><td>Hello</td></tr>
</table>
<table id="t2">
<tr><td>123</td><td>other</td></tr>
<tr><td>foo</td><td>columns</td></tr>
<tr><td>bar</td><td>are</td></tr>
<tr><td>xyz</td><td>ignored</td></tr>
</table>
</body>
</html>
HTML
myhtml = Myhtml::Parser.new(html)
p myhtml.css("#t2 tr td:first-child").map(&.inner_text).to_a
# => ["123", "foo", "bar", "xyz"]
p myhtml.css("#t2 tr td:first-child").map(&.to_html).to_a
# => ["<td>123</td>", "<td>foo</td>", "<td>bar</td>", "<td>xyz</td>"]
git clone https://github.com/kostya/myhtml.git
cd myhtml
make
crystal spec
Parse 1000 times google page(600Kb), and 1000 times css select. myhtml-program, crystagiri-program, nokogiri-program
Lang | Shard | Lib | Parse time, s | Css time, s | Memory, MiB |
---|---|---|---|---|---|
Crystal | lexbor | lexbor | 2.54 | 0.099 | 7.8 |
Crystal | myhtml | myhtml(+modest) | 3.17 | 0.16 | 8.4 |
Ruby 2.7 | Nokogiri | libxml2 | 9.19 | 10.76 | 139.8 |
Crystal | Crystagiri | libxml2 | 11.27 | - | 25.0 |