HTML::Untemplate - web scraping assistant
version 0.019
Suppose you have a set of HTML documents generated by populating the same template with the data from some kind of database. HTML::Untemplate is a set of command-line tools ("xpathify", "untemplate") and modules (HTML::Linear and it's dependencies) which assist in original data retrieval.
This process is also known as wrapper induction.
To achieve this goal, HTML tree nodes are presented as XPath/content pairs. HTML documents linearized this way can be easily inspected manually or with a diff tool. Please refer to "EXAMPLES".
Despite being named similarly to HTML::Template, this distribution is not directly related to it. Instead, it attempts to reverse the templating action, whatever the template agent used.
Suppose you have a CMS. Typical CMS works roughly as this (data flows bottom-down):
RDBMS
scripting language
HTML
HTTP server
(...)
HTTP agent
layout engine
screen
user
Consider the first 3 steps: RDBMS => scripting language => HTML
This is "applying template".
Now, consider this: HTML => scripting language => RDBMS
I would call that "un-applying template", or "untemplate" :)
The practical application of this set of tools is to assist in creation of web scrappers.
A similar (however completely unrelated) approach is described in the paper XPath-Wrapper Induction for Data Extraction.
Consider the following HTML node address representations:
0.1.3.0.0.4.0.0.0.2
(HTML::TreeBuilder internal address representation);/html/body/div[4]/div/div[1]/table[2]/tr/td/ul/li[3]
(HTML::Linear, strict);//td[1]/ul[1]/li[3]
(HTML::Linear, strict, shrink);/html/body[@class='section_home']/div[@id='content_holder'][1]/div[@id='content']/div[@id='main']/table[@class='content_table'][2]/tr/td/ul/li[@class='rss_content rss_content_col'][2]
(HTML::Linear, non-strict);//li[@class='rss_content rss_content_col'][2]
(HTML::Linear, non-strict, shrink).
They all point to the same node, however, their verbosity/readability vary. The strict mode specifies tag names and positions only. Disabling strict will use additional data from CSS selectors. Shrink mode attempts to find the shortest XPath unique for every node (/html/body
is shared among almost all nodes, thus is likely to be irrelevant).
The xpathify tool flatterns the HTML tree into key/value list:
<!DOCTYPE html>
<html>
<head>
<title>Hello HTML</title>
</head>
<body>
<h1>Hello World!</h1>
<p>This is a sample HTML</p>
Beware!
<p>HTML is <b>not</b> XML!</p>
Have a nice day.
</body>
</html>
Becomes:
(HTML block)
/html/head[1]/title[1]/text() | Hello HTM |
/html/body[1]/a[1]/text() | "title" |
/html/body[1]/a[1]/@href | #title |
/html/body[1]/h1[1]/text() | Hello World! |
/html/body[1]/p[1]/text() | This is a sample HTM |
/html/body[1]/p[1]/a[1]/text() | "p" |
/html/body[1]/p[1]/a[1]/@href | #p |
/html/body[1]/p[1]/text() | Beware! |
/html/body[1]/p[2]/text() | HTML is |
/html/body[1]/p[2]/b[1]/text() | not |
/html/body[1]/p[2]/text() | XML! |
/html/body[1]/text() | Have a nice day. |
The keys are in XPath format, while the values are respective content from the HTML tree. Theoretically, it could be possible to reassemble the HTML tree from the flat key/value list this tool generates.
The untemplate tool flatterns a set of HTML documents using the algorithm from xpathify. Then, it strips the shared key/value pairs. The "rest" is composed of original values fed into the template engine.
And this is how the result actually looks like with some simple real-world examples (quotes 1839 and 2486 from bash.org):
(HTML block)
/html/head[1]/title[1]/text() | |
2486.html | QDB: Quote #2486 |
1839.html | QDB: Quote #1839 |
/html/body[1]/form[1]/center[1]/table[1]/tr[1]/td[2]/font[1]/b[1]/text() | |
2486.html | Quote #2486 |
1839.html | Quote #1839 |
//p[@class='quote'][1]/a[1]/@href | |
2486.html | ?2486 |
1839.html | ?1839 |
//p[@class='quote'][1]/a[1]/b[1]/text() | |
2486.html | #2486 |
1839.html | #1839 |
//a[@class='qa'][1]/@href | |
2486.html | ./?le=cc8456a913b26eb7364e4e9a94348d04&rox=2486 |
1839.html | ./?le=cc8456a913b26eb7364e4e9a94348d04&rox=1839 |
//p[@class='quote'][1]/text() | |
2486.html | (228) |
1839.html | (245) |
//a[@class='qa'][2]/@href | |
2486.html | ./?le=cc8456a913b26eb7364e4e9a94348d04&sox=2486 |
1839.html | ./?le=cc8456a913b26eb7364e4e9a94348d04&sox=1839 |
//a[@class='qa'][3]/@href | |
2486.html | ./?le=cc8456a913b26eb7364e4e9a94348d04&sux=2486 |
1839.html | ./?le=cc8456a913b26eb7364e4e9a94348d04&sux=1839 |
//p[@class='qt'][1]/text() | |
2486.html | <R`:#heroin> Is this for recovery or indulgence? |
1839.html | <maff> who needs showers when you've got an assortment of feminine products |
//tr[2]/td[@class='footertext'][1]/text() | |
2486.html | 0.0035 |
1839.html | 0.0033 |
May be used to serialize/flattern HTML documents by your own:
HTML::Linear - represent HTML::Tree as a flat list
HTML::Linear::Element - represent elements to populate HTML::Linear
HTML::Linear::Path - represent paths inside HTML::Tree
Stanislaw Pusep <stas@sysd.org>
This software is copyright (c) 2014 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.