A HTML cleaner based on SimpleXML, fast and customizable
Via Composer
Create a composer.json file in your project root:
{
"require": {
"voilab/htmlcleaner": "0.*"
}
}
$ composer require voilab/htmlcleaner
<p>
Some paragraph with <strong>bold</strong> or
<em><u><i>nested tags</i></u></em>.
</p>
<p>
And a second paragraph (so two roots elements, here) with
<a href="somesite.org">a cool link</a>,
<a href="javascript:alert('BAM!');">a bad link</a>
and some <span class="red">nice attributes to try to keep</span>.
</p>
use \voilab\cleaner\HtmlCleaner;
$cleaner = new HtmlCleaner();
$raw_html = '...'; // take sample dataset above
echo $cleaner->clean($raw_html);
// create cleaner...
$cleaner->addAllowedTags(['p', 'strong']);
// call clean method
// create cleaner...
$cleaner
->addAllowedTags(['p', 'span'])
->addAllowedAttributes(['class']);
// call clean method
// create cleaner...
$cleaner
->addAllowedTags(['p', 'span'])
->addAllowedAttributes([
// keep attribute "class" only for spans
new \voilab\cleaner\attribute\Keep('class', 'span'),
// you can use this shorthand too, as a string
'style:span'
]);
// call clean method
Processors are used to prepare HTML string before it is inserted into a new SimpleXMLElement (base of the process). They are also used to format the HTML after it is cleaned. It's some sort of pre-process and post-process.
The pre-process must remove not allowed tags.
The standard processor uses strip_tags()
to remove not allowed tags. After
process, the processor removes all carriage returns from the string.
You can create your own processor by implementing
\voilab\cleaner\processor\Processor
. Do not forget that the pre-process
is responsible of removing all not allowed tags.
Attributes classes are used to validate attributes and their content. By default
an allowed attribute becomes a \voilab\cleaner\attribute\Keep
. Every
"not allowed" attribute becomes a \voilab\cleaner\attribute\Remove
.
These two attribute types don't need to be instanciated by you. All attributes
provided as a string in setAllowedTags()
are converted in Keep
class.
You may want to keep some attributes but check the content. It's true for the
href
attribute. It can contain a valid URL or some javascript injection.
There is an attribute validator already created for that:
$cleaner
->addAllowedTags(['a'])
->addAllowedAttributes([
new \voilab\cleaner\attribute\Js('href')
]);
Note that allowed attributes can be bound or not to a specific tag. In the example above, the href attribute will be valid for every HTML tag. If you want to bind the attribute to a tag, you need to specify it as a second parameter.
Mixed content outside tags is not allowed in root position.
<!-- not valid: parts "some root " and " special " will disappear -->
some root <strong>mixed</strong> special <em>content</em>
<!-- valid -->
<p>some root <strong>mixed</strong> special <em>content</em></p>
<!-- also valid -->
<p>some root element</p>
<p>and an other root element</p>
If HTML is not well formatted, the cleaner will throw an \Exception
. The
string needs to be perfectly written, because it is processed by
simplexml_load_string($html)
, which is very strict:
- tags must be closed (
<p></p>
or<br />
) - attributes must be wrapped in (double-)quotes (
<hr class="test" />
) - (double-)quote is not allowed in attribute content, it must be converted in
"
beforeHtmlCleaner::clean()
is called - opening tag
<
and&
are not allowed in content, they must be converted respectivly in<
and&
beforeHtmlCleaner::clean()
is called
These limitations will eventually be addressed in future releases.
$ vendor/bin/phpunit --bootstrap vendor/autoload.php tests/
If you discover any security related issues, please use the issue tracker.
The MIT License (MIT). Please see License File for more information.