RDFa Lite 1.1 and HTML Microdata parser for web documents (HTML, SVG, XML)
rdfa-lite-microdata is used for extracting RDFa Lite 1.1 and HTML Microdata information out of web documents (HTML / SVG / XML). The embedded structures may use arbitrary vocabularies (e.g. schema.org) and are returned as a Plain Old PHP Object (POPO) which is compliant with the JSON serialization described for HTML Microdata.
To extract RDFa Lite 1.1 data out of a web document, instantiate an RdfaLite
parser and call the appropriate parse method:
$rdfaParser = new \Jkphl\RdfaLiteMicrodata\Ports\Parser\RdfaLite();
// Parse an HTML file
$rdfaItems = $rdfaParser->parseHtmlFile('/path/to/file.html');
// Parse an HTML string
$rdfaItems = $rdfaParser->parseHtml('<html><head>...</head><body vocab="http://schema.org/">...</body>');
// Parse a DOM document (here: created from an HTML string)
$rdfaDom = new \DOMDocument();
$rdfaDom->loadHTML('<html><head>...</head><body vocab="http://schema.org/">...</body>');
$rdfaItems = $rdfaParser->parseDom($rdfaDom);
// Parse an XML file (e.g. SVG)
$rdfaItems = $rdfaParser->parseXmlFile('/path/to/file.svg');
// Parse an XML string (e.g. SVG)
$rdfaItems = $rdfaParser->parseXml('<svg viewBox="0 0 100 100" vocab="http://schema.org/">...</svg>');
echo json_encode($rdfaItems, JSON_PRETTY_PRINT);
The resulting JSON serialization will look something like this (JSON serialization):
{
"items": [
{
"type": [
"http://schema.org/Movie"
],
"id": "http://www.imdb.com/title/tt0499549/",
"properties": {
"http://schema.org/name": [
"Avatar"
],
"http://schema.org/director": [
{
"type": [
"http://schema.org/Person"
],
"id": null,
"properties": {
"http://schema.org/name": [
"James Cameron"
],
"http://schema.org/birthDate": [
"August 16, 1954"
]
}
}
],
"http://schema.org/genre": [
"Science fiction"
],
"http://schema.org/trailer": [
"../movies/avatar-theatrical-trailer.html"
]
}
}
]
}
Item types and property names can be treated as references consisting of a profile IRI and a separate name. To enable IRI mode, instantiate the parser with true
as argument:
$rdfaParser = new \Jkphl\RdfaLiteMicrodata\Ports\Parser\RdfaLite(true);
$rdfaItems = $rdfaParser->parseHtmlFile('/path/to/file.html');
With IRI mode enabled, the result will look like more verbose (JSON serialization):
{
"items": [
{
"type": [
{
"profile": "http://schema.org/",
"name": "Movie"
}
],
"id": "http://www.imdb.com/title/tt0499549/",
"properties": {
"http://schema.org/name": {
"profile": "http://schema.org/",
"name": "name",
"values": [
"Avatar"
]
},
"http://schema.org/director": {
"profile": "http://schema.org/",
"name": "director",
"values": [
{
"type": [
{
"profile": "http://schema.org/",
"name": "Person"
}
],
"id": null,
"properties": {
"http://schema.org/name": {
"profile": "http://schema.org/",
"name": "name",
"values": [
"James Cameron"
]
},
"http://schema.org/birthDate": {
"profile": "http://schema.org/",
"name": "birthDate",
"values": [
"August 16, 1954"
]
}
}
}
]
},
"http://schema.org/genre": {
"profile": "http://schema.org/",
"name": "genre",
"values": [
"Science fiction"
]
},
"http://schema.org/trailer": {
"profile": "http://schema.org/",
"name": "trailer",
"values": [
"../movies/avatar-theatrical-trailer.html"
]
}
}
}
]
}
The Microdata format isn't specified for non-HTML host formats, so the Microdata
parser only supports HTML processing:
$microdataParser = new \Jkphl\RdfaLiteMicrodata\Ports\Parser\Microdata();
// Parse an HTML file
$microdataItems = $microdataParser->parseHtmlFile('/path/to/file.html');
// Parse an HTML string
$microdataItems = $microdataParser->parseHtml('<html><head>...</head><body itemscope itemtype="http://schema.org/Movie">...</body>');
// Parse a DOM document created from an HTML string
$microdataDom = new \DOMDocument();
$microdataDom->loadHTML('<html><head>...</head><body itemscope itemtype="http://schema.org/Movie">...</body>');
$microdataItems = $microdataParser->parseDom($microdataDom);
// Parse an HTML string with types / property names treated as IRIs
$microdataParserIri = new \Jkphl\RdfaLiteMicrodata\Ports\Parser\Microdata(true);
$microdataItems = $microdataParser->parseHtmlFile('/path/to/file.html');
This library requires PHP >=5.5 or later. I recommend using the latest available version of PHP as a matter of principle. It has no userland dependencies. It's installable and autoloadable via Composer as jkphl/rdfa-lite-microdata.
composer require jkphl/rdfa-lite-microdata
Alternatively, download a release or clone the repository, then require or include its autoload.php
file.
Copyright © 2017 Joschi Kuphal / joschi@tollwerk.de. Licensed under the terms of the MIT license.