URLs turned into strings #58

frmichel · 2020-08-26T15:58:02Z

Hi, this is an issue that we've started to discuss in issue #54.
When scraping page https://inpn.mnhn.fr/espece/cd_nom/60878, some URIs are turned into strings:

<https://inpn.mnhn.fr/espece/cd_nom/60878-2> <https://schema.org/additionalType> "dwc:Taxon" .
<https://inpn.mnhn.fr/espece/cd_nom/60878-2> <https://schema.org/additionalType> "http://rs.tdwg.org/ontology/voc/TaxonConcept#TaxonConcept" .

Property additionalType is defined in Schema.org context as:

        "additionalType": { "@id": "schema:additionalType", "@type": "@id"},

The "@type": "@id"suggests that the object should always be interpreted as URL. Still, both values are turned into strings. Weird huh?

Another more tricky case concerns properties that can take several object types. For instance schema:identifier can take a text, URL or PropertyValue. Therefore you never know wether you'll get a URL or not.
That will also be the case for property taxonRank, that is not yet in schema.org, that can take a text or URL:

<https://inpn.mnhn.fr/espece/cd_nom/60878-2> <https://schema.org/taxonRank> "http://taxref.mnhn.fr/lod/taxrank/Species" .

So I'm wondering whether the scraper should try to look for a usual url scheme (typically anything starting with http:// or https://) and turn it into a URL.
And should this be done only for properties whose object can be a URL, or should it be done whatever the property so that misuses be tolerated: if one uses a property that normally takes a text value and provides a URL, should we still turn this into a URL or keep it as a string?

The text was updated successfully, but these errors were encountered:

AlasdairGray · 2020-08-26T16:09:46Z

I think you are right that we should do some additional processing of properties that could be URLs and ensure that strings that look like URLs or CURIEs are treated as URLs.

The question of whether properties that are not expected to have a URL is an interesting one. For simplicity at this point I would say no. It could be something that we provide as a configuration parameter if the need arose.

frmichel · 2020-08-26T16:25:03Z

I agree. Besides, since this post-processing may be time consuming, we could make it configurable with a postprocess = true|false plus additional optional parameters to fine tune post-processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URLs turned into strings #58

URLs turned into strings #58

frmichel commented Aug 26, 2020 •

edited

Loading

AlasdairGray commented Aug 26, 2020

frmichel commented Aug 26, 2020

URLs turned into strings #58

URLs turned into strings #58

Comments

frmichel commented Aug 26, 2020 • edited Loading

AlasdairGray commented Aug 26, 2020

frmichel commented Aug 26, 2020

frmichel commented Aug 26, 2020 •

edited

Loading