Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URLs turned into strings #58

Open
frmichel opened this issue Aug 26, 2020 · 2 comments
Open

URLs turned into strings #58

frmichel opened this issue Aug 26, 2020 · 2 comments

Comments

@frmichel
Copy link

frmichel commented Aug 26, 2020

Hi, this is an issue that we've started to discuss in issue #54.
When scraping page https://inpn.mnhn.fr/espece/cd_nom/60878, some URIs are turned into strings:

<https://inpn.mnhn.fr/espece/cd_nom/60878-2> <https://schema.org/additionalType> "dwc:Taxon" .
<https://inpn.mnhn.fr/espece/cd_nom/60878-2> <https://schema.org/additionalType> "http://rs.tdwg.org/ontology/voc/TaxonConcept#TaxonConcept" .

Property additionalType is defined in Schema.org context as:

        "additionalType": { "@id": "schema:additionalType", "@type": "@id"},

The "@type": "@id"suggests that the object should always be interpreted as URL. Still, both values are turned into strings. Weird huh?

Another more tricky case concerns properties that can take several object types. For instance schema:identifier can take a text, URL or PropertyValue. Therefore you never know wether you'll get a URL or not.
That will also be the case for property taxonRank, that is not yet in schema.org, that can take a text or URL:

<https://inpn.mnhn.fr/espece/cd_nom/60878-2> <https://schema.org/taxonRank> "http://taxref.mnhn.fr/lod/taxrank/Species" .

So I'm wondering whether the scraper should try to look for a usual url scheme (typically anything starting with http:// or https://) and turn it into a URL.
And should this be done only for properties whose object can be a URL, or should it be done whatever the property so that misuses be tolerated: if one uses a property that normally takes a text value and provides a URL, should we still turn this into a URL or keep it as a string?

@AlasdairGray
Copy link
Member

I think you are right that we should do some additional processing of properties that could be URLs and ensure that strings that look like URLs or CURIEs are treated as URLs.

The question of whether properties that are not expected to have a URL is an interesting one. For simplicity at this point I would say no. It could be something that we provide as a configuration parameter if the need arose.

@frmichel
Copy link
Author

I agree. Besides, since this post-processing may be time consuming, we could make it configurable with a postprocess = true|false plus additional optional parameters to fine tune post-processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants