Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test serialization of GND RDF-XML to compact JSON-LD #1

Closed
fsteeg opened this issue Jun 29, 2017 · 14 comments
Closed

Test serialization of GND RDF-XML to compact JSON-LD #1

fsteeg opened this issue Jun 29, 2017 · 14 comments
Assignees

Comments

@fsteeg
Copy link
Member

fsteeg commented Jun 29, 2017

Both dumps and updates (via OAI) are available as RDF-XML, so that would be a suitable source format:

http://datendienst.dnb.de/cgi-bin/mabit.pl?userID=opendata&pass=opendata&cmd=login
http://www.dnb.de/DE/Service/DigitaleDienste/OAI/oai_node.html (s. "Formate")

We should test serializing that RDF-XML as compact JSON-LD using the entityfacts context:

http://hub.culturegraph.org/entityfacts/context/v1/entityfacts.jsonld
http://hub.culturegraph.org/entityfacts/118540238

If the result looks good, this might be the format to index in Elasticsearch. We might have to do some preprocessing to make sure the values always have the same type (see footnote 1 in http://blog.lobid.org/2017/06/08/lobid-api-why-how.html about compact JSON-LD serialization in Elasticsearch).

@fsteeg fsteeg self-assigned this Jun 29, 2017
@fsteeg fsteeg added the working label Jun 29, 2017
@acka47
Copy link
Contributor

acka47 commented Jun 30, 2017

For testing the quality of the JSON-LD output you should take a look at entities with geo coordinates (which are added via a bnode). For example http://d-nb.info/gnd/4074335-4 (ttl). See the issue at lobid/lodmill#503.

fsteeg added a commit that referenced this issue Jun 30, 2017
@fsteeg
Copy link
Member Author

fsteeg commented Jun 30, 2017

First results, for http://d-nb.info/gnd/2047974-8/about/lds:

{
  "@graph" : [ {
    "@id" : "http://d-nb.info/gnd/2047974-8",
    "@type" : "organisation",
    "http://d-nb.info/standards/elementset/dnb#deprecatedUri" : [ "http://d-nb.info/gnd/4194078-7" ],
    "http://d-nb.info/standards/elementset/gnd#broaderTermInstantial" : [ {
      "@id" : "http://d-nb.info/gnd/4630294-3"
    } ],
    "http://d-nb.info/standards/elementset/gnd#geographicAreaCode" : [ {
      "@id" : "http://d-nb.info/standards/vocab/gnd/geographic-area-code#XA-DE-NW"
    } ],
    "http://d-nb.info/standards/elementset/gnd#gndIdentifier" : [ "2047974-8" ],
    "http://d-nb.info/standards/elementset/gnd#gndSubjectCategory" : [ {
      "@id" : "http://d-nb.info/standards/vocab/gnd/gnd-sc#6.7"
    }, {
      "@id" : "http://d-nb.info/standards/vocab/gnd/gnd-sc#2.2"
    } ],
    "homepage" : [ {
      "@id" : "https://www.hbz-nrw.de/"
    } ],
    "http://d-nb.info/standards/elementset/gnd#oldAuthorityNumber" : [ "(DE-588)4194078-7", "(DE-588b)2047974-8", "(DE-588c)4194078-7" ],
    "placeOfBusiness" : [ {
      "@id" : "http://d-nb.info/gnd/4031483-2"
    } ],
    "preferredName:ForTheCorporateBody" : [ "Hochschulbibliothekszentrum des Landes Nordrhein-Westfalen" ],
    "http://d-nb.info/standards/elementset/gnd#spatialAreaOfActivity" : [ {
      "@id" : "http://d-nb.info/gnd/4042570-8"
    } ],
    "topic" : [ {
      "@id" : "http://d-nb.info/gnd/4132773-1"
    } ],
    "variantName:ForTheCorporateBody" : [ "Hochschulbibliothekszentrum NRW", "Hochschulbibliothekszentrum des Landes NRW", "Hochschulbibliothekszentrum", "hbz", "hbz Köln" ],
    "http://www.w3.org/2002/07/owl#sameAs" : [ {
      "@id" : "http://d-nb.info/gnd/4194078-7"
    } ],
    "url" : [ {
      "@id" : "http://de.wikipedia.org/wiki/Hochschulbibliothekszentrum_des_Landes_Nordrhein-Westfalen"
    } ]
  } ]
}

For http://d-nb.info/gnd/4074335-4/about/lds:

{
  "@graph" : [ {
    "@id" : "_:t1",
    "@type" : "http://www.opengis.net/ont/sf#Point",
    "http://www.opengis.net/ont/geosparql#asWKT" : [ {
      "@type" : "http://www.opengis.net/ont/geosparql#wktLiteral",
      "@value" : "Point ( -000.125740 +051.508530 )"
    } ]
  }, {
    "@id" : "http://d-nb.info/gnd/4074335-4",
    "@type" : "http://d-nb.info/standards/elementset/gnd#TerritorialCorporateBodyOrAdministrativeUnit",
    "http://d-nb.info/standards/elementset/dnb#deprecatedUri" : [ "http://d-nb.info/gnd/1005809-6" ],
    "http://d-nb.info/standards/elementset/gnd#definition" : [ {
      "@language" : "de",
      "@value" : "Hauptstadt des Vereinigten Königreichs von Großbritannien und Nordirland, in Mittelsteinzeit besiedelt, 43 n.Chr. von Römern gegründet; das County of London war 1889-1965 Verwaltungsgrafschaft u. zeremonielle Grafschaft"
    } ],
    "http://d-nb.info/standards/elementset/gnd#geographicAreaCode" : [ {
      "@id" : "http://d-nb.info/standards/vocab/gnd/geographic-area-code#XA-GB"
    } ],
    "http://d-nb.info/standards/elementset/gnd#gndIdentifier" : [ "4074335-4" ],
    "homepage" : [ {
      "@id" : "http://www.london.gov.uk"
    } ],
    "http://d-nb.info/standards/elementset/gnd#oldAuthorityNumber" : [ "(DE-588)1005809-6", "(DE-588b)1005809-6", "(DE-588c)4074335-4" ],
    "preferredName:ForThePlaceOrGeographicName" : [ "London" ],
    "http://d-nb.info/standards/elementset/gnd#relatedDdcWithDegreeOfDeterminacy4" : [ {
      "@id" : "http://dewey.info/class/2--421/"
    } ],
    "variantName:ForThePlaceOrGeographicName" : [ "Londinum", "Londra", "Lundonia", "Augusta Trinobantum", "Westminster", "Lundun", "Landan", "Londyn", "Londres", "Londen", "London (Great Britain)", "Londinium" ],
    "http://www.opengis.net/ont/geosparql#hasGeometry" : [ {
      "@id" : "_:t1"
    } ],
    "http://www.w3.org/2002/07/owl#sameAs" : [ {
      "@id" : "http://d-nb.info/gnd/1005809-6"
    }, {
      "@id" : "http://sws.geonames.org/2643743"
    } ]
  } ]
}

@acka47
Copy link
Contributor

acka47 commented Jun 30, 2017

So the geo stuff is in there. However, we will need some post- and pre-processign to get the expected results.

Pre-processing / Reasoning

In 1.0, we added some inferencing to get more general properties. I suggest doing similar things here:

  1. We don't want specific name properties like preferredNameForThePlaceOrGeographicName and variantNameForThePlaceOrGeographicName. For all entities, we should just use preferredName and variantName.
  2. We probably need to add all superclasses to the data. In this case, this would be PlaceOrGeographicName and AuthorityResource.

Having done 1.) and 2.), the result would look like this:

{
  "@graph" : [ {
    "@id" : "_:t1",
    "@type" : "http://www.opengis.net/ont/sf#Point",
    "http://www.opengis.net/ont/geosparql#asWKT" : [ {
      "@type" : "http://www.opengis.net/ont/geosparql#wktLiteral",
      "@value" : "Point ( -000.125740 +051.508530 )"
    } ]
  }, {
    "@id" : "http://d-nb.info/gnd/4074335-4",
    "@type" : [ "http://d-nb.info/standards/elementset/gnd#TerritorialCorporateBodyOrAdministrativeUnit",  "http://d-nb.info/standards/elementset/gnd#PlaceOrGeographicName", "http://d-nb.info/standards/elementset/gnd#AuthorityResource" ],
    "http://d-nb.info/standards/elementset/dnb#deprecatedUri" : [ "http://d-nb.info/gnd/1005809-6" ],
    "http://d-nb.info/standards/elementset/gnd#definition" : [ {
      "@language" : "de",
      "@value" : "Hauptstadt des Vereinigten Königreichs von Großbritannien und Nordirland, in Mittelsteinzeit besiedelt, 43 n.Chr. von Römern gegründet; das County of London war 1889-1965 Verwaltungsgrafschaft u. zeremonielle Grafschaft"
    } ],
    "http://d-nb.info/standards/elementset/gnd#geographicAreaCode" : [ {
      "@id" : "http://d-nb.info/standards/vocab/gnd/geographic-area-code#XA-GB"
    } ],
    "http://d-nb.info/standards/elementset/gnd#gndIdentifier" : [ "4074335-4" ],
    "homepage" : [ {
      "@id" : "http://www.london.gov.uk"
    } ],
    "http://d-nb.info/standards/elementset/gnd#oldAuthorityNumber" : [ "(DE-588)1005809-6", "(DE-588b)1005809-6", "(DE-588c)4074335-4" ],
    "http://d-nb.info/standards/elementset/gnd#preferredName" : [ "London" ],
    "http://d-nb.info/standards/elementset/gnd#relatedDdcWithDegreeOfDeterminacy4" : [ {
      "@id" : "http://dewey.info/class/2--421/"
    } ],
    "http://d-nb.info/standards/elementset/gnd#variantName" : [ "Londinum", "Londra", "Lundonia", "Augusta Trinobantum", "Westminster", "Lundun", "Landan", "Londyn", "Londres", "Londen", "London (Great Britain)", "Londinium" ],
    "http://www.opengis.net/ont/geosparql#hasGeometry" : [ {
      "@id" : "_:t1"
    } ],
    "http://www.w3.org/2002/07/owl#sameAs" : [ {
      "@id" : "http://d-nb.info/gnd/1005809-6"
    }, {
      "@id" : "http://sws.geonames.org/2643743"
    } ]
  } ]
}

Context & Framing

The result of framing the above output (based on the to-be-added AuthorityResource type) and adding the EntityFacts context can be viewed at http://tinyurl.com/y7n93utq. Obviously, this is not satsifying. For one, the EntityFacts context doesn't suffice and would have to be extended as it obviously doesn't cover the whole GND ontology. (EntityFacts os a simplification for use of GND by web developers). However, using our current context from 1.0 already looks much better, see http://tinyurl.com/ychm4t92. Thus, I suggest to just update this one.

Furthermore, the @graph is still in there after framing and has to be removed by us. (It currently isn't possible to just leave it out but will be possible with the next JSON-LD version, see this thread on the liked-json mailing list and the issue resulting from the thread.)

@acka47
Copy link
Contributor

acka47 commented Jun 30, 2017

I just found out that I already created a context for the 2.0 GND API, see #1. (We should probably delete this repo as soon as we have moved the issue over here.) This context is also missing some things (e.g. the geo properties), see http://tinyurl.com/y8z3f3rl.

@fsteeg
Copy link
Member Author

fsteeg commented Jul 3, 2017

Another option would be direct transformation from MARC-XML to JSON, like in lobid-organisations.

We could adapt the existing mappings for the RDF conversion:
https://github.com/culturegraph/metafacture-examples/tree/master/Linked-Data-Service-Gnd

@acka47
Copy link
Contributor

acka47 commented Jul 3, 2017

Re. the framing output from http://tinyurl.com/ychm4t92, I just noticed that blank nodes get an id:

      "hasGeometry": {
        "@id": "_:b0",
        "@type": "http://www.opengis.net/ont/sf#Point",
        "asWKT": "Point ( -000.125740 +051.508530 )"
      }

We should get rid of them. This has already been addressed in the JSON-LD Framing spec 1.1 ("pruneBlankNodeIdentifiers") but is currently only implemented in the Ruby library, see json-ld/json-ld.org#293.

@fsteeg
Copy link
Member Author

fsteeg commented Jul 3, 2017

Input: http://d-nb.info/gnd/4074335-4/about/lds

Context: https://gist.githubusercontent.com/acka47/98035a3f215c783bdc00/raw/5699ab4e89b5e7ab896ac69442c84fcf7f50ad66/gnd-context_20160126.jsonld

Frame: https://gist.githubusercontent.com/fsteeg/729e623e7f3c5f0003bc6f28a525d2ea/raw/4e0632608116acd043727ec45588236a98cc6eef/gnd-frame_20160126.jsonld

Output:

{
  "@id" : "http://d-nb.info/gnd/4074335-4",
  "@type" : "TerritorialCorporateBodyOrAdministrativeUnit",
  "http://d-nb.info/standards/elementset/dnb#deprecatedUri" : [ "http://d-nb.info/gnd/1005809-6" ],
  "definition" : [ {
    "@language" : "de",
    "@value" : "Hauptstadt des Vereinigten Königreichs von Großbritannien und Nordirland, in Mittelsteinzeit besiedelt, 43 n.Chr. von Römern gegründet; das County of London war 1889-1965 Verwaltungsgrafschaft u. zeremonielle Grafschaft"
  } ],
  "geographicAreaCode" : [ "http://d-nb.info/standards/vocab/gnd/geographic-area-code#XA-GB" ],
  "gndIdentifier" : [ "4074335-4" ],
  "homepage" : [ "http://www.london.gov.uk" ],
  "oldAuthorityNumber" : [ "(DE-588)1005809-6", "(DE-588b)1005809-6", "(DE-588c)4074335-4" ],
  "preferredNameForThePlaceOrGeographicName" : [ "London" ],
  "relatedDdcWithDegreeOfDeterminacy4" : [ "http://dewey.info/class/2--421/" ],
  "variantNameForThePlaceOrGeographicName" : [ "Londinum", "Londra", "Lundonia", "Augusta Trinobantum", "Westminster", "Lundun", "Landan", "Londyn", "Londres", "Londen", "London (Great Britain)", "Londinium" ],
  "http://www.opengis.net/ont/geosparql#hasGeometry" : [ {
    "@id" : "_:b0",
    "@type" : "http://www.opengis.net/ont/sf#Point",
    "http://www.opengis.net/ont/geosparql#asWKT" : [ {
      "@type" : "http://www.opengis.net/ont/geosparql#wktLiteral",
      "@value" : "Point ( -000.125740 +051.508530 )"
    } ]
  } ],
  "sameAs" : [ "http://d-nb.info/gnd/1005809-6", "http://sws.geonames.org/2643743" ]
}

@acka47 Except for the points you already mentioned (missing keys in context, blank node IDs) this looks OK. Did I understand correctly: the idea is to add the http://d-nb.info/standards/elementset/gnd#AuthorityResource type to all authorities?

@acka47
Copy link
Contributor

acka47 commented Jul 3, 2017

Yes, this already looks quite good. And yes, as in 1.0 we should add type AuthorityResource to all entitites.

Furthermore, we should have a type from the second level of GND ontology attached to each resource. We will need this for facetting. GND ontology has three levels in its type hierarchy (except for Person, where we have a fourth one added). see the overview over the GND class hierarchy at https://wiki1.hbz-nrw.de/x/CIeW. In the concrete example, PlaceOrGeographicName should be in the data.

Regarding the name properties, we should only use preferredName and variantName for all entities. This will allow us to query the whole data in a uniform way. (The type is made clear by other means so that we don't need the specific properties.)

fsteeg added a commit that referenced this issue Jul 4, 2017
@fsteeg
Copy link
Member Author

fsteeg commented Jul 4, 2017

@fsteeg
Copy link
Member Author

fsteeg commented Jul 4, 2017

Before working on the details (2nd level superclasses, rename fields, remove blank node IDs), I suggest we continue with testing the actual indexing of this format in Elasticsearch. I'd suggest we resolve this issue, and open new issues for the things I mentioned above. Assigning @acka47 for functional review.

@fsteeg fsteeg removed their assignment Jul 4, 2017
@fsteeg fsteeg added review and removed working labels Jul 4, 2017
acka47 added a commit that referenced this issue Jul 4, 2017
acka47 added a commit that referenced this issue Jul 4, 2017
@acka47
Copy link
Contributor

acka47 commented Jul 4, 2017

I just noticed that the language isn't indicated as we do in other lobid services:

"definition":[
   {
      "@language":"de",
      "@value":"Hauptstadt des Vereinigten Königreichs von Großbritannien und Nordirland, in Mittelsteinzeit besiedelt, 43 n.Chr. von Römern gegründet; das County of London war 1889-1965 Verwaltungsgrafschaft u. zeremonielle Grafschaft"
   }
]

We would rather have "@container": "@language" in the context and the following in the data:

"definition":[
   {
      "de":"Hauptstadt des Vereinigten Königreichs von Großbritannien und Nordirland, in Mittelsteinzeit besiedelt, 43 n.Chr. von Römern gegründet; das County of London war 1889-1965 Verwaltungsgrafschaft u. zeremonielle Grafschaft"
   }
]

I updated the context accordingly but we will have to also take this into accoutn during transformation.

acka47 added a commit that referenced this issue Jul 4, 2017
acka47 added a commit that referenced this issue Jul 4, 2017
@acka47
Copy link
Contributor

acka47 commented Jul 4, 2017

I updated the context accordingly but we will have to also take this into accoutn during transformation.

Looks fine already, thus nothing more to do. (also adjusted context for biographicalOrHistoricalInformation).

We will have to find out on what other properties language tags are used.

@acka47
Copy link
Contributor

acka47 commented Jul 4, 2017

+1 Did some adjustments to the context and I am satisfied for now. Will open issues for the other things.

@acka47 acka47 added deploy and removed review labels Jul 4, 2017
@acka47 acka47 assigned fsteeg and unassigned acka47 Jul 4, 2017
@fsteeg
Copy link
Member Author

fsteeg commented Jul 5, 2017

I don't think we need a separate beta/prod system yet, context is used from GitHub, so nothing to deploy, closing this.

Opened #5 for indexing.

@fsteeg fsteeg closed this as completed Jul 5, 2017
@fsteeg fsteeg removed the deploy label Jul 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants