Dynamic resource count per UUID entry based on file size and configurable bucket size #7

TomWindels · 2022-10-11T13:36:45Z

With these changes, it should be possible to group multiple resources into a single UUID, based on a target file size per resource. The bucket size is now more dynamic as well.

Made bucket size configurable

woutslabbinck

Overall seems good. I have on issue about the extraction of resources, to which I have added a suggestion such that the original pipeline would still produce the same result.

If you have another suggestion, feel free to add :)
There are many options for us both to get what we want.

.gitignore

woutslabbinck · 2022-10-12T09:45:16Z

EventSource/index.ts

-
-    for (const subject of time_subjects) {
-        // add observation to resource
-        let quads = store.getQuads(subject, null, null, null);
-
-        // add featureOfInterest to resource
-        const feats = store.getQuads(subject, 'http://www.w3.org/ns/sosa/hasFeatureOfInterest', null, null);
-        feats.forEach((interst) => {
-            quads = quads.concat(
-                store.getQuads(interst.object, null, null, null)
-            );
-        });
-
-        // add result to resource
-        const results = store.getQuads(subject, 'http://www.w3.org/ns/sosa/hasResult', null, null);
-        results.forEach((res) => {
-            quads = quads.concat(
-                store.getQuads(res.object, null, null, null)
-            );
-        });
-
-        // add location to resource
-        const location = store.getQuads(subject, 'http://www.w3.org/ns/sosa/observedProperty', null, null);
-        location.forEach((loc) => {
-            quads = quads.concat(
-                store.getQuads(loc.object, null, null, null)
-            );
-        });
-
-        // add sensor to resource
-        const sensor = store.getQuads(subject, 'http://www.w3.org/ns/sosa/madeBySensor', null, null);
-        sensor.forEach((sens) => {
-            // we dont want show all the observations the sensor made in every resource, only the one that matters
-            quads.push(store.getQuads(sens.object, 'http://www.w3.org/ns/sosa/madeObservation', subject, null)[0]);
-            // take all quads and filter out all madeBySensor quads
-            const all_sens = store.getQuads(sens.object, null, null, null);
-            const diff = all_sens.filter(x => x.predicate.value !== 'http://www.w3.org/ns/sosa/madeObservation');
-            quads = quads.concat(diff);
-
-            // add platform to resource
-            const platform = store.getQuads(sens.object, 'http://www.w3.org/ns/sosa/isHostedBy', null, null);
-            platform.forEach((plat) => {
-                quads = quads.concat(
-                    store.getQuads(plat.object, null, null, null)
-                );
-            });
-        });
-
-        resources.push(quads)


I see this part is something you do not need from the index.ts script. However, for the original pipeline it is necessary for the extraction of the full resource of the location model.

To be more concrete:
Without that bit of code I would per resource receive this:

<http://location.example.com/tracks/observation/2022-08-07T08%3A14%3A04Z> dct:isVersionOf ex:location ; rdf:type sosa:Observation ; sosa:hasFeatureOfInterest <https://data.knows.idlab.ugent.be/person/woslabbi/#me> ; sosa:hasResult <http://location.example.com/tracks/observation/result/2022-08-07T08%3A14%3A04Z> ; sosa:hasSimpleResult "POINT(3.621189000 50.962510000)"^^geo:wktLiteral ; sosa:madeBySensor <http://sensor.be> ; sosa:observedProperty <http://location.example.com/location> ; sosa:resultTime "2022-08-07T08:14:04Z"^^xsd:dateTime . <http://location.example.com/tracks/observation/result/2022-08-07T08%3A14%3A04Z> rdf:type sosa:Result ; wgs:elevation "7.1" ; wgs:latitude "50.962510000" ; wgs:longitude "3.621189000" ; <https://w3id.org/transportmode#transportMode> <https://w3id.org/transportmode#Walking> .

While with this piece of code I receive more information:

<http://device.be> rdf:type sosa:Platform ; sosa:hosts <http://sensor.be> . <http://location.example.com/location> rdf:type sosa:observedProperty ; rdfs:comment "The Geographic location observed by a sensor."@en ; rdfs:label "Location"@en . <http://location.example.com/tracks/observation/2022-08-07T08%3A14%3A04Z> dct:isVersionOf ex:location ; rdf:type sosa:Observation ; sosa:hasFeatureOfInterest <https://data.knows.idlab.ugent.be/person/woslabbi/#me> ; sosa:hasResult <http://location.example.com/tracks/observation/result/2022-08-07T08%3A14%3A04Z> ; sosa:hasSimpleResult "POINT(3.621189000 50.962510000)"^^geo:wktLiteral ; sosa:madeBySensor <http://sensor.be> ; sosa:observedProperty <http://location.example.com/location> ; sosa:resultTime "2022-08-07T08:14:04Z"^^xsd:dateTime . <http://location.example.com/tracks/observation/result/2022-08-07T08%3A14%3A04Z> rdf:type sosa:Result ; wgs:elevation "7.1" ; wgs:latitude "50.962510000" ; wgs:longitude "3.621189000" ; <https://w3id.org/transportmode#transportMode> <https://w3id.org/transportmode#Walking> . <http://sensor.be> rdf:type sosa:Sensor ; sosa:isHostedBy <http://device.be> ; sosa:madeObservation <http://location.example.com/tracks/observation/2022-08-07T08%3A14%3A04Z> ; sosa:observes <http://location.example.com/location> . <https://data.knows.idlab.ugent.be/person/woslabbi/#me> rdf:type sosa:FeatureOfInterest .

As a suggestion, the above code could be placed in a utility function extractLocationResource.
The default behaviour would still be to call that resource.
In your case, you are only interested in samples per subject, so then you can extract the resource on subject base (which can also be configurable).

I find it strange that the recursive implementation didn't add the triples with http://sensor.be (and recursively http://device.be and http://location.example.com/location) as a subject. I'll play around with it some more to find out how this happened and see if I can resolve it to do this properly as well. However, the triple with subject https://data.knows.idlab.ugent.be/person/woslabbi/#me would indeed not be added with this approach, so a separate function (that is called on default, when no additional arguments are present) would be required indeed.

I have created a fix for another (related) issue, but I am unable to replicate your specific results. Using the data you have given as an example above, and creating a TTL file from that, the data gets parsed to a single resource, which contains the following subjects (and all its data):
Set(6) {
'http://location.example.com/tracks/observation/2022-08-07T08%3A14%3A04Z',
'http://location.example.com/location',
'https://data.knows.idlab.ugent.be/person/woslabbi/#me',
'http://location.example.com/tracks/observation/result/2022-08-07T08%3A14%3A04Z',
'http://sensor.be',
'http://device.be'
}`. Could you maybe provide an example .ttl (or .nt) file so I can see if that helps in replicating the issue?
PS: I see the subject 'https://data.knows.idlab.ugent.be/person/woslabbi/#me' is referenced in the original measurement as well, so a separate function might not be required after all (if I can figure out how it went wrong on your end).

woutslabbinck

Nice Work

Dynamic resource count per UUID entry based on file size

5f9ec74

Made bucket size configurable

woutslabbinck requested changes Oct 12, 2022

View reviewed changes

woutslabbinck linked an issue Oct 12, 2022 that may be closed by this pull request

Make bucketSize configurable #5

Closed

TomWindels requested a review from woutslabbinck October 12, 2022 12:09

TomWindels added 2 commits October 12, 2022 14:09

Fixed issue with circular references in resources

d83d578

Fixed container referencing causing one single big resource file

8944152

woutslabbinck approved these changes Oct 12, 2022

View reviewed changes

woutslabbinck merged commit b849465 into woutslabbinck:main Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic resource count per UUID entry based on file size and configurable bucket size #7

Dynamic resource count per UUID entry based on file size and configurable bucket size #7

TomWindels commented Oct 11, 2022

woutslabbinck left a comment

woutslabbinck Oct 12, 2022

TomWindels Oct 12, 2022

TomWindels Oct 12, 2022 •

edited

Loading

woutslabbinck left a comment

Dynamic resource count per UUID entry based on file size and configurable bucket size #7

Dynamic resource count per UUID entry based on file size and configurable bucket size #7

Conversation

TomWindels commented Oct 11, 2022

woutslabbinck left a comment

Choose a reason for hiding this comment

woutslabbinck Oct 12, 2022

Choose a reason for hiding this comment

TomWindels Oct 12, 2022

Choose a reason for hiding this comment

TomWindels Oct 12, 2022 • edited Loading

Choose a reason for hiding this comment

woutslabbinck left a comment

Choose a reason for hiding this comment

TomWindels Oct 12, 2022 •

edited

Loading