-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an index command #199
Comments
Could also limit this to $ preston index to index all provenance, not allowing any flexibility. But this is significantly less fun |
I like the piping of things for sure! And, I was wondering . . . building an index in just another transformation of some provenance logs. . . and has a specific result (the index), so I was wondering whether you had in mind to be able to do things like: preston history | preston index | preston process where, preston process take the nquads generated by the indexing and adds it to the provenance log. the index would generate some dataset containing a bunch of lucene index files (or insert your favorite indexing method). Neat thing about this would be that a provenance log would be securely linked to a specific version of an index. With this, you can ask questions like: Ok Google, can you find me an alias index derived from hash://sha256/abc123 ? or Hey Siri, can you ask Google to find me a taxonomic name index derived from hash://sha256/abc123 ? Would be fun to say out loud right? And, no need to spin those CPUs unnecessarily to regenerate an index that has already been baked somewhere. |
Piping is great! I figured
“hash colon slash slash sha 2 5 6 slash alpha beta 1 2 3 …” sounds great. Everyone loves convenient voice commands |
So, to clarify, I imagined building the index in temp/, then zipping everything and tossing it into data/ automatically. Then commands that make use of it (thinking of server commands like |
Nice! I want it! |
using #!/bin/bash
#
# index a patched version of provenance graph associated with an anchor
# into oxigraph
#
preston ls\
--anchor hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd\
--remote https://linker.bio\
| sed -E 's/(<)([a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12})([^ ]*)(>)/<urn:uuid:\2>/g'\
| pv -l\
| ./oxigraph_server_v0.3.22_x86_64_linux_gnu load --lenient --format nq --location preston-gib I was able to load:
with
|
and then, with
yielding
|
@mielliott perhaps we have found our indexer in oxigraph . . . |
Looking up content associated with a GBIF dataset id https://gbif.org/dataset/4fa7b334-ce0d-4e88-aaae-2e0c138d049e
see also SELECT ?archiveUrl ?seenAt ?contentId
WHERE {
graph ?g1 {
<urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e> <http://www.w3.org/ns/prov#hadMember> ?archiveUrl .
?archiveUrl <http://purl.org/dc/elements/1.1/format> "application/dwca" .
}
graph ?activity {
?activity <http://www.w3.org/ns/prov#used> ?archiveUrl .
?activity <http://www.w3.org/ns/prov#generatedAtTime> ?seenAt .
?contentId <http://www.w3.org/ns/prov#qualifiedGeneration> ?activity .
}
} limit 10 yielding
|
here's a query for, and resulting list of, contentIds associated with our eBird friends. Note that this accounts for the introduction of activity namespaces in 2020 #41 . PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?contentId ?seenAt ?archiveUrl WHERE
{
{
SELECT ?contentId ?seenAt ?archiveUrl
WHERE {
graph ?g1 {
<urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e> <http://www.w3.org/ns/prov#hadMember> ?archiveUrl .
?archiveUrl <http://purl.org/dc/elements/1.1/format> "application/dwca" .
}
graph ?activity {
?activity <http://www.w3.org/ns/prov#used> ?archiveUrl .
?activity <http://www.w3.org/ns/prov#generatedAtTime> ?seenAt .
?contentId <http://www.w3.org/ns/prov#qualifiedGeneration> ?activity .
}
}
}
UNION
{
SELECT ?contentId ?seenAt ?archiveUrl
WHERE {
<urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e> <http://www.w3.org/ns/prov#hadMember> ?archiveUrl .
?archiveUrl <http://purl.org/dc/elements/1.1/format> "application/dwca" .
?activity <http://www.w3.org/ns/prov#used> ?archiveUrl .
?activity <http://www.w3.org/ns/prov#generatedAtTime> ?seenAt .
?contentId <http://www.w3.org/ns/prov#qualifiedGeneration> ?activity .
}
}
} ORDER BY ?seenAt with
with first 10 yielding and last 10,
attached |
At long last! Hopefully with no fun surprises like Jena's demanding I should probably mention that I did implement the indexing functionality described in #199 (comment) in the registry branch, using Lucene. It never made its way into main though. A big limitation with just using Lucene was the lack of a query language like SPARQL, so instead of writing a Do you plan on packaging oxigraph with preston, or keeping it separate as in your examples? |
@mielliott great question! Not sure yet . . . am almost tempted to treat the oxigraph binaries as assets and add them to the content graph, along with functionality to execute workflows defined in that graph. But other than that, I do not see a compelling reason to merge preston with oxigraph and make it available in a single cli tool. But . . . I we did add a Any ideas? What do you you think, @mielliott ? |
I've added some configuration to query the indexed provenance graph of GIB (GBIF, iDigBio, BioCase). The syntax is a bit weird, but grcl was quite helpful to get a usable API in front of the sparql endpoint. Example query by UUIDusing GBIF's uuid for the eBird dataset (most of GBIF's volume, https://www.gbif.org/dataset/4fa7b334-ce0d-4e88-aaae-2e0c138d049e reformatted to
Example query by DOIUsing GBIF's assigned DOI https://doi.org/10.15468/aomfnb the following can be retrieved:
Query by URLquery activity by known location of a darwin core archive https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip .
Query by ContentId (aka hash)Querying for a known dwc archive hash hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
yields
|
…associations with doi/uuid/url/hashes. Related to #199 (comment) .
…rects. Related to #199 (comment) .
After some tinkering, I ended up implementing a redirection service. The idea is that the service uses a content registry of known provenance, then redirects resolved content ids to a repository. Currently, the resolver resolves identifiers to their associated darwin core archives. You can resolve by:
For identifiers that are not uniquely tied to content (e.g., uuid, doi, url), the resolver picks the most recent darwin core archive associated with the identifier. So, this implements a kind of a wayback machine for darwin core archives registered in the GBIF/iDigBio universe. For now, you can find provenance information for the redirect in the 302 http redirect response headers. Example 1. resolve by eBird dataset uuid
yields
Where,
Example 2. resolve by eBird dataset DOI
resulting in the same redirection, as expected. Example 3. resolve by eBird dataset original resource location
resulting in the same redirection as in examples 1 and 2, as expected. The index is built using oxigraph (see https://github.com/bio-guoda/preston-service/blob/9466c7ac601902b28ff64e7ac83ed6a9a74624a5/query/index-provenance-graph.sh ) and results in a ~30GiB index. This index is then run as a read-only service using https://github.com/bio-guoda/preston-service/blob/9466c7ac601902b28ff64e7ac83ed6a9a74624a5/systemd/system/preston-registry.service . The redirect service is configured to query the index, and redirect to a known content repository via configured defined at https://github.com/bio-guoda/preston-service/blob/main/systemd/system/preston-redirect.service . With this, we have a service that uses a well-defined relation between identifiers and their associated content. No longer we have to rely on DNS, or dynamic databases, because our redirection is anchor in a specific provenance graph (in this case, the provenance graph with version hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd) . @seltmann @mielliott @cboettig - Can you feel the excitement? Curious to hear your thoughts. You should be able to resolve any url/uuid/doi associated with darwin core archives registered with idigbio and gbif. At least, as recorded monthly since late 2018 / early 2019. |
For a UCSB example . . . I am noticing how there's various ids / locations associated with a specific versioned piece of content - the DwC-A containing the digital collection records and their associated metadata.
|
So, to cite an exact version of a dataset, you can now say something like: Cheadle Center for Biodiversity and Ecological Restoration (2023). University of California Santa Barbara Invertebrate Zoology Collection. Occurrence dataset https://doi.org/10.15468/w6hvhv as derived from the DwC-A defined in hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd as gathered through activity urn:uuid:603cb45b-c23e-4d3e-a0bf-604d8537296d at 2023-12-03T06:16:07.462Z Quite the mouthful, and precise. |
Now, with added redirect badges for embedding on web pages . . . with patterns being - https://linker.bio/badge/[some known url / uuid / doi] Example:
which renders to: which would redirect to the associated content via https://linker.bio/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0
|
@seltmann you can check whether your UCSB collection is tracked by Preston by embedding DwC-A and EML download buttons on your respective pages using GBIF Dataset DOI, DwC-A endpoint urls, GBIF Dataset UUID, iDigBIo recordset UUIDs - e.g., can be used as urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0 to get most recent archived/tracked related DwC-A content using https://linker.bio/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0 https://linker.bio/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0 with badge uri https://linker.bio/badge/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0 |
@seltmann please note that I've redesigned the badge to be a FAIR assessment badge. So, without further ado: drum roll. . . . Congratulations to @seltmann and colleagues: UCSB-IZC is FAIR! Accessed from https://linker.bio/#use-case-4-assessing-fairness-of-biodiversity-data on 2024-01-03 - |
Amazing stuff @jhpoelen, very fun! I noticed the badges default to calling stuff a DwC-A if the content type is unknown or the content doesn't exist: preston/preston-serve/src/main/java/bio/guoda/preston/server/RedirectingServlet.java Lines 81 to 89 in 8a912a4
And this kinda confused me when toying around with the new badge feature, asking for badges of silly things like RSS feeds or fake IDs. I can see this causing some confusion if for example something goes wrong for someone's EML/etc. badge, causing linker.bio to instead make a "DwC-A" badge. May I suggest a more unassuming badge when content type can't be determined? A more general "Content", "Error", or just blank? Or maybe there's an "unknown" MimeType or similar. |
@mielliott thanks for sharing your thoughts. I can see how a badge with "DwC-A unknown" can be confusing, especially when plugging in any kind of stuff like https://linker.bio/badge/10.12/345 . So requesting: https://linker.bio/badge/10.12/345 is equivalent to asking: https://linker.bio/badge/10.12/345?type=application/dwca With this information, would you have any suggestions on how to make the "DwC-A unknown" badge less confusing and more informative? |
How about like https://linker.bio/badge/10.12/345?type=cats?
PS - I really like the feature to specify the content type 🙌 |
…lacing requested content type (e.g., DwC-A) with the word "content". As suggested by @mielliott (thank you!) in #199 (comment)
#196 suggests to allow preston to look up URLs associated with a hash. Doing this quickly requires building an index. I can imagine two ways this could work:
preston index
for indexingpreston index
and index their contentOption 1 is simpler and makes it easier for the user to pick and choose what goes into the index
Option 2 has the advantage of being able to record where statements came from, which is a big part of what "indexes" do, and also keeps the provenance chain going, which is great. e.g. in a Lucene index where "documents" represent RDF statements, we could record each statement's origin as a line in a provenance log (
line:hash://sha256/abc!/L52
)(There's also an option 3: do both option 1 and option 2)
Option 1 is tempting but I think I favor option 2. @jhpoelen thoughts? Or better ideas?
The text was updated successfully, but these errors were encountered: