-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support marking of GenBank flat files in content stream #246
Comments
e.g.,
contains a (giant) record associated with accession LR828119 -
|
with associated accession content retrieved from:
|
in comparing the results from the webservice and associated data package,
so, it appears that the files only differ by a newline character. This may be a side effect of implementing the Which is confirmed by the matching sha256 signatures after manually adding a
|
This means that we've created an citable, offline-enabled, version of the
functionality. |
@jhpoelen I just ran some tests for catting content ID'd by hash, alias, and lines. There are various quirks: $ echo "this is a line" > with-newline.txt
$ echo -n "this is a line" > no-newline.txt
$ preston track file://$(pwd)/no-newline.txt file://$(pwd)/with-newline.txt | grep hasVersion
<file:///home/mielliott/test/no-newline.txt> <http://purl.org/pav/hasVersion> <hash://sha256/ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28> <urn:uuid:a82ae34f-1441-4726-80af-09099de9ec71> .
<file:///home/mielliott/test/with-newline.txt> <http://purl.org/pav/hasVersion> <hash://sha256/9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923> <urn:uuid:9ba2b4d6-834d-42d9-8daa-2501b4e9dec2> .
# Retrieval tests for no-newline.txt / hash://sha256/ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28
## Ask by hash = OK
$ preston cat hash://sha256/ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28 | sha256sum
ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28 -
## Ask by alias = newline added
$ preston cat file:///home/mielliott/test/no-newline.txt | sha256sum
9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923 -
## Ask for line 1 = OK
$ preston cat 'line:hash://sha256/ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28!/L1' | sha256sum
ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28 -
## Ask for lines 1-2 = OK
$ preston cat 'line:hash://sha256/ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28!/L1-L2' | sha256sum
ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28 -
# Retrieval tests for with-newline.txt / hash://sha256/9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923
# Ask by hash = OK
$ preston cat hash://sha256/9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923 | sha256sum
9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923 -
# Ask by alias = OK
$ preston cat file:///home/mielliott/test/with-newline.txt | sha256sum
9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923 -
# Ask for line 1 = newline removed
$ preston cat 'line:hash://sha256/9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923!/L1' | sha256sum
ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28 -
# Ask for lines 1-2 = OK
$ preston cat 'line:hash://sha256/9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923!/L1-L2' | sha256sum
9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923 - The quirks:
|
@mielliott thanks for sharing your notes. Any intuitions on desired intuitive behavior? |
Well, the fact that For the $ echo "haha" | head -n1 | wc -c
5
$ echo -n "haha" | head -n1 | wc -c
4 Not sure if there's an official stance on whether \n is the beginning or end of a line. Maybe Google knows |
My personal preference would be to treat \n as the end of the line (if it's there, print it, otherwise don't add one), so that Which would behave the same way as using head/tail to pluck out lines 1-2 |
So far the chat bots are in favor of \n being the end of a line, not the beginning https://www.perplexity.ai/search/aeb7961b-5698-465c-b1c4-a0e05c3fff48
|
Sorry, I meant that the question is about whether "\n is part of the line" vs. "\n is a separator between lines". I don't think anyone's advocating for treating \n as the beginning of a line. |
Thanks for the digging and generating texts using general language models (how do you cite these models again?). Sounds like Wanna take a stab at implementing this? Or are you still busy writing your proposal? |
I'd cite the conversation with ChatGPT as a "personal correspondence". Sure, I can take a look at it, I'll holler if something comes up though Just to make sure we're on the same page @jhpoelen - preston's current behavior with Note #128 (comment) might explain any deja vu |
Yes, |
With current additions, the following genbank "flat file" -
is now exposed as - {
"accession": "KT156259",
"definition": "[Chrysosporium] lobatum strain CBS 624.79 elongation factor 3 gene, partial cds.",
"organism": "[Chrysosporium] lobatum",
"db_xref": "taxon:85844",
"country": "Romania",
"http://www.w3.org/ns/prov#wasDerivedFrom": "line:gz:hash://sha256/8efca32f6aa1837303c1d8ea409eef8f0837ca743bddd02001bd1819d4504ed0!/L11-L63",
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "genbank-flatfile"
} |
First version of gb-stream included in v0.6.4 |
Also see https://github.com/jhpoelen/obi-genbank . |
GenBank flat files epam/NGB#441 and https://www.ncbi.nlm.nih.gov/genbank/samplerecord/ are used to represent GenBank records.
The flat files begins with a line starting with
LOCUS
and ends with a line that only has//
on it.GenBank publishes gzipped data packages with a bunch of these flat files in them (see globalbioticinteractions/globalbioticinteractions#904).
Suggested feature would help do something like:
which would produce some stream of statements like:
where
is a url to a dynamic ncbi web service query that (may) retrieves a GenBank flat file by accession id, and
line:...!/L345-L456
is the exact location of an associated accession record in some content.The text was updated successfully, but these errors were encountered: