Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend BRIT image archival to include SERNEC TCN collections #212

Closed
jhpoelen opened this issue Jan 9, 2023 · 13 comments
Closed

extend BRIT image archival to include SERNEC TCN collections #212

jhpoelen opened this issue Jan 9, 2023 · 13 comments

Comments

@jhpoelen
Copy link
Member

jhpoelen commented Jan 9, 2023

fyi @jbest

South East Regional Network of Expertise and Collections (SERNEC) Thematic Collection Network (TCN), a collaboration that is digitizing and making data accessible for over 3 million plant specimens.

@jhpoelen
Copy link
Member Author

jhpoelen commented Jan 9, 2023

@jhpoelen
Copy link
Member Author

jhpoelen commented Jan 9, 2023

@jhpoelen
Copy link
Member Author

jhpoelen commented Jan 9, 2023

see prototype in development at https://github.com/bio-guoda/preston-sernec .

jhpoelen added a commit to bio-guoda/preston-sernec that referenced this issue Jan 9, 2023
jhpoelen pushed a commit to bio-guoda/preston-sernec that referenced this issue Jan 9, 2023
jhpoelen pushed a commit to bio-guoda/preston-sernec that referenced this issue Jan 9, 2023
@jhpoelen
Copy link
Member Author

jhpoelen commented Jan 9, 2023

I created https://github.com/bio-guoda/preston-sernec . This repo contains today's snapshot of SERNEC associated dwc-a.

With that, I was able to estimate the total number of records with bisque images using:

https://github.com/bio-guoda/preston-sernec/blob/main/list-image-urls.sh

and

./list-image-urls.sh | tee image-urls.tsv

along with

cat image-urls.tsv | grep bisque | wc -l

to be:

9.99M

with an estimated 3.33M individual records estimatd via:

cat image-urls.tsv | grep bisque | grep accessURI | wc -l

Given that image transfer rate of bisque is known to be 1 image per 5 seconds, it'll take:

10M * 5 / (3600 * 24) = 578 days to migrate all the image.

@jhpoelen
Copy link
Member Author

jhpoelen commented Jan 9, 2023

fyi @themerekat - I am curious to learn about your plans to migrate the image from Bisque Cyverse to alternate locations. I've also looped in @jbest .

@themerekat
Copy link

themerekat commented Jan 9, 2023

You'll want to loop Ed Gilbert and Greg Post into this conversation

@jhpoelen
Copy link
Member Author

jhpoelen commented Jan 9, 2023

as @themerekat suggested -

Ed @egbot / Greg @GregPost-ASU - what are you plans to migrate the images from Cyverse before their contract expires? How are you planning to prevent this kind of situation in the (near) future?

I am assuming that image storage services will continue to come and go.

@jbest
Copy link

jbest commented Jan 9, 2023

@GregPost-ASU, @egbot, @themerekat I'll add that the time for image download that @jhpoelen mentioned (5sec/image) is based on accessing the images using the public URL available in the SERNEC image records. Presumably there will be a much faster alternative for retrieving and copying images to a new platform.

@jhpoelen
Copy link
Member Author

jhpoelen commented Jan 9, 2023

@jbest yes, the transfer rate estimates are based on measurements from the perspective of an unprivileged user using public access methods [1]. I am curious to learn more about other ways to access the referenced image content.

references

[1] Botanical Research Institute Texas (BRIT): Origins of BRIT collection records and associated images tracked in period 2022-06/2022-07. hash://sha256/76d40abccfc71bc2cdaf4ea4a6003b9ac49123b27abe9f0d81e233299baf5e94 https://github.com/bio-guoda/preston-brit-2022 https://linker.bio/hash://sha256/76d40abccfc71bc2cdaf4ea4a6003b9ac49123b27abe9f0d81e233299baf5e94

@GregoryPost
Copy link

@jhpoelen, @jbest We are working closely with CyVerse on how to migrate the data. We should be able to transfer directly from CyVerse's backend storage platform (vs. going through Bisque) so we expect the transfer to go pretty quickly.

@jhpoelen
Copy link
Member Author

@GregPost-ASU great to hear alternate methods exist to access the images.

Can you elaborate on how to access the original images and by pass bisque?

Also, how are you planning to keep the various image size rendering up and running (e.g., thumbnails)?

And, how would you verify that that your migration would actually be complete?

And, how are you planning to redirect the referenced image urls embedded in previously published dwc-a to their new content location?

Many questions, and I am very interested in this process, as I expect this to happen over and over again as image services go belly up or get retired.

@jhpoelen
Copy link
Member Author

jhpoelen commented Jan 11, 2023

I believe I have answers for all of these questions, and have solutions in place. So, to me, securing image access by performing verifiable migration (or data tracking) would be a fun and useful exercise to see how the https://github.com/bio-guoda/preston-brit-2022 example would scale up to SERNEC scale. Currently, I don't see any technical issues.

Curious to hear your thoughts.

@jhpoelen
Copy link
Member Author

jhpoelen commented Mar 7, 2024

Closing issue until @GregPost-ASU @jbest et al. are willing/able to continue.

@jhpoelen jhpoelen closed this as completed Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants