-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extend BRIT image archival to include SERNEC TCN collections #212
Comments
SERNEC RSS feed - |
see prototype in development at https://github.com/bio-guoda/preston-sernec . |
I created https://github.com/bio-guoda/preston-sernec . This repo contains today's snapshot of SERNEC associated dwc-a. With that, I was able to estimate the total number of records with bisque images using: https://github.com/bio-guoda/preston-sernec/blob/main/list-image-urls.sh and ./list-image-urls.sh | tee image-urls.tsv along with cat image-urls.tsv | grep bisque | wc -l to be: 9.99M with an estimated 3.33M individual records estimatd via: cat image-urls.tsv | grep bisque | grep accessURI | wc -l Given that image transfer rate of bisque is known to be 1 image per 5 seconds, it'll take: 10M * 5 / (3600 * 24) = 578 days to migrate all the image. |
fyi @themerekat - I am curious to learn about your plans to migrate the image from Bisque Cyverse to alternate locations. I've also looped in @jbest . |
You'll want to loop Ed Gilbert and Greg Post into this conversation |
as @themerekat suggested - Ed @egbot / Greg @GregPost-ASU - what are you plans to migrate the images from Cyverse before their contract expires? How are you planning to prevent this kind of situation in the (near) future? I am assuming that image storage services will continue to come and go. |
@GregPost-ASU, @egbot, @themerekat I'll add that the time for image download that @jhpoelen mentioned (5sec/image) is based on accessing the images using the public URL available in the SERNEC image records. Presumably there will be a much faster alternative for retrieving and copying images to a new platform. |
@jbest yes, the transfer rate estimates are based on measurements from the perspective of an unprivileged user using public access methods [1]. I am curious to learn more about other ways to access the referenced image content. references[1] Botanical Research Institute Texas (BRIT): Origins of BRIT collection records and associated images tracked in period 2022-06/2022-07. hash://sha256/76d40abccfc71bc2cdaf4ea4a6003b9ac49123b27abe9f0d81e233299baf5e94 https://github.com/bio-guoda/preston-brit-2022 https://linker.bio/hash://sha256/76d40abccfc71bc2cdaf4ea4a6003b9ac49123b27abe9f0d81e233299baf5e94 |
@GregPost-ASU great to hear alternate methods exist to access the images. Can you elaborate on how to access the original images and by pass bisque? Also, how are you planning to keep the various image size rendering up and running (e.g., thumbnails)? And, how would you verify that that your migration would actually be complete? And, how are you planning to redirect the referenced image urls embedded in previously published dwc-a to their new content location? Many questions, and I am very interested in this process, as I expect this to happen over and over again as image services go belly up or get retired. |
I believe I have answers for all of these questions, and have solutions in place. So, to me, securing image access by performing verifiable migration (or data tracking) would be a fun and useful exercise to see how the https://github.com/bio-guoda/preston-brit-2022 example would scale up to SERNEC scale. Currently, I don't see any technical issues. Curious to hear your thoughts. |
Closing issue until @GregPost-ASU @jbest et al. are willing/able to continue. |
fyi @jbest
South East Regional Network of Expertise and Collections (SERNEC) Thematic Collection Network (TCN), a collaboration that is digitizing and making data accessible for over 3 million plant specimens.
The text was updated successfully, but these errors were encountered: