-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
thumbnail and highquality image urls appear not the be included #3
Comments
image count before fix: image count after fix: $ zcat image-urls.tsv.new.gz | wc -l |
For image preservation purposes, I don't think it's critical to retrieve the thumbnailAccessURI and goodQualityAccessURI images because those are lower resolution images derived from the full resolution image. So as long as you retrieve the full resolution image, you can regenerate the lower resolutions. That said, generating those can take some time and if you're relying on the corpus as a full back up and/or method for rapid recovery if the main repository is lost, then it would be worth retrieving those. |
@jbest thanks for responding, and yes, I agree that the higher resolution images should have priority over thumbnail and/or lower resolution. I notice that that image urls for the high quality image urls already include some kind of pre-processing. Is there a way to get to the raw original? |
Would:
with
be the unaltered original of a reformatted image at
with
? Oh, btw - do you want your coffee and cookie back now that we realized the work was not quite completed yet? |
Yes, the first URL is the unaltered image. I didn't realize the Bisque URLs we had in the image records were resized and that probably was a big factor in how the size calculation was off. And no, the coffee and cookie are a small price to pay for working through the details of this dataset! Now that we know this, what is the next step? I think the ideal would be for the URL be corrected in the portal so the full, unprocessed image is used so then preston can re-index and retrieve the updated images. I'd be interested in if it would be better to re-index on your end then ship a drive, or if I could start with the current corpus and run preston on my end to get the new images. |
@jbest if it is doable to update the high quality urls to their raw bisque locations, and push an updated dwca with these uris in it, we can re-index the entire thing at whatever location would work best for you. The neat things about doing things across different locations is that you have additional peer review just by transferring the files from A to B. But, in my mind, the first step is to figure out what to index and from where. Can you update the brit dwca easily? |
Also, I think I'd be neat to repeat this exercise with another friendly herbarium collection, and perhaps even establish some kind of data review / archive protocol. But . . . I feel I am running ahead of myself here. . . |
Perhaps the Belgians are interested . . . fyi @qgroom @matdillen @PietrH |
Note that:
produced:
whereas
produced
The content disposition tags, generated by the server hosting the images
persuades a browser to try and render the content in the browser. Whereas
prompt a browser to offer a file download instead of attempting to render the image in the browser. Expected is that, on changing the server configuration to return :
instead (note no "attachment" mentioned), a web browser would try and render the image in place also. |
Note that Bisque source code is available at https://github.com/UCSB-VRL/bisqueUCSB . |
Hey @jbest The cyverse folks have been quite useful in helping to get me started with alternate method to access the bisque image originals. However, for some reason, I was unable to easily find the related files in the associated sftp shared folders . I probably missed something. I started another tracking session with estimated speeds of about 1 image / s, about 5x faster than previous. With about 600k bisque hosted images in this dataset, we'd have 600k s ~ 167 hours, about a week. Not bad right? Perhaps just in time to present something at Digital Data 2023 at ASU in June? However, I did find another method that appears to download bisque image blobs without doing any kind of processing. Is there way for your to confirm the authenticity attached image retrieved from bisque with your originals? Or do we have to trust bisque hosted content? Here a sha256 hash embedded in the DwC-A media table would be helpful. |
with content retrieved from: https://bisque.cyverse.org/blob_service/00-B3BAEtVZrvdsLEXFQhKpeG with alternate location at: https://linker.bio/hash://sha256/388e45da04899cd26b68dbd90c30c470882eadcd8e96ae455559095c17e75bcc also, see attached retrieved via
|
With associated record citing derived (processed) image locations -
{
"http://www.w3.org/ns/prov#wasDerivedFrom": "line:zip:hash://sha256/734d4cdca40b737e39ecba46b40bb3ca324bb3404170dfdb46c94102f85a9776!/multimedia.csv!/L129",
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "http://rs.tdwg.org/ac/terms/Multimedia",
"http://rs.tdwg.org/dwc/text/coreid": "6825982",
"http://ns.adobe.com/xap/1.0/rights/UsageTerms": "CC BY-NC-SA (Attribution-NonCommercial-ShareAlike)",
"http://rs.tdwg.org/ac/terms/providerManagedID": "urn:uuid:b7013043-77e5-4a52-ae99-58f9003cfb11",
"http://rs.tdwg.org/ac/terms/associatedSpecimenReference": "https://sernecportal.org/portal/collections/individual/index.php?occid=6825982",
"http://purl.org/dc/terms/rights": "http://creativecommons.org/licenses/by-nc/3.0/",
"http://rs.tdwg.org/ac/terms/subtype": "Photograph",
"http://purl.org/dc/terms/identifier": "https://bisque.cyverse.org/image_service/image/00-B3BAEtVZrvdsLEXFQhKpeG/resize:3744/format:jpeg",
"http://rs.tdwg.org/ac/terms/metadataLanguage": "en",
"http://ns.adobe.com/xap/1.0/rights/Owner": "Vanderbilt University Herbarium (VDB)",
"http://rs.tdwg.org/ac/terms/comments": null,
"http://ns.adobe.com/xap/1.0/rights/WebStatement": null,
"http://ns.adobe.com/xap/1.0/MetadataDate": "2018-01-16 15:22:49",
"http://rs.tdwg.org/ac/terms/thumbnailAccessURI": "https://bisque.cyverse.org/image_service/image/00-B3BAEtVZrvdsLEXFQhKpeG/thumbnail:200,200",
"http://purl.org/dc/elements/1.1/creator": null,
"http://rs.tdwg.org/ac/terms/caption": null,
"http://purl.org/dc/terms/type": "StillImage",
"http://purl.org/dc/terms/format": "image/jpeg",
"http://rs.tdwg.org/ac/terms/goodQualityAccessURI": "https://bisque.cyverse.org/image_service/image/00-B3BAEtVZrvdsLEXFQhKpeG/resize:1250/format:jpeg",
"http://rs.tdwg.org/ac/terms/accessURI": "https://bisque.cyverse.org/image_service/image/00-B3BAEtVZrvdsLEXFQhKpeG/resize:3744/format:jpeg"
} |
@jhpoelen I can confirm that the file retrieved at: Note this is not the same file as attached above in this thread (BRIT67503). |
Good to hear that you were independently able to confirm that bisque is serving the unaltered file (aside from a filename difference)!
Yes it is a different file, I just picked the first tracked image that the new brit-bisque tracker picked up, which happened to be BRIT67545.jpg . Apologies for the confusion. |
Note that https://linker.bio/hash://sha256/388e45da04899cd26b68dbd90c30c470882eadcd8e96ae455559095c17e75bcc renders the image straight in the browser, no file download. |
So far,
about 10% of all bisque related images have been tracked, with only about 10 unresponsive endpoints. So, assuming the past is a predictor of the future, the estimated would be more like 2-3 weeks to get the images tracked at least once. |
Current status - about 260k images resolved:
with an estimated 11 images to be missing or temporarily unavailable.
I had to expand my server storage to 10TB at about 20 EUR a month, adding about 10 EUR a month to my overhead. Trying to think of ways to use this example to get other collections access to resilient image storage while being able to switch to a more suitable storage solution when needed / possible. Ideally, an image storage migration would not affect the way the images are referenced in digital collections as published through formats like DwC-A. |
Here's a recently resolved image: https://linker.bio/hash://sha256/c82c9907154408d28d9736fc80e767019805b2a0423c3c2449fb83ffb0577cb0 as retrieved via https://bisque.cyverse.org/blob_service/00-4VVhJoR9oagYt245JTCEG9 as documented in line 909 of hash://sha256/6734845363255328f82a3a13b8371102f7099eef8a187f8564808e587cb3dae8 or
with dynamically generated thumbnail available at
|
@jbest status update . . . Current index reference status (or "head") of BRIT Bisque hosted image indexing obtained
yielded:
for which the number of "hasVersion", relating locations to their observed content ids, statements are now up to about 440k . . .
yielded:
|
Now, after completion of tracking of the BRIT images with BisQue endpoint, the brit-bisque corpus has current version:
of
and retrieving the content tracked in this version and their dependencies
yielded:
|
fyi @jbest
As I was working on hashing images from your colleagues in Denver (see bio-guoda/preston#193), I noticed a bug in the image url listing selection used for compiling the BRIT image corpus.
For related fix see bio-guoda/preston-dbg-2022@ea0ff06 .
It appears that only images with url containing
resize:1250
were indexed and hashed, so instead of indexing all of the image properties (e.g.,http://rs.tdwg.org/ac/terms/accessURI" | "http://rs.tdwg.org/ac/terms/thumbnailAccessURI" | "http://rs.tdwg.org/ac/terms/goodQualityAccessURI"
), only the accessURI appear to have been selected.Here's some of the diffs in the image urls for first 10 images extracted.
This may account for lower expected volume of the images.
Apologies for catching this late. I've fixed the issue, and happy to re-run the indexing process. And . . . if performance is same as last time, this would take about 2 months.
Curious to hear your thoughts!
The text was updated successfully, but these errors were encountered: