Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Capture Module (rsync support) #3145

Closed
pdurbin opened this issue May 27, 2016 · 6 comments
Closed

Data Capture Module (rsync support) #3145

pdurbin opened this issue May 27, 2016 · 6 comments

Comments

@pdurbin
Copy link
Member

pdurbin commented May 27, 2016

@pameyer @bmckinney and I met yesterday to discuss what we're calling a "Data Capture Module" or "DCM" for short. http://guides.dataverse.org/en/4.3.1/installation/prep.html#architecture-and-components lists a number of optional components for Dataverse (Shibboleth, rApache, Rserve, Geoconnect, etc.) and "Data Capture Module" will be added to the list. The DCM's main role in the architecture is facilitating large file transfer (#952), especially via non-HTTP mechanism such as rsync.

The Minimum Viable Product (MVP) for the Data Capture Module includes support for rsync (#2960) but other mechanisms are under consideration such as Globus (#2728, #952), Aspera, and SFTP. https://data.sbgrid.org already supports rsync and we expect to be reusing code from that service, cleaning it up and generalizing it.

The task list for the Data Capture Module is still very much in flux but I'm creating this issue so that I have an issue number to associate a branch with as I start committing some code on the Dataverse side, especially API endpoints and the ability for Dataverse to talk to the DCM.

@pameyer
Copy link
Contributor

pameyer commented May 27, 2016

For the DCM (and data uploads generally), "rsync" is short-hand for rsync over ssh. Client-side checksums are also part of the DCM: we'll have to decide how we want to handle the difference in hash functions (switching hashes or multi-hash support in Dataverse). Sorting that out might be out of scope for DCM MVP.

pdurbin added a commit to pdurbin/dataverse that referenced this issue May 31, 2016
pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 16, 2016
pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 16, 2016
pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 17, 2016
pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 17, 2016
pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 21, 2016
pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 21, 2016
pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 21, 2016
pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 22, 2016
pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 23, 2016
pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 23, 2016
@pdurbin
Copy link
Member Author

pdurbin commented Jun 24, 2016

@bmckinney and I met yesterday (notes at https://docs.google.com/document/d/1BSVqAqsc_KieqfFfg_CeKdV7HwDO1Y3UuC2VMO-RJDk/edit?usp=sharing ). I demo'ed https://github.com/pdurbin/dataverse/tree/3145-dcm and he's going to try to merge that branch with https://github.com/bmckinney/bio-dataverse/tree/feature/file-system-import so we can deploy the combined code at https://dv.sbgrid.org and hopefully get closer to a prototype of rsync support. I expect we'll need help from @pameyer to switch from my mock version of the Data Capture Module at https://github.com/sbgrid/data-capture-module/blob/master/api/dcm.py to more of the real thing. All code mentioned above is very preliminary at this point. We still need to meet with @landreev to discuss how to make rsync support compatible with file versioning.

@djbrooke
Copy link
Contributor

(Note to self, mostly) This is a parent issue of the items created and estimated in the 9/8 meeting, notes recorded here:

https://docs.google.com/document/d/1wWSdKUOGA1L7UqFsgF3aOs8_9uyjnVpsPAxk7FObOOI/edit

These will be created as new Github issues and linked here.

@pdurbin
Copy link
Member Author

pdurbin commented Sep 13, 2016

@djbrooke thanks! Here are the related issues we created today:

pdurbin added a commit that referenced this issue Sep 20, 2016
A dependency for rsync support (#3145) is the ability to persist SHA-1
checksums for files rather than MD5 checksums.

A new installation-wide configuration setting called
":FileFixityChecksumAlgorithm" has been added which can be set to
"SHA-1" to have Dataverse calculate and show SHA-1 checksums rather than
MD5 checksums.

In order to run this branch you must run the provided SQL upgrade
script: scripts/database/upgrades/3354-alt-checksum.sql

In addition, the Solr schema should be updated to the version in this
branch.
@pdurbin
Copy link
Member Author

pdurbin commented Oct 30, 2016

#3249 is highly related in that ultimately, end users will need to know how to download the data via rsync or whatever mechanism. The focus of this issue to date has been researchers uploading data, not end-users downloading it.

@pdurbin
Copy link
Member Author

pdurbin commented Jun 28, 2017

These days we're working in small chunks. To follow along, start at the next small chunk that's currently in the backlog: #3942. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants