Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to upload files immediately to S3 #1399

Merged
merged 19 commits into from
Feb 27, 2017

Conversation

PtrTeixeira
Copy link
Contributor

Change has two parts: Adds an optional uploadImmediately field to the uploader
metadata, that when present and set will cause the uploader to upload it
immediately on change, and code within the on-detection function
handleNewOrModifiedS3Metadata to perform the upload if that field is
detected.

This change in particular involved moving the vast majority of the code in
checkUploads into a seperate function. I wanted to avoid duplicating the
logic in checkUploads in handleNewOrModifiedS3Metadata. If I directly
called checkUploads, however, that would upload all the files that were
queued to upload, rather than just the files marked to upload immediately. So I
made a new function that would just take a map of files to upload and upload
only those files. So when it needed to upload the all queued files, as in
checkUploads, it could directly work on the metadataToUploader field, and
when it only needs to upload a single block of files, it can only be passed
that single uploader/metadata.

Change has two parts: Adds an optional uploadImmediately field to the uploader
metadata, that when present and set will cause the uploader to upload it
immediately on change, and code within the on-detection function
`handleNewOrModifiedS3Metadata` to perform the upload if that field is
detected.

This change in particular involved refactoring the vast majority of the code in
`checkUploads` into a seperate function.  I wanted to avoid duplicating the
logic in `checkUploads` in `handleNewOrModifiedS3Metadata`. If I directly
called `checkUploads`, however, that would upload all the files that were
queued to upload, rather than just the files marked to upload immediately. So I
made a new function that would just take a map of files to upload and upload
only those files. So when it needed to upload the all queued files, as in
`checkUploads`, it could directly work on the `metadataToUploader` field, and
when it only needs to upload a single block of files, it can only be passed
that single uploader/metadata.
@ssalinas
Copy link
Member

Commenting here to document a chat we had in-person. Decided to go a slightly different route and have the immediate uploads submit their own future to the executorService, rather than trying to hook something immediate into the runs of the scheduledExecutorService. @PtrTeixeira is making updates and we'll review again once those are up

The way that I had written the code for this, the main thread, which was
used to poll the filesystem for metadata changes, was being blocked by
awaiting the files to finish uploading. This moves the code that depends
on the the uploads finishing (namely, logging that it finished and that
closing the uploaders) into callbacks off of the futures.  This change
also no longer places any uploaders into the class-scoped expiredUpload
queue.
I modified the S2UploadMetadata resource to add a new optional field,
then failed to modify any constructor calls to guarantee that it
actually worked.  This _should_ make it so that the program actually
compiles.
@PtrTeixeira
Copy link
Contributor Author

cc @ssalinas Could you take a look?

@ssalinas
Copy link
Member

@PtrTeixeira looks good overall. For ease of debugging when you try this out, I'd add some info or debug logging near the beginning of performImmediateUpload. Would give us a better idea that everything is being triggered appropriately rather than just waiting to see if things uploaded or waiting for the callback to trigger.

Add some additional logging statements to help track the progress of the
program before and through the callbacks on the immediate uploader.
Hopefully this should make it easier to debug the program.
@ssalinas ssalinas modified the milestone: 0.14.0 Jan 18, 2017
Depending on the settings, the immediate uploader may still get added to
the expiring list. The change guarantees that it is always removed from
the expiring list, in order to prevent a memory leak in the application.
In general, the upload driver wants to avoid re-adding modified existing
metadata, because it doesn't want to create a new uploader for it and it
doesn't actually (for the most part) look at the contents of the
metadata file, delegating that to the uploader itself. So on the whole
it ignores existing metadata files on change. In this case, in
particular, it needs to pick up changes in the metadata file because it
should still perform the upload asap.

This change just forces the upload if there is an existing metadata file
that has been changed so that it requires immediate upload, it will
upload the indicated files. As a heads-up, if you do this (modify an
existing metadata file), it will still delete the metadata file after it
performs the upload, so if you want to keep that around you will need to
re-create the metadata file.
@ssalinas ssalinas added the hs_qa label Jan 23, 2017
@ssalinas
Copy link
Member

FYI @PtrTeixeira I merged #1391 ,should be able to resolve the merge conflicts on this one now

@ssalinas ssalinas force-pushed the s3uploader-upload-immediately branch from c92b0d0 to 8650654 Compare February 8, 2017 15:58
PtrTeixeira and others added 9 commits February 8, 2017 11:40
Rename metric for immediate uploaders
There _appears_ to be a threading issue where if the file system poller
happens to run at the same time as an immediate uploaders is running, a
second immediate uploader can be created for the same metadata.
Typically this would be prevented by the metadata-to-uploader map, but
immediate uploaders are removed from or never added to the map, so it
was never checked if there was an existing uploader for a given metadata
file.

This polls the list of immediate uploaders to check their metadata, and
will refuse to create another uploader of there is an existing uploader
with the same metadata file.
Duplicates the same kind of structure that we are currently using to
keep track of the normal loaders with the immediate uploaders.  That is,
I added another map from metadata to immediate uploaders.  This is
checked before creating a new uploader and only cleared when the
existing uploader is definitely done and has been removed.  Hopefully
this should prevent the creation of duplicate immediate uploaders in a
more robust way than the prior attempt, which might not maintain the
block for the lifetime of the first immediate uploader.
The way that we are blocking attempts to create duplicate immediate
uploaders is by creating a list of metadata objects that we have
immediate uploaders for, and refusing to create a new uploader if that
list already contains the metadata file associated with the new
uploader. We need to perform the check-and-add operation atomically, to
prevent the creation of two duplicate immediate uploaders to be
interspersed in such a way that they are both created. In order to do
so, this moves the check-and-add operation into a locked block of code
in order to prevent that kind of interspersal.
Log the metadata involved when attempting to get a lock on creating a
new immediate SingularityS3Uploader.
…oaders

Prevent building duplicate immediate uploaders
@ssalinas ssalinas merged commit 1a7b7c4 into master Feb 27, 2017
@ssalinas ssalinas deleted the s3uploader-upload-immediately branch February 27, 2017 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants