-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement parallel operations #2067
feat: implement parallel operations #2067
Conversation
Opening as a draft PR while I add in JSDoc comments so that I can start gathering feedback / initial thoughts. |
Repointing this at a new branch named |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple passthrough comments/questions :)
3e26dd8
to
ad6aadd
Compare
ad6aadd
to
1cc8bae
Compare
* fix: more work on transfer manager perf metrics * fix unit tests for tm
* refactor: refactor constants * bug fixes * bug fixes * bug fixes
81f7850
to
c153ab6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments, but generally looks good. Thanks, Denis!!
* ``` | ||
* @experimental | ||
*/ | ||
async downloadManyFiles( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do other lanauges have this as a list of objects? or is a list of strings also an option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python does it as a list of names. If desired we could move to that as well. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two use cases are
- download "directory"
- download list of files
Not sure which is more common (question for @domZippilli) but it would be great to support both (or show users how to do both)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've modified the signature to now accept an array of files, an array of strings, or a single string. If a single string is provided it will be treated as a GCS prefix to call getFiles
. Utilizing this prefix will allow for an entire folder structure to be downloaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably the right idea. It sounds like an idiomatic variation on what I wrote before I saw @ddelgrosso1's reply.
A directory is probably the most common set of files people upload. But of course you'd want to think about this as a list of files to be extensible. To me, the logical progression is something like "I want to upload a directory" -> "I want to form a glob of file names" -> "I want to upload a list of files". So solving the last problem lets you frame the others as variations on it. For example, make a convenience method that accepts a directory name, and just constructs the list of files to pipe into the actual implementation.
assert.strictEqual(count, paths.length); | ||
}); | ||
|
||
it('sets ifGenerationMatch to 0 if skipIfExists is set', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a test for skipIfExists
set to true, the file exists, and it errors? We should verify that it doesnt stop the execution of the code and just moves onto the next file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to best handle this as during a unit test since we aren't calling the actual service.
* ``` | ||
* @experimental | ||
*/ | ||
async downloadManyFiles( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two use cases are
- download "directory"
- download list of files
Not sure which is more common (question for @domZippilli) but it would be great to support both (or show users how to do both)
Here is the summary of changes. You are about to add 3 region tags.
This comment is generated by snippet-bot.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, a few thoughts/comments. Gonna play with it some more tomorrow.
* limitations under the License. | ||
*/ | ||
|
||
// eslint-disable-next-line node/no-unsupported-features/node-builtins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this file requires worker_theads
and not all features were available in the initial release of v12
. If I remove this the CI pipeline starts to complain if I remember correctly.
* @experimental | ||
*/ | ||
async uploadManyFiles( | ||
filePaths: string[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Optional] We have an opportunity to get fancy here to really emphasize throughput; what if we allowed iterators and async iterators?
Use cases:
- walking large directory you want to walk asynchronously (reduces memory usage)
- situations where one may parse a list of files in chunks and want to chain operations together (such as an HTTP server receiving incoming data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably wait until customers ask for this functionality though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a good idea, maybe after the initial release we can circle up. I would definitely be interested in your thoughts on optimizations here.
async downloadFileInChunks( | ||
fileOrName: File | string, | ||
options: DownloadFileInChunksOptions = {} | ||
): Promise<void | DownloadResponse> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your thoughts on data validation with this method? Since file.download({start, end})
skips validation, we may want to add support or mention it in the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question... I will definitely add a callout in the comments that this will skip validation. Do we currently have a means of validating a file after it has been downloaded? I know you did a bunch of work in this area but I thought it only worked with streamed input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's technically possible to CRC32C chunks and then "sum" them into the final checksum, such that you could do this kind of chunked operation and still checksum without an additional byte scan. If you need MD5 or can't do the previous algorithm, I think validation would require a full read of the completed file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we currently have a means of validating a file after it has been downloaded?
Yep (although we may be able to do this as we're receiving each chunk):
Line 257 in a61e619
static async fromFile(file: PathLike) { |
I know you did a bunch of work in this area but I thought it only worked with streamed input?
It's possible to use CRC32C
with/out streams (HashStreamValidator
is the class that wraps CRC32C
for streams):
Line 134 in a61e619
class CRC32C implements CRC32CValidator { |
nodejs-storage/src/hash-stream-validator.ts
Lines 30 to 34 in a61e619
class HashStreamValidator extends Transform { | |
readonly crc32cEnabled: boolean; | |
readonly md5Enabled: boolean; | |
#crc32cHash?: CRC32CValidator = undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we comfortable with only allowing CRC32C. In my mind something is better than nothing... I can take a look at implementing it if we feel strongly this should be included.
const fileToWrite = await fsp.open(filePath, 'w+'); | ||
while (start < size) { | ||
const chunkStart = start; | ||
let chunkEnd = start + chunkSize - 1; | ||
chunkEnd = chunkEnd > size ? size : chunkEnd; | ||
promises.push( | ||
limit(() => | ||
file.download({start: chunkStart, end: chunkEnd}).then(resp => { | ||
return fileToWrite.write(resp[0], 0, resp[0].length, chunkStart); | ||
}) | ||
) | ||
); | ||
|
||
start += chunkSize; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's stopping the chunks from being written out of order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I did this correctly the position in the file of where to write this chunk is handled by the chunkStart
parameter. Basically this call should write the incoming chunk with no offset in the chunk buffer to the file offset specified by chunkStart
. Was your understanding different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slightly different understanding here (I could be wrong), but would it be possible for subsequent chunks to download before proceeding chunks and write to the file first?
Say we have a 65MB file and a 64MB chunk size (making 2 chunks to download). Since the 2nd chunk is 1MB and would download first, wouldn't it write to the file before the first chunk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't matter if they are written in order. This implementation is leaning on a filesystem supporting sparse files. So, imagine you wrote these back-to-front; the first chunk will seek to its chunkStart
near the end of the file, and the OS will logically pad that first byte with 0
s up to the seek point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^that's the missing piece for me, thanks Dom!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Dom outlined this was the behavior I was intending. It shouldn't matter which chunk downloads first they should get written to their correct place in the resultant file. Ultimately if all succeed, the file should be correctly "reassembled".
Since this is getting large and most feedback has been addressed going to go ahead and merge this into the staging branch. We can re-review before the final merge to main. |
* feat: implement parallel operations * add more parallel operations * add header to test file * update import of fs/promises * fix pathing on windows, fix mocking of fs promises * add jsdoc headers to class and uploadMulti * add jsdoc comments to remaining functions * update comment wording * add experimental jsdoc tags * feat: add directory generator to performance test framework * clarify variable names and comments * capitalization * wip: transfer manager performance tests * feat: merged in application performance tests (googleapis#2100) * fix: fixed many bugs (googleapis#2102) * fix: cleaning up bugs * fix: fixed many bugs * fix: more work on transfer manager perf metrics (googleapis#2103) * fix: more work on transfer manager perf metrics * fix unit tests for tm * fix: performance test refactoring, comments (googleapis#2104) * refactor: refactor constants (googleapis#2105) * refactor: refactor constants * bug fixes * bug fixes * bug fixes * linter fixes, download to disk for performance test * rename transfer manager functions * remove callbacks from transfer manager * add more experimental tags, update comments * change signature of downloadManyFiles to accept array of strings or a folder name * linter fix * add transfer manager samples and samples tests Co-authored-by: Sameena Shaffeeullah <shaffeeullah@google.com>
* feat: implement parallel operations * add more parallel operations * add header to test file * update import of fs/promises * fix pathing on windows, fix mocking of fs promises * add jsdoc headers to class and uploadMulti * add jsdoc comments to remaining functions * update comment wording * add experimental jsdoc tags * feat: add directory generator to performance test framework * clarify variable names and comments * capitalization * wip: transfer manager performance tests * feat: merged in application performance tests (googleapis#2100) * fix: fixed many bugs (googleapis#2102) * fix: cleaning up bugs * fix: fixed many bugs * fix: more work on transfer manager perf metrics (googleapis#2103) * fix: more work on transfer manager perf metrics * fix unit tests for tm * fix: performance test refactoring, comments (googleapis#2104) * refactor: refactor constants (googleapis#2105) * refactor: refactor constants * bug fixes * bug fixes * bug fixes * linter fixes, download to disk for performance test * rename transfer manager functions * remove callbacks from transfer manager * add more experimental tags, update comments * change signature of downloadManyFiles to accept array of strings or a folder name * linter fix * add transfer manager samples and samples tests Co-authored-by: Sameena Shaffeeullah <shaffeeullah@google.com>
* feat: implement parallel operations * add more parallel operations * add header to test file * update import of fs/promises * fix pathing on windows, fix mocking of fs promises * add jsdoc headers to class and uploadMulti * add jsdoc comments to remaining functions * update comment wording * add experimental jsdoc tags * feat: add directory generator to performance test framework * clarify variable names and comments * capitalization * wip: transfer manager performance tests * feat: merged in application performance tests (#2100) * fix: fixed many bugs (#2102) * fix: cleaning up bugs * fix: fixed many bugs * fix: more work on transfer manager perf metrics (#2103) * fix: more work on transfer manager perf metrics * fix unit tests for tm * fix: performance test refactoring, comments (#2104) * refactor: refactor constants (#2105) * refactor: refactor constants * bug fixes * bug fixes * bug fixes * linter fixes, download to disk for performance test * rename transfer manager functions * remove callbacks from transfer manager * add more experimental tags, update comments * change signature of downloadManyFiles to accept array of strings or a folder name * linter fix * add transfer manager samples and samples tests Co-authored-by: Sameena Shaffeeullah <shaffeeullah@google.com>
* feat: implement parallel operations (#2067) * feat: implement parallel operations * add more parallel operations * add header to test file * update import of fs/promises * fix pathing on windows, fix mocking of fs promises * add jsdoc headers to class and uploadMulti * add jsdoc comments to remaining functions * update comment wording * add experimental jsdoc tags * feat: add directory generator to performance test framework * clarify variable names and comments * capitalization * wip: transfer manager performance tests * feat: merged in application performance tests (#2100) * fix: fixed many bugs (#2102) * fix: cleaning up bugs * fix: fixed many bugs * fix: more work on transfer manager perf metrics (#2103) * fix: more work on transfer manager perf metrics * fix unit tests for tm * fix: performance test refactoring, comments (#2104) * refactor: refactor constants (#2105) * refactor: refactor constants * bug fixes * bug fixes * bug fixes * linter fixes, download to disk for performance test * rename transfer manager functions * remove callbacks from transfer manager * add more experimental tags, update comments * change signature of downloadManyFiles to accept array of strings or a folder name * linter fix * add transfer manager samples and samples tests Co-authored-by: Sameena Shaffeeullah <shaffeeullah@google.com> * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * feat: add crc32c validation option to chunked download, cleanup naming (#2110) * feat: add crc32c validation option to chunked download, cleanup naming in perf tests * close file in finally block Co-authored-by: Sameena Shaffeeullah <shaffeeullah@google.com> Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #1868 🦕
Fixes #2051