-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block Uploads vs. Offset Ranges #55
Comments
I think the second solution is better too The first one is hard to implement, but also brittle as it requires some form of synchronization between the parallel uploaders in order to guarantee they don't write in the same ranges. I wonder if a combination of the solutions makes sense: Upon file creation, the client specifies how many parallel blocks it wants to work on. A different thought I had, is what if we leave parallel logic out of the uploading itself, but offer a merge request that let's the server know several completed uploaded files (in reality they are blocks) should be concatenated. |
This is a general "problem" about parallel upload which occurs for every solution. The file isn't fully uploaded until the last byte is received.
That's a brilliant idea! While keeping the parallel extension spec as small and focused as possible the parallel uploads can also take advantage of all other features. |
@kvz After thinking a bit more about the merge approach this offers a way to drop the In addition I had some initial thoughts on how to implement this feature. He's my first draft using a new |
I meant that if you don't work with fixed partitioning, you could repartition so that threads/workers could help out with remaining bytes of the slowest blocks. But let's not go into this as we both agree the downsides of that outweigh any upside. As for merging 'regular' tus uploads, I'm glad you like it :) To verify I understand, you're saying we can remove As for the example Gist, it makes sense to me. Should we also define that its named parts are to be removed upon a successful |
I agree, this case is too specific.
Blocks with failed checksum validation never get written and so the offset of the file is never changed. As I said the only function left to the
In some cases you want to do and sometimes not to (e.g. reuse a partial upload to merge with another). What about keeping this application-specific? |
Agreed Marius, unless other contributors have objections I'd say this can be formalized into the protocol |
@Acconut that is awesome you are tackling the parallel upload problem. While
|
@vayam Thanks for the feedback, I appreciate it a lot! 👍
This is a concern we have discussed internally, too. For this reason we have allowed remove the protocol and hostname from the URLs (see https://github.com/tus/tus-resumable-upload-protocol/blob/merge/protocol.md#merge-1):
This enables you to use Assuming a maximum of 4KB of total headers (default for nginx, see http://stackoverflow.com/a/8623061), the default headers (Host, Accept-, User-Agent, Cache-) take about 300 bytes leaving enough space to fit about 90 URLs. For my usecases this would be enough since I hardly imagine a case where you upload 90 chunks in parallel. Maybe you can throw in some experience there? I would like to leave the body untouched by tus (against my older opinion). I had the idea to allow merging final uploads: The uploads A and B are merged into AB. C and D are merged into CD. In the end AB and CD are merged into ABCD. It may require a bit more coordination on the server side but what do you think.
What about including the |
You are right. I don't expect more than 10-15 parallel uploads. Say if you are uploading 10GB with 1024 10MB chunks. The client uploads 1-10 chunks in parallel. Once it finishes those it merges to say B. then does 10 more parallel uploads and merges with the first merged file B and so on right?
That works! But how would the server tie the part uploads to main upload? I don't understand. Can you explain? I came up with a slightly modified version: |
Currently this is not allowed by the specification. You can only merge partial uploads and not final uploads which consist of merged partial uploads. But we may allow this in the future if it is necessary. Your example is basically the same as mine:
Let me know if you want to see this in the protocol.
Ok, I think I paid this point to little attention: Basically merging uploads is a simple concatenation. Assume following three uploads:
If you then merge the upload 1, 2 and 3 in this order the final upload will have a length of 9 bytes (3 * 3) and its content will be
I am not able to follow your thought. Could you please explain it? |
We 'decided' against allow to concatenate finals for focus, and to reduce the surface for bugs to appear, but willing to be persuaded otherwise. Are there compelling use-cases we can think of? Also, would |
@Acconut I reread the Second question about listing the partial files by server. Say my browser crashed while uploading. I start over. If I don't store what I have uploaded so far in local storage there is no way I can ask the server to send me the currently transferred partials corresponding to my file. That is what I mean't by tracking.
@kvz @Acconut @kvz sorry My gist was unclear. I will try to explain better.
Server creates a directory instead of file. directory can be more logical thing if you are in turn persisting into cloud storage per say. Client put parts of the file using PUT /files/id/part1 .. /files/id/partN
Client lists files (needs more discussion)
The next step is to create the final upload. In following request no
The goals I am trying to address. Allow simultaneous stateless partial uploads. |
A client is able to get every offset for every partial upload using a Speaking about your proposal I have a question: Must the client send the Entity-Length when creating a new directory? This may collide with the idea of streaming uploads where the length is not known at the beginning. The same applies for creating new parts (/files/id/part1). A general problem with your approach is that you require the server to define the URLs which is against a principal written in the 1.0 branch:
While I don't stick to this rule until death I want to question breaking it as long as it is not totally necessary. |
@vayam Another problem is see with your solution is that the client need to create the parts chronologically in their order. You are not able to change these afterwards and have to be aware of how to partition the file before uploading it. Using the merge/concat approach you have indeed the possibility to throw stuff around. This is especially important if you deal with non-contiguous chunks. |
@Acconut I agree with you.
You are right.
Agreed
Good point. What if we did
Final Merge
I was thinking more on the lines of http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuAndPermissions.html I am convinced your approach would be no issue for most of use cases. Let us go with that. |
I will add a section to the merge extension to return the @vayam Just to be sure: Are you ok to go with the current proposal in #56 (plus the changes in the above sentence)? If so I will merge the PR and the release 1.0 Prerelease. |
👍 sure go ahead and merge! |
Merged in #56. |
In this issue I will propose two extensions which both aim implement the features needed for parallel uploads and non-contiguous chunks (see #3). In the end we have to choose one of them or go with something else.
Before starting I want to clarify that streaming uploads (with unknown length at the beginning) have not been included in these thoughts since you should not use them in conjunction with parallel uploads.
The first solution uses offset ranges. Instead of defining one offset which starts at 0 and is incremented for each
PATCH
request the server would store one or multiple ranges of allowed and free offsets. These offsets will be returned in theOffset-Range
header in theHEAD
request replacing theOffset
response (!) header. The client then uses this information to choose the offset and uploads the same way as it's currently implemented.Here is an example of a 300-byte file of which the second 100 bytes (100-199) have been uploaded:
The range of the last 100 bytes (200-299) has been removed since this buffer has been filled successfully by the upload.
While this solution allows the maximum of flexibility (compared to my second proposal) since you can upload at any offset (as long as it's available) it may be a though extension to implement for the servers. It has to ensure that the start of the offset range against which the chunk is uploaded is available and the end of the offset. Using the example from above you're not allowed to patch a 150-byte chunk at the offset of 0 because the bytes starting from 100 have already been written.
The second solution I came up with involves a bit more: When creating a new upload (using the file creation extension or somehow else) a blocksize is defined using which the file is separated into different blocks. For example, considering a file of 5KB and a blocksize of 2KB you would end up with two blocks of 2KB and a single one of 1KB. The important point is that each of the blocks has its own offset which starts at position 0 relative to the starting position of the block.
Considering the last example, the relative offset 100 of the second block would be the absolute offset of 2148: 2048 (2KB starting position of the second block) + 100 relative offset.
Only one upload is allowed at the same time per block. In this example a maximum of three parallel uploads are allowed. Each new
PATCH
request must resume where the last upload of the block has stopped, jumps are not allowed.In following example we consider having a file of 5KB with the blocksize of 2KB. The first block is already fully uploaded (2048 bytes), the second with is filled with 100 bytes and the last one has not a single write yet. We are going to upload 100 bytes to the relative offset of 100 into the second block:
Please post your opinion about these solutions (I prefer the my last proposal) or any additional way we could achieve parallel and non-contiguous uploads. Also take the time to consider the work of implementations for servers and clients.
The text was updated successfully, but these errors were encountered: