Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include additional helpful information in default DataUpload/DataDownload display. #8095

Open
mrnold opened this issue Aug 8, 2024 · 17 comments · May be fixed by #8128
Open

Include additional helpful information in default DataUpload/DataDownload display. #8095

mrnold opened this issue Aug 8, 2024 · 17 comments · May be fixed by #8128
Assignees
Labels
area/datamover Help wanted Icebox We see the value, but it is not slated for the next couple releases. kind/requirement Needs Design Reviewed Q3 2024

Comments

@mrnold
Copy link
Contributor

mrnold commented Aug 8, 2024

Describe the problem/challenge you have
We had users evaluating backup performance by watching the progress of DataUploads, basically like this:

$ oc get dataupload --all-namespaces -w
NAMESPACE NAME STATUS STARTED BYTES DONE TOTAL BYTES STORAGE LOCATION AGE NODE
openshift-adp test ts-velero-test-1 0s
openshift-adp test ts-velero-test-1 0s
openshift-adp test Accepted ts-velero-test-1 0s
openshift-adp test Prepared ts-velero-test-1 62s ip.lan
openshift-adp test InProgress 0s ts-velero-test-1 62s ip.lan
openshift-adp test InProgress 13s 555745280 ts-velero-test-1 75s ip.lan
openshift-adp test InProgress 23s 1073741824 ts-velero-test-1 85s ip.lan
openshift-adp test InProgress 24s 1073741824 1073741824 ts-velero-test-1 86s ip.lan
openshift-adp test Completed 24s 1073741824 1073741824 ts-velero-test-1 86s ip.lan

It would have been helpful to see a little more information about the transfer, to help identify where to allocate resources for their cluster.

Describe the solution you'd like
I would like to add another field or two to this output so we can see elapsed time and transfer rates at a glance. This would help them estimate what backups can fit in their time window. It would also be good to see a count of the actual bytes moved over the network, taking kopia's incremental snapshots into account.

Anything else you would like to add:
I have not yet researched what kind of progress reporting kopia provides, please chime in if this sounds infeasible.

Environment:

  • Velero version (use velero version):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

  • 👍 for "The project would be better with this feature added"
  • 👎 for "This feature will not enhance the project in a meaningful way"
@reasonerjt
Copy link
Contributor

elapsed time - This is equivalent to the value in STARTED field
transfer rate - This is more straight-forward, but I'm not sure if we can easily map it to a field in the CRD.

Given the information has been provided in the CR of dataupload I wanna tentatively move it to icebox

@reasonerjt reasonerjt added the Icebox We see the value, but it is not slated for the next couple releases. label Aug 12, 2024
@sseago
Copy link
Collaborator

sseago commented Aug 12, 2024

@reasonerjt elapsed time is only equivalent to STARTED before the DU completes. The main point here is to make it easy for a user looking at kubectl get du output to see how long each DU took to process.

@mrnold
Copy link
Contributor Author

mrnold commented Aug 20, 2024

I opened #8128 to hold some work that will eventually show the difference in a clearer way.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Aug 20, 2024

@mrnold
I don't think we need to add any new fields in DUCR/DDCR for this issue:
elapsed time = [current time] - startTimeStamp = STARTED
transfer rate = doneBytes / elapsed time

What we need here -- startTimeStamp and doneBytes have both included in the DUCR/DDCR.

Therefore, we can see minor gap for transfer rate (elapsed time is already there as STARTED column). But the real gap is how to use the kubuilder marker +kubebuilder:printcolumn to show the calculation result.
So please check with this direction first, we don't want to add redundant info into DUCR/DDCR.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Aug 20, 2024

@sseago

elapsed time is only equivalent to STARTED before the DU completes.

I think the output of kubectl get du is only meaningful to show realtime info during the running of DU/DD. Once DU/DD finishes, to check the static info, velero backup describe is a better choice.
E.g., DU/DD is not a necessary resource after it completes, so there is no guarantee that users could see it.

@mrnold
Copy link
Contributor Author

mrnold commented Aug 21, 2024

Just to summarize a few comments from the community meeting: one of the goals is to allow users to see that incremental backups are working. This is more important than showing a transfer rate. I think doneBytes might not be enough to show that successive backups transfer less data than an initial backup. I am looking into what options Kopia has for showing bytes moved.

I will see if I can get printcolumn to show calculated results, but I haven't yet figured out a way to do this without changing the CRD.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Aug 21, 2024

@mrnold

allow users to see that incremental backups are working

The cachedBytes as the code here could fulfill this purpose. It means the bytes that has been skipped due to the incremental behavior, it monotonically increases along the running of the backup.

The current code has retrieved this info but not set to DUCR, so we need to add one more field to DUCR's DataUploadStatus and also add a new column through +kubebuilder:printcolumn

@sseago
Copy link
Collaborator

sseago commented Aug 21, 2024

@sseago

elapsed time is only equivalent to STARTED before the DU completes.

I think the output of kubectl get du is only meaningful to show realtime info during the running of DU/DD. Once DU/DD finishes, to check the static info, velero backup describe is a better choice. E.g., DU/DD is not a necessary resource after it completes, so there is no guarantee that users could see it.

Hmm. OK -- so maybe the answer is to enhance backup describe to include the elapsed time for each du? What we want here is some way for a user to see in one place, how long each of the du operations took. This could be in the output of kubectl get du or it could be in the output of velero backup describe. Same for the other new field we want to add around transferred and/or cached bytes.

@Lyndon-Li Does moving this functionality to backup describe resolve your objections?
@mrnold Does backup describe work for your use case as well as kubectl get du?

@sseago
Copy link
Collaborator

sseago commented Aug 21, 2024

Also, even during current backup, elapsed time is not "[current time] - startTimeStamp" -- it's "if completed completionTimestamp - startTimestamp else current time - startTimestamp" -- volumes that finish earlier will otherwise appear to take much longer to complete than they actually do.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Aug 21, 2024

@sseago Let me conclude the info I think we need to have for the backup/restore performance observability:

elapsed time

  • It means the total elapsed time for the data transfer (it starts to count only after data transfer starts)
  • If we want to check it realtime during the backup running though kubectl get du, we already have it, that is, the STARTED column
  • We also want to have a permanent statistic after backup completes, then we need to have it in velero backup describe because DU doesn't guarantee existing after backup completes and it is not so meaningful to check static info through kubectl get du

cached bytes

  • It means the total bytes for the files that have been skipped by the incremental backup
  • For full backup, it should be equal to total bytes and for incremental backup, it should normally be less than totalBytes
  • It changes from time to time during the backup running
  • We probably need to change the name. "cached bytes" is not straightforward

throughput

  • It gives the most intuitive observability for how the backup performs
  • Some software make it as doneBytes/elapsedTime; some others make it as (doneBytes - cachedBytes) / elapsedTime. Both is acceptable, the latter just makes a trick to make the performance looks better

For all these info:

  • First thing first is to have them in velero backup describe. To make it, I think we can add the info to VolumeBackupInfo/VolumeRestoreInfo, so that we don't need to download every DU/DD during the describe.
  • It is icing on the cake to have them with kubectl get du. Because kubectl get du is for realtime observability only, so it improves user experience since users see the info dynamically changed and earlier.

@Lyndon-Li
Copy link
Contributor

Also, even during current backup, elapsed time is not "[current time] - startTimeStamp" -- it's "if completed completionTimestamp - startTimestamp else current time - startTimestamp" -- volumes that finish earlier will otherwise appear to take much longer to complete than they actually do.

We don't need to use startTimestamp to calculate elapsedTime. The STARTED column is already elapsedTime (before the backup/restore completes).

@sseago
Copy link
Collaborator

sseago commented Aug 21, 2024

Also, even during current backup, elapsed time is not "[current time] - startTimeStamp" -- it's "if completed completionTimestamp - startTimestamp else current time - startTimestamp" -- volumes that finish earlier will otherwise appear to take much longer to complete than they actually do.

We don't need to use startTimestamp to calculate elapsedTime. The STARTED column is already elapsedTime (before the backup/restore completes).

It isn't though. If you have 2 DUs that run sequentially in a backup (on same node, with parallel execution disabled), and each takes 10 minutes, at minute 19, you'll see DU1 showing STARTED at "19m" and DU2 showing STARTED at "9m" -- but elapsed time should tell us how long the DU has been in progress until completion -- so those fields would need to show "10m" and "9m".

@Lyndon-Li
Copy link
Contributor

U1 showing STARTED at "19m" and DU2 showing STARTED at "9m"

Yes, it is. but just keep my point in mind that kubectl get du is only meaningful during that DU/DD running. Once it completes, we should check with velero backup describe. If you agree, we will not need to add one more field in DUCR/DDCR.

@sseago
Copy link
Collaborator

sseago commented Aug 21, 2024

U1 showing STARTED at "19m" and DU2 showing STARTED at "9m"

Yes, it is. but just keep my point in mind that kubectl get du is only meaningful during that DU/DD running. Once it completes, we should check with velero backup describe. If you agree, we will not need to add one more field in DUCR/DDCR.

OK, lets step back a bit. If I'm a user monitoring a running backup and my main concern is DU progress, I want to look in one place to see:

  1. which DUs are currently in progress
  2. for this backup, how long has each DU taken (either to completion or to now, still running), how much is done (for still running ones), and total bytes (and eventually some data indicating skipped/transferred for incremental volume backup)

Currently, the only place to see any of this is kubectl get du. There we see in progress status, bytesDone, totalBytes, but we don't see the correct elapsed time for completed, only for in-progress.

One possibility is to enhance the du display columns. Of course, to get elapsed time would require a new field.

Another would be to enhance velero backup describe -- perhaps a du summary which gives similar information to kubectl get du, with proper calculated field for elapsed time. If done here, we wouldn't need the CRD change. But the important point is that this would need to work for in-progress backups as well as completed ones, since this is needed for real-time observability.

Either way, we need a real-time view (changes while backup and du progress) that shows completed, in-progress, and queued DUs for the current backup with elapsed time, bytes progress, etc. as appropriate.

@Lyndon-Li
Copy link
Contributor

@sseago After a further thinking, below may be better for elapsedTime:

  • Add the elapsedTime field into DUCR/DDCR
  • Make the STARTED column show the time since the DU/DD is Accepted

Use elapsedTime over STARTED because it is updated simultaneously with doneBytes and cachedBytes. Then when we calculate throughput, the result is always correct; while STARTED or currentTime - startTimestamp is updated by kubernetes, if we use it to calculate throughput the number may be weirdly jumping if doneBytes / cachedBytes is not updated timely.

Changing STARTED because it is duplicated with elapsedTime. But rather than simply removing it, we can change it to express other info --- the time to prepare the snasphot/restored volume.

@reasonerjt
Copy link
Contributor

Per discussion, let's keep it ice-box given we don't have an agreement on how to demonstrate the incremental backup.
And we need a design.

@mrnold
Copy link
Contributor Author

mrnold commented Oct 14, 2024

I tried using cachedBytes, but did not have much success. It seems like kopia calculates cachedBytes up front, so using (doneBytes-cachedBytes)/elapsedTime for throughput shows a negative number for most of an incremental backup transfer.

I see there is also an uploadedBytes that is supposed to update when bytes are uploaded to storage, but in my tests I did not see it change away from 0. So I agree I don't really have a good base for a design yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datamover Help wanted Icebox We see the value, but it is not slated for the next couple releases. kind/requirement Needs Design Reviewed Q3 2024
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants