Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

50 TB test #68

Closed
gordonwatts opened this issue Apr 29, 2024 · 16 comments
Closed

50 TB test #68

gordonwatts opened this issue Apr 29, 2024 · 16 comments
Assignees
Labels
perf test Log of running a performance test performance Issue that is trying to improve or understand the performance of the workflows servicex Related to SX tests
Milestone

Comments

@gordonwatts
Copy link
Member

Ready for a week 5 data test

  • Run with dask cluster unallocated (so we don't kill CEPH, or die by it).
  • Try pre-allocated 100 workers.

See if we can understand #52 (river POD's worse) - or if it is still true.

A major new thing for running against the raw dataset: we have a large amount of the data read out.

@gordonwatts gordonwatts added performance Issue that is trying to improve or understand the performance of the workflows servicex Related to SX tests labels Apr 29, 2024
@gordonwatts gordonwatts added this to the Week 5 milestone Apr 29, 2024
@gordonwatts gordonwatts self-assigned this Apr 29, 2024
@gordonwatts
Copy link
Member Author

Ran out of space in s3 - there is only 15 TB, and our 50 TB test needs 10 TB - the 200 TB test will need 40 TB!

And we have a data rate limit around 40 Gbps

image

And finally, around 800 workers we start getting networking errors.

@gordonwatts
Copy link
Member Author

A complete run without errors:

image

Several things:

  • Errors were caused by river not being able to sent complete messages back to SX
  • SO just used AF instances.
  • Did get a timeout error when running against client - trying to get in touch with SX.

@gordonwatts
Copy link
Member Author

Longer term: pod scaling needs to be re-done.

@gordonwatts
Copy link
Member Author

Now that the data is being skimmed, hard, we are seeing much better performance.

image

That is with 1000 pods (100 Gbps).

@gordonwatts
Copy link
Member Author

The servicex app is under stress when this is happening:

image

We have 5 cores allocated to do the work, and it is basically using them all.

@gordonwatts
Copy link
Member Author

Switched to using two pods for the app.

@gordonwatts
Copy link
Member Author

Done:

image

image

@gordonwatts
Copy link
Member Author

gordonwatts commented May 1, 2024

Final size in S3: 502.2 GB
Number of files in S3: 64.7K

In short - it dropped very few files this time (if any)!

@gordonwatts
Copy link
Member Author

Second test, with similar number of transformers, but... now we have 2 service x pods and things look better:

image

@gordonwatts
Copy link
Member Author

And the database is working on nvme:

image

@gordonwatts
Copy link
Member Author

Using 1000 from AF and 500 from river, until the end when river went up to 1000:

image

So, 130 Gbps consistent!!

image

@gordonwatts
Copy link
Member Author

Going to try an even higher cut. 100 GeV now.

@gordonwatts
Copy link
Member Author

image

Had 1500 in AF and 1000 in river by the end. Didn't see the 130 rate from before.

@gordonwatts
Copy link
Member Author

The next question is - how to proceed with SX testing. The following is my opinion, but others should feel free to jump in!
Things that we should have before running another large scale test:
Scripts to run on multiple datasets at once
Retry implemented in the backend side-car that pushes data to S3 (we are seeing too many failures to get the data files in). And making sure any failures are logged (for now at least to the log collector, but eventually as a transformation failure)
front-end retry when querying the transform after submitting it to get the request id - or fix the 5 second timeout - many transforms are lost because the servicex app seems to be "frozen" dealing with 64K files. This could also be a bug fix in the servicex_app when the datafiles are already cached.
DASK processing error on empty files is not properly dealt with (at least, I think that is the error that I see on highly skimmed files)
That list includes things in the IDAP script, the servicex_frontend, and in the core code of servicex itself. 😕
Once we have those things fixed, I'd be ready to run new large scale SX tests.
Other things that would be nice to have fixed:
Retry on transformer startup - because the transformer gets up and running before the side-car does, and when it tries to contact the sidecar it fails, which leads to a restart, which leads to 10-20 seconds of cycle time, which is hours of cycle time when running 2000 pods.
A command line addition to the servicex command that return the size and number of files in a bucket for a request id
A way to "save" a cache key(hash)/request-id so that you can "remember" a query as you move from one place to another. This could be command line. Other option: do this in the servicex_app
Understand the way the xaod is compressing output files, and if it isn't using ZSTD, convert to using that.
Add code that knows the number of events in each sample to produce a "Hz" measurement.

@gordonwatts
Copy link
Member Author

Finally, some things @ilija Vukotic and I learned about the system (Ilija, please add more conclusions!)

  1. If you ask SX to do a straight copy of the data you read in, it doesn't really do well. In short - SX was designed to skim and thin and write things out. Do that.
  2. Compressing the output data takes a significant amount of time. It was what was preventing us from getting from a max of 45 Gbps. Removing that and we go to 130 Gbps.
  3. Postgress with the DB on nvme seemed to be able to handle the load.
    We needed two pods running the servicex_app to keep up.
  4. There are bugs for dealing with the very large datasets that make it impossible to run the full test currently.
  5. If your skim efficiency is low enough, it looks like you don't need to look at whole baskets of the detailedata - and that reduces the read rate, which improves things overall.
  6. Improvements and modifications to the DNS and a reduction in the data we were pushing to S3 means that river was now able to run with out, seeming, errors.

@gordonwatts
Copy link
Member Author

This test has now "finished"... we need to make some changes before running another large test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf test Log of running a performance test performance Issue that is trying to improve or understand the performance of the workflows servicex Related to SX tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant