50 TB test #68

gordonwatts · 2024-04-29T04:52:43Z

Ready for a week 5 data test

Run with dask cluster unallocated (so we don't kill CEPH, or die by it).
Try pre-allocated 100 workers.

See if we can understand #52 (river POD's worse) - or if it is still true.

A major new thing for running against the raw dataset: we have a large amount of the data read out.

gordonwatts · 2024-04-30T00:46:52Z

Ran out of space in s3 - there is only 15 TB, and our 50 TB test needs 10 TB - the 200 TB test will need 40 TB!

And we have a data rate limit around 40 Gbps

And finally, around 800 workers we start getting networking errors.

gordonwatts · 2024-04-30T19:14:43Z

A complete run without errors:

Several things:

Errors were caused by river not being able to sent complete messages back to SX
SO just used AF instances.
Did get a timeout error when running against client - trying to get in touch with SX.

gordonwatts · 2024-04-30T19:23:53Z

Longer term: pod scaling needs to be re-done.

gordonwatts · 2024-05-01T20:37:32Z

Now that the data is being skimmed, hard, we are seeing much better performance.

That is with 1000 pods (100 Gbps).

gordonwatts · 2024-05-01T20:43:07Z

The servicex app is under stress when this is happening:

We have 5 cores allocated to do the work, and it is basically using them all.

gordonwatts · 2024-05-01T20:46:24Z

Switched to using two pods for the app.

gordonwatts · 2024-05-01T20:48:45Z

Done:

gordonwatts · 2024-05-01T20:50:24Z

Final size in S3: 502.2 GB
Number of files in S3: 64.7K

In short - it dropped very few files this time (if any)!

gordonwatts · 2024-05-01T21:10:13Z

Second test, with similar number of transformers, but... now we have 2 service x pods and things look better:

gordonwatts · 2024-05-01T21:10:34Z

And the database is working on nvme:

gordonwatts · 2024-05-01T21:19:11Z

Using 1000 from AF and 500 from river, until the end when river went up to 1000:

So, 130 Gbps consistent!!

gordonwatts · 2024-05-01T21:22:18Z

Going to try an even higher cut. 100 GeV now.

gordonwatts · 2024-05-01T21:51:31Z

Had 1500 in AF and 1000 in river by the end. Didn't see the 130 rate from before.

gordonwatts · 2024-05-02T05:01:45Z

The next question is - how to proceed with SX testing. The following is my opinion, but others should feel free to jump in!
Things that we should have before running another large scale test:
Scripts to run on multiple datasets at once
Retry implemented in the backend side-car that pushes data to S3 (we are seeing too many failures to get the data files in). And making sure any failures are logged (for now at least to the log collector, but eventually as a transformation failure)
front-end retry when querying the transform after submitting it to get the request id - or fix the 5 second timeout - many transforms are lost because the servicex app seems to be "frozen" dealing with 64K files. This could also be a bug fix in the servicex_app when the datafiles are already cached.
DASK processing error on empty files is not properly dealt with (at least, I think that is the error that I see on highly skimmed files)
That list includes things in the IDAP script, the servicex_frontend, and in the core code of servicex itself. 😕
Once we have those things fixed, I'd be ready to run new large scale SX tests.
Other things that would be nice to have fixed:
Retry on transformer startup - because the transformer gets up and running before the side-car does, and when it tries to contact the sidecar it fails, which leads to a restart, which leads to 10-20 seconds of cycle time, which is hours of cycle time when running 2000 pods.
A command line addition to the servicex command that return the size and number of files in a bucket for a request id
A way to "save" a cache key(hash)/request-id so that you can "remember" a query as you move from one place to another. This could be command line. Other option: do this in the servicex_app
Understand the way the xaod is compressing output files, and if it isn't using ZSTD, convert to using that.
Add code that knows the number of events in each sample to produce a "Hz" measurement.

gordonwatts · 2024-05-02T05:08:14Z

Finally, some things @ilija Vukotic and I learned about the system (Ilija, please add more conclusions!)

If you ask SX to do a straight copy of the data you read in, it doesn't really do well. In short - SX was designed to skim and thin and write things out. Do that.
Compressing the output data takes a significant amount of time. It was what was preventing us from getting from a max of 45 Gbps. Removing that and we go to 130 Gbps.
Postgress with the DB on nvme seemed to be able to handle the load.
We needed two pods running the servicex_app to keep up.
There are bugs for dealing with the very large datasets that make it impossible to run the full test currently.
If your skim efficiency is low enough, it looks like you don't need to look at whole baskets of the detailedata - and that reduces the read rate, which improves things overall.
Improvements and modifications to the DNS and a reduction in the data we were pushing to S3 means that river was now able to run with out, seeming, errors.

gordonwatts · 2024-05-02T05:10:13Z

This test has now "finished"... we need to make some changes before running another large test.

gordonwatts added performance Issue that is trying to improve or understand the performance of the workflows servicex Related to SX tests labels Apr 29, 2024

gordonwatts added this to the Week 5 milestone Apr 29, 2024

gordonwatts self-assigned this Apr 29, 2024

gordonwatts mentioned this issue Apr 29, 2024

Remove Truth ID requests #70

Closed

gordonwatts mentioned this issue May 1, 2024

Make the cut even tighter #80

Closed

gordonwatts mentioned this issue May 2, 2024

Lessons learned #13

Open

gordonwatts closed this as completed May 2, 2024

gordonwatts mentioned this issue May 2, 2024

Understand why SX Pods in the AF are so much worse than in River #52

Closed

gordonwatts added the perf test Log of running a performance test label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

50 TB test #68

50 TB test #68

gordonwatts commented Apr 29, 2024

gordonwatts commented Apr 30, 2024

gordonwatts commented Apr 30, 2024

gordonwatts commented Apr 30, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024 •

edited

Loading

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 2, 2024

gordonwatts commented May 2, 2024

gordonwatts commented May 2, 2024

50 TB test #68

50 TB test #68

Comments

gordonwatts commented Apr 29, 2024

gordonwatts commented Apr 30, 2024

gordonwatts commented Apr 30, 2024

gordonwatts commented Apr 30, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024 • edited Loading

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 1, 2024

gordonwatts commented May 2, 2024

gordonwatts commented May 2, 2024

gordonwatts commented May 2, 2024

gordonwatts commented May 1, 2024 •

edited

Loading