Memory and Core estimate function #263

Jan-Willem · 2024-09-19T14:50:24Z

Add a function to convert_msv2_to_processing_set.py that estimates the amount of memory per core required to convert an MSv2 -> PS(MSv4) and gives the maximum number of cores and a suggested number of cores.

Develop a heuristic that uses the partition scheme code and data shape (number rows, channels and polarizations) to calculate the maximum amount of memory used.
maximum number of cores = number of partitions (more cores will go unutilized)
suggested number of cores = number of partitions/4 (not sure about this)
Write tests for function
Add function to ps_vis tutorial

FedeMPouzols · 2024-10-30T08:39:24Z

@Jan-Willem :

In what sense do we recommend or suggest the "suggested" number of cores? Is that about a reasonable (or optimal) number given available resources? It can be a bit tricky in a distributed setup.
About adding this to the notebooks, I'd perhaps add a note referring to a notebook about that (and then in that one give example of parallel execution after calculating the estimate, possibly using the larger MS version). But this example about estimating resources and parallel execution wouldn't need to be added in every tutorial and guide notebook? That way the notebooks remain as simple and focused and possible and different concerns are kept separate.

The branch of this issue has a heuristic for memory estimation that seems to work relatively well but I'd still like to test it with larger partitions (going to the 100s of GBs). There are some unexpected (to me at least) memory consumers, like .to_zarr() sometimes allocating significant memory which I'd like to double-check with larger datasets that I've used so far.

Jan-Willem · 2024-11-01T07:09:01Z

I would say it is a reasonable number. The partitions/4 were chosen so that there would be some load balancing and sufficient compute resources so that it would not take too long.
I don't think it needs to be added to all guides and tutorials just the main tutorial (https://xradio.readthedocs.io/en/latest/measurement_set/tutorials/ps_vis.html).

…263

…ases), #263

…263

FedeMPouzols · 2024-11-14T18:30:15Z

Here is a summary of estimation results for the branch of this ticket up to ~2024.11.12 (not latest, that will come in a follow-up comment).

The memory estimates produced by the function added here seem already relatively accurate and safe (never underestimating), but could overestimate up to 5-10% (presumably more for even larger partitions than considered here).
That overestimation should be much lower now with the latest commits wihch I'm re-evaluating:

MSv2	Size on disk of MSv2 (GiB)	machine / env	predicted/estimated (GiB) for largest partition	effectively used (GiB) (memory_profiler) for largest partition	notes
uid___A002_Xae4720_X57fe_targets.ms (ALMA proj. 2015.1.00665.S)	0.1972	cvpost030	0.02614	0.038281	(pointing_xds over 22 MiB)
"same as above"	0.1972	cvpost030	0.078409	0.05927	same MSv2, but without FIELD_ID IN partition_scheme
twhya_calibrated.ms, ALMA cont imaging (CASA guides)	0.4063	laptop0	0.4809	0.45107
uid___A002_Xf3491c_Xb008_targets.ms, (ALMA 2021.1.01195)	1.8	cvpost030	0.24635	0.2260
"same as above"	1.8	cvpost030	0.7391	0.6923	same MSv2, but without FIELD_ID IN partition_scheme
uid___A002_Xf3491c_Xb008.ms (ALMA 2021.1.01195)	8.2	cvpost030	0.54954	0.52695	Same EB as in the two rows above, now including calib scans
"same as above"	8.2	cvpost030	1.6486	1.4887	Same MSv2, but without FIELD_ID IN partition_scheme
day2_TDEM0003_10s_norx.ms, VLA high frequency Spectral Line tutorial (CASA guides)	2	laptop0	0.80794	0.75732
NGC660.ms, EVN / VLBI spectral line workshop (CASA guides)	2	laptop0	2.0892	1.88125
uid___A002_Xf99bb0_X14d02_targets.ms (ALMA 2021.1.00738.S)	76	cvpost030	28.603	25.802
uid___A002_Xf99bb0_X143b2_targets.ms (ALMA 2021.1.00738.S)	101	cvpost030	38.137	34.285
uid___A002_Xf859f0_X1b39_targets.ms (ALMA 2021.1.00379.S)	180	cvpost030	78.005	69.994	without FIELD_ID in partition_scheme
uid___A002_Xdf1f69_X4734_targets (ALMA 2018.A.00031.T)	196	cvpost030	49.405	44.429
uid___A002_Xdf1f69_X4734.ms (ALMA 2018.A.00031.T)	520	cvpost030	68.224	61.167	Same EB, now including calib scans
L628614_SAP004_SB349_uv_001.MS (LOFAR)	92	cvpost030	78.347	74.027	5.8% overestimation, but overestimating as much as others (ALMA) because not accounting for the calc_ind_() factor (over 1.3/2 GiB)
L628614_SAP004_SB349_uv_001.MS (LOFAR)	92	casa-rockylinux8-amd64-perf07	78.347	73.477
uid___A002_Xf096a9_X9fd_target.ms	458	cvpost030	174.25	156.35	to_zarr() going wild and allocating for example ~16 GB, ~24 GB for some partitions

(in the table, by default the default partition scheme is used, unless otherwise stated in the notes).

The only exception to the general rule of "never underestimate" are datasets where the pointing_xds is significant/similar order or comparable in size to the main_xds.

Machines / environments:

cvpost030, CV node used for CASA/Pipeline builds and tests, Python 3.11.5, 251 GiB, using /lustre/naasc for MSv2=>MSv4 I/O
laptop0: my laptop, Ubuntu 24.04, Python 3.9.17 Core i7-8665U CPU, 15 GiB
casa-rockylinux8-amd64-dev07, lxc container Rocky Linux 8.9 container, Python 3.11.5, on AMD EPYC 7413 24-Core server, 73 GB, using the "/casadata" array for MSv2=>MSv4 I/O
casa-rockylinux8-amd64-perf07 lxc container with same OS in same server as above, but with 251 GiB

Things to keep an eye on:

calc_indx_for_row_split(): this can sometimes take about 2% of effective memory use for real datasets with large number of rows relative to a small number of channels (for example the long LOFAR scans, and to a lesser extent in some ALMA datasets when including calibration scans). This factor becomes more significant for corner cases or test datasets that stretch the number of rows but use a low number of channels
=> This should be sorted out after the commits from the last couple of days.
create_pointing: although not very relevant for sufficiently large datasets, we are missing this which can easily introduce an underestimation of a few 10s of MBs (can be up to 100s of MBs for full ALMA EBs), and this is using the time interpolation set by default.
In addition to pointing, consider sys_cal: haven't seen relevant examples of sys_cal taking sifnificant amount of memory but that could happen.
=> Should be addressed in follow-up PR.
The to_zarr factor did not seem significant (below few 10s of MBs) in the end in CV lustre tests (not clear about the impact of the filesystem), so I was tempted to set an upper limit for it at say 200 MB, but it turned out to sky rocket to 10s of GBs in some later tests with sufficiently large partitions. I'd assume for now this only happens when sufficient memory is available for buffering, and such large amounts do not need/should not need to be included in the estimate.
=> Need to keep watching the behavior of this.

FedeMPouzols · 2024-11-20T12:31:02Z

Here is an updated and slightly extended table with the estimates after the improvements in the last few commits (after 2024.11.13):

MSv2	Size on disk of MSv2 (GiB)	machine / env	predicted/estimated (GiB) for largest partition	effectively used (GiB) (memory_profiler) for largest partition	notes
uid___A002_Xae4720_X57fe_targets.ms (ALMA proj. 2015.1.00665.S)	0.1972	cvpost030	0.02509	0.038281	(pointing_xds over 22 MiB)
"same as above"	0.1972	cvpost030	0.075258	0.05927	same MSv2, but without FIELD_ID IN partition_scheme
twhya_calibrated.ms, ALMA cont imaging (CASA guides)	0.4063	laptop0	0.46037	0.45107
uid___A002_Xf3491c_Xb008_targets.ms, (ALMA 2021.1.01195)	1.8	cvpost030	0.23560	0.2260
"same as above"	1.8	cvpost030	0.70679	0.6923	same MSv2, but without FIELD_ID IN partition_scheme
uid___A002_Xf3491c_Xb008.ms (ALMA 2021.1.01195)	8.2	cvpost030	0.52518	0.52695	Same EB as in the two rows above, now including calib scans
"same as above"	8.2	cvpost030	1.5756	1.4887	Same MSv2, but without FIELD_ID IN partition_scheme
day2_TDEM0003_10s_norx.ms, VLA high frequency Spectral Line tutorial (CASA guides)	2	laptop0	0.77406	0.75732
NGC660.ms, EVN / VLBI spectral line workshop (CASA guides)	2	laptop0	1.9964	1.88125
uid___A002_Xf99bb0_X14d02_targets.ms (ALMA 2021.1.00738.S)	76	cvpost030	27.342	25.802
uid___A002_Xf99bb0_X143b2_targets.ms (ALMA 2021.1.00738.S)	101	cvpost030	36.456	34.285
uid___A002_Xf859f0_X1b39_targets.ms (ALMA 2021.1.00379.S)	180	cvpost030	37.268	35,015
"same as above"	180	cvpost030	74.536	69.994	Same MSv2, but without FIELD_ID in partition_scheme
uid___A002_Xdf1f69_X4734_targets (ALMA 2018.A.00031.T)	196	cvpost030	47.227	44.429
uid___A002_Xdf1f69_X4734.ms (ALMA 2018.A.00031.T)	520	cvpost030	65.203	61.167	Same EB, now including calib scans
L628614_SAP004_SB349_uv_001.MS (LOFAR)	92	cvpost030	78.682	74.027	5.8% overestimation, but overestimating as much as others (ALMA) because not accounting for the calc_ind_() factor (over 1.3/2 GiB)
L628614_SAP004_SB349_uv_001.MS (LOFAR)	92	casa-rockylinux8-amd64-perf07	78.682	73.477
uid___A002_Xf096a9_X9fd_target.ms	458	cvpost030	166.50	156.35	to_zarr() going wild and allocating for example ~16 GB, ~24 GB for some partitions

This is now accounting for the memory used in calc_indx_for_row_split() and reduced the safe overestimation percentage factor from 10% to 5% of main_xds size.
For example in the last row (largest partition), overestimation is now 6.5%, where it used to be 11.4%

This should be producing a "safe" level of overestimation, which I think can be further reduced once we account for the other sub-xdss that can be relevant at the level of a ~1% or so for medium-large datasets (and much more for tiny/demo/test datasets), like the pointing_xds and the syscal_xds. I'd suggest we open an issue to add estimates of those sub-xdss, to follow up on this issue/branch.

Jan-Willem added the enhancement New feature or request label Sep 19, 2024

Jan-Willem changed the title ~~Memory Usage~~ Memory and Core estimate function Sep 19, 2024

Jan-Willem assigned FedeMPouzols Sep 19, 2024

FedeMPouzols added a commit that referenced this issue Oct 25, 2024

add unit tests + use in stk tests, #263

879c7d3

FedeMPouzols added a commit that referenced this issue Oct 25, 2024

prefer table_exists, removed unused local imports, #263

ab35e4f

FedeMPouzols added a commit that referenced this issue Oct 25, 2024

add too simple estimator function, #263

bd0c1b8

FedeMPouzols added a commit that referenced this issue Oct 25, 2024

add estimate function in convert_msv2_to_processing_set, #263

94d8266

FedeMPouzols added a commit that referenced this issue Oct 25, 2024

more notes and assumptions, #263

fd12e12

FedeMPouzols added a commit that referenced this issue Oct 25, 2024

update tests to refined estimation, #263

f022cd5

FedeMPouzols added a commit that referenced this issue Oct 25, 2024

introduce more detailed factors, beware of nrows, #263

e58e866

FedeMPouzols added a commit that referenced this issue Oct 25, 2024

stk tests: mem estimate checks only on msv2s, #263

1987fa8

FedeMPouzols added a commit that referenced this issue Nov 11, 2024

sd: fix warning about rename (on dim coords) not recreating indices, #…

0e26643

…263

FedeMPouzols added a commit that referenced this issue Nov 11, 2024

fix warning (3.12) about deprecation/future removal of utcnow, #263

6392957

FedeMPouzols added a commit that referenced this issue Nov 11, 2024

use timezone, datetime.UTC alias only in >=3.11, #263

72b50f6

FedeMPouzols added a commit that referenced this issue Nov 13, 2024

add 'term_other_indices', fix 'other_data_vars' term, #263

d79530c

FedeMPouzols added a commit that referenced this issue Nov 13, 2024

reduce term_to_zarr to 5% -> reduce overestimation (enough in worst c…

2d0e55e

…ases), #263

FedeMPouzols added a commit that referenced this issue Nov 13, 2024

add example of estimate_mem_... function to ps_vis tutorial #263

3af1c9e

FedeMPouzols added a commit that referenced this issue Nov 14, 2024

add term calc_indx_for_row_split (relevant when many rows/few chans), #…

83ea688

…263

FedeMPouzols added a commit that referenced this issue Nov 14, 2024

update returned types and ps_vis tutorial, #263

5192f80

FedeMPouzols added a commit that referenced this issue Nov 14, 2024

update ps_vis tutorial and comments about client, #263

2c7027b

FedeMPouzols added a commit that referenced this issue Nov 20, 2024

comments and docs/tutorial improvements, #263

2783fb8

FedeMPouzols added a commit that referenced this issue Nov 20, 2024

clean up notebook, #263

b066774

FedeMPouzols added a commit that referenced this issue Nov 20, 2024

more comments and docs/tutorial improvements, #263

5506c71

FedeMPouzols mentioned this issue Nov 20, 2024

Add memory and core estimate function #308

Merged

FedeMPouzols linked a pull request Nov 20, 2024 that will close this issue

Add memory and core estimate function #308

Merged

Jan-Willem closed this as completed in #308 Nov 21, 2024

FedeMPouzols added a commit that referenced this issue Nov 25, 2024

do enable parallel param in convert, #263

d08a58d

FedeMPouzols mentioned this issue Nov 25, 2024

Enable parallel parameter in convert in ps_vis tutorial #329

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory and Core estimate function #263

Memory and Core estimate function #263

Jan-Willem commented Sep 19, 2024 •

edited by FedeMPouzols

Loading

FedeMPouzols commented Oct 30, 2024

Jan-Willem commented Nov 1, 2024

FedeMPouzols commented Nov 14, 2024 •

edited

Loading

FedeMPouzols commented Nov 20, 2024

Memory and Core estimate function #263

Memory and Core estimate function #263

Comments

Jan-Willem commented Sep 19, 2024 • edited by FedeMPouzols Loading

FedeMPouzols commented Oct 30, 2024

Jan-Willem commented Nov 1, 2024

FedeMPouzols commented Nov 14, 2024 • edited Loading

FedeMPouzols commented Nov 20, 2024

Jan-Willem commented Sep 19, 2024 •

edited by FedeMPouzols

Loading

FedeMPouzols commented Nov 14, 2024 •

edited

Loading