Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory and Core estimate function #263

Closed
5 tasks done
Jan-Willem opened this issue Sep 19, 2024 · 4 comments · Fixed by #308 or #329
Closed
5 tasks done

Memory and Core estimate function #263

Jan-Willem opened this issue Sep 19, 2024 · 4 comments · Fixed by #308 or #329
Assignees
Labels
enhancement New feature or request

Comments

@Jan-Willem
Copy link
Member

Jan-Willem commented Sep 19, 2024

Add a function to convert_msv2_to_processing_set.py that estimates the amount of memory per core required to convert an MSv2 -> PS(MSv4) and gives the maximum number of cores and a suggested number of cores.

  • Develop a heuristic that uses the partition scheme code and data shape (number rows, channels and polarizations) to calculate the maximum amount of memory used.
  • maximum number of cores = number of partitions (more cores will go unutilized)
  • suggested number of cores = number of partitions/4 (not sure about this)
  • Write tests for function
  • Add function to ps_vis tutorial
@FedeMPouzols
Copy link
Collaborator

@Jan-Willem :

  • In what sense do we recommend or suggest the "suggested" number of cores? Is that about a reasonable (or optimal) number given available resources? It can be a bit tricky in a distributed setup.
  • About adding this to the notebooks, I'd perhaps add a note referring to a notebook about that (and then in that one give example of parallel execution after calculating the estimate, possibly using the larger MS version). But this example about estimating resources and parallel execution wouldn't need to be added in every tutorial and guide notebook? That way the notebooks remain as simple and focused and possible and different concerns are kept separate.

The branch of this issue has a heuristic for memory estimation that seems to work relatively well but I'd still like to test it with larger partitions (going to the 100s of GBs). There are some unexpected (to me at least) memory consumers, like .to_zarr() sometimes allocating significant memory which I'd like to double-check with larger datasets that I've used so far.

@Jan-Willem
Copy link
Member Author

@FedeMPouzols
Copy link
Collaborator

FedeMPouzols commented Nov 14, 2024

Here is a summary of estimation results for the branch of this ticket up to ~2024.11.12 (not latest, that will come in a follow-up comment).

The memory estimates produced by the function added here seem already relatively accurate and safe (never underestimating), but could overestimate up to 5-10% (presumably more for even larger partitions than considered here).
That overestimation should be much lower now with the latest commits wihch I'm re-evaluating:

MSv2 Size on disk of MSv2 (GiB) machine / env predicted/estimated (GiB) for largest partition effectively used (GiB) (memory_profiler) for largest partition notes
uid___A002_Xae4720_X57fe_targets.ms (ALMA proj. 2015.1.00665.S) 0.1972 cvpost030 0.02614 0.038281 (pointing_xds over 22 MiB)
"same as above" 0.1972 cvpost030 0.078409 0.05927 same MSv2, but without FIELD_ID IN partition_scheme
twhya_calibrated.ms, ALMA cont imaging (CASA guides) 0.4063 laptop0 0.4809 0.45107
uid___A002_Xf3491c_Xb008_targets.ms, (ALMA 2021.1.01195) 1.8 cvpost030 0.24635 0.2260
"same as above" 1.8 cvpost030 0.7391 0.6923 same MSv2, but without FIELD_ID IN partition_scheme
uid___A002_Xf3491c_Xb008.ms (ALMA 2021.1.01195) 8.2 cvpost030 0.54954 0.52695 Same EB as in the two rows above, now including calib scans
"same as above" 8.2 cvpost030 1.6486 1.4887 Same MSv2, but without FIELD_ID IN partition_scheme
day2_TDEM0003_10s_norx.ms, VLA high frequency Spectral Line tutorial (CASA guides) 2 laptop0 0.80794 0.75732
NGC660.ms, EVN / VLBI spectral line workshop (CASA guides) 2 laptop0 2.0892 1.88125
uid___A002_Xf99bb0_X14d02_targets.ms (ALMA 2021.1.00738.S) 76 cvpost030 28.603 25.802
uid___A002_Xf99bb0_X143b2_targets.ms (ALMA 2021.1.00738.S) 101 cvpost030 38.137 34.285
uid___A002_Xf859f0_X1b39_targets.ms (ALMA 2021.1.00379.S) 180 cvpost030 78.005 69.994 without FIELD_ID in partition_scheme
uid___A002_Xdf1f69_X4734_targets (ALMA 2018.A.00031.T) 196 cvpost030 49.405 44.429
uid___A002_Xdf1f69_X4734.ms (ALMA 2018.A.00031.T) 520 cvpost030 68.224 61.167 Same EB, now including calib scans
L628614_SAP004_SB349_uv_001.MS (LOFAR) 92 cvpost030 78.347 74.027 5.8% overestimation, but overestimating as much as others (ALMA) because not accounting for the calc_ind_() factor (over 1.3/2 GiB)
L628614_SAP004_SB349_uv_001.MS (LOFAR) 92 casa-rockylinux8-amd64-perf07 78.347 73.477
uid___A002_Xf096a9_X9fd_target.ms 458 cvpost030 174.25 156.35 to_zarr() going wild and allocating for example ~16 GB, ~24 GB for some partitions

(in the table, by default the default partition scheme is used, unless otherwise stated in the notes).

The only exception to the general rule of "never underestimate" are datasets where the pointing_xds is significant/similar order or comparable in size to the main_xds.

Machines / environments:

  • cvpost030, CV node used for CASA/Pipeline builds and tests, Python 3.11.5, 251 GiB, using /lustre/naasc for MSv2=>MSv4 I/O
  • laptop0: my laptop, Ubuntu 24.04, Python 3.9.17 Core i7-8665U CPU, 15 GiB
  • casa-rockylinux8-amd64-dev07, lxc container Rocky Linux 8.9 container, Python 3.11.5, on AMD EPYC 7413 24-Core server, 73 GB, using the "/casadata" array for MSv2=>MSv4 I/O
  • casa-rockylinux8-amd64-perf07 lxc container with same OS in same server as above, but with 251 GiB

Things to keep an eye on:

  • calc_indx_for_row_split(): this can sometimes take about 2% of effective memory use for real datasets with large number of rows relative to a small number of channels (for example the long LOFAR scans, and to a lesser extent in some ALMA datasets when including calibration scans). This factor becomes more significant for corner cases or test datasets that stretch the number of rows but use a low number of channels
    => This should be sorted out after the commits from the last couple of days.

  • create_pointing: although not very relevant for sufficiently large datasets, we are missing this which can easily introduce an underestimation of a few 10s of MBs (can be up to 100s of MBs for full ALMA EBs), and this is using the time interpolation set by default.
    In addition to pointing, consider sys_cal: haven't seen relevant examples of sys_cal taking sifnificant amount of memory but that could happen.
    => Should be addressed in follow-up PR.

  • The to_zarr factor did not seem significant (below few 10s of MBs) in the end in CV lustre tests (not clear about the impact of the filesystem), so I was tempted to set an upper limit for it at say 200 MB, but it turned out to sky rocket to 10s of GBs in some later tests with sufficiently large partitions. I'd assume for now this only happens when sufficient memory is available for buffering, and such large amounts do not need/should not need to be included in the estimate.
    => Need to keep watching the behavior of this.

@FedeMPouzols
Copy link
Collaborator

Here is an updated and slightly extended table with the estimates after the improvements in the last few commits (after 2024.11.13):

MSv2 Size on disk of MSv2 (GiB) machine / env predicted/estimated (GiB) for largest partition effectively used (GiB) (memory_profiler) for largest partition notes
uid___A002_Xae4720_X57fe_targets.ms (ALMA proj. 2015.1.00665.S) 0.1972 cvpost030 0.02509 0.038281 (pointing_xds over 22 MiB)
"same as above" 0.1972 cvpost030 0.075258 0.05927 same MSv2, but without FIELD_ID IN partition_scheme
twhya_calibrated.ms, ALMA cont imaging (CASA guides) 0.4063 laptop0 0.46037 0.45107
uid___A002_Xf3491c_Xb008_targets.ms, (ALMA 2021.1.01195) 1.8 cvpost030 0.23560 0.2260
"same as above" 1.8 cvpost030 0.70679 0.6923 same MSv2, but without FIELD_ID IN partition_scheme
uid___A002_Xf3491c_Xb008.ms (ALMA 2021.1.01195) 8.2 cvpost030 0.52518 0.52695 Same EB as in the two rows above, now including calib scans
"same as above" 8.2 cvpost030 1.5756 1.4887 Same MSv2, but without FIELD_ID IN partition_scheme
day2_TDEM0003_10s_norx.ms, VLA high frequency Spectral Line tutorial (CASA guides) 2 laptop0 0.77406 0.75732
NGC660.ms, EVN / VLBI spectral line workshop (CASA guides) 2 laptop0 1.9964 1.88125
uid___A002_Xf99bb0_X14d02_targets.ms (ALMA 2021.1.00738.S) 76 cvpost030 27.342 25.802
uid___A002_Xf99bb0_X143b2_targets.ms (ALMA 2021.1.00738.S) 101 cvpost030 36.456 34.285
uid___A002_Xf859f0_X1b39_targets.ms (ALMA 2021.1.00379.S) 180 cvpost030 37.268 35,015
"same as above" 180 cvpost030 74.536 69.994 Same MSv2, but without FIELD_ID in partition_scheme
uid___A002_Xdf1f69_X4734_targets (ALMA 2018.A.00031.T) 196 cvpost030 47.227 44.429
uid___A002_Xdf1f69_X4734.ms (ALMA 2018.A.00031.T) 520 cvpost030 65.203 61.167 Same EB, now including calib scans
L628614_SAP004_SB349_uv_001.MS (LOFAR) 92 cvpost030 78.682 74.027 5.8% overestimation, but overestimating as much as others (ALMA) because not accounting for the calc_ind_() factor (over 1.3/2 GiB)
L628614_SAP004_SB349_uv_001.MS (LOFAR) 92 casa-rockylinux8-amd64-perf07 78.682 73.477
uid___A002_Xf096a9_X9fd_target.ms 458 cvpost030 166.50 156.35 to_zarr() going wild and allocating for example ~16 GB, ~24 GB for some partitions

This is now accounting for the memory used in calc_indx_for_row_split() and reduced the safe overestimation percentage factor from 10% to 5% of main_xds size.
For example in the last row (largest partition), overestimation is now 6.5%, where it used to be 11.4%

This should be producing a "safe" level of overestimation, which I think can be further reduced once we account for the other sub-xdss that can be relevant at the level of a ~1% or so for medium-large datasets (and much more for tiny/demo/test datasets), like the pointing_xds and the syscal_xds. I'd suggest we open an issue to add estimates of those sub-xdss, to follow up on this issue/branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants