Potential N^2 issue in large file read #1279

gsjaardema · 2019-01-09T21:44:01Z

Please provide as much of the following information as you can, as applicable to the issue being reported. Naturally, not all information is relevant to every issue, but the more information we have to start, the better!

Environment Information

Feel free to skip this if the issue is related to documentation, a feature request, or general discussion.

What platform are you using? (please provide specific distribution/version in summary)
- Linux
- Windows
- OSX
- Other
- NA
32 and/or 64 bit?
- 32-bit
- 64-bit
What build system are you using?
- autotools (configure)
- cmake
Can you provide a sample netCDF file or C code to recreate the issue?
- Yes (please attach to this issue, thank you!)
- No
- Not at this time

Summary of Issue

I have a file with 10,000 dims and 7,000 vars. If I do a nccopy with the file in netcdf-3 classic format, it takes 0.422 seconds. If I do the same nccopy except that the file is in netcdf-4 classic format, it takes 13.360 seconds. These are both for debug, non-optimized builds.

I ran the nccopy in valgrind and it points to the rec_match_dimscales function as taking 32.68 % of the time (inclusive -- including the functions it calls). It calls ncindexth and nclistget each about 60,000,000 times.

It looks like rec_match_dimscales has a loop over the vars and then inside that loop it loops over the dims to find matches for the dimscales of the var. This results in basically O(nvars * ndims) scale behavior.

Maybe it is possible to somehow preprocess the dims data to make it easier to match the vars dimscale dims without a linear search through the dims list? Perhaps a hash on the objno and fileno would permit O(1) access instead of O(N^2)...

The attached file was compressed with gzip.

mix-nc4.g.gz

These are all with current github master.

The text was updated successfully, but these errors were encountered:

gsjaardema · 2019-01-09T21:50:04Z

A correction to the above -- debug nccopy time for netcdf-3 is 0.7 to 0.8 seconds.

Optimized shows:

ceerws2703a:build(master)> time ../bin/nccopy mix-nc4.g tmp-4.g (3 times)
	0m11.405s
	0m11.775s
	0m11.799s

ceerws2703a:build(master)> time ../bin/nccopy mix-nc3.g tmp-3.g (3 times)
	0m0.545s
	0m0.529s
	0m0.589s

edhartnett · 2019-01-09T22:26:55Z

Can you try with the code in PR #1262. Can you try with that code? You must recreate you test file with the library after applying 1262. Then you will avoid the dimscale mathing.

gsjaardema · 2019-01-10T03:29:22Z

I tried it on my laptop (not the system used above) and it only dropped times from about 10.6 to 9.9. I will try on the system above later this week or early next week.

DennisHeimbigner · 2019-01-10T03:47:54Z

Part of the problem is that nccopy is in the end going to
access all of the metadata in order to copy it. There is no
way in that situation to get any better speed than the underlying
HDF5 will allow. My speculation is that doing the copy using
only he HDF5 library will not be significantly faster. If there is an hdf5
copy utility, then we could test that out.

DennisHeimbigner · 2019-01-10T03:51:59Z

Actually I see there is an h5copy utility. Might be interesting to
do a complete copy and time it. If it is much faster, then it is doing
some kind of optimization of which we are not aware.

edhartnett · 2019-01-10T14:47:30Z

Greg, did you recreate your file with the new version of netcdf?

Use h5dump to confirm that all datasets in the file have the _Netcdf4Coodinates attr, which lists dimids for that dataset.

gsjaardema · 2019-01-14T16:12:30Z

Using valgrind, the patch reduced the overall execution time from 18.289 to 15.364 or 16% which is good.

I haven't found a good hdf5 utility to check how fast it can read/write the files. h5copy copies only selected datasets as best I can tell.

edhartnett · 2019-01-14T16:47:56Z

Can you use a profiler and see where all the time is being spent?

It should not be in rec_match_dimscales(). So where is it?

gsjaardema · 2019-01-14T17:24:21Z

The percentage time in rec_match_dimscales() went from 32.7% to 17.8%, but it is still the most expensive function (including the functions it calls nclistget and ncindexth .

I've attached a screen shot showing the before and after profile sorted by self time.

I recreated the file using nccopy old_file.nc new_file.nc and then did all subsequent runs using new_file.nc. I do see the _Netcdf4Coodinates attribute using h5dump.

gsjaardema · 2019-01-14T18:14:48Z

It looks like the _Netcdf4Coordinates attribute exists on all vars which have more than 1 dimension, but if the variable only has a single dimension, it does not have that attribute. In this file, probably about half are multi-dim and about half are ndims=1

gsjaardema · 2019-01-14T18:33:52Z

If I modify to write the _Netcdf4Coordinates attribute for all variables, regenerate the file, and rerun valgrind, it drops from 15.364 to 12.871 or another 16%. The rec_match_dimscales() now accounts for only 0.01% of the time. The top functions are now HDF5 or malloc/free related.

(I think the memcpy which is 50% coming from nclistsetalloc could be reduced with smarter reallocation but probably a hit in memory use...)

edhartnett · 2019-01-14T18:46:55Z

What do you mean "modify to write the _Netcdf4Coordinates attribute for all variables"? This should be happening automatically.

gsjaardema · 2019-01-14T19:13:44Z

No, currently only happening for multi-dimensional variables according to documentation and to code. See nc4hdf.c

@@ -1012,7 +1012,7 @@ var_create_dataset(NC_GRP_INFO_T *grp, NC_VAR_INFO_T *var, nc_bool_t write_dimid
    /* Always write the hidden coordinates attribute, which lists the
     * dimids of this var. When present, this speeds opens. When no
     * present, dimscale matching is used. */
-   if (var->ndims > 1)
+   /*   if (var->ndims > 1) */
       if ((retval = write_coord_dimids(var)))
          BAIL(retval);

edhartnett · 2019-01-14T19:17:03Z

OK, thanks for catching that, it should be used even for 1D vars. I will fix that.

Let me construct a test case with 5000 dimensions and see what I can see. ;-)

DennisHeimbigner · 2019-01-14T20:01:17Z

I got h5copy to copy the file using this command:
h5copy -i input.h5 -o output.h5 -f ref -s / -d /x
In order to avoid a name conflict (God know why) I had
to add the extra group /x to receive the content of / in the input file.
But it seems to copy everything.

DennisHeimbigner · 2019-01-14T20:14:48Z

If nclistsetalloc is the problem, then it is likely because its default list
allocation size is too small. It starts by allocating 16 elements and
then doubles every time it needs to increase.
The ncindex code probably is the ultimate problem and needs to use
a larger default starting size.

edhartnett · 2019-01-14T20:51:43Z

I am modifying the recently-added tst_wrf_reads.c to emulate this case. That way we can try a few things...

gsjaardema · 2019-01-14T21:14:55Z

`nclistsetalloc` isn't the main problem, it just seemed higher than it should be. I don't think it doubles every time. I was seeing many instances in which it only increased in size by 1 each time... Might be good too look closer. It seems to only double if the `sz` argument coming in is <= 0.

…

On Mon, Jan 14, 2019 at 1:51 PM Ed Hartnett ***@***.***> wrote: I am modifying the recently-added tst_wrf_reads.c to emulate this case. That way we can try a few things... — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1279 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA2xDoXjI3SObSBGFryk99fH6QVWBso0ks5vDO3ggaJpZM4Z4XKZ> .

DennisHeimbigner · 2019-01-14T22:05:11Z

oops, you are correct. It doubles only if no size is specified, otherwise
it grows to be exactly the size specified.

See discussion in Unidata#1279

DennisHeimbigner · 2021-08-07T00:04:55Z

To be precise,

O(nvars * ndims) scale behavior.

ndims is rank of each var, so it is not O(nvars * alldims) but rather much less.

edwardhartnett · 2021-08-07T03:36:26Z

Once your changes in #2047 are merged, this should improve a lot...

gsjaardema · 2021-08-09T20:55:33Z

@DennisHeimbigner I disagree, without the coord dimids on the 1D variables, the behavior scales as O(nvars * ndims) where ndims == alldims not the rank of each var. You can see this with the 10,000 dims and 7,000 vars calling ncindexth and nclistget each about 60,000,000 times.

I think it actually scales as O( (1D nvars) * ndims)

DennisHeimbigner · 2021-08-09T22:03:24Z

Odd. I was looking at the code fore rec_match_dimscales and it
is appears to iterate over the dim ids associated with the variable
and then doing` a direct lookup of the corresponding dimension object.

gsjaardema · 2021-08-09T22:10:25Z

I think that if it has the dim ids associated with the variable, then it can do the direct lookup, but if it is a 1D variable, then it has to search all of the dims for the dimid. That is the speedup of #2407 is that it gives us the coordinate dimid attribute associated with all variables and then we do get the direct lookup instead of the current behavior without the patch where 1D variables do not have any way of storing the dimid and therefore there isn't a direct lookup.

Maybe I am misunderstanding your question/assertion.

DennisHeimbigner · 2021-08-09T23:22:16Z

If you have the NC_VAR_INFO_T* pointer to the var info,
then the field "dimids" should have the set of dimids associated
with the variable whatever the rank.
So if you are seeing nvars*ndims behavior, there is something seriously
wrong with rec_match_dimscales. I will take a closer look.

edwardhartnett · 2021-08-10T06:38:06Z

IIRC rec_match_dimscales should be called only when our own hidden attribute metadata does not know the dimids. In that case, rec_match_dimscales() does a very slow recursive search through the file for matching dimscales.

Since something like 4.6.x the dim info is always written to our internal metadata hidden attributes, but before that, it was not.

WardF added this to the 4.7.3 milestone Nov 4, 2019

WardF self-assigned this Nov 4, 2019

WardF modified the milestones: 4.7.3, 4.8.0 Mar 27, 2020

gsjaardema added a commit to gsjaardema/netcdf-c that referenced this issue Aug 3, 2021

The coord dimids should be written for all variables

7b6f11c

See discussion in Unidata#1279

gsjaardema mentioned this issue Aug 3, 2021

The coord dimids should be written for all variables #2047

Merged

DennisHeimbigner mentioned this issue Aug 5, 2021

FYI: file with large amount of metadata zarr-developers/zarr-specs#120

Open

WardF modified the milestones: 4.8.0, 4.8.2 Aug 30, 2021

WardF modified the milestones: 4.8.2, 4.9.1 Jun 15, 2022

WardF modified the milestones: 4.9.1, 4.9.2 Feb 13, 2023

WardF modified the milestones: 4.9.2, 4.9.3 May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential N^2 issue in large file read #1279

Potential N^2 issue in large file read #1279

gsjaardema commented Jan 9, 2019 •

edited

Loading

gsjaardema commented Jan 9, 2019

edhartnett commented Jan 9, 2019

gsjaardema commented Jan 10, 2019

DennisHeimbigner commented Jan 10, 2019

DennisHeimbigner commented Jan 10, 2019

edhartnett commented Jan 10, 2019 •

edited by DennisHeimbigner

Loading

gsjaardema commented Jan 14, 2019

edhartnett commented Jan 14, 2019

gsjaardema commented Jan 14, 2019

gsjaardema commented Jan 14, 2019

gsjaardema commented Jan 14, 2019

edhartnett commented Jan 14, 2019

gsjaardema commented Jan 14, 2019

edhartnett commented Jan 14, 2019

DennisHeimbigner commented Jan 14, 2019 •

edited

Loading

DennisHeimbigner commented Jan 14, 2019

edhartnett commented Jan 14, 2019

gsjaardema commented Jan 14, 2019 via email

DennisHeimbigner commented Jan 14, 2019

DennisHeimbigner commented Aug 7, 2021

edwardhartnett commented Aug 7, 2021

gsjaardema commented Aug 9, 2021

DennisHeimbigner commented Aug 9, 2021

gsjaardema commented Aug 9, 2021

DennisHeimbigner commented Aug 9, 2021

edwardhartnett commented Aug 10, 2021

Potential N^2 issue in large file read #1279

Potential N^2 issue in large file read #1279

Comments

gsjaardema commented Jan 9, 2019 • edited Loading

Environment Information

Summary of Issue

gsjaardema commented Jan 9, 2019

edhartnett commented Jan 9, 2019

gsjaardema commented Jan 10, 2019

DennisHeimbigner commented Jan 10, 2019

DennisHeimbigner commented Jan 10, 2019

edhartnett commented Jan 10, 2019 • edited by DennisHeimbigner Loading

gsjaardema commented Jan 14, 2019

edhartnett commented Jan 14, 2019

gsjaardema commented Jan 14, 2019

gsjaardema commented Jan 14, 2019

gsjaardema commented Jan 14, 2019

edhartnett commented Jan 14, 2019

gsjaardema commented Jan 14, 2019

edhartnett commented Jan 14, 2019

DennisHeimbigner commented Jan 14, 2019 • edited Loading

DennisHeimbigner commented Jan 14, 2019

edhartnett commented Jan 14, 2019

gsjaardema commented Jan 14, 2019 via email

DennisHeimbigner commented Jan 14, 2019

DennisHeimbigner commented Aug 7, 2021

edwardhartnett commented Aug 7, 2021

gsjaardema commented Aug 9, 2021

DennisHeimbigner commented Aug 9, 2021

gsjaardema commented Aug 9, 2021

DennisHeimbigner commented Aug 9, 2021

edwardhartnett commented Aug 10, 2021

gsjaardema commented Jan 9, 2019 •

edited

Loading

edhartnett commented Jan 10, 2019 •

edited by DennisHeimbigner

Loading

DennisHeimbigner commented Jan 14, 2019 •

edited

Loading