Skip to content
This repository has been archived by the owner on Mar 6, 2023. It is now read-only.

Process fit_curve needs a lot of time #53

Open
ValentinaHutter opened this issue Oct 6, 2021 · 2 comments
Open

Process fit_curve needs a lot of time #53

ValentinaHutter opened this issue Oct 6, 2021 · 2 comments

Comments

@ValentinaHutter
Copy link
Collaborator

The fit_curve process is working for small and large spatial extents, but it takes significantly longer for large spatial extents.
When fit_curve is calculating parameters the temporal extent must not be chunked. We now tried to chunk by spatial extent, but this did not improve the speed of the process.
So with an extent of 'x': (11.390419, 11.501999), 'y': (46.311778, 46.373875), 'time': ['2016-09-01', '2018-09-01'], 'measurements': ['B01', 'B02', 'B03', 'B04', 'B07']}) and 'dask_chunks': {'bands': 1, 'time': 150, 'x': 1000, 'y': 1000} applying the fit_curve process takes < 1 hour.
When using the same extent but 'dask_chunks': {'bands': 1, 'time': 150, 'x': 250, 'y': 250} the process took almost 2 hours. So chunking the dataset by spatial extent does not work the way it should.
While a smaller extent like 'x': (11.436012, 11.43804), 'y': (46.346286, 46.34833) takes a minute.

@clausmichele
Copy link
Member

Thanks for the tests, I'll also try out some alternatives locally to see if there is space for improvements.

@clausmichele
Copy link
Member

I report here some info about the tests I did:

  1. Using a good estimate for the parameters makes a huge difference. A sample case I'm trying takes ~16 seconds with initial parameters [0,0,0] and ~6 seconds with [2000,0,0]
  2. Resampling the data to weekly average (aggregate_temporal_period with reducer=mean) decreases the number of samples in the time series and therefore reduces slightly the time required for fitting. The result is very similar but the performance increase is not worth it.
  3. The most important thing to check is the chunks size of the input data:

chunks={'time':-1,'x':8,'y':8})
...
CPU times: user 16.6 s, sys: 866 ms, total: 17.4 s
Wall time: 17.4 s

chunks={'time':-1,'x':64,'y':64})
...
CPU times: user 1.96 s, sys: 484 ms, total: 2.44 s
Wall time: 4.8 s

chunks={'time':-1,'x':128,'y':128})
...
CPU times: user 1.55 s, sys: 488 ms, total: 2.04 s
Wall time: 3.8 s

chunks={'time':1,'x':128,'y':128}) with apply_ufunc option 'allow_rechunk':True
...
CPU times: user 7.27 s, sys: 786 ms, total: 8.06 s
Wall time: 11.8 s

So, the data must be chunked only along the spatial dimension and not along the temporal dimension, keeping the option for rechunkig to False: 'allow_rechunk':False

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants