Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Make our initialization sampling more robust #180

Closed
kaituo opened this issue Jun 26, 2020 · 16 comments
Closed

Make our initialization sampling more robust #180

kaituo opened this issue Jun 26, 2020 · 16 comments
Labels
enhancement New feature or request

Comments

@kaituo
Copy link
Member

kaituo commented Jun 26, 2020

Is your feature request related to a problem? Please describe.
Currently, we sample 24 points (with 60 points apart between two neighboring sample) in the history for cold start. If there is any data hole, we stop moving forward. For example, if the 3rd sample is missing, we can only end up with 61 data points (with interpolation and other optimization). This causes long initialization.

Describe the solution you'd like
We can still move forward if there is a data hole and only interpolate points between present samples. Continuous shingles will be used for cold start.

@kaituo kaituo added the enhancement New feature or request label Jun 26, 2020
@wnbts
Copy link
Contributor

wnbts commented Jun 29, 2020

sampling-based data collection suffers from missing data. To eliminate this issue, the robust solution is collecting all points by scanning.

@kaituo
Copy link
Member Author

kaituo commented Jun 29, 2020

Why do we have to go to the other extreme to collect all samples? Even with all of the samples, we may suffer missing data. Also, what do you think of my solution?

@wnbts
Copy link
Contributor

wnbts commented Jun 29, 2020

The goal should be making data collection robust enough that it won't become a recurring problem in the future. The proposed solution is towards that direction but is still vulnerable to missing data points and will be a recurring problem again.

@wnbts
Copy link
Contributor

wnbts commented Jun 29, 2020

regarding the extreme comment, this is not an extreme use for a. the workload is limited with specified number of samples, which is far less than that used in visualization/exploration, b. the workload occurs only once per model, c. on sparse streams, the process results into no aggregation work, and d. it is also used elsewhere in es.

@kaituo
Copy link
Member Author

kaituo commented Jun 29, 2020

regarding the extreme comment, this is not an extreme use for a. the workload is limited with specified number of samples, which is far less than that used in visualization/exploration, b. the workload occurs only once per model, c. on sparse streams, the process results into no aggregation work, and d. it is also used elsewhere in es.

a. how many samples are you gonna collect? 1440 samples?
b. agreed.
c. you still need to deal with sparse data when sampling all data points, like more than 25% of samples are missing.
d. agreed.

I meant we don't need to self-negate so much. Current solution is fast, cheap, and relatively accurate. Chris and you have spent so much time to get current thing to work. My proposal adds minor improvement on top of yours.

@wnbts
Copy link
Contributor

wnbts commented Jun 30, 2020

thanks, i believe an exhaustive search with a reasonable limited range is the right solution to this issue or any potential issues of this nature in the future. The initial assumption that data stream is dense and continuous that data points can be easily/correctly interpolated has proven to be false and the data from real world is increasingly showing the opposite. Continuing with the bias does not just slow down data collection, worsening user experience, but also brings new issues to the quality of the data and the model.

The model needs max(24 hours/interval, 256) samples. To allow for some missing data points, the samples in the request can be max(24 hours/interval, 512).

It is true that for very sparse data, even an exhaustive search does not guarantee to produce enough data points but it is a best-effort guaranteed by the system.

@kaituo
Copy link
Member Author

kaituo commented Jun 30, 2020

How would you deal with missing data with your exhaustive search?

@wnbts
Copy link
Contributor

wnbts commented Jun 30, 2020

Since missing data points from exhaustive search are known to be missing, they are filtered out for training data.

@ylwu-amzn
Copy link
Contributor

ylwu-amzn commented Jul 7, 2020

Since missing data points from exhaustive search are known to be missing, they are filtered out for training data.

Just to make sure I understand it correctly. So we plan to query last max(24 hours/interval, 512) data points, then feed all of these data points into model directly, not considering missing data points? If the first data point is not missing, then the following 200 data points are missing, should we feed all data points into model? Should we train model with non-consecutive data points?

@wnbts
Copy link
Contributor

wnbts commented Jul 7, 2020

Since missing data points from exhaustive search are known to be missing, they are filtered out for training data.

Just to make sure I understand it correctly. So we plan to query last max(24 hours/interval, 512) data points, then feed all of these data points into model directly, not considering missing data points? If the first data point is not missing, then the following 200 data points are missing, should we feed all data points into model? Should we train model with non-consecutive data points?

If only the first point is present and the rest is missing, the first is unusable and discarded. The model is trained on consecutive data points as currently designed for.

@ylwu-amzn
Copy link
Contributor

ylwu-amzn commented Jul 7, 2020

Since missing data points from exhaustive search are known to be missing, they are filtered out for training data.

Just to make sure I understand it correctly. So we plan to query last max(24 hours/interval, 512) data points, then feed all of these data points into model directly, not considering missing data points? If the first data point is not missing, then the following 200 data points are missing, should we feed all data points into model? Should we train model with non-consecutive data points?

If only the first point is present and the rest is missing, the first is unusable and discarded. The model is trained on consecutive data points as currently designed for.

For example, we get last 512 data points, first 100 data points exist, 101 to 399 is missing, 400 to 512 exist, we will only feed [400, 512] into model as these data points are consecutive?

What if 400 to 510 exist, but 511 and 512 missing, should we drop all data points? How about we use some interpolation to fill 511 and 512 like Kaituo said, then feed [400, 512] into model to train?

@wnbts
Copy link
Contributor

wnbts commented Jul 7, 2020

Since missing data points from exhaustive search are known to be missing, they are filtered out for training data.

Just to make sure I understand it correctly. So we plan to query last max(24 hours/interval, 512) data points, then feed all of these data points into model directly, not considering missing data points? If the first data point is not missing, then the following 200 data points are missing, should we feed all data points into model? Should we train model with non-consecutive data points?

If only the first point is present and the rest is missing, the first is unusable and discarded. The model is trained on consecutive data points as currently designed for.

For example, we get last 512 data points, first 100 data points exist, 101 to 399 is missing, 400 to 512 exist, we will only feed [400, 512] into model as these data points are consecutive?

What if 400 to 510 exist, but 511 and 512 missing, should we drop all data points? How about we use some interpolation to fill 511 and 512 like Kaituo said, then fill [400, 512] into model to train?

In the first example, [0, 100] and [400, 512] will be used.

In the second example, [400, 512] will be used.

As long as a data block is continuous, the block will be used.

@ylwu-amzn
Copy link
Contributor

As long as a data block is continuous, the block will be used.

The block is continuous means there are at least two consecutive data points? Try to understand the algorithm for below cases
1.[0] exists, [1, 399] missing, [400, 512] exists, will feed [400, 512] into model
2.[0,1] exists, [2, 399] missing, [400, 512] exists, will feed [0, 1], [400, 512] into model
3.[0,1] exists, [2, 399] missing, [400, 510] exists, [511, 512] missing, will feed [0, 1], [400, 510] into model

@wnbts
Copy link
Contributor

wnbts commented Jul 8, 2020

As long as a data block is continuous, the block will be used.

The block is continuous means there are at least two consecutive data points? Try to understand the algorithm for below cases
1.[0] exists, [1, 399] missing, [400, 512] exists, will feed [400, 512] into model
2.[0,1] exists, [2, 399] missing, [400, 512] exists, will feed [0, 1], [400, 512] into model
3.[0,1] exists, [2, 399] missing, [400, 510] exists, [511, 512] missing, will feed [0, 1], [400, 510] into model

a continuous block means there are at least single sized continuous data points.

  1. correct.
  2. since shingle is 8, [400, 512] is used.
  3. for the same reason, [400, 510] is used.

@ylwu-amzn
Copy link
Contributor

As long as a data block is continuous, the block will be used.

The block is continuous means there are at least two consecutive data points? Try to understand the algorithm for below cases
1.[0] exists, [1, 399] missing, [400, 512] exists, will feed [400, 512] into model
2.[0,1] exists, [2, 399] missing, [400, 512] exists, will feed [0, 1], [400, 512] into model
3.[0,1] exists, [2, 399] missing, [400, 510] exists, [511, 512] missing, will feed [0, 1], [400, 510] into model

a continuous block means there are at least single sized continuous data points.

  1. correct.
  2. since shingle is 8, [400, 512] is used.
  3. for the same reason, [400, 510] is used.

Thanks. Using data block which >= shingle size makes sense. Is it reasonable to lower the limitation, like data block size should be >= (single size % 90) ? And we can interpolate the other 10% missing data points. So we can handle more use case but still get reasonable AD result. We may need to do some testing to verify AD result and confirm which percentage is the bottom line.

One edge case: 7 consecutive data points exist, then one data point missing, then another 7 consecutive data points exist.

[0, 6] exists, 
[7] missing, 
[8, 14] exists, 
[15] missing, 
[16, 22] exists,
...

In this case, the data is not so sparse. If we limit the data block size >= shingle size, we will drop all data blocks.

@wnbts
Copy link
Contributor

wnbts commented Jul 14, 2020

The change has been checked in. Closing the issue.

@wnbts wnbts closed this as completed Jul 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants