Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last datapoint is ignored in the calculation of average(time_weight('locf', TIME, value)) #732

Closed
kgyrtkirk opened this issue Mar 9, 2023 · 6 comments · Fixed by #740
Closed
Assignees
Labels
bug Something isn't working

Comments

@kgyrtkirk
Copy link

Relevant system information:

  • OS: [timescale/timescaledb-ha:pg14.6-ts2.9.0-latest]
  • PostgreSQL version : 14.6
  • TimescaleDB Toolkit version : [1.12.1]
  • Installation method: ["docker"]

Describe the bug]
reported in https://stackoverflow.com/questions/75680213/time-weighted-average-in-timescaledb-using-last-observation-carried-forward

incorrect average is computed from time_weight; seems like the last datapoint is not counted in

To Reproduce

DROP TABLE IF EXISTS t;


CREATE TABLE t(TIME TIMESTAMP WITH TIME ZONE NOT NULL,
                                             value float, k integer);


INSERT INTO t
VALUES ('2020-01-01 00:00:00', 1, 0),
       ('2020-01-01 00:00:01', 1, 0),
       ('2020-01-01 23:00:01', 1000, 1),
       ('2020-01-01 23:59:59', 1000, 2);


SELECT 0 AS k,
       time_bucket('1 days', TIME) AS timebucket,
       average(time_weight('locf', TIME, value)), (time_weight('locf', TIME, value))
FROM t
WHERE (TIME BETWEEN TIMESTAMP '2020-01-01 00:00:00+00:00' AND TIMESTAMP '2020-01-02 00:00:00+00:00')
  AND k <= 0
GROUP BY timebucket
UNION
ALL
SELECT 1,
       time_bucket('1 days', TIME) AS timebucket,
       average(time_weight('locf', TIME, value)), (time_weight('locf', TIME, value))
FROM t
WHERE (TIME BETWEEN TIMESTAMP '2020-01-01 00:00:00+00:00' AND TIMESTAMP '2020-01-02 00:00:00+00:00')
  AND k <= 1
GROUP BY timebucket
UNION
ALL
SELECT 2,
       time_bucket('1 days', TIME) AS timebucket,
       average(time_weight('locf', TIME, value)), (time_weight('locf', TIME, value))
FROM t
WHERE (TIME BETWEEN TIMESTAMP '2020-01-01 00:00:00+00:00' AND TIMESTAMP '2020-01-02 00:00:00+00:00')
  AND k <= 2
GROUP BY timebucket;

Expected behavior
I guess for k=1 and k=2 the same resultset would be expected; but instead the average is equal to the k=0 case

Actual behavior

 k |       timebucket       |      average      |                                                               time_weight                                                                
---+------------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------
 0 | 2020-01-01 00:00:00+00 |                 1 | (version:1,first:(ts:"2020-01-01 00:00:00+00",val:1),last:(ts:"2020-01-01 00:00:01+00",val:1),weighted_sum:1000000,method:LOCF)
 1 | 2020-01-01 00:00:00+00 |                 1 | (version:1,first:(ts:"2020-01-01 00:00:00+00",val:1),last:(ts:"2020-01-01 23:00:01+00",val:1000),weighted_sum:82801000000,method:LOCF)
 2 | 2020-01-01 00:00:00+00 | 42.60235650875589 | (version:1,first:(ts:"2020-01-01 00:00:00+00",val:1),last:(ts:"2020-01-01 23:59:59+00",val:1000),weighted_sum:3680801000000,method:LOCF)
(3 rows)
@kgyrtkirk kgyrtkirk added the bug Something isn't working label Mar 9, 2023
@WireBaron
Copy link
Contributor

WireBaron commented Mar 13, 2023

Unfortunately the time_weighted_average aggregate has no knowledge of the bounds of the time_bucket used to construct it. All it can do is assume that the last point it has is the end of the data. You can work around this by using the interpolated_average call to specify a range, interpolated_avg(time_weight(...), '2020-01-01 00:00:00+00', '1 day', NULL, NULL), but this is something we're looking at making more user friendly moving forward.

@WireBaron
Copy link
Contributor

Hmm, trying out the example I gave you it looks like the interpolation will still consider the data to end at the last point if there isn't a next aggregate. Unfortunately this makes a workaround even uglier, you have to replace the average call in your example above with the following:

interpolated_average(time_weight('locf', TIME, value), '2020-01-01 00:00:00+00', '1 day', NULL, time_weight('locf', '2020-01-02 00:00:00+00', 0))

This isn't really an reasonable approach here, so I'll work with the team and see if we can't get something much more useful into our next release. Perhaps you can help inform what that looks like, which of the following approaches seems more reasonable to you?

extrapolate_average(time_weight('locf', TIME, value), '2020-01-01 00:00:00+00:00', '2020-01-02 00:00:00+00:00'))

or

average(time_weight('locf', TIME, value).with_bounds('2020-01-01 00:00:00+00:00', '2020-01-02 00:00:00+00:00'))

@Timsgmlr
Copy link

Timsgmlr commented Mar 16, 2023

Hey @WireBaron, I'm the one that opened the above-mentioned Stackoverflow question. My specific use case would be the calculation of time_weighted averages in the timespan of a couple months with daily timebuckets. Therefore with your proposed solution it would be kind of hard to realize that behavior because you would have to somehow dynamically determine the bounds for each day. For my use case the best solution would be if extrapolated_average would just take the timebucket into account and calculate the time_weight depending on the whole interval.

@kgyrtkirk
Copy link
Author

@Timsgmlr - sorry for opening this issue; I just wanted to help - unfortunately I can't change the reporter - but next time I'll only suggest to open an issue here instead of creating a new issue.

@Timsgmlr
Copy link

@Timsgmlr - sorry for opening this issue; I just wanted to help - unfortunately I can't change the reporter - but next time I'll only suggest to open an issue here instead of creating a new issue.

Hey, don't worry about it, I'm glad you directly opened the issue. I just wanted to make clear why I'm answering, when technically the question was directed to you.

@WireBaron
Copy link
Contributor

I worked with this a bit with the rest of the team, and it doesn't look like either the extrapolated_average or the with_bounds is really going to be very useful. The root of the problem is that the aggregate doesn't have any knowledge of either the values in adjacent buckets or even the bounds of the bucket that it's being create with. Ultimately we'd like to somehow share some of this information through the postgres plan nodes, but it will take a bit of work to figure out what's actually possible there and what it means for the toolkit extension (so far we've been avoiding using postgres hooks).

There are a couple of improvements we do want to make to interpolated_average in the meantime at least. First, we do want to fix the current behavior where locf stops at the last data point and make sure it extends through the end of the interpolation interval even if there's no following aggregate. Second, we'd like to default the previous and next arguments to null to make this easier to use in cases like the example above.

@Timsgmlr - if you're using time_bucket to bucket your values, the arguments for interpolated_average should just be the time bucketed timestamp and bucket width.

WITH time_weights AS (
    SELECT
        time_bucket('1d', ts) as bucket,
        time_weight('locf', ts, val) as agg
    FROM data
    GROUP BY 1
)
SELECT
    AVG(
        interpolated_average(agg, bucket, '1d', 
            LAG(agg) OVER (ORDER BY bucket), 
            LEAD(agg) OVER (ORDER BY bucket))
    )
FROM time_weights

That being said, it's likely better to rollup the aggregates before calling average or interpolated_average. This will automatically combine the aggregates correctly, and will also deal with missing days, which is an outstanding issue for interpolating calls.

WITH time_weights AS (
    SELECT
        time_bucket('1d', ts) as bucket,
        time_weight('locf', ts, val) as agg
    FROM data
    GROUP BY 1
)
SELECT
    average(rollup(agg))
FROM time_weights

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants