Last datapoint is ignored in the calculation of average(time_weight('locf', TIME, value)) #732

kgyrtkirk · 2023-03-09T08:51:14Z

Relevant system information:

OS: [timescale/timescaledb-ha:pg14.6-ts2.9.0-latest]
PostgreSQL version : 14.6
TimescaleDB Toolkit version : [1.12.1]
Installation method: ["docker"]

Describe the bug]
reported in https://stackoverflow.com/questions/75680213/time-weighted-average-in-timescaledb-using-last-observation-carried-forward

incorrect average is computed from time_weight; seems like the last datapoint is not counted in

To Reproduce

DROP TABLE IF EXISTS t;


CREATE TABLE t(TIME TIMESTAMP WITH TIME ZONE NOT NULL,
                                             value float, k integer);


INSERT INTO t
VALUES ('2020-01-01 00:00:00', 1, 0),
       ('2020-01-01 00:00:01', 1, 0),
       ('2020-01-01 23:00:01', 1000, 1),
       ('2020-01-01 23:59:59', 1000, 2);


SELECT 0 AS k,
       time_bucket('1 days', TIME) AS timebucket,
       average(time_weight('locf', TIME, value)), (time_weight('locf', TIME, value))
FROM t
WHERE (TIME BETWEEN TIMESTAMP '2020-01-01 00:00:00+00:00' AND TIMESTAMP '2020-01-02 00:00:00+00:00')
  AND k <= 0
GROUP BY timebucket
UNION
ALL
SELECT 1,
       time_bucket('1 days', TIME) AS timebucket,
       average(time_weight('locf', TIME, value)), (time_weight('locf', TIME, value))
FROM t
WHERE (TIME BETWEEN TIMESTAMP '2020-01-01 00:00:00+00:00' AND TIMESTAMP '2020-01-02 00:00:00+00:00')
  AND k <= 1
GROUP BY timebucket
UNION
ALL
SELECT 2,
       time_bucket('1 days', TIME) AS timebucket,
       average(time_weight('locf', TIME, value)), (time_weight('locf', TIME, value))
FROM t
WHERE (TIME BETWEEN TIMESTAMP '2020-01-01 00:00:00+00:00' AND TIMESTAMP '2020-01-02 00:00:00+00:00')
  AND k <= 2
GROUP BY timebucket;

Expected behavior
I guess for k=1 and k=2 the same resultset would be expected; but instead the average is equal to the k=0 case

Actual behavior

 k |       timebucket       |      average      |                                                               time_weight                                                                
---+------------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------
 0 | 2020-01-01 00:00:00+00 |                 1 | (version:1,first:(ts:"2020-01-01 00:00:00+00",val:1),last:(ts:"2020-01-01 00:00:01+00",val:1),weighted_sum:1000000,method:LOCF)
 1 | 2020-01-01 00:00:00+00 |                 1 | (version:1,first:(ts:"2020-01-01 00:00:00+00",val:1),last:(ts:"2020-01-01 23:00:01+00",val:1000),weighted_sum:82801000000,method:LOCF)
 2 | 2020-01-01 00:00:00+00 | 42.60235650875589 | (version:1,first:(ts:"2020-01-01 00:00:00+00",val:1),last:(ts:"2020-01-01 23:59:59+00",val:1000),weighted_sum:3680801000000,method:LOCF)
(3 rows)

The text was updated successfully, but these errors were encountered:

WireBaron · 2023-03-13T17:59:19Z

Unfortunately the time_weighted_average aggregate has no knowledge of the bounds of the time_bucket used to construct it. All it can do is assume that the last point it has is the end of the data. You can work around this by using the interpolated_average call to specify a range, interpolated_avg(time_weight(...), '2020-01-01 00:00:00+00', '1 day', NULL, NULL), but this is something we're looking at making more user friendly moving forward.

WireBaron · 2023-03-13T18:41:36Z

Hmm, trying out the example I gave you it looks like the interpolation will still consider the data to end at the last point if there isn't a next aggregate. Unfortunately this makes a workaround even uglier, you have to replace the average call in your example above with the following:

interpolated_average(time_weight('locf', TIME, value), '2020-01-01 00:00:00+00', '1 day', NULL, time_weight('locf', '2020-01-02 00:00:00+00', 0))

This isn't really an reasonable approach here, so I'll work with the team and see if we can't get something much more useful into our next release. Perhaps you can help inform what that looks like, which of the following approaches seems more reasonable to you?

extrapolate_average(time_weight('locf', TIME, value), '2020-01-01 00:00:00+00:00', '2020-01-02 00:00:00+00:00'))

or

average(time_weight('locf', TIME, value).with_bounds('2020-01-01 00:00:00+00:00', '2020-01-02 00:00:00+00:00'))

Timsgmlr · 2023-03-16T09:26:34Z

Hey @WireBaron, I'm the one that opened the above-mentioned Stackoverflow question. My specific use case would be the calculation of time_weighted averages in the timespan of a couple months with daily timebuckets. Therefore with your proposed solution it would be kind of hard to realize that behavior because you would have to somehow dynamically determine the bounds for each day. For my use case the best solution would be if extrapolated_average would just take the timebucket into account and calculate the time_weight depending on the whole interval.

kgyrtkirk · 2023-03-16T09:45:54Z

@Timsgmlr - sorry for opening this issue; I just wanted to help - unfortunately I can't change the reporter - but next time I'll only suggest to open an issue here instead of creating a new issue.

Timsgmlr · 2023-03-16T09:48:38Z

@Timsgmlr - sorry for opening this issue; I just wanted to help - unfortunately I can't change the reporter - but next time I'll only suggest to open an issue here instead of creating a new issue.

Hey, don't worry about it, I'm glad you directly opened the issue. I just wanted to make clear why I'm answering, when technically the question was directed to you.

WireBaron · 2023-03-17T18:14:29Z

I worked with this a bit with the rest of the team, and it doesn't look like either the extrapolated_average or the with_bounds is really going to be very useful. The root of the problem is that the aggregate doesn't have any knowledge of either the values in adjacent buckets or even the bounds of the bucket that it's being create with. Ultimately we'd like to somehow share some of this information through the postgres plan nodes, but it will take a bit of work to figure out what's actually possible there and what it means for the toolkit extension (so far we've been avoiding using postgres hooks).

There are a couple of improvements we do want to make to interpolated_average in the meantime at least. First, we do want to fix the current behavior where locf stops at the last data point and make sure it extends through the end of the interpolation interval even if there's no following aggregate. Second, we'd like to default the previous and next arguments to null to make this easier to use in cases like the example above.

@Timsgmlr - if you're using time_bucket to bucket your values, the arguments for interpolated_average should just be the time bucketed timestamp and bucket width.

WITH time_weights AS (
    SELECT
        time_bucket('1d', ts) as bucket,
        time_weight('locf', ts, val) as agg
    FROM data
    GROUP BY 1
)
SELECT
    AVG(
        interpolated_average(agg, bucket, '1d', 
            LAG(agg) OVER (ORDER BY bucket), 
            LEAD(agg) OVER (ORDER BY bucket))
    )
FROM time_weights

That being said, it's likely better to rollup the aggregates before calling average or interpolated_average. This will automatically combine the aggregates correctly, and will also deal with missing days, which is an outstanding issue for interpolating calls.

WITH time_weights AS (
    SELECT
        time_bucket('1d', ts) as bucket,
        time_weight('locf', ts, val) as agg
    FROM data
    GROUP BY 1
)
SELECT
    average(rollup(agg))
FROM time_weights

kgyrtkirk added the bug Something isn't working label Mar 9, 2023

WireBaron self-assigned this Mar 17, 2023

WireBaron mentioned this issue Mar 17, 2023

Fix right boundary when interpolating 'locf' TWA #740

Merged

bors bot closed this as completed in 0a8cb9c Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Last datapoint is ignored in the calculation of average(time_weight('locf', TIME, value)) #732

Last datapoint is ignored in the calculation of average(time_weight('locf', TIME, value)) #732

kgyrtkirk commented Mar 9, 2023

WireBaron commented Mar 13, 2023 •

edited

Loading

WireBaron commented Mar 13, 2023

Timsgmlr commented Mar 16, 2023 •

edited

Loading

kgyrtkirk commented Mar 16, 2023

Timsgmlr commented Mar 16, 2023

WireBaron commented Mar 17, 2023

Last datapoint is ignored in the calculation of average(time_weight('locf', TIME, value)) #732

Last datapoint is ignored in the calculation of average(time_weight('locf', TIME, value)) #732

Comments

kgyrtkirk commented Mar 9, 2023

WireBaron commented Mar 13, 2023 • edited Loading

WireBaron commented Mar 13, 2023

Timsgmlr commented Mar 16, 2023 • edited Loading

kgyrtkirk commented Mar 16, 2023

Timsgmlr commented Mar 16, 2023

WireBaron commented Mar 17, 2023

WireBaron commented Mar 13, 2023 •

edited

Loading

Timsgmlr commented Mar 16, 2023 •

edited

Loading