-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DISTINCT does not return correct values #6615
Comments
This looks like a duplicate of #6495 which should be in 0.13. Please reopen if this is still a problem after upgrading. Thanks! |
@madshall was #6495 related to your issue? I'm trying to get the timestamps of the first occurrence of distinct values, using Influx 1.2.0, and seeing the same thing you are, where the timestamps are equal to exactly now() - 1h. According to the distinct function docs, it says:
I guess according to the docs, what you and I are seeing is the expected behavior, but I'm wondering if there's a workaround to have the timestamps be the first occurrence of the value, when using distinct. |
@stevenviola Looks like DISTINCT in INFLUXDB doesn't do what SQL-thinking people would expect. I think I made my field both field and tag and did |
Yeah, it feels like it might be possible with a subquery to get the distinct values, and then be able to query for the first times those values appeared, but I'm not sure how elegant or resource intensive that is. Running your command returns the timestamp of the first row of the oldest value, but not the timestamps of the other values for |
I'm also looking for an elegant way to do grab the first time a distinct value exists. The documentation seems to suggest that in a previous version, DISTINCT returned the first timestamp. Does anyone know why this was changed? @stevenviola how effective was the subquery workaround? I was also considering using INTO to downsample, but I feel like that runs into most of the same problems as a subquery and takes up unnecessary space. |
@clarkj, I think I was incorrect in saying that a subquery would do the job, but @madshall had a good hack, by storing the values in the field and tag, grouping by the tag, and then taking the first value in each group to get the first timestamp for each distinct value. If you're able to get an INTO statement which does the trick, I'd me interested in seeing it, as that might be easier for me to implement than altering the tags that are set when writing data. |
@stevenviola I've been using a version of @madshall 's workaround. It's turned into a rather complex implementation. For context: Each scan has I want to use this to determine the first time a distinct customerId scans (i.e. when they arrived at the resort). I then I want to aggregate the results of this over time, so I can visualize the rate of customer arrival over the day in Grafana. If
Unfortunately, it does not. As a result, we have to group by customerId. This is problematic because customerId is a field. Because the high cardinality of total customers, I can't store customerID as a tag on entry. I've created a new database with a 24 hour retention policy that clears at 2am every morning (the resort is only open 8am-9pm, so this is not problematic). I don't think INTO allows transformations, so I'm between using Kapacitor to query+add to the daily DB while adding customerID as a tag or (temporarily) having Telegraf just write to the long-term database with ID being a field and to the daily database with ID being a field and a tag (Below default cardinality limit of InfluxDB as there will never be more than 10,000 distinct customer visits, and therefore ids, in a day). In this new daily database, our data now looks like:
At this point, I used @madshall 's workaround: select first(custIdField) from scanDaily groupBy(custIdTag) This returns the correct results. I would think I could use this in a subquery to calculate the total number of first scans on a time interval, but even the simpler query
returns a completely incorrect count (may be related to #7885). So next, I have a continuous query that does the equivalent of:
Finally, I can query the count of first distinct customerId's (i.e. arrival time of customer):
So I'm preforming 1 transformation (customerId as a field in the main db to a daily copy where customerId is a tag) and 1 continuous query to get my data in a form that can be queried to build a chart in Grafana. All this for something that could be replaced by a DISTINCT function that returns the first timestamp instead of epoch 0. This is even more important because of the inability to GROUP BY fields #7200. In short, a couple different issues make this particular use case extremely convoluted to implement. If anyone sees a simpler way of going about this, let me know, but I think the combination of having to turn a high cardinality unique ID into a tag AND the current bug with querying subqueries with a group by tag are to blame for a lot of the difficulty. |
Count(Distinct("field")) works, at least on grafana. |
Bug report
System info: 0.12.2, Linux 2.6.32-504.16.2.el6.x86_64 SMP Tue Mar 10 17:01:00 EDT 2015 x86_64 GNU/Linux
Steps to reproduce:
Case 1
Expected behavior: unique values for "activeClient" field are returned along with correct timestapms of first occurrence
Actual behavior: Nothing is returned
Case 2
Expected behavior: unique values for "activeClient" field are returned along with correct timestapms of first occurrence
Actual behavior: Unique values for "activeClient" field are returned, but they all have the same timestamp that is equal to exactly now() - 1h, which is not true as the values appear in database one after another
The text was updated successfully, but these errors were encountered: