Query's `query_hash` is incorrect for queries with parameters #1909

redshirtrob · 2017-08-05T05:43:37Z

Issue Summary

When using parameters in queries, the fully rendered query's hash_code does not match the hash_code stored in the queries table. Consequently, the latest_query_data_id does not get updated.

If the query is created with a parameter, saved, run, used in a dashboard, and that dashboard is subsequently shared, the shared dashboard will crash because there will be no cached data for the query, i.e. latest_query_data_id is empty.

Steps to Reproduce

Create a new query with a parameter, e.g.

SELECT ts_bucket,
       COUNT(*) AS records
FROM db.table
WHERE ts_date >= CURRENT_DATE - interval '{{day}}' DAY
  AND ts_bucket >= CURRENT_TIMESTAMP - interval '24' hour
GROUP BY ts_bucket

Save the query
Observe the data stored in the queries table, i.e. hash_code and latest_query_data_id

=> select id,query_hash,latest_query_data_id from queries where id=7;
 id |            query_hash            | latest_query_data_id
----+----------------------------------+----------------------
  7 | 7d6b4ae0f8b0838f73c83f19dedf602e |                  186

Execute the query
Observe the log messages. Note the logged hash_code does NOT match the table's value.

[2017-08-05 05:10:55,082][PID:637][INFO][root] Inserting job for c453626aef08a7d0e8506e90ac5bf4f0 with metadata={'Username': u'email', 'Query ID': 7}

Observe the data stored in the queries table is unchanged, especially latest_query_data_id

> select id,query_hash,latest_query_data_id from queries where id=7;
 id |            query_hash            | latest_query_data_id
----+----------------------------------+----------------------
  7 | 7d6b4ae0f8b0838f73c83f19dedf602e |                  186

Remove the parameter from the query and replace it with the actual value

SELECT ts_bucket,
       COUNT(*) AS records
FROM db.table
WHERE ts_date >= CURRENT_DATE - interval '1' DAY
  AND ts_bucket >= CURRENT_TIMESTAMP - interval '24' hour
GROUP BY ts_bucket

Save the query
Observe the new values in the database. Note that query_hash now matches the value from the log message in Step 5.

=> select id,query_hash,latest_query_data_id from queries where id=7;
 id |            query_hash            | latest_query_data_id
----+----------------------------------+----------------------
  7 | c453626aef08a7d0e8506e90ac5bf4f0 |                  188

Execute the query
Observe the log messages. Note the logged hash_code now matches the table's value.

[2017-08-05 05:20:12,731][PID:1387][INFO][root] Inserting job for c453626aef08a7d0e8506e90ac5bf4f0 with metadata={'Username': u'email', 'Query ID': 7}

Observe the values in the database. Note that the latest_query_data_id has been updated.

=> select id,query_hash,latest_query_data_id from queries where id=7;
 id |            query_hash            | latest_query_data_id
----+----------------------------------+----------------------
  7 | c453626aef08a7d0e8506e90ac5bf4f0 |                  189

I expect parameterized queries to behave just like any other queries. When they succeed the data should be linked to the queries table, and the latest results should be available in any shared dashboards.

Technical details:

I believe the simplest solution is to pass along a query_hash field in the metadata parameter to tasks:queries.py::enqueue_query(). This way QueryExecutor could choose the metadata version when calling models.QueryResult.store_result(). The query_hash field in metadata should contain the same value as the hash_code field in the queries table.

My main concern with replacing the value of the query_hash formal parameter to enqueue_query is that it's used in a bunch of places in the redis_pipeline and I'm not sure what kind of unintended side-effects that might cause.

Redash Version: 1.0.3+b2850
Browser/OS: Chrome/macOS
How did you install Redash: Docker

The text was updated successfully, but these errors were encountered:

arikfr · 2017-08-06T11:17:09Z

Thank you for the elaborate description. Another possible solution is to have the query_hash represent the query with its default parameter(s), then the rest of the system will work as intended.

WDYT?

redshirtrob · 2017-08-06T18:32:14Z

@arikfr I'm not sure I understand how that will work. I assume you mean the query_hash value stored in the queries table will be the hash of the default rendered Query? If so, I think there's still an issue in that enqueue_query generates the hash based on the text sent from the FE. This could be a Query rendered with non-default values.

def enqueue_query(query, data_source, user_id, scheduled_query=None, metadata={}):
    query_hash = gen_query_hash(query)
    logging.info("Inserting job for %s with metadata=%s", query_hash, metadata)

OTOH, perhaps I'm misunderstanding something about the architecture. I realize there are quite a few moving parts here and I'm still working my way through the code.

Anyway, I want to propose another approach. Ultimately, the Query and QueryResult are linked via models.QueryResult.store_result(), specifically this bit of code:

        queries = db.session.query(Query).filter(
            Query.query_hash == query_hash,
            Query.data_source == data_source)
        for q in queries:
            q.latest_query_data = query_result
            db.session.add(q)

Suppose we altered the interfaces a bit to pass along a query_id as well (I've done a preliminary scan, and this seems quite possible). Then, we could change that bit of code to:

        queries = db.session.query(Query).filter(
            Query.query_hash == query_hash or Query.id == query_id, # Change here
            Query.data_source == data_source)
        for q in queries:
            q.latest_query_data = query_result
            db.session.add(q)

I believe this would preserve existing behavior (in that "duplicate" queries are updated with the latest data) and that the source parameterized query also gets linked.

I'm happy to continue kicking this around. And, once we land on an approach, I'm happy to send out a PR.

arikfr · 2017-08-06T19:33:05Z

I assume you mean the query_hash value stored in the queries table will be the hash of the default rendered Query?

Exactly.

If so, I think there's still an issue in that enqueue_query generates the hash based on the text sent from the FE. This could be a Query rendered with non-default values.

It definitely can be and in this case there is no reason to update the query's latest result reference. latest_query_data is mostly used as an optimization to avoid looking up the last result. It's also used for shared dashboards. For the optimization, it doesn't matter much, but for the shared dashboards it will result in unexpected behavior if it suddenly starts to reference "random" results rather than the default one (because the shared dashboards only shows the default one).

redshirtrob · 2017-08-07T00:05:47Z

Ah, it's the point about shared dashboards that I was missing. Thanks for the explanation.

In that case I think your solution is appropriate. IIUC, saving the query would also need to render the template with default values and generate the hash from that. Is that what you mean?

arikfr · 2017-08-07T07:35:51Z

In that case I think your solution is appropriate. IIUC, saving the query would also need to render the template with default values and generate the hash from that. Is that what you mean?

Yes.
Not sure I'm 100% comfortable with this, but it feels like a reasonable solution for this issue (and not a hard one either).

redshirtrob · 2017-08-07T15:09:10Z

Here's one other thing to consider, that I tried to call out in my initial report. If you create a brand new query with a parameter, such that the first query save is already templated, you can never share this query in a dashboard because latest_query_data can't get set.

Anyway, I can understand being somewhat queasy with this change, but I think it addresses a pretty clear bug. And, as you said, it's a pretty small change.

I'll craft a PR.

arikfr · 2017-08-09T07:37:16Z

Thanks, @redshirtrob.

remil1000 · 2023-05-31T08:11:01Z

Hi @redshirtrob & @arikfr
do you think #6063 could do the trick ?

I'm not super familiar with the shared dashboard you were talking about, but I tested the above PR through simple and parameterised queries and included those in a dashboard and so far it seems work as intended

arikfr mentioned this issue Aug 8, 2017

Server Code 500 on Public Dashboard URL mozilla/redash#166

Closed

arikfr added the Backend label Dec 22, 2017

arikfr added this to the Next milestone Feb 5, 2018

arikfr added the Feature: Parameters label Jan 15, 2019

arikfr mentioned this issue Jan 15, 2019

Complete redo of permissions in Redash #3284

Open

arikfr mentioned this issue Sep 25, 2019

Queries with parameters aren't refreshed on schedule #4182

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query's `query_hash` is incorrect for queries with parameters #1909

Query's `query_hash` is incorrect for queries with parameters #1909

redshirtrob commented Aug 5, 2017 •

edited

Loading

arikfr commented Aug 6, 2017

redshirtrob commented Aug 6, 2017

arikfr commented Aug 6, 2017

redshirtrob commented Aug 7, 2017

arikfr commented Aug 7, 2017

redshirtrob commented Aug 7, 2017

arikfr commented Aug 9, 2017

remil1000 commented May 31, 2023

Query's query_hash is incorrect for queries with parameters #1909

Query's query_hash is incorrect for queries with parameters #1909

Comments

redshirtrob commented Aug 5, 2017 • edited Loading

Issue Summary

Steps to Reproduce

Technical details:

arikfr commented Aug 6, 2017

redshirtrob commented Aug 6, 2017

arikfr commented Aug 6, 2017

redshirtrob commented Aug 7, 2017

arikfr commented Aug 7, 2017

redshirtrob commented Aug 7, 2017

arikfr commented Aug 9, 2017

remil1000 commented May 31, 2023

Query's `query_hash` is incorrect for queries with parameters #1909

Query's `query_hash` is incorrect for queries with parameters #1909

redshirtrob commented Aug 5, 2017 •

edited

Loading