-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use SQL to query flex fields, and related Album/Item data #4746
Conversation
56c79b7
to
319b430
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neato!! This is very exciting; I'm really glad we (as a project) are finally getting around to getting serious about making flexattr queries fast. When we built flexattr support long ago, I was pretty happy with the "slow query" mechanism as a way to make functionality progress without worrying about performance, but I always knew we'd want to come back and make them actually work in SQL through some combination of query cleverness and schema updates. We're doing that at last!
To summarize the current strategy: we're doing a two-phase SQL query. First, use a LEFT JOIN
on the flexattr table to get IDs for potential matches, and then stringifying those ID results into a WHERE
clause for a second-stage SQL query. This seems like a reasonable way to go because it is not clear how you would write a single JOIN
that would harmonize queries that operate on flexible and fixed attributes. I don't yet 100% understand the machinery around how we enumerate the fields involved in a query, but I'll keep diving into that.
Here's one somewhat low-level question: could we do this with a nested select instead of two SQL queries? SQLite supports expressions like item.id IN (SELECT ...)
. I haven't used this "subquery" functionality much so I don't know if it fits the bill, but it would be cool if we could avoid fetching all the IDs, stringifying them, and then feeding them back into the second-stage query syntax.
beets/dbcore/db.py
Outdated
if table == "items": | ||
joins.append(join_tmpl.format("album_attributes", "items.album_id")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[longer-term comment] I know this is a tall order, but it would be nice someday if dbcore
didn't need to know anything about beets models. (It's meant to be a mostly-generic database abstraction layer.) This may be unavoidable, at least in some form, for dealing with the "item-to-album fallback." But I just wanted to put that lofty goal out here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe, I like your thinking! @wisp3rwind made the same note here: #4362 (comment) :)
Luckily in this design that's an easy adjustment, see 5690946
beets/dbcore/db.py
Outdated
@@ -670,26 +654,22 @@ class Results: | |||
constructs LibModel objects that reflect database rows. | |||
""" | |||
|
|||
def __init__(self, model_class, rows, db, flex_rows, | |||
query=None, sort=None): | |||
def __init__(self, model_class, rows, db, flex_rows, sort=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very much a strategic thing about backwards compatibility… but what would you think about leaving the "slow query" infrastructure in place here as a kind of fallback? In particular, I think it is important to have in case plugins provide slow queries that cannot be implemented in the new style. We can leave this functionality here and just gradually reduce our reliance on it—and someday in the future remove it after it's truly been proven worthless.
This kind of fallback could also support an incremental approach to your current work. For instance, if querying multiple flexattrs turns out to be kinda complicated, we can leave those as "slow" queries and they will at least keep working (and their performance will be no worse than before). Then we can address them in a future iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good point. Indeed, we shouldn't make a hard switch - hopefully this might also take care of the lslimit
testing failures, so that this PR can be merged in smoothly before we get back to #4731. Still getting used to maintaining public APIs :)
I think it is important to have in case plugins provide slow queries that cannot be implemented in the new style.
I reckon it should eventually be replaced entirely - I think it will be easier to maintain it when there's just a single way of doing this?
I wouldn't be against having this as a constraint for plugin developers - they may try and implement the filter in SQL; if their query is really only possible in Python, we could provide them with the add_db_function
(or equivalent) that we discussed the other day as a fallback? Of course, we'd have to figure out the design before we replace it completely.
This should ensure that all queries are fast in the future, methinks 🙂 What do you think about deprecating this logic, and making this clear in the changelog?
319b430
to
fdb0f59
Compare
Thanks to your blog post which gave some nice pointers about what needs improving here! I do completely agree about what does not need improving here :)
It may be possible - this would also offer an additional performance gain. I'll have a look how does it play together with having to query two flexible attribute tables! I guess those can be |
aa03146
to
88c3e14
Compare
@sampsyo see the new commits: aggregation ended up significantly simplifying the machinery behind the new queries - this now uses just one query. We can now query multiple flex attributes! The drawback is that we're losing the ability to query related flexible attributes - for example, the simplicity here does not allow There are some tests failing that have to do with the types - I'll have a deeper look at them later today/tomorrow. |
9864a1e
to
8eaee99
Compare
That is awesome!!!! Regarding the issue of "related" queries, it seems to me like these run in opposite directions:
|
@sampsyo I apologize - I used completely wrong examples in my previous comment 🤦🏽♂️ see the clarification below
In summary,
|
Just had a couple of runs comparing
The new queries have a more or less constant speed; the command speed is largely determined by the number of things that need printing. See the table below for the summary of timings and the numbers of matched results. You can see that |
24b8af9
to
878aa07
Compare
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward? This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward? This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@sampsyo would you still want this to be merged? I could try and rebase it off master |
Improving flexible field queries would be huge... right now they are painfully slow for a large library. |
17838bc
to
3e0e639
Compare
3e0e639
to
698bb1f
Compare
0ba83c4
to
d432b50
Compare
d432b50
to
f49c2f4
Compare
This will be help with testing each of the documents which do not any more depend on the 'global' `current_app` and `request`. These two can now be provided at the time the objects are instantiated.
f49c2f4
to
ef4a4b5
Compare
We need this knowledge since the fields will tell whether we need to join any related tables.
Use `json_group_object` SQLite function to aggregate flexible attributes into `flex_attrs` field. Register SQLite converter `json.loads` to automatically convert the JSON string to a Python dictionary. Remove the code that had this task previously. Add json1 extension to ensure JSON support for users that may be using an ancient SQLite version.
For a flexible attribute query, replace the `col_name` property with a function call that extracts that attribute from the `field_attrs` field introduced in the earlier commit. Additionally, for boolean, numeric and date queries CAST the value to NUMERIC SQLite affinity to ensure that our queries like 'flex:1..5' and 'flex:true' continue working fine. This removes the concept of 'slow query', since every query for any field now has an SQL clause.
It seems like previously filtering by flexible attributes did not work - I'd receive '{"data": []}' trying to GET `/aura/tracks?filter[play_count]=11` Now this works, not only for tracks, but for `/aura/artists` and `/aura/albums` too. Additionally, this improves `/aura/tracks` response time significantly. I tried loading the default of 500 tracks from my library: On `master`, it took ~20s After this commit, it takes under 1s.
ef4a4b5
to
d88b3f2
Compare
Superseded by #5240 |
In #4746 I was making a small adjustment in beetsplug/aura.py and found that the module wasn't tested. So this PR adds some high-level tests to act a safeguard for any future adjustments.
Another and (hopefully) final attempt to improve querying speed. Fixes #4360 Fixes #3515 and possibly more issues to do with slow queries. This PR supersedes #4746. ## What's been done The `album` and `item` tables are joined, and corresponding data from `item_attributes` and `album_attributes` is merged and made available for filtering. This enables to achieve the following: - [x] Faster album path queries, `beet list -a path::some/path` - [x] Faster flexible attributes queries, both albums and tracks, `beet list play_count:10` - [x] (New) Ability to filter albums with track-level (and vice-versa) **db** field queries, `beet list -a title:something` - [x] (New) Ability to filter tracks with album-level **flexible** field queries, `beet list artpath:cover` - [x] (New) Ability to filter albums with track-level **flexible** field queries, `beet list -a art_source:something` ## Benchmarks  You can see that now querying speed is more or less constant regardless of the query, and the speed is mostly influenced by how many results need to be printed out  Compare this with what we had previously  ## Later #5318 #5319
Description
This PR depends on #4741 and #4744 which convert some queries to be SQL-compatible.
In short, it enables filtering flexible attributes and related items or album
through SQL. This speeds up flexible attributes lookup and enables queries like
This is largely the same logic that is behind #4362. I realized that the original PR ended
up too complicated with too much of not-so-strictly-required refactorings which makes it hard to review. This PR skips the noise and contains only the bits that are required.
Note: I will be adding basic tests for the new behaviour within the next days. It'd be good to hear what sort of expectations people have here - happy to test them too!
Right before submitting the PR I stumbled upon a limitation: a query with more than one flexible attribute filter gives empty results.
Completely forgot to take this into account - the way
items
and, sayitem_attributes
are currently joined yields a row for eachitems
anditem_attributes
combination, something like (simplified)items.id
item_attributes.key
item_attributes.value
skip_count
play_count
When we try to filter using a single flexible attribute, one of these rows will match (if found, of course) so it works as expected.
If we try to filter by two, we obviously fail. This can be tackled by (1) aggregating the attribtues or (2) performing
EXISTS
check within a subquery. Will test whichever is more performant and will add it in the next commit.To Do
docs/
to describe it.)docs/changelog.rst
near the top of the document.)