Use SQL to query flex fields, and related Album/Item data #4746

snejus · 2023-04-08T18:31:07Z

Description

This PR depends on #4741 and #4744 which convert some queries to be SQL-compatible.

In short, it enables filtering flexible attributes and related items or album
through SQL. This speeds up flexible attributes lookup and enables queries like

beet list album_flex:something  # list all items under the albums that have album_flex:something
beet list -a item_flex:something  # list all albums which have at least one item with item_flex:something

This is largely the same logic that is behind #4362. I realized that the original PR ended
up too complicated with too much of not-so-strictly-required refactorings which makes it hard to review. This PR skips the noise and contains only the bits that are required.

Note: I will be adding basic tests for the new behaviour within the next days. It'd be good to hear what sort of expectations people have here - happy to test them too!

Right before submitting the PR I stumbled upon a limitation: a query with more than one flexible attribute filter gives empty results.

Completely forgot to take this into account - the way items and, say
item_attributes are currently joined yields a row for each items and
item_attributes combination, something like (simplified)

items.id item_attributes.key item_attributes.value

1 skip_count 10

1 play_count 5

When we try to filter using a single flexible attribute, one of these rows will match (if found, of course) so it works as expected.

If we try to filter by two, we obviously fail. This can be tackled by (1) aggregating the attribtues or (2) performing EXISTS check within a subquery. Will test whichever is more performant and will add it in the next commit.

To Do

Documentation. (If you've add a new command-line flag, for example, find the appropriate page under docs/ to describe it.)
Changelog. (Add an entry to docs/changelog.rst near the top of the document.)
Tests. (Encouraged but not strictly required.)

sampsyo

Neato!! This is very exciting; I'm really glad we (as a project) are finally getting around to getting serious about making flexattr queries fast. When we built flexattr support long ago, I was pretty happy with the "slow query" mechanism as a way to make functionality progress without worrying about performance, but I always knew we'd want to come back and make them actually work in SQL through some combination of query cleverness and schema updates. We're doing that at last!

To summarize the current strategy: we're doing a two-phase SQL query. First, use a LEFT JOIN on the flexattr table to get IDs for potential matches, and then stringifying those ID results into a WHERE clause for a second-stage SQL query. This seems like a reasonable way to go because it is not clear how you would write a single JOIN that would harmonize queries that operate on flexible and fixed attributes. I don't yet 100% understand the machinery around how we enumerate the fields involved in a query, but I'll keep diving into that.

Here's one somewhat low-level question: could we do this with a nested select instead of two SQL queries? SQLite supports expressions like item.id IN (SELECT ...). I haven't used this "subquery" functionality much so I don't know if it fits the bill, but it would be cool if we could avoid fetching all the IDs, stringifying them, and then feeding them back into the second-stage query syntax.

sampsyo · 2023-04-09T17:14:55Z

beets/dbcore/db.py

+        if table == "items":
+            joins.append(join_tmpl.format("album_attributes", "items.album_id"))


[longer-term comment] I know this is a tall order, but it would be nice someday if dbcore didn't need to know anything about beets models. (It's meant to be a mostly-generic database abstraction layer.) This may be unavoidable, at least in some form, for dealing with the "item-to-album fallback." But I just wanted to put that lofty goal out here.

Hehe, I like your thinking! @wisp3rwind made the same note here: #4362 (comment) :)

Luckily in this design that's an easy adjustment, see 5690946

sampsyo · 2023-04-09T17:18:02Z

beets/dbcore/db.py

@@ -670,26 +654,22 @@ class Results:
    constructs LibModel objects that reflect database rows.
    """

-    def __init__(self, model_class, rows, db, flex_rows,
-                 query=None, sort=None):
+    def __init__(self, model_class, rows, db, flex_rows, sort=None):


This is very much a strategic thing about backwards compatibility… but what would you think about leaving the "slow query" infrastructure in place here as a kind of fallback? In particular, I think it is important to have in case plugins provide slow queries that cannot be implemented in the new style. We can leave this functionality here and just gradually reduce our reliance on it—and someday in the future remove it after it's truly been proven worthless.

This kind of fallback could also support an incremental approach to your current work. For instance, if querying multiple flexattrs turns out to be kinda complicated, we can leave those as "slow" queries and they will at least keep working (and their performance will be no worse than before). Then we can address them in a future iteration.

That's a very good point. Indeed, we shouldn't make a hard switch - hopefully this might also take care of the lslimit testing failures, so that this PR can be merged in smoothly before we get back to #4731. Still getting used to maintaining public APIs :)

I think it is important to have in case plugins provide slow queries that cannot be implemented in the new style.

I reckon it should eventually be replaced entirely - I think it will be easier to maintain it when there's just a single way of doing this?

I wouldn't be against having this as a constraint for plugin developers - they may try and implement the filter in SQL; if their query is really only possible in Python, we could provide them with the add_db_function (or equivalent) that we discussed the other day as a fallback? Of course, we'd have to figure out the design before we replace it completely.

This should ensure that all queries are fast in the future, methinks 🙂 What do you think about deprecating this logic, and making this clear in the changelog?

snejus · 2023-04-10T10:05:54Z

Neato!! This is very exciting; I'm really glad we (as a project) are finally getting around to getting serious about making flexattr queries fast. When we built flexattr support long ago, I was pretty happy with the "slow query" mechanism as a way to make functionality progress without worrying about performance, but I always knew we'd want to come back and make them actually work in SQL through some combination of query cleverness and schema updates. We're doing that at last!

Thanks to your blog post which gave some nice pointers about what needs improving here! I do completely agree about what does not need improving here :)

Here's one somewhat low-level question: could we do this with a nested select instead of two SQL queries? SQLite supports expressions like item.id IN (SELECT ...). I haven't used this "subquery" functionality much so I don't know if it fits the bill, but it would be cool if we could avoid fetching all the IDs, stringifying them, and then feeding them back into the second-stage query syntax.

It may be possible - this would also offer an additional performance gain. I'll have a look how does it play together with having to query two flexible attribute tables! I guess those can be UNIONed. :)

snejus · 2023-04-11T13:11:33Z

@sampsyo see the new commits: aggregation ended up significantly simplifying the machinery behind the new queries - this now uses just one query.

We can now query multiple flex attributes!

The drawback is that we're losing the ability to query related flexible attributes - for example, the simplicity here does not allow beet list -a title:hello or beet list artpath:art_path. Though, I think for now this should be OK - we'd in any case need to have a wider discussion about how queries like beet list -a play_count:10 should work - do we here check play_count for every item, or do we sum/aggregate them?

There are some tests failing that have to do with the types - I'll have a deeper look at them later today/tomorrow.

sampsyo · 2023-04-18T00:55:02Z

That is awesome!!!!

Regarding the issue of "related" queries, it seems to me like these run in opposite directions:

beet list -a title:hello: Querying albums based on item metadata. I actually would have guessed that this currently would not work, unless there was an album out there with a flexattr called title (which would be weird). But maybe I am misremembering what we support here?
beet list artpath:art_path: Querying items based on album metadata. This actually seems more common, and we added support for doing stuff like beet list albumartist:Beatles in Add fallback for item access to album's attributes #2988. It would be great to avoid breaking this, although Modifying album fields propagate to items? #4742 (comment) also points toward a larger change that would obviate this kind of "item-to-album fallback."

snejus · 2023-04-18T08:23:09Z

@sampsyo I apologize - I used completely wrong examples in my previous comment 🤦🏽‍♂️ see the clarification below

The drawback is that we're losing the ability to query related flexible attributes - for example, the simplicity here does not allow ~~beet list -a title:hello~~ beet list -a item_flex_attr:hello or ~~beet list artpath:art_path~~ beet list album_flex_attr:hello.

In summary,

✅ related model field lookups are supported
- ✅ beet list -a title:hello
- ✅ beet list artpath:hello
⛔ related flex attr lookups are not supported
- ⛔ beet list -a item_flex_attr:hello
- ⛔ beet list album_flex_attr:hello

snejus · 2023-04-18T10:08:46Z

Just had a couple of runs comparing master to this branch with various queries, see the results below

Previously, the more filters were provided on the command line, the longer would the query take. Now, the behavior is the opposite - every argument either speeds up the query or keeps it the same.
You can see master ran -a title:a very quickly - that's because it didn't return anything.

The new queries have a more or less constant speed; the command speed is largely determined by the number of things that need printing. See the table below for the summary of timings and the numbers of matched results. You can see that albums query tends to be slightly quicker than items.

stale · 2023-09-17T01:10:27Z

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2024-03-17T13:13:40Z

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

snejus · 2024-03-19T03:21:19Z

@sampsyo would you still want this to be merged? I could try and rebase it off master

arsaboo · 2024-03-19T11:27:52Z

Improving flexible field queries would be huge... right now they are painfully slow for a large library.

This will be help with testing each of the documents which do not any more depend on the 'global' `current_app` and `request`. These two can now be provided at the time the objects are instantiated.

We need this knowledge since the fields will tell whether we need to join any related tables.

Use `json_group_object` SQLite function to aggregate flexible attributes into `flex_attrs` field. Register SQLite converter `json.loads` to automatically convert the JSON string to a Python dictionary. Remove the code that had this task previously. Add json1 extension to ensure JSON support for users that may be using an ancient SQLite version.

For a flexible attribute query, replace the `col_name` property with a function call that extracts that attribute from the `field_attrs` field introduced in the earlier commit. Additionally, for boolean, numeric and date queries CAST the value to NUMERIC SQLite affinity to ensure that our queries like 'flex:1..5' and 'flex:true' continue working fine. This removes the concept of 'slow query', since every query for any field now has an SQL clause.

It seems like previously filtering by flexible attributes did not work - I'd receive '{"data": []}' trying to GET `/aura/tracks?filter[play_count]=11` Now this works, not only for tracks, but for `/aura/artists` and `/aura/albums` too. Additionally, this improves `/aura/tracks` response time significantly. I tried loading the default of 500 tracks from my library: On `master`, it took ~20s After this commit, it takes under 1s.

snejus · 2024-05-09T11:37:26Z

Superseded by #5240

In #4746 I was making a small adjustment in beetsplug/aura.py and found that the module wasn't tested. So this PR adds some high-level tests to act a safeguard for any future adjustments.

Another and (hopefully) final attempt to improve querying speed. Fixes #4360 Fixes #3515 and possibly more issues to do with slow queries. This PR supersedes #4746. ## What's been done The `album` and `item` tables are joined, and corresponding data from `item_attributes` and `album_attributes` is merged and made available for filtering. This enables to achieve the following: - [x] Faster album path queries, `beet list -a path::some/path` - [x] Faster flexible attributes queries, both albums and tracks, `beet list play_count:10` - [x] (New) Ability to filter albums with track-level (and vice-versa) **db** field queries, `beet list -a title:something` - [x] (New) Ability to filter tracks with album-level **flexible** field queries, `beet list artpath:cover` - [x] (New) Ability to filter albums with track-level **flexible** field queries, `beet list -a art_source:something` ## Benchmarks ![image](https://github.com/beetbox/beets/assets/16212750/3136bc23-33ea-4f07-b93b-ff768b79fb0d) You can see that now querying speed is more or less constant regardless of the query, and the speed is mostly influenced by how many results need to be printed out ![image](https://github.com/beetbox/beets/assets/16212750/bfb037fb-b8d3-46ca-9c30-f851d2af1887) Compare this with what we had previously ![image](https://github.com/beetbox/beets/assets/16212750/38108879-a404-4cdf-90cc-5563e4a75f9f) ## Later #5318 #5319

snejus mentioned this pull request Apr 8, 2023

Add -l flag to list command to limit query results #4731

Closed

3 tasks

snejus force-pushed the only-fast-filtering branch from 56c79b7 to 319b430 Compare April 8, 2023 19:12

sampsyo reviewed Apr 9, 2023

View reviewed changes

snejus force-pushed the only-fast-filtering branch from 319b430 to fdb0f59 Compare April 9, 2023 20:30

snejus force-pushed the only-fast-filtering branch from aa03146 to 88c3e14 Compare April 11, 2023 13:06

snejus requested a review from sampsyo April 11, 2023 13:12

snejus force-pushed the only-fast-filtering branch 6 times, most recently from 9864a1e to 8eaee99 Compare April 17, 2023 06:56

snejus force-pushed the only-fast-filtering branch 3 times, most recently from 24b8af9 to 878aa07 Compare April 19, 2023 01:31

stale bot added the stale label Sep 17, 2023

sampsyo removed the stale label Sep 21, 2023

stale bot added the stale label Mar 17, 2024

stale bot removed the stale label Mar 19, 2024

snejus force-pushed the only-fast-filtering branch 2 times, most recently from 17838bc to 3e0e639 Compare May 3, 2024 22:25

snejus force-pushed the only-fast-filtering branch from 3e0e639 to 698bb1f Compare May 3, 2024 22:28

snejus marked this pull request as draft May 3, 2024 23:06

snejus mentioned this pull request May 3, 2024

Querying improvements / removing 'slow' queries #4362

Closed

10 tasks

snejus force-pushed the only-fast-filtering branch 2 times, most recently from 0ba83c4 to d432b50 Compare May 5, 2024 17:19

snejus mentioned this pull request May 7, 2024

Test aura plugin #5233

Closed

snejus force-pushed the only-fast-filtering branch from d432b50 to f49c2f4 Compare May 7, 2024 17:15

Feed in app context and args into Document to allow testing

6ee7346

This will be help with testing each of the documents which do not any more depend on the 'global' `current_app` and `request`. These two can now be provided at the time the objects are instantiated.

snejus force-pushed the only-fast-filtering branch from f49c2f4 to ef4a4b5 Compare May 7, 2024 19:18

snejus added 11 commits May 7, 2024 21:14

Test fetching each of the documents

c1518db

Dedupe get_attribute_converter

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

e9ce625

Make every Query return info about queried fields

8ad0bdc

We need this knowledge since the fields will tell whether we need to join any related tables.

Make LazyClassProperty / cached_classproperty reusable

a619321

Add support for filtering relations

347f92c

Add ability to debug queries

eebacfa

Fix querying fields present in both tables

d3ae74a

Enable querying related flexible attributes

796a86b

snejus force-pushed the only-fast-filtering branch from ef4a4b5 to d88b3f2 Compare May 7, 2024 20:29

This was referenced May 9, 2024

Test aura #5239

Merged

Make queries fast, filter all flexible attributes #5240

Merged

snejus closed this May 9, 2024

snejus added a commit that referenced this pull request Jun 10, 2024

Test aura (#5239)

0966e3c

In #4746 I was making a small adjustment in beetsplug/aura.py and found that the module wasn't tested. So this PR adds some high-level tests to act a safeguard for any future adjustments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use SQL to query flex fields, and related Album/Item data #4746

Use SQL to query flex fields, and related Album/Item data #4746

snejus commented Apr 8, 2023 •

edited

Loading

sampsyo left a comment

sampsyo Apr 9, 2023

snejus Apr 17, 2023

sampsyo Apr 9, 2023

snejus Apr 17, 2023

snejus commented Apr 10, 2023

snejus commented Apr 11, 2023 •

edited

Loading

sampsyo commented Apr 18, 2023

snejus commented Apr 18, 2023 •

edited

Loading

snejus commented Apr 18, 2023 •

edited

Loading

stale bot commented Sep 17, 2023

stale bot commented Mar 17, 2024

snejus commented Mar 19, 2024 •

edited

Loading

arsaboo commented Mar 19, 2024

snejus commented May 9, 2024

		if table == "items":
		joins.append(join_tmpl.format("album_attributes", "items.album_id"))

`items.id`	`item_attributes.key`	`item_attributes.value`
1	`skip_count`	10
1	`play_count`	5

Use SQL to query flex fields, and related Album/Item data #4746

Use SQL to query flex fields, and related Album/Item data #4746

Conversation

snejus commented Apr 8, 2023 • edited Loading

Description

To Do

sampsyo left a comment

Choose a reason for hiding this comment

sampsyo Apr 9, 2023

Choose a reason for hiding this comment

snejus Apr 17, 2023

Choose a reason for hiding this comment

sampsyo Apr 9, 2023

Choose a reason for hiding this comment

snejus Apr 17, 2023

Choose a reason for hiding this comment

snejus commented Apr 10, 2023

snejus commented Apr 11, 2023 • edited Loading

sampsyo commented Apr 18, 2023

snejus commented Apr 18, 2023 • edited Loading

snejus commented Apr 18, 2023 • edited Loading

stale bot commented Sep 17, 2023

stale bot commented Mar 17, 2024

snejus commented Mar 19, 2024 • edited Loading

arsaboo commented Mar 19, 2024

snejus commented May 9, 2024

snejus commented Apr 8, 2023 •

edited

Loading

snejus commented Apr 11, 2023 •

edited

Loading

snejus commented Apr 18, 2023 •

edited

Loading

snejus commented Apr 18, 2023 •

edited

Loading

snejus commented Mar 19, 2024 •

edited

Loading