Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SQL to query flex fields, and related Album/Item data #4746

Closed
wants to merge 12 commits into from

Conversation

snejus
Copy link
Member

@snejus snejus commented Apr 8, 2023

Description

This PR depends on #4741 and #4744 which convert some queries to be SQL-compatible.

In short, it enables filtering flexible attributes and related items or album
through SQL. This speeds up flexible attributes lookup and enables queries like

beet list album_flex:something  # list all items under the albums that have album_flex:something
beet list -a item_flex:something  # list all albums which have at least one item with item_flex:something

This is largely the same logic that is behind #4362. I realized that the original PR ended
up too complicated with too much of not-so-strictly-required refactorings which makes it hard to review. This PR skips the noise and contains only the bits that are required.


Note: I will be adding basic tests for the new behaviour within the next days. It'd be good to hear what sort of expectations people have here - happy to test them too!

Right before submitting the PR I stumbled upon a limitation: a query with more than one flexible attribute filter gives empty results.

  • Completely forgot to take this into account - the way items and, say
    item_attributes are currently joined yields a row for each items and
    item_attributes combination, something like (simplified)

    items.id item_attributes.key item_attributes.value
    1 skip_count 10
    1 play_count 5

    When we try to filter using a single flexible attribute, one of these rows will match (if found, of course) so it works as expected.

    If we try to filter by two, we obviously fail. This can be tackled by (1) aggregating the attribtues or (2) performing EXISTS check within a subquery. Will test whichever is more performant and will add it in the next commit.

To Do

  • Documentation. (If you've add a new command-line flag, for example, find the appropriate page under docs/ to describe it.)
  • Changelog. (Add an entry to docs/changelog.rst near the top of the document.)
  • Tests. (Encouraged but not strictly required.)

Copy link
Member

@sampsyo sampsyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neato!! This is very exciting; I'm really glad we (as a project) are finally getting around to getting serious about making flexattr queries fast. When we built flexattr support long ago, I was pretty happy with the "slow query" mechanism as a way to make functionality progress without worrying about performance, but I always knew we'd want to come back and make them actually work in SQL through some combination of query cleverness and schema updates. We're doing that at last!

To summarize the current strategy: we're doing a two-phase SQL query. First, use a LEFT JOIN on the flexattr table to get IDs for potential matches, and then stringifying those ID results into a WHERE clause for a second-stage SQL query. This seems like a reasonable way to go because it is not clear how you would write a single JOIN that would harmonize queries that operate on flexible and fixed attributes. I don't yet 100% understand the machinery around how we enumerate the fields involved in a query, but I'll keep diving into that.

Here's one somewhat low-level question: could we do this with a nested select instead of two SQL queries? SQLite supports expressions like item.id IN (SELECT ...). I haven't used this "subquery" functionality much so I don't know if it fits the bill, but it would be cool if we could avoid fetching all the IDs, stringifying them, and then feeding them back into the second-stage query syntax.

Comment on lines 1080 to 1088
if table == "items":
joins.append(join_tmpl.format("album_attributes", "items.album_id"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[longer-term comment] I know this is a tall order, but it would be nice someday if dbcore didn't need to know anything about beets models. (It's meant to be a mostly-generic database abstraction layer.) This may be unavoidable, at least in some form, for dealing with the "item-to-album fallback." But I just wanted to put that lofty goal out here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, I like your thinking! @wisp3rwind made the same note here: #4362 (comment) :)

Luckily in this design that's an easy adjustment, see 5690946

@@ -670,26 +654,22 @@ class Results:
constructs LibModel objects that reflect database rows.
"""

def __init__(self, model_class, rows, db, flex_rows,
query=None, sort=None):
def __init__(self, model_class, rows, db, flex_rows, sort=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very much a strategic thing about backwards compatibility… but what would you think about leaving the "slow query" infrastructure in place here as a kind of fallback? In particular, I think it is important to have in case plugins provide slow queries that cannot be implemented in the new style. We can leave this functionality here and just gradually reduce our reliance on it—and someday in the future remove it after it's truly been proven worthless.

This kind of fallback could also support an incremental approach to your current work. For instance, if querying multiple flexattrs turns out to be kinda complicated, we can leave those as "slow" queries and they will at least keep working (and their performance will be no worse than before). Then we can address them in a future iteration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good point. Indeed, we shouldn't make a hard switch - hopefully this might also take care of the lslimit testing failures, so that this PR can be merged in smoothly before we get back to #4731. Still getting used to maintaining public APIs :)

I think it is important to have in case plugins provide slow queries that cannot be implemented in the new style.

I reckon it should eventually be replaced entirely - I think it will be easier to maintain it when there's just a single way of doing this?

I wouldn't be against having this as a constraint for plugin developers - they may try and implement the filter in SQL; if their query is really only possible in Python, we could provide them with the add_db_function (or equivalent) that we discussed the other day as a fallback? Of course, we'd have to figure out the design before we replace it completely.

This should ensure that all queries are fast in the future, methinks 🙂 What do you think about deprecating this logic, and making this clear in the changelog?

@snejus snejus force-pushed the only-fast-filtering branch from 319b430 to fdb0f59 Compare April 9, 2023 20:30
@snejus
Copy link
Member Author

snejus commented Apr 10, 2023

Neato!! This is very exciting; I'm really glad we (as a project) are finally getting around to getting serious about making flexattr queries fast. When we built flexattr support long ago, I was pretty happy with the "slow query" mechanism as a way to make functionality progress without worrying about performance, but I always knew we'd want to come back and make them actually work in SQL through some combination of query cleverness and schema updates. We're doing that at last!

Thanks to your blog post which gave some nice pointers about what needs improving here! I do completely agree about what does not need improving here :)

Here's one somewhat low-level question: could we do this with a nested select instead of two SQL queries? SQLite supports expressions like item.id IN (SELECT ...). I haven't used this "subquery" functionality much so I don't know if it fits the bill, but it would be cool if we could avoid fetching all the IDs, stringifying them, and then feeding them back into the second-stage query syntax.

It may be possible - this would also offer an additional performance gain. I'll have a look how does it play together with having to query two flexible attribute tables! I guess those can be UNIONed. :)

@snejus snejus force-pushed the only-fast-filtering branch from aa03146 to 88c3e14 Compare April 11, 2023 13:06
@snejus
Copy link
Member Author

snejus commented Apr 11, 2023

@sampsyo see the new commits: aggregation ended up significantly simplifying the machinery behind the new queries - this now uses just one query.

We can now query multiple flex attributes!

The drawback is that we're losing the ability to query related flexible attributes - for example, the simplicity here does not allow beet list -a title:hello or beet list artpath:art_path. Though, I think for now this should be OK - we'd in any case need to have a wider discussion about how queries like beet list -a play_count:10 should work - do we here check play_count for every item, or do we sum/aggregate them?

There are some tests failing that have to do with the types - I'll have a deeper look at them later today/tomorrow.

@snejus snejus requested a review from sampsyo April 11, 2023 13:12
@snejus snejus force-pushed the only-fast-filtering branch 6 times, most recently from 9864a1e to 8eaee99 Compare April 17, 2023 06:56
@sampsyo
Copy link
Member

sampsyo commented Apr 18, 2023

That is awesome!!!!

Regarding the issue of "related" queries, it seems to me like these run in opposite directions:

  • beet list -a title:hello: Querying albums based on item metadata. I actually would have guessed that this currently would not work, unless there was an album out there with a flexattr called title (which would be weird). But maybe I am misremembering what we support here?
  • beet list artpath:art_path: Querying items based on album metadata. This actually seems more common, and we added support for doing stuff like beet list albumartist:Beatles in Add fallback for item access to album's attributes #2988. It would be great to avoid breaking this, although Modifying album fields propagate to items? #4742 (comment) also points toward a larger change that would obviate this kind of "item-to-album fallback."

@snejus
Copy link
Member Author

snejus commented Apr 18, 2023

@sampsyo I apologize - I used completely wrong examples in my previous comment 🤦🏽‍♂️ see the clarification below

The drawback is that we're losing the ability to query related flexible attributes - for example, the simplicity here does not allow beet list -a title:hello beet list -a item_flex_attr:hello or beet list artpath:art_path beet list album_flex_attr:hello.

In summary,

  • ✅ related model field lookups are supported
    • beet list -a title:hello
    • beet list artpath:hello
  • ⛔ related flex attr lookups are not supported
    • beet list -a item_flex_attr:hello
    • beet list album_flex_attr:hello

@snejus
Copy link
Member Author

snejus commented Apr 18, 2023

Just had a couple of runs comparing master to this branch with various queries, see the results below

image

  • Previously, the more filters were provided on the command line, the longer would the query take. Now, the behavior is the opposite - every argument either speeds up the query or keeps it the same.
  • You can see master ran -a title:a very quickly - that's because it didn't return anything.

The new queries have a more or less constant speed; the command speed is largely determined by the number of things that need printing. See the table below for the summary of timings and the numbers of matched results. You can see that albums query tends to be slightly quicker than items.

image

@snejus snejus force-pushed the only-fast-filtering branch 3 times, most recently from 24b8af9 to 878aa07 Compare April 19, 2023 01:31
@stale
Copy link

stale bot commented Sep 17, 2023

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 17, 2023
@sampsyo sampsyo removed the stale label Sep 21, 2023
Copy link

stale bot commented Mar 17, 2024

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 17, 2024
@snejus
Copy link
Member Author

snejus commented Mar 19, 2024

@sampsyo would you still want this to be merged? I could try and rebase it off master

@arsaboo
Copy link
Contributor

arsaboo commented Mar 19, 2024

Improving flexible field queries would be huge... right now they are painfully slow for a large library.

@stale stale bot removed the stale label Mar 19, 2024
@snejus snejus force-pushed the only-fast-filtering branch 2 times, most recently from 17838bc to 3e0e639 Compare May 3, 2024 22:25
@snejus snejus force-pushed the only-fast-filtering branch from 3e0e639 to 698bb1f Compare May 3, 2024 22:28
@snejus snejus marked this pull request as draft May 3, 2024 23:06
@snejus snejus force-pushed the only-fast-filtering branch 2 times, most recently from 0ba83c4 to d432b50 Compare May 5, 2024 17:19
@snejus snejus mentioned this pull request May 7, 2024
@snejus snejus force-pushed the only-fast-filtering branch from d432b50 to f49c2f4 Compare May 7, 2024 17:15
This will be help with testing each of the documents which do not
any more depend on the 'global' `current_app` and `request`. These two
can now be provided at the time the objects are instantiated.
@snejus snejus force-pushed the only-fast-filtering branch from f49c2f4 to ef4a4b5 Compare May 7, 2024 19:18
snejus added 11 commits May 7, 2024 21:14

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
We need this knowledge since the fields will tell whether we need to
join any related tables.
Use `json_group_object` SQLite function to aggregate flexible attributes
into `flex_attrs` field.

Register SQLite converter `json.loads` to automatically convert the JSON
string to a Python dictionary. Remove the code that had this task
previously.

Add json1 extension to ensure JSON support for users that may be using
an ancient SQLite version.
For a flexible attribute query, replace the `col_name` property with
a function call that extracts that attribute from the `field_attrs`
field introduced in the earlier commit.

Additionally, for boolean, numeric and date queries CAST the value to
NUMERIC SQLite affinity to ensure that our queries like 'flex:1..5' and
'flex:true' continue working fine.

This removes the concept of 'slow query', since every query for any
field now has an SQL clause.
It seems like previously filtering by flexible attributes did not work
- I'd receive '{"data": []}' trying to GET `/aura/tracks?filter[play_count]=11`

Now this works, not only for tracks, but for `/aura/artists` and
`/aura/albums` too.

Additionally, this improves `/aura/tracks` response time significantly.
I tried loading the default of 500 tracks from my library:

On `master`, it took ~20s
After this commit, it takes under 1s.
@snejus snejus force-pushed the only-fast-filtering branch from ef4a4b5 to d88b3f2 Compare May 7, 2024 20:29
@snejus
Copy link
Member Author

snejus commented May 9, 2024

Superseded by #5240

@snejus snejus closed this May 9, 2024
snejus added a commit that referenced this pull request Jun 10, 2024
In #4746 I was making a small adjustment in beetsplug/aura.py and found
that the module wasn't tested. So this PR adds some high-level tests to
act a safeguard for any future adjustments.
snejus added a commit that referenced this pull request Jun 17, 2024
Another and (hopefully) final attempt to improve querying speed.

Fixes #4360 
Fixes #3515
and possibly more issues to do with slow queries.

This PR supersedes #4746.

## What's been done
The `album` and `item` tables are joined, and corresponding data from
`item_attributes` and `album_attributes` is merged and made available
for filtering. This enables to achieve the following:

- [x] Faster album path queries, `beet list -a path::some/path` 
- [x] Faster flexible attributes queries, both albums and tracks, `beet
list play_count:10`
- [x] (New) Ability to filter albums with track-level (and vice-versa)
**db** field queries, `beet list -a title:something`
- [x] (New) Ability to filter tracks with album-level **flexible** field
queries, `beet list artpath:cover`
- [x] (New) Ability to filter albums with track-level **flexible** field
queries, `beet list -a art_source:something`

## Benchmarks

![image](https://github.com/beetbox/beets/assets/16212750/3136bc23-33ea-4f07-b93b-ff768b79fb0d)

You can see that now querying speed is more or less constant regardless
of the query, and the speed is mostly influenced by how many results
need to be printed out

![image](https://github.com/beetbox/beets/assets/16212750/bfb037fb-b8d3-46ca-9c30-f851d2af1887)

Compare this with what we had previously

![image](https://github.com/beetbox/beets/assets/16212750/38108879-a404-4cdf-90cc-5563e4a75f9f)

## Later
#5318
#5319
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants