-
-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tag posts and comments with language (fixes #440) #2269
Conversation
36f7e96
to
12295cd
Compare
crates/db_schema/Cargo.toml
Outdated
@@ -22,6 +22,7 @@ serde = { version = "1.0.136", features = ["derive"] } | |||
url = { version = "2.2.2", features = ["serde"] } | |||
strum = "0.24.0" | |||
strum_macros = "0.24.0" | |||
iso639-1 = { git = "https://github.com/Nutomic/iso639.git", branch = "add-strum", features = ["strum"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works with cargo 1.60 seems like the dash in crate name is causing trouble. Thats why ci is failling :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issues for me is how joins to the local_user_discussion_language
table can work correctly.
I think its as simple as :
select * from post p
left join local_user lu on lu.person_id = person_id_join
left join local_user_discussion_language ludl
on ludl.local_user_id = lu.id and
(ludl.lang = 'all' or p.lang = 'und' or ludl.lang = p.lang)
...
I can handle all the lemmy-UI stuff BTW
Another concern I have, is that to me, language is something that we'd likely only see or care about at high levels first. IE most likely instances will be centered on one language. After that, the next likelihood is that communities will be in one language. After that, comes posts, then comments.
@@ -0,0 +1,4 @@ | |||
alter table local_user rename column lang to interface_language; | |||
alter table local_user add column discussion_languages varchar(3)[] not null default '{}'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never do lists as a single column in SQL, this needs to be a separate joinable table: local_user_discussion_language
. (id, local_user_id, lang) unique (local_user_id, lang)
That's crucial since we'll have to join to that table.
Also why is is varchar(3), shouldn't it be 2 characters? Is that for the und
one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think a table like local_user_discussion_language(id, local_user_id, lang)
makes sense, as we are dealing with a fixed list of languages, so it would just waste space. Instead i would make a table like all_discussion_languages(id, lang)
, and populate that with a code migration.
Yes, und
stands for "undetermined language", and is part of ISO_639-2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, we should really have a language
table: (id, language_code), and both the local_user_discussion_language, and the post table, should reference the lang_id
, so that joins are more optimized.
Then we can do proper foreign-key constraints, and the joins should be fast.
crates/api_common/src/post.rs
Outdated
@@ -17,6 +17,7 @@ pub struct CreatePost { | |||
pub body: Option<String>, | |||
pub honeypot: Option<String>, | |||
pub nsfw: Option<bool>, | |||
pub language: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is only for posts, not communities or comments?
crates/api_crud/src/user/create.rs
Outdated
@@ -149,6 +149,7 @@ impl PerformCrud for Register { | |||
password_encrypted: Some(data.password.to_string()), | |||
show_nsfw: Some(data.show_nsfw), | |||
email_verified: Some(false), | |||
discussion_languages: Some(LanguageIdentifier::all_languages()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the und
tag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, new users start with all languages enabled. So they will see content in all languages, and have to change the setting to disable languages they dont understand. I will can a param in Register
for this, so lemmy-ui can initialize it based on browser language.
@@ -58,6 +59,7 @@ mod safe_settings_type { | |||
show_new_post_notifs, | |||
email_verified, | |||
accepted_application, | |||
discussion_languages, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will have to be moved to the view, which can do the join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how this is compiling, because discussion_languages isn't a column, its a table.
crates/db_schema/src/schema.rs
Outdated
@@ -162,6 +162,7 @@ table! { | |||
show_new_post_notifs -> Bool, | |||
email_verified -> Bool, | |||
accepted_application -> Bool, | |||
discussion_languages -> Array<Text>, | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL rule #1: never use arrays in a column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think we should blindly follow such a rule. In this case an array makes a lot of sense, because the items are tightly coupled with the local_user table, and wont have relations to any other tables. Also there is no reason to access the elements one by one, but always together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above. You have to use an intermediary table to avoid the eq_in query, which it isn't possible to index, and are always going to be slow.
In general you should never use arrays in columns in SQL (goes against all the original intentions of normalized data vectors). The one case I could see using them, is if you aren't using them to join anything, but that isn't the case here.
crates/db_views/src/post_view.rs
Outdated
@@ -260,6 +242,17 @@ impl<'a> PostQueryBuilder<'a> { | |||
self | |||
} | |||
|
|||
pub fn set_params_for_user(mut self, user: &Option<LocalUserView>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this much better.
crates/db_views/src/post_view.rs
Outdated
read_post_listings_with_person_after_block.show_bot_accounts = Some(true); | ||
read_post_listings_with_person_after_block.my_person_id = Some(inserted_person.id); | ||
let read_post_listings_with_person_after_block = | ||
read_post_listings_with_person_after_block.list().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add all the language stuff to these tests also. IE, have a user's preferences be All, one lang, or several langs, and make sure the queries return the correct results.
Another thing: maybe language preferences should really only count for GetPosts
, not Search
or GetPost
, so that you can still search for posts in other languages, in which case we might need a flag on the PostQueryBuilder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is extremely long and confusing, i will see if i can figure that out.
GetPost
doesnt use PostQueryBuilder
. And the same point about Search
could be made regarding other params like show_read_posts, which is also not supported. It would have to be a separate PR.
This is really only the first part of the language tags implementation, in order to figure out how things can work. Next step will be comments (should be very easy, more or less copypaste of this pr). After that, i will add language settings for community/site (so that you can only post there in allowed languages). Technically, none of that should require breaking changes. Nonetheless, it can change the way Lemmy works a lot, so i would release all of it together as 0.17.0, to make sure things work well together. Edit: btw you havent commented how you would like the list of languages to be returned by the api, either as part of /api/v3/site, or as a new endpoint. This list is necessary for the dropdown in user settings. I dont want to implement it one way and then redo everything. Also, it would probably only return language identifiers (en, fr), not human readable language names as the iso639-1 crate doesnt have them (but could be added via pr). |
The list of possible languages should probably come back with GetSite. |
Sorry, I'd def prefer the languages come back as part of /api/v3/site . |
28c24e9
to
8973d21
Compare
I decided to change this and not rely on an external crate for the language list. Instead i'm storing it in the database. This will make it easier to change languages later, without having to make pull requests for another crate, or adding extra logic in our code.
Okay will do. |
60cc67a
to
6ca1630
Compare
@dessalines This is worth a review now, as it seems to be feature complete (except for comment language, which i will add a bit later). For now there are still some things i am unhappy about. First, post has a field containing the LanguageId, which makes a lot of sense for the database. But that id is also exposed over the api, which seems like a bad idea. It should have the language code (en) instead. Similar for local_user.discussion_languages, which also contains the database id and not the language code. Then there is of course the problem of array fields in the database. I dont think performance is a big concern for those, because i limited the array size to 5 elements. I could of course add a table for |
347bc44
to
64ba6f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, post has a field containing the LanguageId, which makes a lot of sense for the database. But that id is also exposed over the api, which seems like a bad idea. It should have the language code (en) instead. Similar for local_user.discussion_languages, which also contains the database id and not the language code.
In the views, do an inner join to the language
table, to get back their codes. Now that it'll be a separate table tho, that's pry not necessary.
The main thing is that we really do need to have a separate table, and do a proper join to it, and maybe even add an index for language_id
on the post table, in order to make that join fast.
); | ||
|
||
alter table local_user rename column lang to interface_language; | ||
alter table local_user add column discussion_languages integer[] not null default '{}'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still the main issue: you need to add a local_user_language
table, to make all the post queries joinable, rather than using eq_in
)
(id, local_user_id, language_id)
@@ -73,4 +75,5 @@ pub struct LocalUserSettings { | |||
pub show_new_post_notifs: bool, | |||
pub email_verified: bool, | |||
pub accepted_application: bool, | |||
pub discussion_languages: Vec<LanguageId>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Into its own table.
crates/db_schema/src/schema.rs
Outdated
@@ -361,6 +362,7 @@ table! { | |||
thumbnail_url -> Nullable<Text>, | |||
ap_id -> Varchar, | |||
local -> Bool, | |||
language -> Int4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
language_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
alter table local_user rename column lang to interface_language; | ||
alter table local_user add column discussion_languages integer[] not null default '{}'; | ||
|
||
alter table post add column language integer not null default 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
language_id , and this needs to reference / constraint to the language table.
@@ -117,6 +119,25 @@ impl Perform for SaveUserSettings { | |||
}) | |||
.await? | |||
.map_err(|e| LemmyError::from_error_message(e, "user_already_exists"))?; | |||
let discussion_languages: Option<Vec<LanguageId>> = | |||
if let Some(discussion_languages) = data.discussion_languages.clone() { | |||
if discussion_languages.len() > 5 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't worry about limiting the number.
Also, since it might be necessary to do an inner_join when doing the post query, you might need to add all languages to local_user_language
, if none is given, so as to return all possible results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the limit. The loop with Language::read_id_from_code()
can probably also be removed, but i havent done that yet.
@dessalines Okay i added the local_user_language table and made the other changes as requested. Main thing left is the join/filter in post_view.rs, because i never really learned how joins work. Confusing error messages from diesel dont help either. I think it would be easiest if you could push a commit to get that working directly to my branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we get to the postquerybuilder stuff ( which might get a little messy ), this still needs to have a LocalUserDiscussionLanguageView
, that comes back with MyUserInfo
. It needs to join to Language and LocalUser.
Check out admin_purge_community_view
for a really simple example of how the joins and tuples work. You can copy paste that and tweak the tables.
pm me if you need any help there, or run into any issues.
Some(language_ids) | ||
} else { | ||
None | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be inserting / upserting into the local_user_language
table, which is still missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, as weird as it sounds, the API should probably take in a vector of language_id
:
- Its super easy to get those, since they will come back with GetSite.
- The
local_user_language
only needs (local_user_id, language_id), so you can do upserts there without have to do a pointless select from the language table.
crates/api_crud/src/post/update.rs
Outdated
let language = blocking(context.pool(), move |conn| { | ||
Language::read_id_from_code_opt(conn, language) | ||
}) | ||
.await??; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
crates/db_views/src/comment_view.rs
Outdated
recipient: a.3, | ||
post: a.4, | ||
community: a.5, | ||
counts: a.6, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dessalines You should use into_iter whenever possible, as that will take the value instead of reference, and allow you to avoid all these unnecessary clones. I already fixed this in all the db views.
Its also pretty annoying that this conversion method has to be copy-pasted all over the place, but there seems to be no way to avoid it as diesel can only load these queries into tuples and not structs.
2722142
to
d64c517
Compare
Okay i added LocalUserDiscussionLanguageView, using LocalUserSettings to avoid loading password_encrypted column. Now i'm not sure how to continue with PostView. I also made the api changes you mentioned, but anyway that can wait for later as the compiler cant check any of that as long as db code fails to compile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things, but also I'll pull this down to see what's wrong with the postquerybuilder
crates/db_views_actor/src/lib.rs
Outdated
@@ -9,6 +9,8 @@ pub mod community_person_ban_view; | |||
#[cfg(feature = "full")] | |||
pub mod community_view; | |||
#[cfg(feature = "full")] | |||
pub mod local_user_discusion_language_view; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two ss in discussion
create table local_user_language ( | ||
id serial primary key, | ||
local_user_id int references local_user on update cascade on delete cascade not null, | ||
language_id int references language_id on update cascade on delete cascade not null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be references language
, I'm guessing this migration fails.
crates/db_views_actor/src/structs.rs
Outdated
@@ -70,3 +72,9 @@ pub struct PersonViewSafe { | |||
pub person: PersonSafe, | |||
pub counts: PersonAggregates, | |||
} | |||
|
|||
#[derive(Debug, Serialize, Deserialize, Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in db_views/src/structs , since that's where the other local_user ones are.
language::all_columns, | ||
)) | ||
.filter(local_user::id.eq(local_user_id)) | ||
.load::<LocalUserDiscussionLanguageViewTuple>(conn)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, looks good.
let mut language_ids = vec![]; | ||
for l in discussion_languages { | ||
language_ids.push( | ||
blocking(context.pool(), move |conn| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this from the DB is no longer necessary, but you'll need to do upserts to the local_user_language table based on the provided language_ids.
crates/api_common/src/post.rs
Outdated
@@ -17,6 +17,7 @@ pub struct CreatePost { | |||
pub body: Option<String>, | |||
pub honeypot: Option<String>, | |||
pub nsfw: Option<bool>, | |||
pub language: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also be language_id
, since you have these language_ids in the front end.
crates/api_crud/src/post/create.rs
Outdated
@@ -91,6 +92,25 @@ impl PerformCrud for CreatePost { | |||
let (embed_title, embed_description, embed_video_url) = metadata_res | |||
.map(|u| (u.title, u.description, u.embed_video_url)) | |||
.unwrap_or_default(); | |||
let language = data.language.clone(); | |||
let mut language = blocking(context.pool(), move |conn| { | |||
Language::read_id_from_code_opt(conn, language) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once this is language_id, you won't need to read anything from the DB.
crates/db_views/src/post_view.rs
Outdated
@@ -170,6 +176,7 @@ pub struct PostQueryBuilder<'a> { | |||
show_bot_accounts: Option<bool>, | |||
show_read_posts: Option<bool>, | |||
saved_only: Option<bool>, | |||
languages: Option<Vec<LanguageId>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Languages isn't necessary, because you only need my_person_id
, which can join to the local_user_languages
table.
BTW don't worry about my comments above for now, I'm making a lot of these fixes in a separate branch. |
Okay merged the other pr, and just made a small change to update user languages in a single transaction, instead of various separate db calls. Should be ready to review/merge now. Edit: unless i add language tagging for comments to this pr as well. But its alreafy quite big so i would just make anothet one. |
No need to worry about tagging for comments. If you would tho, add some unit tests for several test cases:
|
You are right, a user with no languages configured currently gets no posts at all back. Any hint how to fix it? I also need to fix federation tests. Edit: looks like federation tests are broken for the same reason (user with no languages doesnt get any results from PostQueryBuilder). |
// Filter out the rows with missing languages | ||
if self.my_local_user_id.is_some() { | ||
query = query.filter(local_user_language::id.is_not_null()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the if body needs to be something like this:
query = query.filter(sql(
"TRUE GROUP BY (post.id, person.id, community.id, community_person_ban.id, post_aggregates.id, \
community_follower.id, post_saved.id, post_read.id, person_block.id, post_like.score, language.id,\
local_user_language.id, community_block.person_id) \
HAVING count(local_user_language.id) = 0 or local_user_language.id is not null"))
Main part is the having clause, if all local_user_language.id values are null, dont filter. Unfortunately it doesnt work correctly, even if a user has one language configured, it returns posts for all languages. The group by is necessary because otherwise postgres throws errors like column "community_block.person_id" must appear in the GROUP BY clause or be used in an aggregate function
. And having/group by are only implemented in diesel 2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look this weekend and try to get it working.
* Some fixes for all / missing user languages. * Adding back in transaction.
I fixed the conflicts, was quite complicated. |
.unwrap(); | ||
|
||
// only one french language post should be returned | ||
assert_eq!(1, post_listing_french.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This keeps failing after rebase and i cant figure out why. Already checked that the user languages are set correctly to french, and that the db queries are unchanged. It just keeps returning all 4 posts here, seemingly ignoring the user languages for no reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, forgot to pass local_user_id. Maybe later we should change it so the builder takes LocalUserView, so we dont have to pass all these things separately, and easily forget some.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. We'll need to performance test a lot of these recent changes before we do the next release also.
I'll start on the front end for this once I get back from vacation in a week or so.
Here is what i have done so far:
This still needs to be done:
Note that this only adds language tags for posts. Once this is working, it will be easy to add them for comments as well.