Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag posts and comments with language (fixes #440) #2269

Merged
merged 25 commits into from
Aug 18, 2022
Merged

Conversation

Nutomic
Copy link
Member

@Nutomic Nutomic commented May 17, 2022

Here is what i have done so far:

  • post now has language field (handled via federation and api)
  • renamed local_user.lang to interface_language
  • added field local_user.discussion_languages
    • initialized with all languages on signup, handled for user settings change
    • migration to set discussion_languages for existing users to all languages
  • use iso639 crate for language list
    • uses 2 character codes (en, fr, ru)
    • forked the crate to add strum for iterating over languages enum, and getting list of all languages
  • added support for "undetermined" language (used as default for posts)
  • change PostQueryBuilder to consider discussion_languages of current user
    • so that user will only see posts tagged with specified discussion_languages
    • also changed query builder to take LocalUserView, instead of taking params one by one
  • list of all languages is stored in database table (easier to make changes this way)
  • api/v3/site includes list of all languages

This still needs to be done:

  • lemmy-ui needs a user settings input for discussion language
    • should be similar to peertube
    • need to get a list of language names from somewhere
    • show a warning if "unknown" is disabled, as this will hide content from most platforms which dont support language tags
  • lemmy-ui needs a dropdown on post creation page, to indicate post language
    • if the user only has one discussion_language specified, this input can be hidden
  • test federation with lemmy/peertube

Note that this only adds language tags for posts. Once this is working, it will be easy to add them for comments as well.

@Nutomic Nutomic force-pushed the language-tagging branch 4 times, most recently from 36f7e96 to 12295cd Compare May 18, 2022 12:25
@@ -22,6 +22,7 @@ serde = { version = "1.0.136", features = ["derive"] }
url = { version = "2.2.2", features = ["serde"] }
strum = "0.24.0"
strum_macros = "0.24.0"
iso639-1 = { git = "https://github.com/Nutomic/iso639.git", branch = "add-strum", features = ["strum"] }
Copy link
Member Author

@Nutomic Nutomic May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works with cargo 1.60 seems like the dash in crate name is causing trouble. Thats why ci is failling :/

Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issues for me is how joins to the local_user_discussion_language table can work correctly.

I think its as simple as :

select * from post p
left join local_user lu on lu.person_id = person_id_join
left join local_user_discussion_language ludl 
  on ludl.local_user_id = lu.id and 
  (ludl.lang = 'all' or p.lang = 'und' or ludl.lang = p.lang)
...

I can handle all the lemmy-UI stuff BTW

Another concern I have, is that to me, language is something that we'd likely only see or care about at high levels first. IE most likely instances will be centered on one language. After that, the next likelihood is that communities will be in one language. After that, comes posts, then comments.

@@ -0,0 +1,4 @@
alter table local_user rename column lang to interface_language;
alter table local_user add column discussion_languages varchar(3)[] not null default '{}';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never do lists as a single column in SQL, this needs to be a separate joinable table: local_user_discussion_language. (id, local_user_id, lang) unique (local_user_id, lang)

That's crucial since we'll have to join to that table.

Also why is is varchar(3), shouldn't it be 2 characters? Is that for the und one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think a table like local_user_discussion_language(id, local_user_id, lang) makes sense, as we are dealing with a fixed list of languages, so it would just waste space. Instead i would make a table like all_discussion_languages(id, lang), and populate that with a code migration.

Yes, und stands for "undetermined language", and is part of ISO_639-2.

Copy link
Member

@dessalines dessalines May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, we should really have a language table: (id, language_code), and both the local_user_discussion_language, and the post table, should reference the lang_id, so that joins are more optimized.

Then we can do proper foreign-key constraints, and the joins should be fast.

@@ -17,6 +17,7 @@ pub struct CreatePost {
pub body: Option<String>,
pub honeypot: Option<String>,
pub nsfw: Option<bool>,
pub language: Option<String>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is only for posts, not communities or comments?

@@ -149,6 +149,7 @@ impl PerformCrud for Register {
password_encrypted: Some(data.password.to_string()),
show_nsfw: Some(data.show_nsfw),
email_verified: Some(false),
discussion_languages: Some(LanguageIdentifier::all_languages()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the und tag?

Copy link
Member Author

@Nutomic Nutomic May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, new users start with all languages enabled. So they will see content in all languages, and have to change the setting to disable languages they dont understand. I will can a param in Register for this, so lemmy-ui can initialize it based on browser language.

@@ -58,6 +59,7 @@ mod safe_settings_type {
show_new_post_notifs,
email_verified,
accepted_application,
discussion_languages,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have to be moved to the view, which can do the join.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this is compiling, because discussion_languages isn't a column, its a table.

@@ -162,6 +162,7 @@ table! {
show_new_post_notifs -> Bool,
email_verified -> Bool,
accepted_application -> Bool,
discussion_languages -> Array<Text>,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL rule #1: never use arrays in a column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we should blindly follow such a rule. In this case an array makes a lot of sense, because the items are tightly coupled with the local_user table, and wont have relations to any other tables. Also there is no reason to access the elements one by one, but always together.

https://stackoverflow.com/q/20219503

https://stackoverflow.com/q/6843912

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above. You have to use an intermediary table to avoid the eq_in query, which it isn't possible to index, and are always going to be slow.

In general you should never use arrays in columns in SQL (goes against all the original intentions of normalized data vectors). The one case I could see using them, is if you aren't using them to join anything, but that isn't the case here.

crates/db_views/src/post_view.rs Show resolved Hide resolved
@@ -260,6 +242,17 @@ impl<'a> PostQueryBuilder<'a> {
self
}

pub fn set_params_for_user(mut self, user: &Option<LocalUserView>) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this much better.

read_post_listings_with_person_after_block.show_bot_accounts = Some(true);
read_post_listings_with_person_after_block.my_person_id = Some(inserted_person.id);
let read_post_listings_with_person_after_block =
read_post_listings_with_person_after_block.list().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add all the language stuff to these tests also. IE, have a user's preferences be All, one lang, or several langs, and make sure the queries return the correct results.

Another thing: maybe language preferences should really only count for GetPosts, not Search or GetPost, so that you can still search for posts in other languages, in which case we might need a flag on the PostQueryBuilder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is extremely long and confusing, i will see if i can figure that out.

GetPost doesnt use PostQueryBuilder. And the same point about Search could be made regarding other params like show_read_posts, which is also not supported. It would have to be a separate PR.

@Nutomic
Copy link
Member Author

Nutomic commented May 19, 2022

This is really only the first part of the language tags implementation, in order to figure out how things can work. Next step will be comments (should be very easy, more or less copypaste of this pr). After that, i will add language settings for community/site (so that you can only post there in allowed languages).

Technically, none of that should require breaking changes. Nonetheless, it can change the way Lemmy works a lot, so i would release all of it together as 0.17.0, to make sure things work well together.

Edit: btw you havent commented how you would like the list of languages to be returned by the api, either as part of /api/v3/site, or as a new endpoint. This list is necessary for the dropdown in user settings. I dont want to implement it one way and then redo everything. Also, it would probably only return language identifiers (en, fr), not human readable language names as the iso639-1 crate doesnt have them (but could be added via pr).

@dessalines
Copy link
Member

The list of possible languages should probably come back with GetSite.

@dessalines
Copy link
Member

Edit: btw you havent commented how you would like the list of languages to be returned by the api, either as part of /api/v3/site, or as a new endpoint.

Sorry, I'd def prefer the languages come back as part of /api/v3/site .

@Nutomic Nutomic force-pushed the language-tagging branch from 28c24e9 to 8973d21 Compare June 21, 2022 13:57
@Nutomic
Copy link
Member Author

Nutomic commented Jun 21, 2022

I decided to change this and not rely on an external crate for the language list. Instead i'm storing it in the database. This will make it easier to change languages later, without having to make pull requests for another crate, or adding extra logic in our code.

Sorry, I'd def prefer the languages come back as part of /api/v3/site .

Okay will do.

@Nutomic Nutomic force-pushed the language-tagging branch from 60cc67a to 6ca1630 Compare June 22, 2022 13:15
@Nutomic
Copy link
Member Author

Nutomic commented Jun 22, 2022

@dessalines This is worth a review now, as it seems to be feature complete (except for comment language, which i will add a bit later). For now there are still some things i am unhappy about. First, post has a field containing the LanguageId, which makes a lot of sense for the database. But that id is also exposed over the api, which seems like a bad idea. It should have the language code (en) instead. Similar for local_user.discussion_languages, which also contains the database id and not the language code.

Then there is of course the problem of array fields in the database. I dont think performance is a big concern for those, because i limited the array size to 5 elements. I could of course add a table for local_user_languages, but that would make things a lot more complicated. Plus there would have to be similar tables for community_languages and site_language.

@Nutomic Nutomic force-pushed the language-tagging branch from 347bc44 to 64ba6f3 Compare June 22, 2022 19:37
Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, post has a field containing the LanguageId, which makes a lot of sense for the database. But that id is also exposed over the api, which seems like a bad idea. It should have the language code (en) instead. Similar for local_user.discussion_languages, which also contains the database id and not the language code.

In the views, do an inner join to the language table, to get back their codes. Now that it'll be a separate table tho, that's pry not necessary.

The main thing is that we really do need to have a separate table, and do a proper join to it, and maybe even add an index for language_id on the post table, in order to make that join fast.

);

alter table local_user rename column lang to interface_language;
alter table local_user add column discussion_languages integer[] not null default '{}';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still the main issue: you need to add a local_user_language table, to make all the post queries joinable, rather than using eq_in)

(id, local_user_id, language_id)

crates/db_views/src/post_view.rs Show resolved Hide resolved
@@ -73,4 +75,5 @@ pub struct LocalUserSettings {
pub show_new_post_notifs: bool,
pub email_verified: bool,
pub accepted_application: bool,
pub discussion_languages: Vec<LanguageId>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Into its own table.

crates/db_schema/src/schema.rs Show resolved Hide resolved
@@ -361,6 +362,7 @@ table! {
thumbnail_url -> Nullable<Text>,
ap_id -> Varchar,
local -> Bool,
language -> Int4,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

language_id

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

alter table local_user rename column lang to interface_language;
alter table local_user add column discussion_languages integer[] not null default '{}';

alter table post add column language integer not null default 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

language_id , and this needs to reference / constraint to the language table.

@@ -117,6 +119,25 @@ impl Perform for SaveUserSettings {
})
.await?
.map_err(|e| LemmyError::from_error_message(e, "user_already_exists"))?;
let discussion_languages: Option<Vec<LanguageId>> =
if let Some(discussion_languages) = data.discussion_languages.clone() {
if discussion_languages.len() > 5 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't worry about limiting the number.

Also, since it might be necessary to do an inner_join when doing the post query, you might need to add all languages to local_user_language, if none is given, so as to return all possible results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the limit. The loop with Language::read_id_from_code() can probably also be removed, but i havent done that yet.

@Nutomic
Copy link
Member Author

Nutomic commented Jul 11, 2022

@dessalines Okay i added the local_user_language table and made the other changes as requested. Main thing left is the join/filter in post_view.rs, because i never really learned how joins work. Confusing error messages from diesel dont help either. I think it would be easiest if you could push a commit to get that working directly to my branch.

Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we get to the postquerybuilder stuff ( which might get a little messy ), this still needs to have a LocalUserDiscussionLanguageView, that comes back with MyUserInfo. It needs to join to Language and LocalUser.

Check out admin_purge_community_view for a really simple example of how the joins and tuples work. You can copy paste that and tweak the tables.

pm me if you need any help there, or run into any issues.

crates/api/src/local_user/save_settings.rs Outdated Show resolved Hide resolved
Some(language_ids)
} else {
None
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be inserting / upserting into the local_user_language table, which is still missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as weird as it sounds, the API should probably take in a vector of language_id:

  • Its super easy to get those, since they will come back with GetSite.
  • The local_user_language only needs (local_user_id, language_id), so you can do upserts there without have to do a pointless select from the language table.

let language = blocking(context.pool(), move |conn| {
Language::read_id_from_code_opt(conn, language)
})
.await??;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

crates/api_crud/src/post/create.rs Outdated Show resolved Hide resolved
crates/api_common/src/site.rs Show resolved Hide resolved
crates/api_common/src/person.rs Outdated Show resolved Hide resolved
recipient: a.3,
post: a.4,
community: a.5,
counts: a.6,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dessalines You should use into_iter whenever possible, as that will take the value instead of reference, and allow you to avoid all these unnecessary clones. I already fixed this in all the db views.

Its also pretty annoying that this conversion method has to be copy-pasted all over the place, but there seems to be no way to avoid it as diesel can only load these queries into tuples and not structs.

@Nutomic Nutomic force-pushed the language-tagging branch from 2722142 to d64c517 Compare July 13, 2022 10:53
@Nutomic
Copy link
Member Author

Nutomic commented Jul 13, 2022

Okay i added LocalUserDiscussionLanguageView, using LocalUserSettings to avoid loading password_encrypted column. Now i'm not sure how to continue with PostView.

I also made the api changes you mentioned, but anyway that can wait for later as the compiler cant check any of that as long as db code fails to compile.

Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things, but also I'll pull this down to see what's wrong with the postquerybuilder

@@ -9,6 +9,8 @@ pub mod community_person_ban_view;
#[cfg(feature = "full")]
pub mod community_view;
#[cfg(feature = "full")]
pub mod local_user_discusion_language_view;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two ss in discussion

create table local_user_language (
id serial primary key,
local_user_id int references local_user on update cascade on delete cascade not null,
language_id int references language_id on update cascade on delete cascade not null,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be references language , I'm guessing this migration fails.

@@ -70,3 +72,9 @@ pub struct PersonViewSafe {
pub person: PersonSafe,
pub counts: PersonAggregates,
}

#[derive(Debug, Serialize, Deserialize, Clone)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in db_views/src/structs , since that's where the other local_user ones are.

language::all_columns,
))
.filter(local_user::id.eq(local_user_id))
.load::<LocalUserDiscussionLanguageViewTuple>(conn)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, looks good.

let mut language_ids = vec![];
for l in discussion_languages {
language_ids.push(
blocking(context.pool(), move |conn| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this from the DB is no longer necessary, but you'll need to do upserts to the local_user_language table based on the provided language_ids.

@@ -17,6 +17,7 @@ pub struct CreatePost {
pub body: Option<String>,
pub honeypot: Option<String>,
pub nsfw: Option<bool>,
pub language: Option<String>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be language_id, since you have these language_ids in the front end.

@@ -91,6 +92,25 @@ impl PerformCrud for CreatePost {
let (embed_title, embed_description, embed_video_url) = metadata_res
.map(|u| (u.title, u.description, u.embed_video_url))
.unwrap_or_default();
let language = data.language.clone();
let mut language = blocking(context.pool(), move |conn| {
Language::read_id_from_code_opt(conn, language)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this is language_id, you won't need to read anything from the DB.

@@ -170,6 +176,7 @@ pub struct PostQueryBuilder<'a> {
show_bot_accounts: Option<bool>,
show_read_posts: Option<bool>,
saved_only: Option<bool>,
languages: Option<Vec<LanguageId>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Languages isn't necessary, because you only need my_person_id, which can join to the local_user_languages table.

@dessalines
Copy link
Member

BTW don't worry about my comments above for now, I'm making a lot of these fixes in a separate branch.

@Nutomic
Copy link
Member Author

Nutomic commented Jul 28, 2022

Okay merged the other pr, and just made a small change to update user languages in a single transaction, instead of various separate db calls. Should be ready to review/merge now.

Edit: unless i add language tagging for comments to this pr as well. But its alreafy quite big so i would just make anothet one.

@Nutomic Nutomic marked this pull request as ready for review July 28, 2022 21:16
@dessalines
Copy link
Member

No need to worry about tagging for comments. If you would tho, add some unit tests for several test cases:

  • A user who has no language preferences, make sure they get everything back. (This might be broken)
  • A user who has one language
  • A user who has multiple languages

@Nutomic
Copy link
Member Author

Nutomic commented Aug 1, 2022

You are right, a user with no languages configured currently gets no posts at all back. Any hint how to fix it?

I also need to fix federation tests.

Edit: looks like federation tests are broken for the same reason (user with no languages doesnt get any results from PostQueryBuilder).

// Filter out the rows with missing languages
if self.my_local_user_id.is_some() {
query = query.filter(local_user_language::id.is_not_null());
}
Copy link
Member Author

@Nutomic Nutomic Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the if body needs to be something like this:

query = query.filter(sql(
    "TRUE GROUP BY (post.id, person.id, community.id, community_person_ban.id, post_aggregates.id, \
    community_follower.id, post_saved.id, post_read.id, person_block.id, post_like.score, language.id,\
    local_user_language.id, community_block.person_id) \
    HAVING count(local_user_language.id) = 0 or local_user_language.id is not null"))

Main part is the having clause, if all local_user_language.id values are null, dont filter. Unfortunately it doesnt work correctly, even if a user has one language configured, it returns posts for all languages. The group by is necessary because otherwise postgres throws errors like column "community_block.person_id" must appear in the GROUP BY clause or be used in an aggregate function. And having/group by are only implemented in diesel 2.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look this weekend and try to get it working.

dessalines and others added 2 commits August 18, 2022 09:45
* Some fixes for all / missing user languages.

* Adding back in transaction.
@Nutomic
Copy link
Member Author

Nutomic commented Aug 18, 2022

I fixed the conflicts, was quite complicated.

.unwrap();

// only one french language post should be returned
assert_eq!(1, post_listing_french.len());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This keeps failing after rebase and i cant figure out why. Already checked that the user languages are set correctly to french, and that the db queries are unchanged. It just keeps returning all 4 posts here, seemingly ignoring the user languages for no reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, forgot to pass local_user_id. Maybe later we should change it so the builder takes LocalUserView, so we dont have to pass all these things separately, and easily forget some.

Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. We'll need to performance test a lot of these recent changes before we do the next release also.

I'll start on the front end for this once I get back from vacation in a week or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants