Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor index string fallbacks for Contributor Index #25834

Merged
merged 2 commits into from
Jan 18, 2023

Conversation

bryophyta
Copy link
Contributor

@bryophyta bryophyta commented Jan 16, 2023

What does this change?

Replaces a chain of orElse statements with string concatenation. The previous structure didn't test whether tags.lastName was an empty string. Passing an empty string to tagPages.alphaIndexKey doesn't crash the job, but it was leading to an error being logged.

@jamesgorrie's PR #25832 will downgrade the error message to an info message.

This PR should remove a significant proportion of the events which were triggering the logging statement in the first place. Using string concatenation here means that we don't need to explicitly test whether lastName is empty; if it is empty then the first character in firstName will be used for the index (and likewise if firstName is also empty). This is a less 'lazy' approach as it will often involve evaluating values that don't get used, but the cost of eagerness here seems low because it just involves unwrapping some Options and concatenating three strings.

This PR should also affect the contributor indexes that are built by this job, because all the contributors with no lastName should now be indexed by their firstName or webTitle instead of the fallback index key of 1-9. It looks like this was the intention of the original code, and in theory should be a more 'useful' index. However, it will still provide the 'wrong' bucketing and alphabetical ordering for those contributors who have no lastName set through human error (as opposed to the case where their name doesn't fit within the convention of a first and last name; in the latter cases ideally the affordances provided here would be used by editorial to produce the conventionally correct ordering based on the specifics of the case).

Does this change need to be reproduced in dotcom-rendering ?

  • No
  • Yes (please indicate your plans for DCR Implementation)

Tests

I deployed this to CODE last night and the errors seem to have dropped off since that deployment:
image

Also, @ioannakok and I have verified that when the job runs with the code in this PR it does rebuild the indexes in S3 with contributors allocated as we'd expect (i.e. most of them have moved from the 1-9 bucket into an alphabet bucket):

Before After
the 1-9 page only has many entries for contributors: Frank, David Aaronovitch, Abahachi, etc. the 1-9 page only has one entry now: Studio 20 NYU

@bryophyta bryophyta requested a review from a team as a code owner January 16, 2023 18:25
* Concatenating these three strings therefore allows us to use firstName and webTitle as fallbacks
* if lastName is None or "".
* */
val indexString = tag.lastName.getOrElse("").trim + tag.firstName.getOrElse("").trim + tag.webTitle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a risk of these being duplicated between contributors? If so do we know the effect that this might have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question! my understanding is that these values are just used to bucket contributors by the first letter in their names, if available. (More context on this functionality can be found in a previous PR from @ioannakok: #25176) So there shouldn't be any need for the indexString value to be unique, if I've understood the functionality correctly.

@ioannakok
Copy link
Contributor

Perhaps this is a separate issue and I'm misunderstanding something here but even in the current situation I am not sure why contributors with both first name and last name end up in the 1-9 index: https://www.theguardian.com/index/contributors/1-9

@bryophyta
Copy link
Contributor Author

bryophyta commented Jan 17, 2023

Perhaps this is a separate issue and I'm misunderstanding something here but even in the current situation I am not sure why contributors with both first name and last name end up in the 1-9 index: theguardian.com/index/contributors/1-9

It's a good question! I'm not 100% sure but I believe it's generally going to be because their information has been put in "wrong", so that for instance the webTitle has the full name, but the lastName field has been left blank. e.g. the 1-9 page includes David Aaronovitch, but if you look at the CAPI contributor response we see that there's no lastName set for this contributor:

image

@ioannakok
Copy link
Contributor

Perhaps this is a separate issue and I'm misunderstanding something here but even in the current situation I am not sure why contributors with both first name and last name end up in the 1-9 index: theguardian.com/index/contributors/1-9

It's a good question! I'm not 100% sure but I believe it's generally going to be because their information has been put in "wrong", so that for instance the webTitle has the full name, but the lastName field has been left blank. e.g. the 1-9 page includes David Aaronovitch, but if you look at the CAPI contributor response we see that there's no lastName set for this contributor:

image

Ah, I see - thank you!

* Concatenating these three strings therefore allows us to use firstName and webTitle as fallbacks
* if lastName is None or "".
* */
val indexString = tag.lastName.getOrElse("").trim + tag.firstName.getOrElse("").trim + tag.webTitle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val indexString = tag.lastName.getOrElse("").trim + tag.firstName.getOrElse("").trim + tag.webTitle
val indexString = tag.lastName.getOrElse("").trim.concat(tag.firstName.getOrElse("").trim).concat(tag.webTitle)

Co-authored-by: Ioanna Kokkini <ioannakok@users.noreply.github.com>
Copy link
Contributor

@ioannakok ioannakok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@bryophyta bryophyta merged commit 3768ce5 into main Jan 18, 2023
@bryophyta bryophyta deleted the pf/refactor-contributor-index branch January 18, 2023 10:58
Copy link
Contributor

@jamesgorrie jamesgorrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good - but just had a thought about making these indexes simpler in general. Could we just use the webTitle?

Looking at the Tag model - webTitle is required and almost always seems to be some combination of names.

Looking at where we render the text, it is also the webTitle with some sort of deduping.

@bryophyta
Copy link
Contributor Author

bryophyta commented Jan 18, 2023

Could we just use the webTitle?

I did wonder about this, but thought it was best in this PR to focus on the bug and keep the original intent of the functionality.

I think there's definitely a case to be made for using webTitle, but in most cases it would mean breaking the convention that where people's names fit the pattern of first and last names, then we order them alphabetically by last name. As noted above, not all names do fit this pattern, but I take it that this is the rationale behind the current functionality. Changing this feels to me like it would be an editorial decision of some sort, though maybe I'm wrong about this and it's actually something we could just do!

Any thoughts on this @jamesgorrie ?

@prout-bot
Copy link
Collaborator

Seen on FRONTS-PROD, ADMIN-PROD (merged by @bryophyta 3 hours, 15 minutes and 14 seconds ago)

@jamesgorrie
Copy link
Contributor

then we order them alphabetically by last name.

I would love to speak to the person who thought an index sorted on multiple, unsurfaced criteria would be easier to search. Saying that, it's a good point of not knowing the intent.

Let's go digging for that.

I am going to take a guess it was:

  • You normally sort by last name, let's do that
  • Shizzle, we don't always have that, let's fallback to firstname
  • Oh my, sometimes that's missing, fallback to webTitle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants