Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-connect a Project to a RemoteRepository #8229

Closed
humitos opened this issue Jun 2, 2021 · 24 comments
Closed

Re-connect a Project to a RemoteRepository #8229

humitos opened this issue Jun 2, 2021 · 24 comments
Assignees
Labels
Feature New feature Needed: design decision A core team decision is required

Comments

@humitos
Copy link
Member

humitos commented Jun 2, 2021

With the work done in Read the Docs for Business about VCS Auth (see https://docs.readthedocs.io/en/stable/commercial/single-sign-on.html#sso-with-vcs-provider-github-bitbucket-or-gitlab) where all the permissions are based on Projects being linked/related to RemoteRepository and with the work that we are doing to migrate this feature + organization to Read the Docs Community, it will be almost mandatory to support a way for users to "(Re) Connect a Project to a RemoteRepository" to be able to use the VCS Auth feature.

There are some things to consider here:

  • we only have Project.repo (e.g. https://github.com/rtfd/sphinx-notfound-page.git) as a source
  • RemoteRepository has fields: clone_url, ssh_url and html_url we can compare against
  • these URLs won't exactly match in some different scenarios:
    • the GH project is imported living at rtfd/sphinx-notfound-page and then moved to readthedocs/sphinx-notfound-page
    • the URL used to import the repository has trailing /
    • the URL used to import it is SSH and ends without .git
    • https:// vs. http://
    • maybe other cases

We will need to manipulate a little the URL to handle these cases:

  1. find out the "final URL" for the repository
    • if it's an HTTP URL we can make a request and follow redirects until we get the final URL
  2. add/remove the trailing /
  3. add/remove the final .git
  4. change http:// to https://
  5. others?

Then we can query our database to find out the correct RemoteRepository:

# all the generated URLs after manipulating `project.repo`
urls = [
  # https protocol
  'https://github.com/rtfd/sphinx-notfound-page.git',
  'https://github.com/rtfd/sphinx-notfound-page',
  'https://github.com/rtfd/sphinx-notfound-page/',

  # Final URL with https protocol
  'https://github.com/readthedocs/sphinx-notfound-page.git',
  'https://github.com/readthedocs/sphinx-notfound-page',
  'https://github.com/readthedocs/sphinx-notfound-page/',
]

# this query is not super fast
remote_repository = RemoteRepository.objects.filter(
  Q(html_url__in=urls),
  Q(clone_url__in=urls),
  Q(ssh_url__in=urls),
).first()

if remote_repository:
  project.remote_repository = remote_repository
  project.save()
else:
  # show a message to the user that we weren't able to find out the 
  # remote repository and they may need to re-sync their repositories

I'm not sure that this will work in all the cases, tho

Also, note that we currently have an OneToOneField relation between Project - RemoteRepository (see

project = models.OneToOneField(
Project,
on_delete=models.SET_NULL,
related_name='remote_repository',
null=True,
blank=True,
)
). This has to be changed to a ForeignKey on the Project model so we can allow the same RemoteRepository connected to multiple Projects. Otherwise, a non-owner of a GH repository could manually import the official GH repository under a different slug making the official one be disconnected from the RemoteRepository.

Front logo Front conversations

@humitos humitos added Feature New feature Needed: design decision A core team decision is required labels Jun 2, 2021
@humitos
Copy link
Member Author

humitos commented Jun 2, 2021

We also need to define how the UI for this feature would look like. I took a look at the Admin tabs and I didn't find a great place to put a button for this. So, this may be a good point for the new templates as well.

@humitos
Copy link
Member Author

humitos commented Feb 3, 2022

Note that for corporate we will have more URLs options here because they could use git@github.com:<organization>/<repository>/ as well.

@humitos
Copy link
Member Author

humitos commented Feb 7, 2022

This issue is related to #8229 as well.

@stsewd
Copy link
Member

stsewd commented Feb 8, 2022

Don't know if this is related to the issue, but the remote repo gets out of sync when the repo URL is changed. Happened on https://readthedocs.org/dashboard/dev/

In [5]: p.remote_repository
Out[5]: <RemoteRepository: Remote repository: https://github.com/astrojuanlu/rtd-dev-docs>

That repo was first, but then we changed it to https://github.com/readthedocs/readthedocs.org/.

But even weirder, the build status was being sent correctly a couple of weeks ago.

@ericholscher
Copy link
Member

This issue is related to #8229 as well.

@humitos that's a link back to this issue :)

It seems like there is something more we can be doing here, but need to figure out what it is.

@humitos
Copy link
Member Author

humitos commented Feb 9, 2022

@humitos that's a link back to this issue :)

@ericholscher I think I wanted to link to #8715 , but I had a different issue in my clipboard 😓

@humitos
Copy link
Member Author

humitos commented Jun 1, 2022

I'd like to prioritize this if possible and find a proper solution for this case. I've bitten multiple times by this and I'm jumping into the Django shell and connecting the RemoteRepository by hand to my projects.

@ericholscher
Copy link
Member

@humitos added this to Q3, so it doesn't get lost 👍 I agree this is important, and we can probably add it to an upcoming sprint.

@humitos
Copy link
Member Author

humitos commented Jun 27, 2022

@stsewd

Don't know if this is related to the issue, but the remote repo gets out of sync when the repo URL is changed

I think you already solved this in #9178, so I think this is not a problem anymore.

@humitos
Copy link
Member Author

humitos commented Jun 27, 2022

OK, I'm back to this issue and some things have changed/fixed already. I will write down a small proposal to be able to move forward with this with the minor UI / UX changes. We could create a better UI with the new templates.

My proposal is as follows:

  1. add "(Re)-connect" button on the "Integrations" admin tab, next to "Add integration" 1
  2. only show this button if we have an exact match for this project 2 and the user has admin permissions on it
  3. the exact match will be based on Project.repo and RemoteRepository.<...>_urls (clone_, ssh_, html_)
    • for http/https URL, we will find the final URL following the redirects (supports organization/repository renames)
    • for SSH URLs we may need to convert them to its https version first or use __startswith filter (to allow ending .git) over the ssh_url 3
  4. once the button is pressed, we will execute the following code:
import requests

def get_final_url(project):
    final_url = None
    url = project.repo
    if '@' in url:
       url = convert_to_https(url)
    if url.startswith('http'):
        try:
            reponse = requests.head(project.repo, allow_redirects=True)
            final_url = response.url
        except requests.exceptions.TooManyRedirects:
            pass
    return final_url

def reconnect_remote_repository(project, user):
    final_url = get_final_url(project)
    if final_url:
        remote_repository = RemoteRepository.objects.filter(
            http_url=final_url,
            remote_repository_relations__user=user,
            remote_repository_relations__admin=True,
        ).first()
        if remote_repository:
            project.remote_repository = remote_repository
            project.save()

Footnotes

  1. I didn't find a better place to put it, but I think it's fine for now. On new templates, it will be shown in a better place.

  2. The second iteration could allow users to link to a different one (does it make sense, anyway?)

  3. I didn't find a way to get the final URL from an SSH URL 😢

@ericholscher
Copy link
Member

This makes sense to me. It seems like there should be a better way to store and query this information, so we don't need translate the format in such weird ways, but that's a future-looking improvement.

@humitos
Copy link
Member Author

humitos commented Jun 28, 2022

I thought a little more about this (while working on the PR: #9381) and I realize that we may be able to perform this action automatically instead of asking the users to do it. I mean, if we are going to allow only one possibility to connect the project and it will always match the repository URL: we can do it by ourselves without the user interaction, right?

What are the cases where doing this automatically does not make sense?

  • "forks" will have a different URL, so they will be linked to the fork repository
  • users importing other people's repositories without their concent. It will be connected to the official repository. We are not managing permissions at VCS level in community, so users will keep being admin users.
    • this point may need more thinking and some decisions
    • does it make sense to have admin users on Read the Docs that do not have access to the VCS repository?
  • would we want to connect the project to a different repository (e.g. http://github.com/org/repository to http://github.com/org/another-repository)? why? what is the use case for this?

It seems we could iterate over the Projects that are not spam, find the last URL and check if we have it registered for one of the RemoteRepository objects stored in our database. If that's the case, we link them to each other.

@humitos
Copy link
Member Author

humitos commented Jun 28, 2022

@ericholscher

It seems like there should be a better way to store and query this information, so we don't need translate the format in such weird ways, but that's a future-looking improvement.

We could update the Project.repo field in bulk if we want:

  • for those that have a RemoteRepository connected, the proper field is RemoteRepository.clone_url. It will be the HTTPS version if the repository is public and SSH one if it's private
  • for those that are not connected, we could use the requests.head(..., allow_redirects=True) method to find the current valid URL

@ericholscher
Copy link
Member

ericholscher commented Jun 28, 2022

We could update the Project.repo field in bulk if we want

Not a terrible idea, but I do worry about overwriting the users data. Perhaps having another field like canonical_repo or something that we control as a stopgap? I haven't thought deeply about it, but I do worry a bit about changing things that might break in weird ways (eg. a user has some custom format that works for some reason)

It seems we could iterate over the Projects that are not spam, find the last URL and check if we have it registered for one of the RemoteRepository objects stored in our database. If that's the case, we link them to each other.

I like this idea. The more we can do to stop things getting out of sync automatically the better in my mind.

@humitos
Copy link
Member Author

humitos commented Jun 29, 2022

I created a script to test the automated way and I think it could work fine: https://gist.github.com/humitos/414c8ff7a4c16776d23e108aef184ce8

These are some of the projects that it will connect (output from the script):

Connecting: read-the-docs
  slug: read-the-docs
  remoterepository:
    readthedocs/readthedocs.org
    156
  url: https://github.com/readthedocs/readthedocs.org

Connecting: sphinxcontrib-doxylink
  slug: sphinxcontrib-doxylink
  remoterepository:
    sphinx-contrib/doxylink
    1428
  url: https://github.com/sphinx-contrib/doxylink

Connecting: django-layar
  slug: django-layar
  remoterepository:
    jamesturk/django-layar
    145279
  url: https://github.com/jamesturk/django-layar

Connecting: django-native-tags
  slug: django-native-tags
  remoterepository:
    justquick/django-native-tags
    1344867
  url: https://github.com/justquick/django-native-tags

Connecting: django-formwizard
  slug: django-formwizard
  remoterepository:
    stephrdev/django-formwizard
    1159789
  url: https://github.com/stephrdev/django-formwizard

....

Connecting: envbuilder
RemoteRepository not found for this project. url=https://github.com/jasonbaker/envbuilder

Connecting: wilson
RemoteRepository not found for this project. url=https://github.com/chrisdickinson/wilson

Connecting: cli
RemoteRepository not found for this project. url=https://bitbucket.org/wcmaier/cli

All of them look accurate to me.


Since our RemoteRepository database is not up-to-date, we won't be able to find many of them when trying to connect. So, we could re-sync all the users before performing the linking: ~140k (User.objects.filter(socialaccount__isnull=False).count()). They are a lot. However, this won't block us due to the rate limit since we are using a user-specific token, which will count
these requests on behalf of the user, instead of our application.

This solves the problem of re-connecting all the Project to RemoteRepository automatically. However, this does not solve the problem of a "Manually imported project that a user wants to connect later". This brings me the question: do we want to expose this to users? Why? Why they should be aware of this connection? What's the UX we want to build around this?


Note that most of our projects do not have their RemoteRepository connected:

# Projects with remote repository connected
In [17]: (
    ...:     Project.objects.annotate(spam_score=Sum("spam_rules__value"))
    ...:     .filter(
    ...:         spam_score__lt=settings.RTD_SPAM_THRESHOLD_DENY_ON_ROBOTS,
    ...:         remote_repository__isnull=False,
    ...:     )
    ...:     .count()
    ...: )
Out[17]: 4834

# Project without remote repository
In [18]: (
    ...:     Project.objects.annotate(spam_score=Sum("spam_rules__value"))
    ...:     .filter(
    ...:         spam_score__lt=settings.RTD_SPAM_THRESHOLD_DENY_ON_ROBOTS,
    ...:         remote_repository__isnull=True,
    ...:     )
    ...:     .count()
    ...: )
Out[18]: 117026

@ericholscher
Copy link
Member

ericholscher commented Jun 29, 2022

I think we likely want to resync all users & re-connect the RemoteRepo. If we can keep this data in a mostly working order, it gives us a lot of interesting paths forward, and also is a prerequisite for doing widescale SSO on .org

This solves the problem of re-connecting all the Project to RemoteRepository automatically. However, this does not solve the problem of a "Manually imported project that a user wants to connect later". This brings me the question: do we want to expose this to users? Why? Why they should be aware of this connection? What's the UX we want to build around this?

In general we should not expose this to the user. It's an implementation detail for sure. If we need to expose something like this to the user, it should almost certainly be "You need to connect your GitHub repository so that we can perform this operation" with the RemoteRepo creation happening automatically if it's already connected.

@humitos
Copy link
Member Author

humitos commented Jun 30, 2022

@ericholscher

it should almost certainly be "You need to connect your GitHub repository so that we can perform this operation" with the RemoteRepo creation happening automatically if it's already connected.

I understand you wanted to say "You need to connect your GitHub account so that we can perform this operation". If that's the case, I agree.

@humitos
Copy link
Member Author

humitos commented Jun 30, 2022

@ericholscher

I think we likely want to resync all users & re-connect the RemoteRepo. If we can keep this data in a mostly working order, it gives us a lot of interesting paths forward, and also is a prerequisite for doing widescale SSO on .org

Cool! I will start by re-syncing all the users' accounts first. My idea is to spin up a new web-celery instance, listening to a different queue (e.g. resync-oauth) and trigger all the tasks there. This is to avoid conflicting with production instances / queues / others.

  • Community
  • Commercial

Once we have all this data re-synced, I will come back to my script that shows what are going to be the connections, review it again and run it after some QA.

@ericholscher
Copy link
Member

Sounds good. Excited to be moving forward on this! 🚀

@humitos
Copy link
Member Author

humitos commented Jul 4, 2022

Now that we have all the user accounts synced, I updated the script and did another run as a test to show what are the "Project <-> RemoteRepository" will be linked automatically. Here are the results for 500 projects: https://onenr.io/0LwGL0yBvR6. Note that from 500 repositories checked, we found matches only for 37. I may run a higher sample of projects, but so far it seems accurate and safe to do for me.

@humitos
Copy link
Member Author

humitos commented Jul 4, 2022

This script won't solve the problem of auto-connect them on these situations:

  • a user connects a social account that matches projects already imported
  • a user imports manually a project with a Project.repo URL that does not match exactly, but following the redirects, it does
  • a user re-sync their connections and a newer/updated RemoteRepository matches the URL immediately (or after following the redirects)

Probably, these are not all the situations and there are more, but at least these are good to start talking and thinking about the approach we want to follow for auto-connect. I'm not sure we will be able to cover all the possible cases (unless we completely disable "Manual Import") but I'd be happy reducing the "unlinking" as much as we can.

@ericholscher
Copy link
Member

Agreed. At least getting 80% of the projects in the easy cases is a huge win. The edge cases will always be weird/broken in some way.

humitos added a commit that referenced this issue Jul 7, 2022
Trigger a daily task to compare user's last login isoweekday with today's
isoweekday for all active users. If they matches, we resync the
`RemoteRepository` for this user.

This logic is the same as "resync `RemoteRepository` once a week per each user".

We consider active users those that have logged in at least once in the last 90
days.

Related: #8229
Related: #9409
humitos added a commit that referenced this issue Jul 7, 2022
Trigger a daily task to compare user's last login isoweekday with today's
isoweekday for all active users. If they matches, we resync the
`RemoteRepository` for this user.

This logic is the same as "resync `RemoteRepository` once a week per each user".

We consider active users those that have logged in at least once in the last 90
days.

Related: #8229
Related: #9409
humitos added a commit that referenced this issue Jul 11, 2022
* OAuth: resync `RemoteRepository` weekly for active users

Trigger a daily task to compare user's last login isoweekday with today's
isoweekday for all active users. If they matches, we resync the
`RemoteRepository` for this user.

This logic is the same as "resync `RemoteRepository` once a week per each user".

We consider active users those that have logged in at least once in the last 90
days.

Related: #8229
Related: #9409

* Crontab schedule

Co-authored-by: Eric Holscher <25510+ericholscher@users.noreply.github.com>

* OAuth: re-sync `RemoteRepository` in 1 Celery process only

Weekly resync will use only one Celery process to avoid backing up our queue.

Co-authored-by: Eric Holscher <25510+ericholscher@users.noreply.github.com>
@humitos
Copy link
Member Author

humitos commented Jul 14, 2022

The script has finished on community and we ended up with 9121 project connected after running the script, from 4834 that we had before.

On commercial, I triggered a dryrun of the script and we have around ~250 projects that we auto-reconnect. So, I think it's worth doing it as well (see the progress in https://onenr.io/0oR809B9XRG)

I think the only missing piece from this task is handling the situations mentioned in #8229 (comment), in an attempt to auto-connect them on manual import or account reconnect when finding an exact match.

@humitos
Copy link
Member Author

humitos commented Jul 19, 2022

Marking this issue as done. I created #9437 to track the missing parts of it. Let me know if you have something else in mind that I should check/do, tho.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature Needed: design decision A core team decision is required
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants