Skip to content
This repository has been archived by the owner on Feb 3, 2018. It is now read-only.

Handling slightly different URLs to same project #27

Closed
mattfarina opened this issue May 10, 2016 · 13 comments
Closed

Handling slightly different URLs to same project #27

mattfarina opened this issue May 10, 2016 · 13 comments
Assignees
Milestone

Comments

@mattfarina
Copy link

How does the source manager handle the case of two URLs to the same location. For example,

  • https://github.com/foo/bar
  • git@github.com:foo/bar

This case can arise in a global cache where two different projects specify the same location two different ways.

In some systems the (e.g., private git installs) there may be a difference in available branches (e.g., private branches) and the access that entails. Don't need that to leak.

How does the source manager handle this so it can be used in the GOPATH and scanning?

Glide handles this by creating a key for the cache that's based on the URI. So, the two examples above would be two different cache entries.

@sdboyer sdboyer changed the title Handling slightly different URIs to same project Handling slightly different URLs to same project May 17, 2016
@sdboyer
Copy link
Owner

sdboyer commented May 17, 2016

Sorry, with a swamp of github notifications and a conference last week, I totally missed these...

The SM assumes (or will - I haven't dealt with it just yet) there is a canonical expression of a URI that underlies each URL, and stores them in the same location designated by that URL. I don't have it handy, but there's a standard transform for git, at least, that derives this URI; @technosophos pointed me to one in Deis a while back.

There's an express assumption here that you can't have different branches or tags from different URLs corresponding to the same URI. That's not a capability that basic git services provide, at least. Though...gitolite might? Is that what you're referring to by "private git installs"? Also, do bzr or hg, or one of their major hosting platforms allow that?

Even so, because the cache's location is configurable, and is managed using the current user's permissions, I'm not sure that a 'leak' is so problematic here.

But I can see how this might be a problem. Examples of how and where it actually comes up - basically, what vcs hosting circumstances - would help me assess.

@sdboyer
Copy link
Owner

sdboyer commented May 18, 2016

also, I'm sorta loathe to link #10 b/c it's just my braindump, but...well, it's where I sorta braindumped through this. might be helpful for discussion.

@mattfarina
Copy link
Author

I've only got a moment so I'll come back to this later but...

gitolite does have access rules for branches. This is one case and I do know of people using gitolite. I need to do some digging for other cases.

@sdboyer
Copy link
Owner

sdboyer commented May 19, 2016

Rawr. Indeed they do - thanks for the link.

OK, reading through those docs...just shooting from the hip, it seems like there are two, not mutually exclusive scenarios to consider:

  1. One URL for a URI may be missing some of the branches/tags furnished by a different URL for the same URI.
  2. One URL for a URI has branches or tags pointing to some particular revs, and another URL for that URI has those same branches or tags pointing to different revisions.

The latter case is the one that unequivocally requires a different place on disk...and, really, is kind of a violation of the concept of a URI in the first place - if the same properties have different values, then it's a different resource.

The former case - which, if I'm grokking correctly, is the one gitolite's access controls enables - is more like just seeing some of the resource's properties. And it actually might not be a reason to keep the URIs separately. Here's my thinking:

While vsolver itself doesn't internally enforce that these URLs are coming from a manifest file, or that that file is committed to version control and being shared by a team, there's basically no reason to use a vsolver-based tool unless that's the case. So, assuming that, we can infer a chain of things:

  1. Special, credentialed access doesn't happen without an organization of some kind. So, any manifest (say, from project A), that names a dep (say, for project B) with a URL which requires special credentials to access necessarily entails that A and B are gonna be within the sphere of influence of one team, or at least multiple teams under one organizational umbrella.
  2. If A and B are owned, directly or indirectly, by the same group of people, then these people are already aware that their e.g., gitolite setup allows for the possibility that some team members might be able to see only a subset of available branches or tags.
  3. If they're not aware, either they have a rude awakening coming sooner or later independent of their package manager, or it's because their team has been kept intentionally in the dark. In the latter case, this is all irrelevant, because no one on the team working with A or B knows that there's stuff they can't see, and they see all the same stuff...so everything just works like normal.
  4. As the team is either aware of these variations, or the variations don't matter, they would quickly learn to make choices in the version constraints A's manifest specifies for B so that none of the branches/tags visible only to some team members would ever get used.

That's all really just a way of saying that there's no point in distributing a manifest - which is explicitly intended to be as widely shared as the containing repository - to people who won't be able to use that manifest. Really, I think this is actually the same problem as distributing a manifest with a totally private repository, just with slightly different, more uneven failure modes.

What's salient here is, providing a different on-disk storage location for a different URI doesn't actually solve a problem. The cache isn't enforced to be per-user, but that's the expectation, so unless you're doing sub-user level multi-tenancy (in which case, not our problem), if the cache is already there from a different URI, it's because the user had legit access.

There's a bunch more rabbit holes I could chase, but it already seems like the sort of thing people would have to intentionally screw up in order to really have it go wrong.

OH SHIT

Lemme invert it real quick: gitolite's actually not a great example, because gitolite over ssh won't have a per-user URL. HTTP would, but it doesn't matter, because it means sameness of URL is irrelevant as a predictor of whether a URI will provide the a complete, incomplete, or empty resource set.

Like I said, this is just shooting from the hip, but that last bit in particular feels pretty definitive. If I've missed something, let me know...but I think the real decider here is if we can find a case where a different URL for a URI can produce branches/tags pointing at different revisions. That'll mean we definitely need to separate storage.

@mattfarina
Copy link
Author

Two quick things...

  1. There are other tools doing finer grained access control as well (here's an example). I was hoping to use gitolites waning popularity into account but there are others.
  2. Technically, git@example.com:foo/bar.git and https://example.com/foo/bar.git are different URIs. They may lead to the same-ish location but they are different. In Glide we create a cache key based on that and store them in separate locations. This can lead to a little duplication but disk space is generally cheap so I'm not sure it's a big deal. Wonder if this is the safer way to go.

@sdboyer
Copy link
Owner

sdboyer commented May 23, 2016

There are other tools doing finer grained access control as well (here's an example). I was hoping to use gitolites waning popularity into account but there are others.

OK, but as my previous line of thought laid out, permission-level differences don't seem like a good reason to separate otherwise-equivalent URIs. If you can think of some holes in that line of thinking...

Technically, git@example.com:foo/bar.git and https://example.com/foo/bar.git are different URIs.

So we risk getting into a really pedantic discussion here, but...well, I'm not sure how meaningful it is to call those "different URIs." The spec indicates that a URI indicating a protocol "does not imply that use of these URIs will result in access to the resource via the named protocol." And it's certainly quite clear that the goal is to "allow uniform semantic interpretation of common syntactic conventions across different types of resource identifiers.".

This doesn't necessitate that we should consider these URL/URIs that are only differentiated by protocol as pointing to the same resource. But it does mean that's a valid way to look at it - which means it's a question of whether the application we're working with looks at it that way.

Git, as far as I know, does expect them to be equivalent, though it's certainly possible to configure a git hosting service that violates the constraint. It seems that helm, at least, figures that normalizing is safe for git. idk about hg/bzr/svn, but I'd be very surprised if they didn't try to maintain this invariant.

This is why I'm looking for examples of a hosting service that actually does violate the equivalency relationship between different URIs where the identifier component is the same.

They may lead to the same-ish location but they are different. In Glide we create a cache key based on that and store them in separate locations. This can lead to a little duplication but disk space is generally cheap so I'm not sure it's a big deal. Wonder if this is the safer way to go.

It's certainly safer, but it's more complex for vsolver than for glide. We have to maintain a permanent cache, access to atomic parts of that cache, metadata about that cache, and still make the code therein available for cross-project analysis. Doing this would require keeping all local clones in a separate directory structure based on full URL, then dynamically moving them into place when (cross-project) analysis is required. That then incurs new cleanup requirements, which also incurs new startup sanity-checking requirements. And more complications for any attempts to convert the SourceManager into something that can operate in a more server daemon-like context.

There's a decent chance this all ends up being necessary at some point. But it's a lot of extra complexity right now, all to solve a problem that still strikes me as remote.

@sdboyer
Copy link
Owner

sdboyer commented Jul 8, 2016

I've changed my mind on this. While I still think the above rationale applies (that is, I don't find the different-access-over-different-protocols argument convincing), it's now pretty much moot.

My reason for not wanting to do this was because I was hoping we could keep a complete, correct $GOPATH in the cache dir populated with repositories and code for comparative static analysis purposes. Introducing different paths for different URLs would break that basic scheme. However, what I realized while writing docs is that it's already not possible to keep a $GOPATH like that, not because of slightly different URLs, but because of very different URLs - as in, forks.

We need, for example, https://github.com/sdboyer/semver to be able to reside temporarily at GOPATH/src/github.com/Masterminds/semver if the user has specified that they want to swap in the fork for the main. As we've all struggled with since time immemorial with Go, having that fork at the wrong path just doesn't work, and that problem would replay itself for anything we do based on go/build.

So, being that keeping repos on a separate path and moving them into place on $GOPATH when the need arises is a requirement anyway, there's little reason (except some space and cache efficiency) to not just separate out URLs.

That said, I don't think I'll make e.g., https://github.com/sdboyer/semver and git@github.com:sdboyer/semver a satisfiability check failure, because of the reasons outlined here.

@sdboyer sdboyer added this to the MVP milestone Jul 19, 2016
@mattfarina
Copy link
Author

Don't you control the directory you scan for dependencies? For example, when a reference to github.com/Masterminds/sermver comes in can you then scan for the dependencies in $CACHEPATH/src/https-github.com-sdboyer-semver instead? Keeping the mapping internal?

@sdboyer
Copy link
Owner

sdboyer commented Jul 20, 2016

tl;dr - my thinking's updated since even that last comment, and i don't think repo movement is necessary at all, as you say.


yep yep, i do control the dir, which is why the current approach was OK back when we discussed in mid-May. what i was worried about was the cross-package analysis (#67) - so, for example, needing to have both github.com/Masterminds/semver and github.com/sdboyer/gps checked out, at the right version, in order to verify type-level compatibility between the APIs.

that worry was based on the assumption that the type analysis would ultimately come back to go/build, which would need a well-formed GOPATH in order to hop across trees (gh/sdboyer/gps -> gh/Masterminds/semver) correctly.

however, having now banged out the package tree parsing logic and spent a lot of time in the guts of build.ImportDir(), it's clear that we'll be much better off anyway just rewriting what we need directly with go/parser. that will extend to the implementation of #67, which means GOPATH doesn't matter at all anymore, so there's no need to move repos around. they can just live at something like what you suggested - $CACHEPATH/src/https-github.com-sdboyer-semver - and the type checker in #67 will be embedded in the SourceManager that already knows how to deal with ProjectIdentifier - and those have everything we need to negotiate the proper root import path onto the right network name.

@mattfarina
Copy link
Author

I'll need to look again but I think you can use $CACHEPATH/src/https-github.com-sdboyer-semver and use go/build.

@sdboyer
Copy link
Owner

sdboyer commented Jul 20, 2016

you can, if you're walking the directory tree yourself and call e.g. build.ImportDir() on the directory you want to scan.

the problem i was anticipating was that, when jumping between packages, anything go/build-based would necessarily need a proper GOPATH in order to find the target of an import.

it was an imaginary problem, though, because no lib exists that does this sort of type checking, afaik. so, i have to write it anyway (while hoisting a bunch of private code out of the toolchain). writing it myself means i'll be in control of what directory corresponds to what import path - thus, no problem.

@sdboyer
Copy link
Owner

sdboyer commented Aug 8, 2016

I'm actually going to move this out of MVP. While it's what I'm actively progressing towards, it's necessarily part of a much larger set of changes (#83) which aren't strictly necessary to move forward. It'll mean a bit of an interface hiccup for glide when I do get that merged, but the overall effect shouldn't be too significant.

@sdboyer sdboyer removed this from the MVP milestone Aug 8, 2016
@sdboyer sdboyer added this to the v0.10.0 milestone Aug 15, 2016
@sdboyer
Copy link
Owner

sdboyer commented Aug 16, 2016

With #83 in, this should now be all wrapped up

@sdboyer sdboyer closed this as completed Aug 16, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants