diff --git a/rfc/0001-oauth-access-tokens.rst b/rfc/0001-oauth-access-tokens.rst new file mode 100644 index 0000000..09ef552 --- /dev/null +++ b/rfc/0001-oauth-access-tokens.rst @@ -0,0 +1,345 @@ +============== +OAuth 2.0 Authorization in the Hail Service +============== + +.. author:: Daniel Goldstein +.. date-accepted:: Leave blank. This will be filled in when the proposal is accepted. +.. ticket-url:: Leave blank. This will eventually be filled with the + ticket URL which will track the progress of the + implementation of the feature. +.. implemented:: Leave blank. This will be filled in with the first Hail version which + implements the described feature. +.. header:: This proposal is `discussed at this pull request `_. + **After creating the pull request, edit this file again, update the + number in the link, and delete this bold sentence.** +.. sectnum:: +.. contents:: +.. role:: python(code) + +Motivation +========== + +This proposal focuses on the way by which users of Hail Batch and Hail Query-on-Batch +(from now on referred to as the Hail Service) authorize programmatic access to the hail API. + +The Hail Service authenticates users using the OAuth2 protocol, relying on either +GCP IAM or Azure AD as the identity providers. However, while the Hail Service +relies on these identity providers for authentication, it currently does *not* use them +to authorize access to Hail APIs. The Hail ``auth`` service acts as an Authorization +Server for the Hail API, minting long-lived tokens after the OAuth2 flow that are persisted +on user machines. Minting our own tokens imposes a maintenance and security burden +on the Hail team and any operators of a Hail Service. + +This proposal deprecates the use of hail-minted tokens in favor of using +access tokens from the identity providers listed above to authorize API access. +This removes the security burden of minting and protecting our own authorization +tokens while reducing code complexity since cloud access tokens are already +used within the hail codebase to access cloud APIs. + +Proposed Change Specification +============================= + +Currently, requests to the Hail APIs send a hail-minted Bearer token in the +``Authorization`` header of HTTP requests. This token is stored in a well-known +location on the user's disk. +For user machines, this file is persisted during the login flow ``hailctl auth login``. +For use in Batch jobs, the tokens are stored in Kubernetes secrets and delivered +to the Batch Worker as part of the job spec. + +This proposal adds the ability for HTTP requests from hail clients to send +OAuth2 access tokens in the ``Authorization`` header instead of hail-minted +tokens. The ``auth`` service will: + +- Assert the validity, expiration and audience of access tokens and associate + them with users of the system. +- Support hail-minted bearer tokens for backwards compatibility with old clients. + +Hail clients will be updated to use access tokens in requests to Hail APIs. How +they do so is described in the following subsections. + + +Overview of Relevant OAuth2 Background +-------------------------------------- + +Prior to discussing the details of the implementation, it is worth covering some +background on OAuth2. Note that much of this functionality is encapsulated in the +Google OAuth and AAD client libraries that we use, but a thorough understanding +is valuable to ensure that we are using them properly. + +We'll consider four primary entities in an OAuth2 interaction: + +- The user/identity +- The client (e.g. the hail python library) +- The Authorization server (Google IAM or AAD) +- The API/Resource server (the hail service) + + +The first step in this interaction is a login flow initated by the client, +during which the user must authenticate with the Authorization Server and +authorize the client to act on the user's behalf with a given set of capabilities (scopes). +From that point on, the client will be able to request tokens with a subset of +the approved scopes without manual approval. Note that this step does not apply +to robot identities. + +The next step occurs when the client wants to perform some operation against the +Resource Server. To do so it must first request an access token from the Authorization +Server. Three important factors to note about the access token are: + +- The scopes the token is granted. These specify what kinds of operations the + token is authorized to execute. These prevent the token from being used for + operations it was not intended for. +- The identity represented by that token. This is either the user or robot identity + and in JWTs can be uniquely identified by the + `sub `_ claim. This prevents + the token from being used to act on a different identity's behalf. +- The "intended audience" of the token. What this means exactly varies between + Google and Azure, but in both cases is represented by the + `aud `_ claim and prevents + the token from being used against a different Resource Server from that for which + it was generated. + +The client should then request a token with the minimal set scopes required to +perform the desired operation (in our case just enough to identify the user) and with +an audience that will be accepted by the Resource Server. It then sends this token +in the ``Authorization`` header of requests to the Resource Server. + +When the Resource Server receives the request, it can verify the validity and +expiration of the token, identify the user through the ``sub`` claim, and finally +accept the token only if its ``aud`` claim is one that the Resource Server recognizes +and permits. This way tokens from that user that were generated and intended +for other systems cannot be replayed against this Resource Server. + + +Unfortunately Google and Azure have slightly different approaches to this interaction. +Both scenarios will involve installing an OAuth2 client credential on the user's machine +to be used by the hail python library, and they will involve similar changes to the ``auth`` +service. However, their implementations vary slightly when it comes to the audience +claim, so the process to obtain access tokens will look slightly different. +The following sections detail how that process would work with those two identity providers. + + +Google Implementation +--------------------- + +When a client application requests an access token from Google IAM, the ``aud`` +claim is always set to the unique ID of the client. On a user's machine, ``aud`` +would be the client ID of the OAuth2 Client used to obtain that credential. For +service accounts, it would be the unique ID of the service account in IAM. Note +that in the service account case ``aud == sub``, but not in the case of the hail +python library acting on behalf of a user. + +I find this unintuitive, but I suppose this can be interpreted as "the intended +recipient of this token is the application that requested it, and Resource Servers +should maintain a list of trusted applications". + +Thus, when the ``auth`` service validates an access token, it must assert that +the ``aud`` claim is *either* the Client ID for the python library OAuth2 Client +or the unique ID of a hail-owned service account in the system. Doing so protects +against client applications that we don't control impersonating human users to our +system. + +Another detail of note is that Google IAM access tokens are *opaque*, so in order +to decode them the ``auth`` server must submit them to a Google API. The ``auth`` +service should take care to properly cache requests. + + +Azure Implementation +--------------------- + +Azure, however, interprets "intended recipient" as the Resource Server for which +a token is destined, and infers that recipient based on the scopes requested +by the client. For example, requesting the scope ``https://management.azure.com/.default`` +results in tokens whose ``aud`` claim is the ID of the Graph API. In order to use +non-Azure Resource Servers, AAD allows you to create custom scopes. We register +a custom scope like ``api://`` with the AAD Oauth2 Client application +and then any code that requests that scope will receive a token whose ``aud`` +scope is the ID of that OAuth2 Client application. + +This simplifies the work of the ``auth`` service, as there is a single audience +it must trust. However, it means that we must communicate this custom scope to +all our environments. + +As opposed to the opaque access tokens in Google, Azure access tokens are JWTs. +That means they can be decoded and cryptographically validated by the ``auth`` +service without making a network request. + + +User Machine Configuration Changes +---------------------------------- + +If we remove hail-minted tokens, the hail python client needs a mechanism +for requesting access tokens on behalf of the user. The way to do this is to have +a Desktop OAuth2 client credential that lives on the user's machine that administers +the OAuth2 flow and is later used to request tokens. + +Instead of depositing a ``tokens.json`` file during the login flow, +``hailctl auth login`` will instead result in the following file placed in the +user's configuration directory at ``$XDG_CONFIG_HOME/hail/identity.json``. + +.. code-block:: json + + { + "idp": "Google" | "Microsoft", + ... Optional IDP-Specific OAuth2 client secret ... + } + +This file contains the identity provider the user used to log into the Hail +Service and a OAuth2 client credential file issued by the Hail Service +for that identity provider along with the refresh token. This client credential +will be used in future requests by the client to obtain scoped access tokens +from the identity provider that are intended for the Hail Service. In Azure, +this will include the custom scope that the client needs for requests. + +For further information on the details of the OAuth2 flow, see the User Login +Flow Changes section. + +If a user does not reauthenticate after updating their hail version, +the client will continue to use extant ``tokens.json`` file. + + +Batch Job Configuration Changes +------------------------------- +Batch jobs do not authenticate through an OAuth2 flow in the way that human users do. +The service account keys or metadata server available in batch jobs both provide +ways to easily obtain access tokens. All that the job needs to know is which identity +provider it should use. Batch jobs will then be provided with the `HAIL_IDENTITY_PROVIDER` +environment variable which is interpreted by the client application as the following +identity config: ``{"idp": "$HAIL_IDENTITY_PROVIDER"}``. Without the presence of a +specific OAuth2 client to use for generating tokens, the hail library will fall +back to the latent credentials in the environment, e.g. ``GOOGLE_APPLICATION_CREDENTIALS`` +or the metadata server. + +In Azure, there will be another environment variable ``HAIL_AZURE_OAUTH_SCOPE`` +that clients must use to obtain an appropriate audience claim. + + +User Login Flow Changes +----------------------- + +Currently, ``hailctl auth login`` performs a sort of mixed desktop and server +OAuth2 login flow, which occurs in the following sequence: + +1. User executes ``hailctl auth login`` via the command line +2. The user's machine prompts the hail ``auth`` service to initiate a login flow + by making a request to ``/api/v1alpha/login``. The ``auth`` service responds + with an authorization URL that ``hailctl`` then opens in a browser. +3. The user authenticates and provides user consent +4. The OAuth2 provider authenticates the user and sends a callack to ``localhost`` + with an authorization code. +5. ``hailctl`` sends that authorization code to the ``auth`` service, which uses + it to complete the OAuth flow, receiving an ID token, an access token and a refresh token. +6. The ``auth`` service uses the ID token to identify the user and assert that the + user has an account with the Service. +7. The ``auth`` service mints a token that it sends in the response to ``hailctl``. +8. ``hailctl`` persists the token for future authorization of API calls to the Service. + + +The proposed ``hailctl auth login`` flow is as follows: + +1. User executes ``hailctl auth login`` via the command line +2. ``hailctl`` obtains the OAuth2 client credentials from a well-known, public + endpoint on the ``auth`` API. +3. ``hailctl`` performs the full Desktop OAuth flow on the user's machine, + persisting the ``refresh_token`` it receives at the end of the flow along with + the OAuth2 client credentials. +4. ``hailctl`` attempts to access the ``/userinfo`` endpoint on the ``auth`` service + to confirm that the logged in user is registered with the Hail service. + + +The programmatic OAuth2 flow will use a different OAuth2 client than that used +in the typical Web flow. When conducting a web-based flow, the OAuth2 client credentials +can be kept secret by the server and Google can verify that the request to initiate a +login flow is coming from a source that owns the OAuth2 client. As such, it is valuable to +keep the OAuth2 client actually secret. However, this does not exist in the world of +Desktop applications, as client secrets stored on user devices *cannot be considered secret*. +In order to preserve the integrity of the web-based login, it is best to maintain a separate +OAuth2 client that is issued specifically for desktop applications. There is also an intuitive +argument for why we should generate two OAuth clients, as the hail python library and the hail +web service are two distinct applications, and we could in the future want different scopes +in those two environments. + +It is worth noting that attackers with access to the user's filesystem can use the +``refresh_token`` to create access tokens. That being said, the access tokens +that an attacker could obtain from this OAuth2 secret can only be used outside of the Hail +Service to obtain the user's email. If an attacker wanted additional scopes they woudl need +to initate an OAuth2 flow which would require manual user consent for the elevated permissions. +More realistically, an attacker can just as easily obtain ``gcloud`` access tokens that are likely +to be far more privileged. So it is reasonable to say that we are not introducing new +vulnerabilities to the user's machine. + + +Effect and Interactions +----------------------- + +It is worth comparing the privileges obtained in both the current and proposed scenario +to determine if there are any increased risks under the new regime. + +For hail-minted access tokens: + +- An attacker who obtains a token can fully impersonate a user to the Hail Service +- The token is *only* authorized to access the Hail Service +- Tokens can be explicitly revoked by the user by executing ``hailctl auth logout`` + +For hail-audience client secret: + +- An attacker can just as easily access the client secret as they can the hail tokens. + The attacker can then generate access tokens. +- The audience claim of these access tokens will be the hail python package, so these + tokens can only be used against the Hail Service. +- Unlike the hail-minted tokens, the Bearer token in the requests are short-lived + access tokens. So any access tokens that might be leaked are unlikely to pose + a security risk. +- The client can dynamically configure the validity period for access tokens it + generates. +- The credentials can be invalidated by the user revoking the refresh token. This + will be a side effect of ``hailctl auth logout``. + + +Alternatives +------------ + +An alternative to persisting a hail-owned client secret on the user's machine +is to use the latent credentials from ``gcloud`` Application Default Credentials. +However, this is seen as an abuse of the OAuth2 model. Using Application Default +Credentials would require that the ``auth`` service accept tokens with the +``gcloud`` audience claim. It would obviate the need to authenticate with the +Hail Service and any entity with a gcloud-generated user access token +would be able to impersonate the user to the Hail Service. Additionally, the +Hail Service, if compromised, could impersonate the user to other APIs that +accept the ``gcloud`` audience claim. + +Another alternative is simply to not change our authorization model. Doing nothing +would leave Hail Service operators with the management of token secrets. It would +also make more difficult the integration of hail services inside other +environments that use access-token based authentication such as the Terra platform. + +Not an alternative, but an extension to this model could be encrypting and protecting +access to the OAuth2 client secret using something like Apple Keychain or equivalent +on other operating systems. The user would then be prompted to enter their password +when ``hailctl`` attempts to access the file and would therefore make it obvious to +the user if other applications try to do the same. Given that even ``gcloud`` does +not do this, we are leaving it out of this initial proposal. + + +Unresolved Questions +-------------------- + +It is as of yet unclear whether regular rotation of client secrets stored on +client devices should be performed. If that should be the case, we could do so +without much effort because the Hail Service distributes the client secrets in +the first place. We would simply need to configure the ``hailctl`` client to reinitiate +a login flow when the credential expires or is revoked. + +It is also unclear whether there is any way to somehow restrict the audience of +service account access tokens in Google as you can in Azure. I think this is a minor +concern as the tokens we'll generate for hail auth will be strictly scoped. + + +Endorsements +------------- +(Optional) This section provides an opportunity for any third parties to express their +support for the proposal, and to say why they would like to see it adopted. +It is not mandatory for have any endorsements at all, but the more substantial +the proposal is, the more desirable it is to offer evidence that there is +significant demand from the community. This section is one way to provide +such evidence.