Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Activity dashboard] Pull PyPI downloads data nightly #664

Closed
klai95 opened this issue Sep 19, 2022 · 10 comments · Fixed by #710
Closed

[Activity dashboard] Pull PyPI downloads data nightly #664

klai95 opened this issue Sep 19, 2022 · 10 comments · Fixed by #710
Assignees
Labels

Comments

@klai95
Copy link
Contributor

klai95 commented Sep 19, 2022

In #638, we introduced API endpoints that connect to PyPI downloads data. As part of that work, we did a one-time pull of data from Snowflake. Now, let's make that a recurring cron job that runs once every 24 hours.

Additional details

  • Even though activity dashboard functionality is currently behind a feature flag, let's develop this data integration such that it runs successfully in all our environments (dev, staging and prod)
  • The underlying data should be getting updated once a day, and @bnelson-czi 's job runs at 3am PT for ~10 minutes, so maybe 5am PT or after is a good time to target.
  • We will be introducing other activity dashboard metrics in the future, so consider a folder/naming structure such as /activity-dashboard-metrics/pypi-downloads.csv

Earlier notes from Kevin

  • The follow-up work to tackle improvements to the snowflake code as well as connecting it to a cron job so that we don't need to keep re-running it; see more at Activity Dashboard Prototype #662 for the activity dashboard prototype.
  • As such, we can integrate the read and write of data between Snowflake and S3 in our existing infrastructure that enables for automatic updates. Further more, additional data sources can be integrated seamlessly as well should we ever need to pull from new external sources.
@klai95 klai95 changed the title Activity Dashboard - Move environment variables Activity Dashboard - Follow-up work Sep 22, 2022
@klai95 klai95 changed the title Activity Dashboard - Follow-up work Activity Dashboard Follow-up work Sep 29, 2022
@klai95 klai95 changed the title Activity Dashboard Follow-up work Activity Dashboard Follow-up Work Sep 29, 2022
@richaagarwal richaagarwal changed the title Activity Dashboard Follow-up Work [Activity dashboard] Pull PyPI downloads data nightly Sep 30, 2022
@richaagarwal richaagarwal moved this to Backlog in napari hub backlog Sep 30, 2022
@richaagarwal richaagarwal moved this from Backlog to Ready in napari hub backlog Oct 3, 2022
@klai95
Copy link
Contributor Author

klai95 commented Oct 4, 2022

Below is the code to generate Snowflake data; as part of this issue, I will need to inject this SQL code directly in the backend codebase.

select file_project, date_trunc('month', timestamp) as month, count(*) as num_downloads
from
(
  select country_code, project, project_type, file_project, file_version, file_type, details_installer_name, details_installer_version,
      CONCAT(
        DETAILS,
        DETAILS_INSTALLER, DETAILS_INSTALLER_NAME, DETAILS_INSTALLER_VERSION,
        DETAILS_PYTHON,
        DETAILS_IMPLEMENTATION_NAME, DETAILS_IMPLEMENTATION_VERSION,
        DETAILS_DISTRO, DETAILS_DISTRO_NAME, DETAILS_DISTRO_VERSION, DETAILS_DISTRO_ID, DETAILS_DISTRO_LIBC, DETAILS_DISTRO_LIBC_LIB, DETAILS_DISTRO_LIBC_VERSION,
        DETAILS_SYSTEM, DETAILS_SYSTEM_NAME, DETAILS_SYSTEM_RELEASE,
        DETAILS_CPU,
        DETAILS_OPENSSL_VERSION,
        DETAILS_SETUPTOOLS_VERSION,
        DETAILS_RUSTC_VERSION
      ) as details_all,
      CASE
          WHEN details_all like '%amzn%' OR details_all like '%aws%' OR details_all like '%-azure%' OR details_all like '%gcp%' THEN 'ci usage'
          WHEN details_installer_name in ('bandersnatch', 'pip') THEN details_installer_name
          ELSE 'other'
      END AS download_type,
      to_timestamp(timestamp) as timestamp
  from imaging.pypi.downloads
  where download_type = 'pip'
  and project_type = 'plugin'
  order by timestamp desc
)
group by 1,2
order by NUM_DOWNLOADS desc, MONTH, FILE_PROJECT

@klai95
Copy link
Contributor Author

klai95 commented Oct 4, 2022

1: Add workspace variables on Terraform: happy-sci-imaging->Workspaces->Variables. We want to utilize this process to add sensitive Snowflake credentials such as user and password here.
2: Add code changes to set up Snowflake access (credentials, CloudWatch cadence, etc) in backend codebase
3: Pull PyPI downloads data nightly (write data to S3 without writing the file locally)

Note: No additional lambda will be created. This PR will be captured in the backend lambda.

@bnelson-czi
Copy link

@klai95 One minor update. Change...
WHEN details_all like '%amzn%' OR details_all like '%aws%' OR details_all like '%-azure%' THEN 'ci usage'
...to...
WHEN details_all like '%amzn%' OR details_all like '%aws%' OR details_all like '%-azure%' OR details_all like '%gcp%' THEN 'ci usage'

This classifies any GCP installs as CI usage.

@klai95
Copy link
Contributor Author

klai95 commented Oct 6, 2022

@richaagarwal richaagarwal moved this from Ready to In Progress in napari hub backlog Oct 10, 2022
@klai95
Copy link
Contributor Author

klai95 commented Oct 19, 2022

Pinpointed the necessary files (https://github.com/chanzuckerberg/sci-imaging-infra/terraform/envs/prod/happy-napari-hub/variables.tf, https://github.com/chanzuckerberg/sci-imaging-infra/blob/main/terraform/envs/prod/happy-napari-hub/main.tf, https://github.com/chanzuckerberg/sci-imaging-infra/blob/main/terraform/modules/happy-napari-hub/variables.tf, https://github.com/chanzuckerberg/sci-imaging-infra/blob/main/terraform/modules/happy-napari-hub/main.tf) to make Terraform changes - will look into how to make those changes on my own for now

@richaagarwal
Copy link
Collaborator

richaagarwal commented Oct 21, 2022

@klai95 Thanks for the update, Kevin! Let's make sure to configure our dev and staging environments as well, and test on dev or staging first.

Tagging @manasaV3 as I think she'll be reaching out to you to pair on this!

@manasaV3
Copy link
Collaborator

Based on discussions from our code pairing session, @klai95 and I decided to use secret manager to store the credentials for snowflake. The reason for this was to increase the security for the credentials, by preventing the credentials from being readable as an environment variable of the lambda.

This will be achieved by having the lambda fetch the credentials from the secret manager before making the snowflake query.

Steps we have identified to implement this:

1. Create a new secret to allow for storing credentials

Currently we are planning on creating the secret manually with aws/secretsmanager as the encryption key. The credentials will be added to it as a key/value pairs. We will be naming this something that can be extended to support other credentials in the environment if needed in the future. Proposed structure of the secret:

{"snowflake_user":"XXXXXXX", "snowflake_password":"YYYYYYY"}

We could also explore adding this using terraform in the future, as we could leverage the terraform variables for it. This would allow for having a single place to manage secrets.

2. Manually add the secret name to the existing environment config

The environment config is currently being maintained as happy/env-<env-name>-config in secrets manager. The data is stored as json, and we can add additional information on the secrets as follows:

...
"secrets": {
    "credetials": {"name": "foo/bar", "arn": "arn:aws:secretsmanager:us-west-2:12345678910:secret:foo/bar-assw7uY" }
},
...

This is based on the assumption that happy/env-<env-name>-config is currently being maintained manually. This assumption will have to be verified before making updates to staging/prod environments.

3. Accessing the secret's details in terraform

We can access the the secret's details from the env-config secret that is already being fetched and decoded. We can assign it to the locals in here

credentials_secrets_arn = try(local.secret["secrets"]["credentials"]["arn"], "")
credentials_secrets_name = try(local.secret["secrets"]["credentials"]["name"], "")

4. Updating the backend lambda access policy

To allow the lambda to be able to fetch the secret values from the secret manager, we have to update the existing execution role to add policies allowing get secret value from the lambda. The current execution role definition can be found here. The proposed policy addition:

  statement {
    action = [
        "secretsmanager:GetSecretValue"
    ],
    resources = [
        local.credentials_secrets_arn
    ]
  }

Reference to IAM policy documentation:
https://docs.aws.amazon.com/secretsmanager/latest/userguide/auth-and-access_examples.html#auth-and-access_examples_read

5. Add the secret name to the lambda environment variables

To prevent hard coding the secret name and to allow for environment specific secrets we could use the environment variables in the lambda. We have to update the terraform here to add that.

We can use local.credentials_secrets_name to add the secret name as an environment variable.

6. Fetching credentials from secrets manager in code

Before making the call to snowflake as referenced here, we can call the secrets manager to fetch the username and password.

Reference to api: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/secretsmanager.html#SecretsManager.Client.get_secret_value

We could have the session_manager calls be a separated class, to allow for it to be used in other use-cases in the future. Proposed method signature:

# Takes a secret name as input and returns a dictionary of the key values in the secret
def get_secrets_key_pair(secret_name: str) -> dict:

@richaagarwal
Copy link
Collaborator

This is based on the assumption that happy/env--config is currently being maintained manually. This assumption will have to be verified before making updates to staging/prod environments.

@manasaV3 It's definitely worth verifying, but I don't believe those are being maintained manually. I believe all values are set in Terraform once and then generated in AWS.

Would it be possible to follow the current process of settings credentials via Terraform, and then proposing a change as a separate issue? I'm hesitant to have different variables be set in different places, and want to keep this ticket as narrowly scoped as possible. Our infra team can also help with setting up the variables according to our current process if needed.

@manasaV3
Copy link
Collaborator

manasaV3 commented Oct 21, 2022

I don't believe those are being maintained manually. I believe all values are set in Terraform once and then generated in AWS.

This is good to know.

My reasoning for suggesting the usage of the secret manager in code over having the terraform was because it didn't feel like the best practice to have the db credentials easily accessible. But, I do see your point on having different variables being sourced from different sources. Also, not having clarity on the sourcing of the env-config secret might take a some more time to unblock.
What are your thoughts on making this a separate issue that not only tackles cleaning up of the environment variables, but also removing duplication of credentials across the secret manager and terraform?

P.S. As we are going the route of having the credentials surface in lambda env variables, we should ensure that the snowflake user is specific to our team's workflows, and having its password rotated at a later point wouldn't impact any other workflows.

@richaagarwal
Copy link
Collaborator

Broke out the credentials piece into its own issue - #695 - so that we can focus specifically on the python/cron job piece in this ticket!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants