[Activity dashboard] Pull PyPI downloads data nightly #664

klai95 · 2022-09-19T21:33:01Z

In #638, we introduced API endpoints that connect to PyPI downloads data. As part of that work, we did a one-time pull of data from Snowflake. Now, let's make that a recurring cron job that runs once every 24 hours.

Additional details

Even though activity dashboard functionality is currently behind a feature flag, let's develop this data integration such that it runs successfully in all our environments (dev, staging and prod)
The underlying data should be getting updated once a day, and @bnelson-czi 's job runs at 3am PT for ~10 minutes, so maybe 5am PT or after is a good time to target.
We will be introducing other activity dashboard metrics in the future, so consider a folder/naming structure such as /activity-dashboard-metrics/pypi-downloads.csv

Earlier notes from Kevin

The follow-up work to tackle improvements to the snowflake code as well as connecting it to a cron job so that we don't need to keep re-running it; see more at Activity Dashboard Prototype #662 for the activity dashboard prototype.
As such, we can integrate the read and write of data between Snowflake and S3 in our existing infrastructure that enables for automatic updates. Further more, additional data sources can be integrated seamlessly as well should we ever need to pull from new external sources.

The text was updated successfully, but these errors were encountered:

klai95 · 2022-10-04T22:21:54Z

Below is the code to generate Snowflake data; as part of this issue, I will need to inject this SQL code directly in the backend codebase.

select file_project, date_trunc('month', timestamp) as month, count(*) as num_downloads
from
(
  select country_code, project, project_type, file_project, file_version, file_type, details_installer_name, details_installer_version,
      CONCAT(
        DETAILS,
        DETAILS_INSTALLER, DETAILS_INSTALLER_NAME, DETAILS_INSTALLER_VERSION,
        DETAILS_PYTHON,
        DETAILS_IMPLEMENTATION_NAME, DETAILS_IMPLEMENTATION_VERSION,
        DETAILS_DISTRO, DETAILS_DISTRO_NAME, DETAILS_DISTRO_VERSION, DETAILS_DISTRO_ID, DETAILS_DISTRO_LIBC, DETAILS_DISTRO_LIBC_LIB, DETAILS_DISTRO_LIBC_VERSION,
        DETAILS_SYSTEM, DETAILS_SYSTEM_NAME, DETAILS_SYSTEM_RELEASE,
        DETAILS_CPU,
        DETAILS_OPENSSL_VERSION,
        DETAILS_SETUPTOOLS_VERSION,
        DETAILS_RUSTC_VERSION
      ) as details_all,
      CASE
          WHEN details_all like '%amzn%' OR details_all like '%aws%' OR details_all like '%-azure%' OR details_all like '%gcp%' THEN 'ci usage'
          WHEN details_installer_name in ('bandersnatch', 'pip') THEN details_installer_name
          ELSE 'other'
      END AS download_type,
      to_timestamp(timestamp) as timestamp
  from imaging.pypi.downloads
  where download_type = 'pip'
  and project_type = 'plugin'
  order by timestamp desc
)
group by 1,2
order by NUM_DOWNLOADS desc, MONTH, FILE_PROJECT

klai95 · 2022-10-04T22:26:27Z

1: Add workspace variables on Terraform: happy-sci-imaging->Workspaces->Variables. We want to utilize this process to add sensitive Snowflake credentials such as user and password here.
2: Add code changes to set up Snowflake access (credentials, CloudWatch cadence, etc) in backend codebase
3: Pull PyPI downloads data nightly (write data to S3 without writing the file locally)

Note: No additional lambda will be created. This PR will be captured in the backend lambda.

bnelson-czi · 2022-10-04T23:06:31Z

@klai95 One minor update. Change...
WHEN details_all like '%amzn%' OR details_all like '%aws%' OR details_all like '%-azure%' THEN 'ci usage'
...to...
WHEN details_all like '%amzn%' OR details_all like '%aws%' OR details_all like '%-azure%' OR details_all like '%gcp%' THEN 'ci usage'

This classifies any GCP installs as CI usage.

klai95 · 2022-10-06T23:32:47Z

I have added snowflake_user and snowflake_password on https://si.prod.tfe.czi.technology/app/sci-imaging-infra/workspaces/prod-happy-napari-hub.

@potating-potato will help apply the change on https://github.com/chanzuckerberg/sci-imaging-infra/blob/28ecdd11f35edc431bba52420b3b9220b1e9dd82/terraform/modules/happy-napari-hub/main.tf#L31-L60 in the additional_secrets section.

klai95 · 2022-10-19T17:41:36Z

Pinpointed the necessary files (https://github.com/chanzuckerberg/sci-imaging-infra/terraform/envs/prod/happy-napari-hub/variables.tf, https://github.com/chanzuckerberg/sci-imaging-infra/blob/main/terraform/envs/prod/happy-napari-hub/main.tf, https://github.com/chanzuckerberg/sci-imaging-infra/blob/main/terraform/modules/happy-napari-hub/variables.tf, https://github.com/chanzuckerberg/sci-imaging-infra/blob/main/terraform/modules/happy-napari-hub/main.tf) to make Terraform changes - will look into how to make those changes on my own for now

richaagarwal · 2022-10-21T02:04:12Z

@klai95 Thanks for the update, Kevin! Let's make sure to configure our dev and staging environments as well, and test on dev or staging first.

Tagging @manasaV3 as I think she'll be reaching out to you to pair on this!

manasaV3 · 2022-10-21T21:00:44Z

Based on discussions from our code pairing session, @klai95 and I decided to use secret manager to store the credentials for snowflake. The reason for this was to increase the security for the credentials, by preventing the credentials from being readable as an environment variable of the lambda.

This will be achieved by having the lambda fetch the credentials from the secret manager before making the snowflake query.

Steps we have identified to implement this:

1. Create a new secret to allow for storing credentials

Currently we are planning on creating the secret manually with aws/secretsmanager as the encryption key. The credentials will be added to it as a key/value pairs. We will be naming this something that can be extended to support other credentials in the environment if needed in the future. Proposed structure of the secret:

{"snowflake_user":"XXXXXXX", "snowflake_password":"YYYYYYY"}

We could also explore adding this using terraform in the future, as we could leverage the terraform variables for it. This would allow for having a single place to manage secrets.

2. Manually add the secret name to the existing environment config

The environment config is currently being maintained as happy/env-<env-name>-config in secrets manager. The data is stored as json, and we can add additional information on the secrets as follows:

...
"secrets": {
    "credetials": {"name": "foo/bar", "arn": "arn:aws:secretsmanager:us-west-2:12345678910:secret:foo/bar-assw7uY" }
},
...

This is based on the assumption that happy/env-<env-name>-config is currently being maintained manually. This assumption will have to be verified before making updates to staging/prod environments.

3. Accessing the secret's details in terraform

We can access the the secret's details from the env-config secret that is already being fetched and decoded. We can assign it to the locals in here

credentials_secrets_arn = try(local.secret["secrets"]["credentials"]["arn"], "")
credentials_secrets_name = try(local.secret["secrets"]["credentials"]["name"], "")

4. Updating the backend lambda access policy

To allow the lambda to be able to fetch the secret values from the secret manager, we have to update the existing execution role to add policies allowing get secret value from the lambda. The current execution role definition can be found here. The proposed policy addition:

  statement {
    action = [
        "secretsmanager:GetSecretValue"
    ],
    resources = [
        local.credentials_secrets_arn
    ]
  }

Reference to IAM policy documentation:
https://docs.aws.amazon.com/secretsmanager/latest/userguide/auth-and-access_examples.html#auth-and-access_examples_read

5. Add the secret name to the lambda environment variables

To prevent hard coding the secret name and to allow for environment specific secrets we could use the environment variables in the lambda. We have to update the terraform here to add that.

We can use local.credentials_secrets_name to add the secret name as an environment variable.

6. Fetching credentials from secrets manager in code

Before making the call to snowflake as referenced here, we can call the secrets manager to fetch the username and password.

Reference to api: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/secretsmanager.html#SecretsManager.Client.get_secret_value

We could have the session_manager calls be a separated class, to allow for it to be used in other use-cases in the future. Proposed method signature:

# Takes a secret name as input and returns a dictionary of the key values in the secret
def get_secrets_key_pair(secret_name: str) -> dict:

richaagarwal · 2022-10-21T21:26:32Z

This is based on the assumption that happy/env--config is currently being maintained manually. This assumption will have to be verified before making updates to staging/prod environments.

@manasaV3 It's definitely worth verifying, but I don't believe those are being maintained manually. I believe all values are set in Terraform once and then generated in AWS.

Would it be possible to follow the current process of settings credentials via Terraform, and then proposing a change as a separate issue? I'm hesitant to have different variables be set in different places, and want to keep this ticket as narrowly scoped as possible. Our infra team can also help with setting up the variables according to our current process if needed.

manasaV3 · 2022-10-21T22:04:19Z

I don't believe those are being maintained manually. I believe all values are set in Terraform once and then generated in AWS.

This is good to know.

My reasoning for suggesting the usage of the secret manager in code over having the terraform was because it didn't feel like the best practice to have the db credentials easily accessible. But, I do see your point on having different variables being sourced from different sources. Also, not having clarity on the sourcing of the env-config secret might take a some more time to unblock.
What are your thoughts on making this a separate issue that not only tackles cleaning up of the environment variables, but also removing duplication of credentials across the secret manager and terraform?

P.S. As we are going the route of having the credentials surface in lambda env variables, we should ensure that the snowflake user is specific to our team's workflows, and having its password rotated at a later point wouldn't impact any other workflows.

richaagarwal · 2022-10-21T22:27:19Z

Broke out the credentials piece into its own issue - #695 - so that we can focus specifically on the python/cron job piece in this ticket!

klai95 mentioned this issue Sep 19, 2022

Activity Dashboard Prototype #662

Merged

klai95 changed the title ~~Activity Dashboard - Move environment variables~~ Activity Dashboard - Follow-up work Sep 22, 2022

klai95 changed the title ~~Activity Dashboard - Follow-up work~~ Activity Dashboard Follow-up work Sep 29, 2022

klai95 changed the title ~~Activity Dashboard Follow-up work~~ Activity Dashboard Follow-up Work Sep 29, 2022

richaagarwal changed the title ~~Activity Dashboard Follow-up Work~~ [Activity dashboard] Pull PyPI downloads data nightly Sep 30, 2022

richaagarwal added this to napari hub backlog Sep 30, 2022

richaagarwal added the backend label Sep 30, 2022

richaagarwal moved this to Backlog in napari hub backlog Sep 30, 2022

richaagarwal moved this from Backlog to Ready in napari hub backlog Oct 3, 2022

richaagarwal assigned klai95 Oct 3, 2022

richaagarwal moved this from Ready to In Progress in napari hub backlog Oct 10, 2022

richaagarwal mentioned this issue Oct 21, 2022

Add snowflake credentials to environment #695

Closed

neuromusic mentioned this issue Oct 24, 2022

Activity Dashboard to give insights into plugin quality #500

Closed

13 tasks

klai95 mentioned this issue Oct 25, 2022

[Activity dashboard] Pull PyPI downloads data nightly #710

Merged

klai95 moved this from In Progress to In Review in napari hub backlog Oct 25, 2022

richaagarwal mentioned this issue Oct 28, 2022

API: return installs count for the prior 30 days #709

Closed

klai95 closed this as completed in #710 Nov 1, 2022

Repository owner moved this from In Review to Done in napari hub backlog Nov 1, 2022

richaagarwal moved this from Done to Done (all-time) in napari hub backlog Nov 16, 2022

richaagarwal moved this to Done in napari hub backlog Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Activity dashboard] Pull PyPI downloads data nightly #664

[Activity dashboard] Pull PyPI downloads data nightly #664

klai95 commented Sep 19, 2022 •

edited by richaagarwal

Loading

klai95 commented Oct 4, 2022 •

edited

Loading

klai95 commented Oct 4, 2022 •

edited

Loading

bnelson-czi commented Oct 4, 2022

klai95 commented Oct 6, 2022 •

edited

Loading

klai95 commented Oct 19, 2022 •

edited

Loading

richaagarwal commented Oct 21, 2022 •

edited

Loading

manasaV3 commented Oct 21, 2022

richaagarwal commented Oct 21, 2022

manasaV3 commented Oct 21, 2022 •

edited

Loading

richaagarwal commented Oct 21, 2022

[Activity dashboard] Pull PyPI downloads data nightly #664

[Activity dashboard] Pull PyPI downloads data nightly #664

Comments

klai95 commented Sep 19, 2022 • edited by richaagarwal Loading

Additional details

klai95 commented Oct 4, 2022 • edited Loading

klai95 commented Oct 4, 2022 • edited Loading

bnelson-czi commented Oct 4, 2022

klai95 commented Oct 6, 2022 • edited Loading

klai95 commented Oct 19, 2022 • edited Loading

richaagarwal commented Oct 21, 2022 • edited Loading

manasaV3 commented Oct 21, 2022

1. Create a new secret to allow for storing credentials

2. Manually add the secret name to the existing environment config

3. Accessing the secret's details in terraform

4. Updating the backend lambda access policy

5. Add the secret name to the lambda environment variables

6. Fetching credentials from secrets manager in code

richaagarwal commented Oct 21, 2022

manasaV3 commented Oct 21, 2022 • edited Loading

richaagarwal commented Oct 21, 2022

klai95 commented Sep 19, 2022 •

edited by richaagarwal

Loading

klai95 commented Oct 4, 2022 •

edited

Loading

klai95 commented Oct 4, 2022 •

edited

Loading

klai95 commented Oct 6, 2022 •

edited

Loading

klai95 commented Oct 19, 2022 •

edited

Loading

richaagarwal commented Oct 21, 2022 •

edited

Loading

manasaV3 commented Oct 21, 2022 •

edited

Loading