From 78a85eab061741cc8f2e15e7b266b3f1fb835ba9 Mon Sep 17 00:00:00 2001 From: stevenleggdfe <51697598+stevenleggdfe@users.noreply.github.com> Date: Wed, 29 Jun 2022 17:18:48 +0100 Subject: [PATCH] Update README.md - Add SQL query to generate table inside of asking installers to type out the schema and partitioning options themselves. This also includes some default metadata for the events table. - Merge in documentation from Apply for QTS service's experience --- README.md | 206 +++++++++++++++++++++++++++++++++++++++- create-events-table.sql | 26 +++++ 2 files changed, 227 insertions(+), 5 deletions(-) create mode 100644 create-events-table.sql diff --git a/README.md b/README.md index 54e701e4..26fc311f 100644 --- a/README.md +++ b/README.md @@ -19,6 +19,10 @@ one for keeping your field configuration up to date. To set the gem up follow the steps in "Configuration", below. +## See also + +[dfe-analytics-dataform](https://github.com/DFE-Digital/dfe-analytics-dataform) provides a JavaScript package designed to generate SQL queries executed in [Dataform](https://dataform.co/) that transform data streamed into BigQuery by this gem into useful tables for quicker analysis and visualisation. + ## Names and jargon A Rails model is an analytics **Entity**. @@ -70,7 +74,199 @@ bundle install ## Configuration -### 1. Configure BigQuery connection, feature flags etc + +### 1. Get a BigQuery project setup and add initial owners + +Ask in Slack on the `#twd_data_insights` channel for someone to help you +procure a BigQuery instance in the `digital.education.gov.uk` Google Cloud +Organisation. + +Ask for your `@digital.education.gov.uk` Google account to be setup as an owner +via the IAM and Admin settings. Add other team members as necessary. + +#### Set up billing + +You also need to set up your BigQuery instance with paid billing. This is +because `dfe-analytics` uses streaming, and streaming isn't allowed in the free +tier: + +```bash +accessDenied: Access Denied: BigQuery BigQuery: Streaming insert is not allowed +in the free tier +``` + +### 2. Create a dataset and table + +You should create separate datasets for each environment (dev, qa, preprod, prod etc.). + +1. Open your project's BigQuery instance +2. Go to the Analysis -> SQL Workspace section +3. Tap on the 3 dots next to the project name, "Create data set" +4. Name it something like `APPLICATIONNAME_events_ENVIRONMENT`, such as `applyforqts_events_production`, and set the location to `europe-west2 (London)` +5. Select your new dataset +6. Open a new query execution tab. +7. Paste and run [create-events-table.sql](https://github.com/DFE-Digital/dfe-analytics/create-events-table.sql) to create a blank events table for dfe-analytics to stream data into (editing as appropriate). + +### 4. Create custom roles + +1. Go to IAM and Admin settings > Roles +1. Click on "+ Create role" +1. Create the 3 roles outlined below + +#### Analyst + +| Field | Value | +| ----------------- | -------------------------------------------------- | +| Title | **BigQuery Analyst Custom** | +| Description | Assigned to accounts used by analysts and SQL developers. | +| ID | `bigquery_analyst_custom` | +| Role launch stage | General Availability | +| + Add permissions | See below | + +
+Permissions for bigquery_analyst_custom + bigquery.datasets.get + bigquery.datasets.getIamPolicy + bigquery.datasets.updateTag + bigquery.jobs.create + bigquery.jobs.get + bigquery.jobs.list + bigquery.jobs.listAll + bigquery.models.export + bigquery.models.getData + bigquery.models.getMetadata + bigquery.models.list + bigquery.routines.get + bigquery.routines.list + bigquery.savedqueries.create + bigquery.savedqueries.delete + bigquery.savedqueries.get + bigquery.savedqueries.list + bigquery.savedqueries.update + bigquery.tables.createSnapshot + bigquery.tables.export + bigquery.tables.get + bigquery.tables.getData + bigquery.tables.getIamPolicy + bigquery.tables.list + bigquery.tables.restoreSnapshot + resourcemanager.projects.get +
+ +#### Developer + +| Field | Value | +| ----------------- | ---------------------------------------- | +| Title | **BigQuery Developer Custom** | +| Description | Assigned to accounts used by developers. | +| ID | `bigquery_developer_custom` | +| Role launch stage | General Availability | +| + Add permissions | See below | + +
+Permissions for bigquery_developer_custom + bigquery.connections.create + bigquery.connections.delete + bigquery.connections.get + bigquery.connections.getIamPolicy + bigquery.connections.list + bigquery.connections.update + bigquery.connections.updateTag + bigquery.connections.use + bigquery.datasets.create + bigquery.datasets.delete + bigquery.datasets.get + bigquery.datasets.getIamPolicy + bigquery.datasets.update + bigquery.datasets.updateTag + bigquery.jobs.create + bigquery.jobs.delete + bigquery.jobs.get + bigquery.jobs.list + bigquery.jobs.listAll + bigquery.jobs.update + bigquery.models.create + bigquery.models.delete + bigquery.models.export + bigquery.models.getData + bigquery.models.getMetadata + bigquery.models.list + bigquery.models.updateData + bigquery.models.updateMetadata + bigquery.models.updateTag + bigquery.routines.create + bigquery.routines.delete + bigquery.routines.get + bigquery.routines.list + bigquery.routines.update + bigquery.routines.updateTag + bigquery.savedqueries.create + bigquery.savedqueries.delete + bigquery.savedqueries.get + bigquery.savedqueries.list + bigquery.savedqueries.update + bigquery.tables.create + bigquery.tables.createSnapshot + bigquery.tables.delete + bigquery.tables.deleteSnapshot + bigquery.tables.export + bigquery.tables.get + bigquery.tables.getData + bigquery.tables.getIamPolicy + bigquery.tables.list + bigquery.tables.restoreSnapshot + bigquery.tables.setCategory + bigquery.tables.update + bigquery.tables.updateData + bigquery.tables.updateTag + resourcemanager.projects.get +
+ +#### Appender + +| Field | Value | +| ----------------- | ---------------------------------------------------------- | +| Title | **BigQuery Appender Custom** | +| Description | Assigned to accounts used by appenders (apps and scripts). | +| ID | `bigquery_appender_custom` | +| Role launch stage | General Availability | +| + Add permissions | See below | + +
+Permissions for bigquery_appender_custom + bigquery.datasets.get + bigquery.tables.get + bigquery.tables.updateData +
+ +### 5. Create an appender service account + +1. Go to [IAM and Admin settings > Create service account](https://console.cloud.google.com/projectselector/iam-admin/serviceaccounts/create?supportedpurview=project) +1. Name it like "Appender NAME_OF_SERVICE ENVIRONMENT" e.g. "Appender ApplyForQTS Production" +1. Add a description, like "Used when developing locally." +1. Grant the service account access to the project, use the "BigQuery Appender Custom" role you set up earlier + +### 6. Get an API JSON key :key: + +1. Access the service account you previously set up +1. Go to the keys tab, click on "Add key > Create new key" +1. Create a JSON private key + +The full contents of this JSON file is your `BIGQUERY_API_JSON_KEY`. + +### 7. Set up environment variables + +Putting the previous things together, to finish setting up `dfe-analytics`, you +need these environment variables: + +```bash +BIGQUERY_TABLE_NAME=events +BIGQUERY_PROJECT_ID=your-bigquery-project-name +BIGQUERY_DATASET=your-bigquery-dataset-name +BIGQUERY_API_JSON_KEY= +``` + +### 8. Configure BigQuery connection, feature flags etc ```bash bundle exec rails generate dfe:analytics:install @@ -86,7 +282,7 @@ The `dfe:analytics:install` generator will also initialize some empty config fil | `config/analytics_pii.yml` | List all fields we will obfuscate before sending to BigQuery. This should be a subset of fields in `analytics.yml` | | `config/analytics_blocklist.yml` | Autogenerated file to list all fields we will NOT send to BigQuery, to support the `analytics:check` task | -### 2. Check your fields +### 9. Check your fields A good place to start is to run @@ -110,7 +306,7 @@ config but missing from the model. **It's recommended to run this task regularly - at least as often as you run database migrations. Consider enhancing db:migrate to run it automatically.** -### 3. Enable callbacks +### 10. Enable callbacks Mix in the following modules. It's recommended to include them at the highest possible level in the inheritance hierarchy of your controllers and @@ -149,7 +345,7 @@ web request and model update. While you’re setting things up consider setting the config options `async: false` and `log_only: true` to take ActiveJob and BigQuery (respectively) out of the loop. -### 4. Adding specs +### Adding specs #### Testing modes @@ -190,7 +386,7 @@ end ``` -See the list of existing event types below for what kinds of event types can be used with the above matchers. +See the list of existing event types below for what kinds of event types can be used with the above matchers. ## Existing DfE Analytics event types diff --git a/create-events-table.sql b/create-events-table.sql new file mode 100644 index 00000000..5c20fbd8 --- /dev/null +++ b/create-events-table.sql @@ -0,0 +1,26 @@ +CREATE TABLE + /* Update your-project-name and your-dataset-name before running this query */ + `your-project-name.your-dataset-name.events` ( occurred_at TIMESTAMP NOT NULL OPTIONS(description="The timestamp at which the event occurred in the application."), + event_type STRING NOT NULL OPTIONS(description="The type of the event, for example web_request. This determines the schema of the data which will be included in the data field."), + user_id STRING OPTIONS(description="If a user was logged in when they sent a web request event that is this event, then this is the UID of this user."), + request_uuid STRING OPTIONS(description="Unique ID of the web request, if this event is a web request event"), + request_method STRING OPTIONS(description="Whether this web request was a GET or POST request, if this event is a web request event."), + request_path STRING OPTIONS(description="The path, starting with a / and excluding any query parameters, of this web request, if this event is a web request"), + request_user_agent STRING OPTIONS(description="The user agent of this web request, if this event is a web request. Allows a user's browser and operating system to be identified"), + request_referer STRING OPTIONS(description="The URL of any page the user was viewing when they initiated this web request, if this event is a web request. This is the full URL, including protocol (https://) and any query parameters, if the browser shared these with our application as part of the web request. It is very common for this referer to be truncated for referrals from external sites."), + request_query ARRAY < STRUCT OPTIONS(description="Contents of the query parameter e.g. if the URL ended ?foo=bar then this will be bar.") > > OPTIONS(description="ARRAY of STRUCTs, each with a key and a value. Contains any query parameters that were sent to the application as part of this web reques, if this event is a web request."), + response_content_type STRING OPTIONS(description="Content type of any data that was returned to the browser following this web request, if this event is a web request. For example, 'text/html; charset=utf-8'. Image views, for example, may have a non-text/html content type."), + response_status STRING OPTIONS(description="HTTP response code returned by the application in response to this web request, if this event is a web request. See https://developer.mozilla.org/en-US/docs/Web/HTTP/Status."), + DATA ARRAY < STRUCT OPTIONS(description="Contents of the field in the database after it was created or updated, or just before it was imported or destroyed.") > > OPTIONS(description="ARRAY of STRUCTs, each with a key and a value. Contains a set of data points appropriate to the event_type of this event. For example, if this event was an entity create, update, delete or import event, data will contain the values of each field in the database after this event took place - according to the settings in the analytics.yml configured for this instance of dfe-analytics. Value be anonymised as a one way hash, depending on configuration settings."), + entity_table_name STRING OPTIONS(description="If event_type was an entity create, update, delete or import event, the name of the table in the database that this entity is stored in. NULL otherwise."), + event_tags ARRAY < STRING > OPTIONS(description="Currently left blank for future use."), + anonymised_user_agent_and_ip STRING OPTIONS(description="One way hash of a combination of the user's IP address and user agent, if this event is a web request. Can be used to identify the user anonymously, even when user_id is not set. Cannot be used to identify the user over a time period of longer than about a month, because of IP address changes and browser updates."), + environment STRING OPTIONS(description="The application environment that the event was streamed from."), + namespace STRING OPTIONS(description="The namespace of the instance of dfe-analytics that streamed this event. For example this might identify the name of the service that streamed the event.") ) +PARTITION BY + DATE(occurred_at) +CLUSTER BY + event_type OPTIONS (description="Events streamed into the BigQuery from the application") + /* You could add extra info here, like which environment and which application */