Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
- Add SQL query to generate table inside of asking installers to type out the schema and partitioning options themselves. This also includes some default metadata for the events table.
- Merge in documentation from Apply for QTS service's experience
  • Loading branch information
stevenleggdfe authored and duncanjbrown committed Jul 1, 2022
1 parent 52fc43d commit 78a85ea
Show file tree
Hide file tree
Showing 2 changed files with 227 additions and 5 deletions.
206 changes: 201 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ one for keeping your field configuration up to date.

To set the gem up follow the steps in "Configuration", below.

## See also

[dfe-analytics-dataform](https://github.com/DFE-Digital/dfe-analytics-dataform) provides a JavaScript package designed to generate SQL queries executed in [Dataform](https://dataform.co/) that transform data streamed into BigQuery by this gem into useful tables for quicker analysis and visualisation.

## Names and jargon

A Rails model is an analytics **Entity**.
Expand Down Expand Up @@ -70,7 +74,199 @@ bundle install

## Configuration

### 1. Configure BigQuery connection, feature flags etc

### 1. Get a BigQuery project setup and add initial owners

Ask in Slack on the `#twd_data_insights` channel for someone to help you
procure a BigQuery instance in the `digital.education.gov.uk` Google Cloud
Organisation.

Ask for your `@digital.education.gov.uk` Google account to be setup as an owner
via the IAM and Admin settings. Add other team members as necessary.

#### Set up billing

You also need to set up your BigQuery instance with paid billing. This is
because `dfe-analytics` uses streaming, and streaming isn't allowed in the free
tier:

```bash
accessDenied: Access Denied: BigQuery BigQuery: Streaming insert is not allowed
in the free tier
```

### 2. Create a dataset and table

You should create separate datasets for each environment (dev, qa, preprod, prod etc.).

1. Open your project's BigQuery instance
2. Go to the Analysis -> SQL Workspace section
3. Tap on the 3 dots next to the project name, "Create data set"
4. Name it something like `APPLICATIONNAME_events_ENVIRONMENT`, such as `applyforqts_events_production`, and set the location to `europe-west2 (London)`
5. Select your new dataset
6. Open a new query execution tab.
7. Paste and run [create-events-table.sql](https://github.com/DFE-Digital/dfe-analytics/create-events-table.sql) to create a blank events table for dfe-analytics to stream data into (editing as appropriate).

### 4. Create custom roles

1. Go to IAM and Admin settings > Roles
1. Click on "+ Create role"
1. Create the 3 roles outlined below

#### Analyst

| Field | Value |
| ----------------- | -------------------------------------------------- |
| Title | **BigQuery Analyst Custom** |
| Description | Assigned to accounts used by analysts and SQL developers. |
| ID | `bigquery_analyst_custom` |
| Role launch stage | General Availability |
| + Add permissions | See below |

<details>
<summary>Permissions for bigquery_analyst_custom</summary>
bigquery.datasets.get
bigquery.datasets.getIamPolicy
bigquery.datasets.updateTag
bigquery.jobs.create
bigquery.jobs.get
bigquery.jobs.list
bigquery.jobs.listAll
bigquery.models.export
bigquery.models.getData
bigquery.models.getMetadata
bigquery.models.list
bigquery.routines.get
bigquery.routines.list
bigquery.savedqueries.create
bigquery.savedqueries.delete
bigquery.savedqueries.get
bigquery.savedqueries.list
bigquery.savedqueries.update
bigquery.tables.createSnapshot
bigquery.tables.export
bigquery.tables.get
bigquery.tables.getData
bigquery.tables.getIamPolicy
bigquery.tables.list
bigquery.tables.restoreSnapshot
resourcemanager.projects.get
</details>

#### Developer

| Field | Value |
| ----------------- | ---------------------------------------- |
| Title | **BigQuery Developer Custom** |
| Description | Assigned to accounts used by developers. |
| ID | `bigquery_developer_custom` |
| Role launch stage | General Availability |
| + Add permissions | See below |

<details>
<summary>Permissions for bigquery_developer_custom</summary>
bigquery.connections.create
bigquery.connections.delete
bigquery.connections.get
bigquery.connections.getIamPolicy
bigquery.connections.list
bigquery.connections.update
bigquery.connections.updateTag
bigquery.connections.use
bigquery.datasets.create
bigquery.datasets.delete
bigquery.datasets.get
bigquery.datasets.getIamPolicy
bigquery.datasets.update
bigquery.datasets.updateTag
bigquery.jobs.create
bigquery.jobs.delete
bigquery.jobs.get
bigquery.jobs.list
bigquery.jobs.listAll
bigquery.jobs.update
bigquery.models.create
bigquery.models.delete
bigquery.models.export
bigquery.models.getData
bigquery.models.getMetadata
bigquery.models.list
bigquery.models.updateData
bigquery.models.updateMetadata
bigquery.models.updateTag
bigquery.routines.create
bigquery.routines.delete
bigquery.routines.get
bigquery.routines.list
bigquery.routines.update
bigquery.routines.updateTag
bigquery.savedqueries.create
bigquery.savedqueries.delete
bigquery.savedqueries.get
bigquery.savedqueries.list
bigquery.savedqueries.update
bigquery.tables.create
bigquery.tables.createSnapshot
bigquery.tables.delete
bigquery.tables.deleteSnapshot
bigquery.tables.export
bigquery.tables.get
bigquery.tables.getData
bigquery.tables.getIamPolicy
bigquery.tables.list
bigquery.tables.restoreSnapshot
bigquery.tables.setCategory
bigquery.tables.update
bigquery.tables.updateData
bigquery.tables.updateTag
resourcemanager.projects.get
</details>

#### Appender

| Field | Value |
| ----------------- | ---------------------------------------------------------- |
| Title | **BigQuery Appender Custom** |
| Description | Assigned to accounts used by appenders (apps and scripts). |
| ID | `bigquery_appender_custom` |
| Role launch stage | General Availability |
| + Add permissions | See below |

<details>
<summary>Permissions for bigquery_appender_custom</summary>
bigquery.datasets.get
bigquery.tables.get
bigquery.tables.updateData
</details>

### 5. Create an appender service account

1. Go to [IAM and Admin settings > Create service account](https://console.cloud.google.com/projectselector/iam-admin/serviceaccounts/create?supportedpurview=project)
1. Name it like "Appender NAME_OF_SERVICE ENVIRONMENT" e.g. "Appender ApplyForQTS Production"
1. Add a description, like "Used when developing locally."
1. Grant the service account access to the project, use the "BigQuery Appender Custom" role you set up earlier

### 6. Get an API JSON key :key:

1. Access the service account you previously set up
1. Go to the keys tab, click on "Add key > Create new key"
1. Create a JSON private key

The full contents of this JSON file is your `BIGQUERY_API_JSON_KEY`.

### 7. Set up environment variables

Putting the previous things together, to finish setting up `dfe-analytics`, you
need these environment variables:

```bash
BIGQUERY_TABLE_NAME=events
BIGQUERY_PROJECT_ID=your-bigquery-project-name
BIGQUERY_DATASET=your-bigquery-dataset-name
BIGQUERY_API_JSON_KEY=<contents of the JSON, make sure to strip or escape newlines>
```

### 8. Configure BigQuery connection, feature flags etc

```bash
bundle exec rails generate dfe:analytics:install
Expand All @@ -86,7 +282,7 @@ The `dfe:analytics:install` generator will also initialize some empty config fil
| `config/analytics_pii.yml` | List all fields we will obfuscate before sending to BigQuery. This should be a subset of fields in `analytics.yml` |
| `config/analytics_blocklist.yml` | Autogenerated file to list all fields we will NOT send to BigQuery, to support the `analytics:check` task |

### 2. Check your fields
### 9. Check your fields

A good place to start is to run

Expand All @@ -110,7 +306,7 @@ config but missing from the model.
**It's recommended to run this task regularly - at least as often as you run
database migrations. Consider enhancing db:migrate to run it automatically.**

### 3. Enable callbacks
### 10. Enable callbacks

Mix in the following modules. It's recommended to include them at the
highest possible level in the inheritance hierarchy of your controllers and
Expand Down Expand Up @@ -149,7 +345,7 @@ web request and model update. While you’re setting things up consider setting
the config options `async: false` and `log_only: true` to take ActiveJob and
BigQuery (respectively) out of the loop.

### 4. Adding specs
### Adding specs

#### Testing modes

Expand Down Expand Up @@ -190,7 +386,7 @@ end

```

See the list of existing event types below for what kinds of event types can be used with the above matchers.
See the list of existing event types below for what kinds of event types can be used with the above matchers.

## Existing DfE Analytics event types

Expand Down
26 changes: 26 additions & 0 deletions create-events-table.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
CREATE TABLE
/* Update your-project-name and your-dataset-name before running this query */
`your-project-name.your-dataset-name.events` ( occurred_at TIMESTAMP NOT NULL OPTIONS(description="The timestamp at which the event occurred in the application."),
event_type STRING NOT NULL OPTIONS(description="The type of the event, for example web_request. This determines the schema of the data which will be included in the data field."),
user_id STRING OPTIONS(description="If a user was logged in when they sent a web request event that is this event, then this is the UID of this user."),
request_uuid STRING OPTIONS(description="Unique ID of the web request, if this event is a web request event"),
request_method STRING OPTIONS(description="Whether this web request was a GET or POST request, if this event is a web request event."),
request_path STRING OPTIONS(description="The path, starting with a / and excluding any query parameters, of this web request, if this event is a web request"),
request_user_agent STRING OPTIONS(description="The user agent of this web request, if this event is a web request. Allows a user's browser and operating system to be identified"),
request_referer STRING OPTIONS(description="The URL of any page the user was viewing when they initiated this web request, if this event is a web request. This is the full URL, including protocol (https://) and any query parameters, if the browser shared these with our application as part of the web request. It is very common for this referer to be truncated for referrals from external sites."),
request_query ARRAY < STRUCT <key STRING NOT NULL OPTIONS(description="Name of the query parameter e.g. if the URL ended ?foo=bar then this will be foo."),
value ARRAY < STRING > OPTIONS(description="Contents of the query parameter e.g. if the URL ended ?foo=bar then this will be bar.") > > OPTIONS(description="ARRAY of STRUCTs, each with a key and a value. Contains any query parameters that were sent to the application as part of this web reques, if this event is a web request."),
response_content_type STRING OPTIONS(description="Content type of any data that was returned to the browser following this web request, if this event is a web request. For example, 'text/html; charset=utf-8'. Image views, for example, may have a non-text/html content type."),
response_status STRING OPTIONS(description="HTTP response code returned by the application in response to this web request, if this event is a web request. See https://developer.mozilla.org/en-US/docs/Web/HTTP/Status."),
DATA ARRAY < STRUCT <key STRING NOT NULL OPTIONS(description="Name of the field in the entity_table_name table in the database after it was created or updated, or just before it was imported or destroyed."),
value ARRAY < STRING > OPTIONS(description="Contents of the field in the database after it was created or updated, or just before it was imported or destroyed.") > > OPTIONS(description="ARRAY of STRUCTs, each with a key and a value. Contains a set of data points appropriate to the event_type of this event. For example, if this event was an entity create, update, delete or import event, data will contain the values of each field in the database after this event took place - according to the settings in the analytics.yml configured for this instance of dfe-analytics. Value be anonymised as a one way hash, depending on configuration settings."),
entity_table_name STRING OPTIONS(description="If event_type was an entity create, update, delete or import event, the name of the table in the database that this entity is stored in. NULL otherwise."),
event_tags ARRAY < STRING > OPTIONS(description="Currently left blank for future use."),
anonymised_user_agent_and_ip STRING OPTIONS(description="One way hash of a combination of the user's IP address and user agent, if this event is a web request. Can be used to identify the user anonymously, even when user_id is not set. Cannot be used to identify the user over a time period of longer than about a month, because of IP address changes and browser updates."),
environment STRING OPTIONS(description="The application environment that the event was streamed from."),
namespace STRING OPTIONS(description="The namespace of the instance of dfe-analytics that streamed this event. For example this might identify the name of the service that streamed the event.") )
PARTITION BY
DATE(occurred_at)
CLUSTER BY
event_type OPTIONS (description="Events streamed into the BigQuery from the application")
/* You could add extra info here, like which environment and which application */

0 comments on commit 78a85ea

Please sign in to comment.