diff --git a/README.md b/README.md index ac52cdf2..220f58f1 100644 --- a/README.md +++ b/README.md @@ -162,13 +162,13 @@ The `dfe:analytics:install` generator will also initialize some empty config fil | Filename | Purpose | |---------------------------------------|--------------------------------------------------------------------------------------------------------------------| -| `config/analytics.yml` | List all fields we will send to BigQuery | -| `config/analytics_pii.yml` | List all fields we will obfuscate before sending to BigQuery. This should be a subset of fields in `analytics.yml` | -| `config/analytics_blocklist.yml` | Autogenerated file to list all fields we will NOT send to BigQuery, to support the `analytics:check` task | -| `config/analytics_custom_events.yml` | Optional file including list of all custom event names - -**It is imperative that you perform a full check of those fields are being sent, and exclude those containing personally-identifiable information (PII) in `config/analytics_pii.yml`, in order to comply with the requirements of the [Data Protection Act 2018](https://www.gov.uk/data-protection), unless an exemption has been obtained.** +| `config/analytics.yml` | List all fields we will send to BigQuery | +| `config/analytics_pii.yml` | List all fields we will obfuscate before sending to BigQuery. This should be a subset of fields in `analytics.yml` | +| `config/analytics_hidden_pii.yml` | List all fields we will send separately to BigQuery where they will be hidden. This should be a subset of fields in `analytics.yml` | +| `config/analytics_blocklist.yml` | Autogenerated file to list all fields we will NOT send to BigQuery, to support the `analytics:check` task | +| `config/analytics_custom_events.yml` | Optional file including list of all custom event names | +**It is imperative that you perform a full check of the fields that are being sent, and exclude those containing personally-identifiable information (PII) in `config/analytics_hidden_pii.yml`, in order to comply with the requirements of the [Data Protection Act 2018](https://www.gov.uk/data-protection), unless an exemption has been obtained.** When you first install the gem, none of your fields will be listed in `analytics.yml`, so no data will be sent to BigQuery. To get started, generate a blocklist using this command: @@ -177,7 +177,7 @@ bundle exec rails dfe:analytics:regenerate_blocklist ``` Work through `analytics_blocklist.yml` to move entries into `analytics.yml` and -optionally also to `analytics_pii.yml`. +optionally also to `analytics_hidden_pii.yml`. When you boot your app, DfE::Analytics will raise an error if there are fields in your field configuration which are present in the database but @@ -256,7 +256,7 @@ it might be necessary to add a primary key to the table and to update the releva ## Custom events -If you wish to send custom analytics event, create a file `config/analytics_custom_events.yml` containing an array of your custom events types under a `shared` key like: +If you wish to send custom analytics event, for example if you have data about emails sent, server-side validation errors, API query data, or data relating to searches performed, create a file `config/analytics_custom_events.yml` containing an array of your custom events types under a `shared` key like: ```yaml shared: @@ -275,6 +275,26 @@ event = DfE::Analytics::Event.new .with_data(some: 'custom details about event') ``` +If you need to include hidden PII, you can use the `hidden_data` key which will allow all fields listed to be sent separately to BigQuery where they will be hidden. + +```ruby +event = DfE::Analytics::Event.new + .with_type(:some_custom_event) + .with_user(current_user) + .with_request_details(request) + .with_namespace('some_namespace') + .with_data( + data: + { + some: 'custom details about event' + }, + hidden_data: { + some_hidden: 'some data to be hidden', + more_hidden: 'more data to be hidden, + } + ) +``` + Once all the events have been constructed, simply send them to your analytics: ```ruby @@ -389,7 +409,7 @@ See the list of existing event types below for what kinds of event types can be The different types of events that DfE Analytics send are: - `web_request` - sent after a controller action is performed using controller callbacks -- `create_entity` - sent after an object is created using model callbacks +- `create_entity` - sent after an object is created using model callbacks - `update_entity` - sent after an object is updated using model callbacks - `delete_entity` - sent after an object is deleted using model callbacks - `import_entity` - sent for each object imported using the DfE Analytics import rake tasks diff --git a/docs/create-events-table.sql b/docs/create-events-table.sql index 5c20fbd8..f129eb1d 100644 --- a/docs/create-events-table.sql +++ b/docs/create-events-table.sql @@ -14,6 +14,8 @@ CREATE TABLE response_status STRING OPTIONS(description="HTTP response code returned by the application in response to this web request, if this event is a web request. See https://developer.mozilla.org/en-US/docs/Web/HTTP/Status."), DATA ARRAY < STRUCT OPTIONS(description="Contents of the field in the database after it was created or updated, or just before it was imported or destroyed.") > > OPTIONS(description="ARRAY of STRUCTs, each with a key and a value. Contains a set of data points appropriate to the event_type of this event. For example, if this event was an entity create, update, delete or import event, data will contain the values of each field in the database after this event took place - according to the settings in the analytics.yml configured for this instance of dfe-analytics. Value be anonymised as a one way hash, depending on configuration settings."), + hidden_DATA ARRAY < STRUCT OPTIONS(description="Contents of the field in the database after it was created or updated, or just before it was imported or destroyed.") > > OPTIONS(description="Defined in the same way as the DATA ARRAY of STRUCTs, except containing fields configured to be hidden in analytics_hidden_pii.yml"), entity_table_name STRING OPTIONS(description="If event_type was an entity create, update, delete or import event, the name of the table in the database that this entity is stored in. NULL otherwise."), event_tags ARRAY < STRING > OPTIONS(description="Currently left blank for future use."), anonymised_user_agent_and_ip STRING OPTIONS(description="One way hash of a combination of the user's IP address and user agent, if this event is a web request. Can be used to identify the user anonymously, even when user_id is not set. Cannot be used to identify the user over a time period of longer than about a month, because of IP address changes and browser updates."), diff --git a/docs/google_cloud_bigquery_setup.md b/docs/google_cloud_bigquery_setup.md index ce42c01f..2b9b0c61 100644 --- a/docs/google_cloud_bigquery_setup.md +++ b/docs/google_cloud_bigquery_setup.md @@ -71,8 +71,7 @@ requires more manual work especially when it comes to adding permissions. - -#### Analyst Role +#### Basic Role This role is used for analysts or other users who don't need to write to or modify data in BigQuery. @@ -80,7 +79,7 @@ modify data in BigQuery.
Using the GCloud CLI ``` bash -gcloud iam roles create bigquery_analyst_custom --title="BigQuery Analyst Custom" --description="Assigned to accounts used by analysts and SQL developers." --permissions=bigquery.datasets.get,bigquery.datasets.getIamPolicy,bigquery.datasets.updateTag,bigquery.jobs.create,bigquery.jobs.get,bigquery.jobs.list,bigquery.jobs.listAll,bigquery.models.export,bigquery.models.getData,bigquery.models.getMetadata,bigquery.models.list,bigquery.routines.get,bigquery.routines.list,bigquery.savedqueries.create,bigquery.savedqueries.delete,bigquery.savedqueries.get,bigquery.savedqueries.list,bigquery.savedqueries.update,bigquery.tables.createSnapshot,bigquery.tables.export,bigquery.tables.get,bigquery.tables.getData,bigquery.tables.getIamPolicy,bigquery.tables.list,bigquery.tables.restoreSnapshot,resourcemanager.projects.get --project=YOUR_PROJECT_ID +gcloud iam roles create bigquery_basic_custom --title="BigQuery Basic Custom" --description="Assigned to accounts used by analysts." --permissions=bigquery.connections.get,bigquery.dataPolicies.maskedGet,bigquery.datasets.get,bigquery.datasets.getIamPolicy,bigquery.datasets.updateTag,bigquery.jobs.create,bigquery.jobs.get,bigquery.jobs.list,bigquery.jobs.listAll,bigquery.models.export,bigquery.models.getData,bigquery.models.getMetadata,bigquery.models.list,bigquery.readsessions.create,bigquery.readsessions.getData,bigquery.readsessions.update,bigquery.routines.get,bigquery.routines.list,bigquery.savedqueries.create,bigquery.savedqueries.delete,bigquery.savedqueries.get,bigquery.savedqueries.list,bigquery.savedqueries.update,bigquery.tables.createSnapshot,bigquery.tables.export,bigquery.tables.get,bigquery.tables.getData,bigquery.tables.getIamPolicy,bigquery.tables.list,bigquery.tables.restoreSnapshot,datacatalog.entries.get,datacatalog.entries.list,datacatalog.entryGroups.get,datacatalog.entryGroups.list,datacatalog.tagTemplates.get,datacatalog.tagTemplates.getTag,datacatalog.taxonomies.get,datacatalog.taxonomies.list,datalineage.events.get,datalineage.events.list,datalineage.locations.searchLinks,datalineage.processes.get,datalineage.processes.list,datalineage.runs.get,datalineage.runs.list,iam.serviceAccounts.actAs,iam.serviceAccounts.get,iam.serviceAccounts.list,pubsub.topics.get,resourcemanager.projects.get --project=YOUR_PROJECT_ID ```
@@ -89,15 +88,17 @@ gcloud iam roles create bigquery_analyst_custom --title="BigQuery Analyst Custom | Field | Value | |-------------------|-----------------------------------------------------------| -| Title | **BigQuery Analyst Custom** | -| Description | Assigned to accounts used by analysts and SQL developers. | -| ID | `bigquery_analyst_custom` | +| Title | **BigQuery Basic Custom** | +| Description | Assigned to accounts used by analysts or other users who don't need to write to or modify data in BigQuery. | +| ID | `bigquery_basic_custom` | | Role launch stage | General Availability | | + Add permissions | See below | -##### Permissions for `bigquery_analyst_custom` +##### Permissions for `bigquery_basic_custom` ``` +bigquery.connections.get +bigquery.dataPolicies.maskedGet bigquery.datasets.get bigquery.datasets.getIamPolicy bigquery.datasets.updateTag @@ -109,6 +110,9 @@ bigquery.models.export bigquery.models.getData bigquery.models.getMetadata bigquery.models.list +bigquery.readsessions.create +bigquery.readsessions.getData +bigquery.readsessions.update bigquery.routines.get bigquery.routines.list bigquery.savedqueries.create @@ -123,20 +127,39 @@ bigquery.tables.getData bigquery.tables.getIamPolicy bigquery.tables.list bigquery.tables.restoreSnapshot +datacatalog.entries.get +datacatalog.entries.list +datacatalog.entryGroups.get +datacatalog.entryGroups.list +datacatalog.tagTemplates.get +datacatalog.tagTemplates.getTag +datacatalog.taxonomies.get +datacatalog.taxonomies.list +datalineage.events.get +datalineage.events.list +datalineage.locations.searchLinks +datalineage.processes.get +datalineage.processes.list +datalineage.runs.get +datalineage.runs.list +iam.serviceAccounts.actAs +iam.serviceAccounts.get +iam.serviceAccounts.list +pubsub.topics.get resourcemanager.projects.get ``` -#### Developer Role +#### Advanced Role -This role is used for developers or other users who need to be able to write to +This role is used for Dataform SQL developers or other users who need to be able to write to or modify data in BigQuery.
Using the GCloud CLI ``` bash -gcloud iam roles create bigquery_developer_custom --title="BigQuery Developer Custom" --description="Assigned to accounts used by developers." --permissions=bigquery.connections.create,bigquery.connections.delete,bigquery.connections.get,bigquery.connections.getIamPolicy,bigquery.connections.list,bigquery.connections.update,bigquery.connections.updateTag,bigquery.connections.use,bigquery.datasets.create,bigquery.datasets.delete,bigquery.datasets.get,bigquery.datasets.getIamPolicy,bigquery.datasets.update,bigquery.datasets.updateTag,bigquery.jobs.create,bigquery.jobs.delete,bigquery.jobs.get,bigquery.jobs.list,bigquery.jobs.listAll,bigquery.jobs.update,bigquery.models.create,bigquery.models.delete,bigquery.models.export,bigquery.models.getData,bigquery.models.getMetadata,bigquery.models.list,bigquery.models.updateData,bigquery.models.updateMetadata,bigquery.models.updateTag,bigquery.routines.create,bigquery.routines.delete,bigquery.routines.get,bigquery.routines.list,bigquery.routines.update,bigquery.routines.updateTag,bigquery.savedqueries.create,bigquery.savedqueries.delete,bigquery.savedqueries.get,bigquery.savedqueries.list,bigquery.savedqueries.update,bigquery.tables.create,bigquery.tables.createSnapshot,bigquery.tables.delete,bigquery.tables.deleteSnapshot,bigquery.tables.export,bigquery.tables.get,bigquery.tables.getData,bigquery.tables.getIamPolicy,bigquery.tables.list,bigquery.tables.restoreSnapshot,bigquery.tables.setCategory,bigquery.tables.update,bigquery.tables.updateData,bigquery.tables.updateTag,resourcemanager.projects.get --project=YOUR_PROJECT_ID +gcloud iam roles create bigquery_advanced_custom --title="BigQuery Advanced Custom" --description="Assigned to accounts used by Dataform SQL developers who need to be able to write to or modify data in BigQuery." --permissions=aiplatform.notebookRuntimeTemplates.apply,aiplatform.notebookRuntimeTemplates.get,aiplatform.notebookRuntimeTemplates.getIamPolicy,aiplatform.notebookRuntimeTemplates.list,aiplatform.notebookRuntimes.assign,aiplatform.notebookRuntimes.get,aiplatform.notebookRuntimes.list,aiplatform.operations.list,bigquery.config.get,bigquery.connections.create,bigquery.connections.delete,bigquery.connections.get,bigquery.connections.getIamPolicy,bigquery.connections.list,bigquery.connections.update,bigquery.connections.updateTag,bigquery.connections.use,bigquery.datasets.create,bigquery.datasets.delete,bigquery.datasets.get,bigquery.datasets.getIamPolicy,bigquery.datasets.update,bigquery.datasets.updateTag,bigquery.jobs.create,bigquery.jobs.delete,bigquery.jobs.get,bigquery.jobs.list,bigquery.jobs.listAll,bigquery.jobs.update,bigquery.models.create,bigquery.models.delete,bigquery.models.export,bigquery.models.getData,bigquery.models.getMetadata,bigquery.models.list,bigquery.models.updateData,bigquery.models.updateMetadata,bigquery.models.updateTag,bigquery.readsessions.create,bigquery.readsessions.getData,bigquery.readsessions.update,bigquery.routines.create,bigquery.routines.delete,bigquery.routines.get,bigquery.routines.list,bigquery.routines.update,bigquery.routines.updateTag,bigquery.savedqueries.create,bigquery.savedqueries.delete,bigquery.savedqueries.get,bigquery.savedqueries.list,bigquery.savedqueries.update,bigquery.tables.create,bigquery.tables.createSnapshot,bigquery.tables.delete,bigquery.tables.deleteSnapshot,bigquery.tables.export,bigquery.tables.get,bigquery.tables.getData,bigquery.tables.getIamPolicy,bigquery.tables.list,bigquery.tables.restoreSnapshot,bigquery.tables.setCategory,bigquery.tables.update,bigquery.tables.updateData,bigquery.tables.updateTag,datacatalog.categories.fineGrainedGet,datacatalog.entries.get,datacatalog.entries.list,datacatalog.entryGroups.get,datacatalog.entryGroups.list,datacatalog.tagTemplates.get,datacatalog.tagTemplates.getTag,datacatalog.taxonomies.get,datacatalog.taxonomies.list,dataform.compilationResults.create,dataform.compilationResults.get,dataform.compilationResults.list,dataform.compilationResults.query,dataform.locations.get,dataform.locations.list,dataform.releaseConfigs.create,dataform.releaseConfigs.delete,dataform.releaseConfigs.get,dataform.releaseConfigs.list,dataform.releaseConfigs.update,dataform.repositories.commit,dataform.repositories.computeAccessTokenStatus,dataform.repositories.create,dataform.repositories.delete,dataform.repositories.fetchHistory,dataform.repositories.fetchRemoteBranches,dataform.repositories.get,dataform.repositories.getIamPolicy,dataform.repositories.list,dataform.repositories.queryDirectoryContents,dataform.repositories.readFile,dataform.repositories.setIamPolicy,dataform.repositories.update,dataform.workflowConfigs.create,dataform.workflowConfigs.delete,dataform.workflowConfigs.get,dataform.workflowConfigs.list,dataform.workflowConfigs.update,dataform.workflowInvocations.cancel,dataform.workflowInvocations.create,dataform.workflowInvocations.delete,dataform.workflowInvocations.get,dataform.workflowInvocations.list,dataform.workflowInvocations.query,dataform.workspaces.commit,dataform.workspaces.create,dataform.workspaces.delete,dataform.workspaces.fetchFileDiff,dataform.workspaces.fetchFileGitStatuses,dataform.workspaces.fetchGitAheadBehind,dataform.workspaces.get,dataform.workspaces.getIamPolicy,dataform.workspaces.installNpmPackages,dataform.workspaces.list,dataform.workspaces.makeDirectory,dataform.workspaces.moveDirectory,dataform.workspaces.moveFile,dataform.workspaces.pull,dataform.workspaces.push,dataform.workspaces.queryDirectoryContents,dataform.workspaces.readFile,dataform.workspaces.removeDirectory,dataform.workspaces.removeFile,dataform.workspaces.reset,dataform.workspaces.searchFiles,dataform.workspaces.setIamPolicy,dataform.workspaces.writeFile,datalineage.events.get,datalineage.events.list,datalineage.locations.searchLinks,datalineage.processes.get,datalineage.processes.list,datalineage.runs.get,datalineage.runs.list,iam.serviceAccounts.actAs,iam.serviceAccounts.get,iam.serviceAccounts.list,logging.buckets.get,logging.buckets.list,logging.exclusions.get,logging.exclusions.list,logging.links.get,logging.links.list,logging.locations.get,logging.locations.list,logging.logEntries.list,logging.logMetrics.get,logging.logMetrics.list,logging.logServiceIndexes.list,logging.logServices.list,logging.logs.list,logging.operations.get,logging.operations.list,logging.queries.create,logging.queries.delete,logging.queries.get,logging.queries.list,logging.queries.listShared,logging.queries.update,logging.sinks.get,logging.sinks.list,logging.usage.get,logging.views.get,logging.views.list,pubsub.topics.get,resourcemanager.projects.get --project=YOUR_PROJECT_ID ```
@@ -145,15 +168,24 @@ gcloud iam roles create bigquery_developer_custom --title="BigQuery Developer Cu | Field | Value | | ----------------- | ---------------------------------------- | -| Title | **BigQuery Developer Custom** | -| Description | Assigned to accounts used by developers. | -| ID | `bigquery_developer_custom` | +| Title | **BigQuery Advanced Custom** | +| Description | Assigned to accounts used by Dataform SQL developers who need to be able to write to or modify data in BigQuery. | +| ID | `bigquery_advanced_custom` | | Role launch stage | General Availability | | + Add permissions | See below | -##### Permissions for `bigquery_developer_custom` +##### Permissions for `bigquery_advanced_custom` ``` +aiplatform.notebookRuntimeTemplates.apply +aiplatform.notebookRuntimeTemplates.get +aiplatform.notebookRuntimeTemplates.getIamPolicy +aiplatform.notebookRuntimeTemplates.list +aiplatform.notebookRuntimes.assign +aiplatform.notebookRuntimes.get +aiplatform.notebookRuntimes.list +aiplatform.operations.list +bigquery.config.get bigquery.connections.create bigquery.connections.delete bigquery.connections.get @@ -183,6 +215,9 @@ bigquery.models.list bigquery.models.updateData bigquery.models.updateMetadata bigquery.models.updateTag +bigquery.readsessions.create +bigquery.readsessions.getData +bigquery.readsessions.update bigquery.routines.create bigquery.routines.delete bigquery.routines.get @@ -208,6 +243,111 @@ bigquery.tables.setCategory bigquery.tables.update bigquery.tables.updateData bigquery.tables.updateTag +datacatalog.categories.fineGrainedGet +datacatalog.entries.get +datacatalog.entries.list +datacatalog.entryGroups.get +datacatalog.entryGroups.list +datacatalog.tagTemplates.get +datacatalog.tagTemplates.getTag +datacatalog.taxonomies.get +datacatalog.taxonomies.list +dataform.compilationResults.create +dataform.compilationResults.get +dataform.compilationResults.list +dataform.compilationResults.query +dataform.locations.get +dataform.locations.list +dataform.releaseConfigs.create +dataform.releaseConfigs.delete +dataform.releaseConfigs.get +dataform.releaseConfigs.list +dataform.releaseConfigs.update +dataform.repositories.commit +dataform.repositories.computeAccessTokenStatus +dataform.repositories.create +dataform.repositories.delete +dataform.repositories.fetchHistory +dataform.repositories.fetchRemoteBranches +dataform.repositories.get +dataform.repositories.getIamPolicy +dataform.repositories.list +dataform.repositories.queryDirectoryContents +dataform.repositories.readFile +dataform.repositories.setIamPolicy +dataform.repositories.update +dataform.workflowConfigs.create +dataform.workflowConfigs.delete +dataform.workflowConfigs.get +dataform.workflowConfigs.list +dataform.workflowConfigs.update +dataform.workflowInvocations.cancel +dataform.workflowInvocations.create +dataform.workflowInvocations.delete +dataform.workflowInvocations.get +dataform.workflowInvocations.list +dataform.workflowInvocations.query +dataform.workspaces.commit +dataform.workspaces.create +dataform.workspaces.delete +dataform.workspaces.fetchFileDiff +dataform.workspaces.fetchFileGitStatuses +dataform.workspaces.fetchGitAheadBehind +dataform.workspaces.get +dataform.workspaces.getIamPolicy +dataform.workspaces.installNpmPackages +dataform.workspaces.list +dataform.workspaces.makeDirectory +dataform.workspaces.moveDirectory +dataform.workspaces.moveFile +dataform.workspaces.pull +dataform.workspaces.push +dataform.workspaces.queryDirectoryContents +dataform.workspaces.readFile +dataform.workspaces.removeDirectory +dataform.workspaces.removeFile +dataform.workspaces.reset +dataform.workspaces.searchFiles +dataform.workspaces.setIamPolicy +dataform.workspaces.writeFile +datalineage.events.get +datalineage.events.list +datalineage.locations.searchLinks +datalineage.processes.get +datalineage.processes.list +datalineage.runs.get +datalineage.runs.list +iam.serviceAccounts.actAs +iam.serviceAccounts.get +iam.serviceAccounts.list +logging.buckets.get +logging.buckets.list +logging.exclusions.get +logging.exclusions.list +logging.links.get +logging.links.list +logging.locations.get +logging.locations.list +logging.logEntries.list +logging.logMetrics.get +logging.logMetrics.list +logging.logServiceIndexes.list +logging.logServices.list +logging.logs.list +logging.operations.get +logging.operations.list +logging.queries.create +logging.queries.delete +logging.queries.get +logging.queries.list +logging.queries.listShared +logging.queries.update +logging.sinks.get +logging.sinks.list +logging.usage.get +logging.views.get +logging.views.list +pubsub.topics.get resourcemanager.projects.get ``` @@ -246,6 +386,24 @@ bigquery.tables.updateData +### 4. Create a policy tag +We use a BigQuery 'policy tag' to label some fields in some tables in BigQuery +as 'hidden', restrict access to these fields and mask data in these fields to +users without access. Policy tag(s) exist within a group known as a 'taxonomy'. + +To create the 'hidden' policy tag required by dfe-analytics: +1. Enable the "BigQuery Data Policy API": search for this from the 'Enable APIs + and services' screen, accessible from the 'Enabled APIs and services' screen + within the 'APIs and services' section of GCP, and click 'Enable'. +2. Open BigQuery, open the 'Policy tags' screen and click 'Create taxonomy'. +3. Use this screen to create a policy tag named ‘hidden’ within a taxonomy named + something like ‘project-restricted-access' (replacing ‘project’ with something + meaningful to your GCP project). Ensure the taxonomy is within the + europe-west2 (London) region. +4. Click the 'Manage data policies' button to open the Masking rules screen. Under + 'Data policy name 1' type 'hidden' and under 'Masking rule 1' select + 'Hash (SHA256)'. Click Submit. + ## Dataset and Table Setup `dfe-analytics` inserts events into a table in BigQuery with a pre-defined @@ -299,6 +457,12 @@ Once the dataset is ready you need to create the `events` table in it: into the query editor. 3. Edit your project and dataset names in the query editor. 4. Run the query to create a blank events table. +5. Label the hidden_DATA field with the 'hidden' policy tag to restrict + access to it: Navigate to the newly created table in BigQuery using the + left hand sidebar. Click 'Edit Schema'. Expand the 'hidden_DATA' field + and select the checkbox next to the 'value' element within it. Click + 'Add policy tag' and select the 'hidden' policy tag in the taxonomy for + your project. Click Save. BigQuery allows you to copy a table to a new dataset, so now is a good time to create all the datasets you need and copy the blank `events` table to each of @@ -332,6 +496,3 @@ Ensure you have the email address of the service account handy for this. principals" box. 4. Select the "BigQuery Appender Custom" role you created previously. 5. Click "SAVE" to finish. - - - diff --git a/lib/dfe/analytics.rb b/lib/dfe/analytics.rb index ab6eb15e..219f177f 100644 --- a/lib/dfe/analytics.rb +++ b/lib/dfe/analytics.rb @@ -138,6 +138,12 @@ def self.allowlist_pii Rails.application.config_for(:analytics_pii) end + def self.hidden_pii + Rails.application.config_for(:analytics_hidden_pii) + rescue RuntimeError + { 'shared' => {} } + end + def self.blocklist Rails.application.config_for(:analytics_blocklist) end @@ -185,16 +191,29 @@ def self.models_for_entity(entity) def self.extract_model_attributes(model, attributes = nil) # if no list of attrs specified, consider all attrs belonging to this model attributes ||= model.attributes - table_name = model.class.table_name + table_name = model.class.table_name.to_sym - exportable_attrs = allowlist[table_name.to_sym].presence || [] - pii_attrs = allowlist_pii[table_name.to_sym].presence || [] + exportable_attrs = (allowlist[table_name].presence || []).map(&:to_sym) + hidden_pii_attrs = (hidden_pii[table_name].presence || []).map(&:to_sym) + pii_attrs = (allowlist_pii[table_name].presence || []).map(&:to_sym) + + # Validation in fields.rb ensures attributes do not appear on both allowlist_pii and allowlist_hidden_pii exportable_pii_attrs = exportable_attrs & pii_attrs + exportable_hidden_pii_attrs = exportable_attrs & hidden_pii_attrs + + # Exclude both pii and hidden attributes from allowed_attributes + allowed_attrs_to_include = exportable_attrs - (exportable_pii_attrs + exportable_hidden_pii_attrs) - allowed_attributes = attributes.slice(*exportable_attrs&.map(&:to_s)) - obfuscated_attributes = attributes.slice(*exportable_pii_attrs&.map(&:to_s)) + allowed_attributes = attributes.slice(*allowed_attrs_to_include&.map(&:to_s)) + obfuscated_attributes = attributes.slice(*exportable_pii_attrs.map(&:to_s)) + .transform_values { |value| pseudonymise(value) } + hidden_attributes = attributes.slice(*exportable_hidden_pii_attrs&.map(&:to_s)) - allowed_attributes.deep_merge(obfuscated_attributes.transform_values { |value| pseudonymise(value) }) + # Allowed attributes (which currently includes the allowlist_pii) must be kept separate from hidden_attributes + model_attributes = {} + model_attributes.merge!(data: allowed_attributes.deep_merge(obfuscated_attributes)) if allowed_attributes.any? || obfuscated_attributes.any? + model_attributes.merge!(hidden_data: hidden_attributes) if hidden_attributes.any? + model_attributes end def self.anonymise(value) diff --git a/lib/dfe/analytics/entities.rb b/lib/dfe/analytics/entities.rb index ee3499b2..0544a7b6 100644 --- a/lib/dfe/analytics/entities.rb +++ b/lib/dfe/analytics/entities.rb @@ -9,24 +9,26 @@ module Entities attr_accessor :event_tags after_create do - data = DfE::Analytics.extract_model_attributes(self) - send_event('create_entity', data) if data.any? + extracted_attributes = DfE::Analytics.extract_model_attributes(self) + send_event('create_entity', extracted_attributes) if extracted_attributes.any? end after_destroy do - data = DfE::Analytics.extract_model_attributes(self) - send_event('delete_entity', data) if data.any? + extracted_attributes = DfE::Analytics.extract_model_attributes(self) + send_event('delete_entity', extracted_attributes) if extracted_attributes.any? end after_update do - # in this after_update hook we don’t have access to the new fields via + # in this after_update hook we don't have access to the new fields via # #attributes — we need to dig them out of saved_changes which stores # them in the format { attr: ['old', 'new'] } - interesting_changes = DfE::Analytics.extract_model_attributes( + updated_attributes = DfE::Analytics.extract_model_attributes( self, saved_changes.transform_values(&:last) ) - send_event('update_entity', DfE::Analytics.extract_model_attributes(self).merge(interesting_changes)) if interesting_changes.any? + allowed_attributes = DfE::Analytics.extract_model_attributes(self).deep_merge(updated_attributes) + + send_event('update_entity', allowed_attributes) if updated_attributes.any? end end diff --git a/lib/dfe/analytics/event.rb b/lib/dfe/analytics/event.rb index 7d2e7f6a..b460b12e 100644 --- a/lib/dfe/analytics/event.rb +++ b/lib/dfe/analytics/event.rb @@ -73,7 +73,8 @@ def with_entity_table_name(table_name) end def with_data(hash) - @event_hash.deep_merge!(data: hash_to_kv_pairs(hash)) + @event_hash.deep_merge!(data: hash_to_kv_pairs(hash[:data])) if hash.include?(:data) + @event_hash.deep_merge!(hidden_data: hash_to_kv_pairs(hash[:hidden_data])) if hash.include?(:hidden_data) self end @@ -109,6 +110,8 @@ def convert_value_to_json(value) end def hash_to_kv_pairs(hash) + return [] if hash.nil? + hash.map do |(key, values)| if Array.wrap(values).any?(&:nil?) message = "an array field contains nulls - event: #{@event_hash} key: #{key} values: #{values}" diff --git a/lib/dfe/analytics/event_matcher.rb b/lib/dfe/analytics/event_matcher.rb index addab722..0cf940cc 100644 --- a/lib/dfe/analytics/event_matcher.rb +++ b/lib/dfe/analytics/event_matcher.rb @@ -16,6 +16,8 @@ def matched? private def filter_matched?(filter, nested_fields = []) + return false if filter.nil? || filter.values.any?(&:nil?) + filter.all? do |field, filter_value| fields = nested_fields + [field] @@ -31,9 +33,14 @@ def filter_matched?(filter, nested_fields = []) def field_matched?(filter_value, nested_fields) event_value = event_value_for(nested_fields) - regexp = Regexp.new(filter_value) + return false if event_value.nil? + + # Convert values to strings for comparison + filter_value_str = filter_value.to_s + event_value_str = event_value.to_s - regexp.match?(event_value) + regexp = Regexp.new(filter_value_str) + regexp.match?(event_value_str) end def event_value_for(nested_fields) diff --git a/lib/dfe/analytics/fields.rb b/lib/dfe/analytics/fields.rb index bf4dabea..f5eedecb 100644 --- a/lib/dfe/analytics/fields.rb +++ b/lib/dfe/analytics/fields.rb @@ -45,6 +45,15 @@ def self.check! HEREDOC end + if overlapping_pii_fields.any? + errors << <<~HEREDOC + PII configuration error detected! The following fields are listed in both hidden_pii and allowlist_pii. + Fields must only be present in one. Please update the configuration to resolve the conflict: + + #{overlapping_pii_fields.to_yaml} + HEREDOC + end + configuration_errors = errors.join("\n\n----------------\n\n") raise(ConfigurationError, configuration_errors) if errors.any? @@ -58,6 +67,27 @@ def self.allowlist DfE::Analytics.allowlist end + def self.hidden_pii + DfE::Analytics.hidden_pii || {} + end + + def self.allowlist_pii + DfE::Analytics.allowlist_pii || {} + end + + def self.overlapping_pii_fields + overlapping_fields = [] + hidden_pii.each do |entity, fields| + next if fields.blank? + + if allowlist_pii[entity] + overlapping = fields & allowlist_pii[entity] + overlapping_fields.concat(overlapping) unless overlapping.blank? + end + end + overlapping_fields + end + def self.database DfE::Analytics.all_entities_in_application .reduce({}) do |list, entity| diff --git a/lib/dfe/analytics/initialisation_events.rb b/lib/dfe/analytics/initialisation_events.rb index 02781fd1..86403b60 100644 --- a/lib/dfe/analytics/initialisation_events.rb +++ b/lib/dfe/analytics/initialisation_events.rb @@ -27,7 +27,7 @@ def send_initialisation_events initialise_analytics_event = DfE::Analytics::Event.new .with_type('initialise_analytics') - .with_data(initialise_analytics_data) + .with_data(data: initialise_analytics_data) .as_json DfE::Analytics::SendEvents.perform_for([initialise_analytics_event]) diff --git a/lib/dfe/analytics/railtie.rb b/lib/dfe/analytics/railtie.rb index 95a93f44..5721181a 100644 --- a/lib/dfe/analytics/railtie.rb +++ b/lib/dfe/analytics/railtie.rb @@ -4,9 +4,10 @@ module DfE module Analytics # Railtie class Railtie < Rails::Railtie - config.before_initialize do + initializer 'dfe.analytics.configure_params' do |app| i18n_files = File.expand_path("#{File.dirname(__FILE__)}/../../../config/locales/en.yml") I18n.load_path << i18n_files + app.config.filter_parameters += [:hidden_data] end initializer 'dfe.analytics.insert_middleware' do |app| diff --git a/lib/dfe/analytics/send_events.rb b/lib/dfe/analytics/send_events.rb index 281674e5..6b5af636 100644 --- a/lib/dfe/analytics/send_events.rb +++ b/lib/dfe/analytics/send_events.rb @@ -30,18 +30,44 @@ def self.perform_for(events) def perform(events) if DfE::Analytics.log_only? # Use the Rails logger here as the job's logger is set to :warn by default - Rails.logger.info("DfE::Analytics: #{events.inspect}") + events.each { |event| Rails.logger.info("DfE::Analytics: #{mask_hidden_data(event).inspect}") } else - if DfE::Analytics.event_debug_enabled? events .select { |event| DfE::Analytics::EventMatcher.new(event).matched? } - .each { |event| Rails.logger.info("DfE::Analytics processing: #{event.inspect}") } + .each { |event| Rails.logger.info("DfE::Analytics processing: #{mask_hidden_data(event).inspect}") } end DfE::Analytics.config.azure_federated_auth ? DfE::Analytics::BigQueryApi.insert(events) : DfE::Analytics::BigQueryLegacyApi.insert(events) end end + + private + + def mask_hidden_data(event) + masked_event = event.deep_dup.with_indifferent_access + return event unless masked_event&.key?(:hidden_data) + + mask_hidden_data_values(masked_event) + end + + def mask_hidden_data_values(event) + hidden_data = event[:hidden_data] + + hidden_data.each { |data| mask_data(data) } if hidden_data.is_a?(Array) + + event + end + + def mask_data(data) + return unless data.is_a?(Hash) + + data[:value] = ['HIDDEN'] if data[:value].present? + + return unless data[:key].is_a?(Hash) && data[:key][:value].present? + + data[:key][:value] = ['HIDDEN'] + end end end end diff --git a/lib/dfe/analytics/services/entity_table_checks.rb b/lib/dfe/analytics/services/entity_table_checks.rb index b8a464c7..711b6a21 100644 --- a/lib/dfe/analytics/services/entity_table_checks.rb +++ b/lib/dfe/analytics/services/entity_table_checks.rb @@ -108,7 +108,7 @@ def build_event_for(entity_name, entity_type, entity_tag, order_column) .with_type(entity_type) .with_entity_table_name(entity_name) .with_tags([entity_tag]) - .with_data(entity_table_check_data(entity_name, order_column)) + .with_data(data: entity_table_check_data(entity_name, order_column)) .as_json end end diff --git a/lib/generators/dfe/analytics/install_generator.rb b/lib/generators/dfe/analytics/install_generator.rb index 7fa1b70a..2baca88e 100644 --- a/lib/generators/dfe/analytics/install_generator.rb +++ b/lib/generators/dfe/analytics/install_generator.rb @@ -12,6 +12,7 @@ def install create_file 'config/analytics.yml', { 'shared' => {} }.to_yaml create_file 'config/analytics_pii.yml', { 'shared' => {} }.to_yaml + create_file 'config/analytics_hidden_pii.yml', { 'shared' => {} }.to_yaml create_file 'config/analytics_blocklist.yml', { 'shared' => {} }.to_yaml end diff --git a/spec/dfe/analytics/entities_spec.rb b/spec/dfe/analytics/entities_spec.rb index fcb06439..ad680896 100644 --- a/spec/dfe/analytics/entities_spec.rb +++ b/spec/dfe/analytics/entities_spec.rb @@ -1,14 +1,16 @@ # frozen_string_literal: true RSpec.describe DfE::Analytics::Entities do - let(:interesting_fields) { [] } + let(:allowlist_fields) { [] } let(:pii_fields) { [] } + let(:hidden_pii_fields) { [] } with_model :Candidate do table do |t| t.string :email_address t.string :last_name t.string :first_name + t.string :dob end end @@ -18,13 +20,17 @@ allow(DfE::Analytics).to receive(:enabled?).and_return(true) allow(DfE::Analytics).to receive(:allowlist).and_return({ - Candidate.table_name.to_sym => interesting_fields + Candidate.table_name.to_sym => allowlist_fields }) allow(DfE::Analytics).to receive(:allowlist_pii).and_return({ Candidate.table_name.to_sym => pii_fields }) + allow(DfE::Analytics).to receive(:hidden_pii).and_return({ + Candidate.table_name.to_sym => hidden_pii_fields + }) + # autogenerate a compliant blocklist allow(DfE::Analytics).to receive(:blocklist).and_return(DfE::Analytics::Fields.generate_blocklist) @@ -33,7 +39,7 @@ describe 'create_entity events' do context 'when fields are specified in the analytics file' do - let(:interesting_fields) { ['id'] } + let(:allowlist_fields) { ['id'] } it 'includes attributes specified in the settings file' do Candidate.create(id: 123) @@ -85,7 +91,7 @@ end context 'and the specified fields are listed as PII' do - let(:interesting_fields) { ['email_address'] } + let(:allowlist_fields) { ['email_address'] } let(:pii_fields) { ['email_address'] } it 'hashes those fields' do @@ -102,7 +108,7 @@ end context 'and other fields are listed as PII' do - let(:interesting_fields) { ['id'] } + let(:allowlist_fields) { ['id'] } let(:pii_fields) { ['email_address'] } it 'does not include the fields only listed as PII' do @@ -119,7 +125,7 @@ end context 'when no fields are specified in the analytics file' do - let(:interesting_fields) { [] } + let(:allowlist_fields) { [] } it 'does not send create_entity events at all' do Candidate.create @@ -128,11 +134,27 @@ .with([a_hash_including({ 'event_type' => 'create_entity' })]) end end + + context 'when fields are specified in the analytics and hidden_pii file' do + let(:allowlist_fields) { %w[email_address dob] } + let(:hidden_pii_fields) { %w[dob] } + + it 'sends event with separated allowed and hidden data' do + Candidate.create(email_address: 'foo@bar.com', dob: '20062000') + + expect(DfE::Analytics::SendEvents).to have_received(:perform_later) + .with([a_hash_including({ + 'event_type' => 'create_entity', + 'data' => array_including(a_hash_including('key' => 'email_address', 'value' => ['foo@bar.com'])), + 'hidden_data' => array_including(a_hash_including('key' => 'dob', 'value' => ['20062000'])) + })]) + end + end end describe 'update_entity events' do context 'when fields are specified in the analytics file' do - let(:interesting_fields) { %w[email_address first_name] } + let(:allowlist_fields) { %w[email_address first_name] } it 'sends update events for fields we care about' do entity = Candidate.create(email_address: 'foo@bar.com', first_name: 'Jason') @@ -173,7 +195,7 @@ end context 'when no fields are specified in the analytics file' do - let(:interesting_fields) { [] } + let(:allowlist_fields) { [] } it 'does not send update events at all' do entity = Candidate.create @@ -185,10 +207,37 @@ })]) end end + + context 'when fields are specified in the analytics and hidden_pii file' do + let(:candidate) { Candidate.create(email_address: 'name@example.com', dob: '20062000') } + let(:allowlist_fields) { %w[email_address dob] } + let(:hidden_pii_fields) { %w[dob] } + + it 'sends events with updated allowed field but without original hidden data' do + candidate.update(email_address: 'updated@example.com') + + expect(DfE::Analytics::SendEvents).to have_received(:perform_later) + .with([a_hash_including({ + 'event_type' => 'update_entity', + 'data' => array_including(a_hash_including('key' => 'email_address', 'value' => ['updated@example.com'])) + })]) + end + + it 'sends events with updated allowed field and with updated hidden data' do + candidate.update(email_address: 'updated@example.com', dob: '21062000') + + expect(DfE::Analytics::SendEvents).to have_received(:perform_later) + .with([a_hash_including({ + 'event_type' => 'update_entity', + 'data' => array_including(a_hash_including('key' => 'email_address', 'value' => ['updated@example.com'])), + 'hidden_data' => array_including(a_hash_including('key' => 'dob', 'value' => ['21062000'])) + })]) + end + end end describe 'delete_entity events' do - let(:interesting_fields) { ['email_address'] } + let(:allowlist_fields) { ['email_address'] } it 'sends events when objects are deleted' do entity = Candidate.create(email_address: 'boo@example.com') @@ -203,5 +252,22 @@ ] })]) end + + context 'when fields are specified in the analytics and hidden_pii file' do + let(:allowlist_fields) { %w[email_address dob] } + let(:hidden_pii_fields) { %w[dob] } + + it 'sends event indicating deletion with allowed and hidden data' do + entity = Candidate.create(email_address: 'to@be.deleted', dob: '21062000') + entity.destroy + + expect(DfE::Analytics::SendEvents).to have_received(:perform_later) + .with([a_hash_including({ + 'event_type' => 'delete_entity', + 'data' => array_including(a_hash_including('key' => 'email_address')), + 'hidden_data' => array_including(a_hash_including('key' => 'dob', 'value' => ['21062000'])) + })]) + end + end end end diff --git a/spec/dfe/analytics/event_matcher_spec.rb b/spec/dfe/analytics/event_matcher_spec.rb index bcee423d..5de5e427 100644 --- a/spec/dfe/analytics/event_matcher_spec.rb +++ b/spec/dfe/analytics/event_matcher_spec.rb @@ -127,5 +127,26 @@ end end end + + describe '.field_matched?' do + let(:logging) do + { + event_filters: [ + { + event_type: 'update_entity', + entity_table_name: 'course_options', + data: { + key: 'course_id' + } + } + ] + } + end + + it 'returns false when event_value is nil' do + allow(subject).to receive(:event_value_for).and_return(nil) + expect(subject.send(:field_matched?, 'course_id', 'data')).to be false + end + end end end diff --git a/spec/dfe/analytics/event_spec.rb b/spec/dfe/analytics/event_spec.rb index d5b7b740..f3b87d9e 100644 --- a/spec/dfe/analytics/event_spec.rb +++ b/spec/dfe/analytics/event_spec.rb @@ -61,6 +61,7 @@ end describe 'data pairs' do + let(:event) { described_class.new } let(:has_as_json_class) do Struct.new(:colour, :is_cat) do def as_json @@ -72,43 +73,49 @@ def as_json end end - it 'converts booleans to strings' do - event = described_class.new - output = event.with_data(key: true).as_json - expect(output['data'].first['value']).to eq ['true'] + def find_data_pair(output, key) + output['data'].find { |pair| pair['key'] == key } end - it 'converts hashes to strings' do - event = described_class.new - output = event.with_data(key: { equality_and_diversity: { ethnic_background: 'Irish' } }).as_json - expect(output['data'].first['value']).to eq ['{"equality_and_diversity":{"ethnic_background":"Irish"}}'] + it 'converts data types to their string representations' do + boolean_output = event.with_data(data: { boolean_key: true }, hidden_data: {}).as_json + expect(find_data_pair(boolean_output, 'boolean_key')['value']).to eq(['true']) end - it 'strips out nil values' do - event = described_class.new - output = event.with_data(key: ['A', nil, 'B']).as_json - expect(output['data'].first['value']).to eq %w[A B] + it 'converts hashes to strings' do + hash_output = event.with_data(data: { hash_key: { equality_and_diversity: { ethnic_background: 'Irish' } } }, hidden_data: {}).as_json + expect(find_data_pair(hash_output, 'hash_key')['value']).to eq(['{"equality_and_diversity":{"ethnic_background":"Irish"}}']) end - it 'logs a warning when stripping out nil values' do + it 'strips out nil values and logs a warning' do expect(Rails.logger).to receive(:warn).with(/DfE::Analytics an array field contains nulls/) - - event = described_class.new - event.with_data(key: ['A', nil, nil]).as_json + nil_values_output = event.with_data(data: { key_with_nil: ['A', nil, nil] }, hidden_data: {}).as_json + expect(find_data_pair(nil_values_output, 'key_with_nil')['value']).to eq(['A']) end it 'handles objects that have JSON-friendly structures' do - event = described_class.new - output = event.with_data(as_json_object: has_as_json_class.new(:green, true)).as_json + output = event.with_data(data: { as_json_object: has_as_json_class.new('green', true) }, hidden_data: {}).as_json + expect(output['data'].first['value']).to eq ['{"colour":"green","is_cat":true}'] + puts has_as_json_class.new('green', true).as_json.to_json end it 'handles arrays of JSON-friendly structures' do - event = described_class.new - output = event.with_data( - as_json_object: [has_as_json_class.new(:green, true)] - ).as_json - expect(output['data'].first['value']).to eq ['{"colour":"green","is_cat":true}'] + output = event.with_data(data: { as_json_object: [has_as_json_class.new('green', true)] }, hidden_data: {}).as_json + + expect(output['data']).not_to be_nil + expect(output['data']).not_to be_empty + + found_key_value_pair = output['data'].find { |pair| pair['key'] == 'as_json_object' } + expect(found_key_value_pair).not_to be_nil + expect(found_key_value_pair['value']).to eq(['{"colour":"green","is_cat":true}']) + end + + it 'behaves correctly when with_data is called with empty data and hidden_data' do + event.with_data(data: {}, hidden_data: {}) + updated_event_hash = event.as_json + expect(updated_event_hash['data']).to eq([]) + expect(updated_event_hash['hidden_data']).to eq([]) end end @@ -209,6 +216,35 @@ def as_json end end + describe 'custom events with hidden_data' do + let(:type) { 'some_custom_event' } + + before do + allow(DfE::Analytics).to receive(:custom_events).and_return [type] + end + + it 'includes hidden_data in the event payload' do + event = DfE::Analytics::Event.new + .with_type(type) + .with_request_details(fake_request) + .with_namespace('some_namespace') + .with_data( + data: { some: 'custom details about event' }, + hidden_data: { some_hidden: 'some data to be hidden' } + ) + output = event.as_json + + visible_data = output['data'].find { |d| d['key'] == 'some' } + hidden_data = output['hidden_data'].find { |d| d['key'] == 'some_hidden' } + + expect(visible_data).not_to be_nil + expect(visible_data['value']).to eq(['custom details about event']) + + expect(hidden_data).not_to be_nil + expect(hidden_data['value']).to eq(['some data to be hidden']) + end + end + def fake_request(overrides = {}) attrs = { uuid: '123', diff --git a/spec/dfe/analytics/fields_spec.rb b/spec/dfe/analytics/fields_spec.rb index 545acbc8..9fba2cb9 100644 --- a/spec/dfe/analytics/fields_spec.rb +++ b/spec/dfe/analytics/fields_spec.rb @@ -5,11 +5,12 @@ t.string :email_address t.string :first_name t.string :last_name + t.string :dob end end - let(:existing_allowlist) { { Candidate.table_name.to_sym => ['email_address'] } } - let(:existing_blocklist) { { Candidate.table_name.to_sym => ['id'] } } + let(:existing_allowlist) { { Candidate.table_name.to_sym => %w[email_address] } } + let(:existing_blocklist) { { Candidate.table_name.to_sym => %w[id] } } before do allow(DfE::Analytics).to receive(:allowlist).and_return(existing_allowlist) @@ -46,7 +47,7 @@ describe '.conflicting_fields' do context 'when fields conflict' do - let(:existing_allowlist) { { Candidate.table_name.to_sym => %w[email_address id first_name] } } + let(:existing_allowlist) { { Candidate.table_name.to_sym => %w[email_address id first_name dob] } } let(:existing_blocklist) { { Candidate.table_name.to_sym => %w[email_address first_name] } } it 'returns the conflicting fields' do @@ -111,5 +112,50 @@ end end end + + describe 'handling of hidden PII fields' do + let(:existing_allowlist) { { Candidate.table_name.to_sym => %w[id dob email_address] } } + let(:hidden_pii) { { Candidate.table_name.to_sym => %w[dob] } } + let(:allowlist_pii) { { Candidate.table_name.to_sym => %w[id email_address] } } + let(:existing_blocklist) { { Candidate.table_name.to_sym => %w[first_name last_name] } } + + before do + allow(DfE::Analytics).to receive(:hidden_pii).and_return(hidden_pii) + allow(DfE::Analytics).to receive(:allowlist_pii).and_return(allowlist_pii) + end + + describe '.hidden_pii' do + let(:hidden_pii) { { Candidate.table_name.to_sym => %w[dob] } } + it 'returns all the fields in the analytics_hidden_pii.yml file' do + expect(described_class.hidden_pii).to eq(hidden_pii) + end + end + + describe '.check!' do + context 'when there are no overlapping fields in hidden_pii and allowlist_pii' do + it 'does not raise an error' do + expect { DfE::Analytics::Fields.check! }.not_to raise_error + end + end + + context 'when there are overlapping fields in hidden_pii and allowlist_pii' do + let(:allowlist_pii) { { Candidate.table_name.to_sym => %w[id dob email_address] } } + + it 'raises an error' do + error_message = /PII configuration error detected! The following fields are listed in both hidden_pii and allowlist_pii/ + expect { DfE::Analytics::Fields.check! }.to raise_error(DfE::Analytics::ConfigurationError, error_message) + end + end + + context 'when hidden PII fields are improperly managed' do + let(:existing_allowlist) { { Candidate.table_name.to_sym => %w[email_address] } } + let(:allowlist_pii) { { Candidate.table_name.to_sym => %w[email_address] } } + + it 'raises an error about hidden PII fields not in allowlist' do + expect { described_class.check! }.to raise_error(DfE::Analytics::ConfigurationError, /New database field detected/) + end + end + end + end end end diff --git a/spec/dfe/analytics/load_entity_batch_spec.rb b/spec/dfe/analytics/load_entity_batch_spec.rb index 16b964d9..8a2c3c20 100644 --- a/spec/dfe/analytics/load_entity_batch_spec.rb +++ b/spec/dfe/analytics/load_entity_batch_spec.rb @@ -4,6 +4,7 @@ with_model :Candidate do table do |t| t.string :email_address + t.string :dob end end @@ -12,7 +13,7 @@ allow(DfE::Analytics::SendEvents).to receive(:perform_now) allow(DfE::Analytics).to receive(:allowlist).and_return({ - Candidate.table_name.to_sym => ['email_address'] + Candidate.table_name.to_sym => %w[email_address] }) end @@ -37,7 +38,7 @@ perform_enqueued_jobs do c = Candidate.create(email_address: '12345678910') c2 = Candidate.create(email_address: '12345678910') - stub_const('DfE::Analytics::LoadEntityBatch::BQ_BATCH_MAX_BYTES', 250) + stub_const('DfE::Analytics::LoadEntityBatch::BQ_BATCH_MAX_BYTES', 300) described_class.perform_now('Candidate', [c.id, c2.id], entity_tag) @@ -48,7 +49,7 @@ it 'doesn’t split a batch unless it has to' do c = Candidate.create(email_address: '12345678910') c2 = Candidate.create(email_address: '12345678910') - stub_const('DfE::Analytics::LoadEntityBatch::BQ_BATCH_MAX_BYTES', 500) + stub_const('DfE::Analytics::LoadEntityBatch::BQ_BATCH_MAX_BYTES', 550) described_class.perform_now('Candidate', [c.id, c2.id], entity_tag) @@ -74,5 +75,53 @@ expect(DfE::Analytics::SendEvents).to have_received(:perform_now).once end end + + context 'when both allowed data and hidden data is present' do + before do + allow(DfE::Analytics).to receive(:allowlist).and_return({ + Candidate.table_name.to_sym => %w[email_address dob] + }) + + allow(DfE::Analytics).to receive(:hidden_pii).and_return({ + Candidate.table_name.to_sym => %w[dob] + }) + end + + it 'includes both allowed and hidden data in the event when present' do + candidate = Candidate.create(email_address: 'test@example.com', dob: '20062000') + + described_class.new.perform(model_class, [candidate.id], entity_tag) + + expect(DfE::Analytics::SendEvents).to have_received(:perform_now) do |events| + expect(events.first['data']).not_to be_empty + expect(events.first['hidden_data']).not_to be_empty + end + end + + it 'splits a batch when the batch is too big, including hidden data' do + perform_enqueued_jobs do + c = Candidate.create(email_address: '12345678910', dob: '12072000') + c2 = Candidate.create(email_address: '12345678910', dob: '12072000') + + stub_const('DfE::Analytics::LoadEntityBatch::BQ_BATCH_MAX_BYTES', 300) + + described_class.perform_now('Candidate', [c.id, c2.id], entity_tag) + + expect(DfE::Analytics::SendEvents).to have_received(:perform_now).twice + end + end + + it 'does not split a batch if the payload size is below the threshold' do + perform_enqueued_jobs do + c = Candidate.create(email_address: '12345678910') + c2 = Candidate.create(email_address: '12345678910') + stub_const('DfE::Analytics::LoadEntityBatch::BQ_BATCH_MAX_BYTES', 1000) + + described_class.perform_now('Candidate', [c.id, c2.id], entity_tag) + + expect(DfE::Analytics::SendEvents).to have_received(:perform_now).once + end + end + end end end diff --git a/spec/dfe/analytics/send_events_spec.rb b/spec/dfe/analytics/send_events_spec.rb index 7fdb3920..39996d77 100644 --- a/spec/dfe/analytics/send_events_spec.rb +++ b/spec/dfe/analytics/send_events_spec.rb @@ -15,18 +15,120 @@ let(:events) { [event.as_json] } + let(:hidden_pii_event) do + { + 'entity_table_name' => 'user_profiles', + 'event_type' => 'update_entity', + 'data' => [ + { 'key' => 'email', 'value' => 'user@example.com' }, + { 'key' => 'phone_number', 'value' => '1234567890' } + ], + 'hidden_data' => [ + { 'key' => 'dob', 'value' => '20/06/1990' }, + { 'key' => 'first_name', 'value' => 'Sarah' } + ] + } + end + describe '#perform' do subject(:perform) { described_class.new.perform(events) } context 'when "log_only" is set' do before do allow(DfE::Analytics).to receive(:log_only?).and_return true + allow(Rails.logger).to receive(:info) end it 'does not go call bigquery apis' do expect(DfE::Analytics::BigQueryLegacyApi).not_to receive(:insert).with(events) perform end + + it 'logs events with all sensitive data masked' do + expect(Rails.logger).to receive(:info) do |log_message| + expect(log_message).to include('"key"=>"dob", "value"=>["HIDDEN"]') + expect(log_message).to include('"key"=>"first_name", "value"=>["HIDDEN"]') + expect(log_message).to include('"key"=>"email", "value"=>"user@example.com"') + expect(log_message).to include('"key"=>"phone_number", "value"=>"1234567890"') + end + + described_class.new.perform([hidden_pii_event]) + end + end + + describe 'Masking hidden_pii when event_debug_enabled?' do + subject(:perform) { described_class.new.perform(events) } + + let(:hidden_pii_event) do + { + 'entity_table_name' => 'user_profiles', + 'event_type' => 'update_entity', + 'data' => [ + { 'key' => 'email', 'value' => 'user@example.com' }, + { 'key' => 'phone_number', 'value' => '1234567890' } + ], + 'hidden_data' => [ + { 'key' => 'dob', 'value' => '20/06/1990' }, + { 'key' => 'first_name', 'value' => 'Sarah' } + ] + } + end + + let(:event_debug_filters) do + { + event_filters: [ + { + event_type: 'update_entity', + entity_table_name: 'user_profiles', + data: { + key: 'dob', + value: '20/06/1990' + } + }, + { + event_type: 'update_entity', + entity_table_name: 'user_profiles', + data: { + key: 'first_name', + value: 'Sarah' + } + }, + { + event_type: 'update_entity', + entity_table_name: 'user_profiles', + data: { + key: 'email', + value: 'user@example.com' + } + }, + { + event_type: 'update_entity', + entity_table_name: 'user_profiles', + data: { + key: 'phone_number', + value: '1234567890' + } + } + ] + } + end + + before do + allow(DfE::Analytics).to receive(:event_debug_filters).and_return(event_debug_filters) + allow(DfE::Analytics::BigQueryLegacyApi).to receive(:insert) + allow(Rails.logger).to receive(:info) + end + + it 'masks sensitive data in the log output' do + expect(Rails.logger).to receive(:info) do |log_message| + expect(log_message).to include('"key"=>"dob", "value"=>["HIDDEN"]') + expect(log_message).to include('"key"=>"first_name", "value"=>["HIDDEN"]') + expect(log_message).to include('"key"=>"email", "value"=>"user@example.com"') + expect(log_message).to include('"key"=>"phone_number", "value"=>"1234567890"') + end + + described_class.new.perform([hidden_pii_event]) + end end describe 'logging events for event debug' do diff --git a/spec/dfe/analytics_spec.rb b/spec/dfe/analytics_spec.rb index 147c60da..5013ab28 100644 --- a/spec/dfe/analytics_spec.rb +++ b/spec/dfe/analytics_spec.rb @@ -177,6 +177,61 @@ end end + describe '.extract_model_attributes' do + with_model :Candidate do + table do |t| + t.string :email_address + t.string :hidden_data + t.integer :age + end + end + + before do + allow(DfE::Analytics).to receive(:allowlist).and_return({ + Candidate.table_name.to_sym => %w[email_address hidden_data age] + }) + allow(DfE::Analytics).to receive(:allowlist_pii).and_return({ + Candidate.table_name.to_sym => %w[email_address] + }) + allow(DfE::Analytics).to receive(:hidden_pii).and_return({ + Candidate.table_name.to_sym => %w[hidden_data age] + }) + end + + let(:candidate) { Candidate.create(email_address: 'test@example.com', hidden_data: 'secret', age: 50) } + + it 'correctly separates and obfuscates attributes' do + result = described_class.extract_model_attributes(candidate) + + expect(result[:data].keys).to include('email_address') + expect(result[:data]['email_address']).to_not eq(candidate.email_address) + + expect(result[:hidden_data]['hidden_data']).to eq('secret') + expect(result[:hidden_data]['age']).to eq(50) + end + + it 'correctly separates allowed and hidden attributes' do + result = described_class.extract_model_attributes(candidate) + + expect(result[:data].keys).to include('email_address') + expect(result[:data]).not_to have_key('hidden_data') + expect(result[:data]).not_to have_key('age') + + expect(result[:hidden_data]['hidden_data']).to eq('secret') + expect(result[:hidden_data]['age']).to eq(50) + end + + it 'does not error if no hidden data is sent' do + candidate = Candidate.create(email_address: 'test@example.com') + allow(DfE::Analytics).to receive(:allowlist).and_return(Candidate.table_name.to_sym => %w[email_address]) + + result = described_class.extract_model_attributes(candidate) + expect(result[:data].keys).to include('email_address') + expect(result[:hidden_data]).to be_nil.or be_empty + expect { DfE::Analytics.extract_model_attributes(candidate) }.not_to raise_error + end + end + describe '.parse_maintenance_window' do context 'with a valid maintenance window' do before do diff --git a/spec/dummy/config/analytics_hidden_pii.yml b/spec/dummy/config/analytics_hidden_pii.yml new file mode 100644 index 00000000..3a42532a --- /dev/null +++ b/spec/dummy/config/analytics_hidden_pii.yml @@ -0,0 +1,2 @@ +--- +shared: {}