Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send config events #75

Merged
merged 6 commits into from
May 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 22 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

**👉 Send every web request and database update to BigQuery**

**✋ Skip or anonymise fields containing PII**
**✋ Skip or pseudonymise fields containing PII**. For an explanation of pseudonymisation, see [ICO Guidance](https://ico.org.uk/media/about-the-ico/consultations/4019579/chapter-3-anonymisation-guidance.pdf)

**✌️ Configure and forget**

Expand Down Expand Up @@ -250,15 +250,27 @@ If a field other than `id` is required for the user identifier, then a custom us
DfE::Analytics.config.user_identifier = proc { |user| user&.id }
```

#### User ID anonymisation
#### User ID pseudonymisation

The `user_id` in the web request event will not be anonymised by default. This can be changed by updating the configuration option in `config/initializers/dfe_analytics.rb`:
The `user_id` in the web request event will not be pseudonymised by default. This can be changed by updating the configuration option in `config/initializers/dfe_analytics.rb`:

```ruby
DfE::Analytics.config.anonymise_web_request_user_id = false
DfE::Analytics.config.pseudonymise_web_request_user_id = false
```

Anonymisation of `user_id` would be required if the source field in the schema is in `analytics_pii.yml` so that analysts can join the IDs together. If the `user_id` is not in `analytics_pii.yml` but is in `analytics.yml` then `user_id` anonymisation would *not* be required so that the IDs could still be joined together.
Pseudonymisation of `user_id` would be required if the source field in the schema is in `analytics_pii.yml` so that analysts can join the IDs together. If the `user_id` is not in `analytics_pii.yml` but is in `analytics.yml` then `user_id` pseudonymisation would *not* be required so that the IDs could still be joined together.

### Data Pseudonymisation Algorithm

Generally all PII data should be pseudonymised, including data that directly or indirect references PII, for example database IDs.

The `dfe-analytics` gem also pseudonymises such data, if it is configured to do so. If you are pseudonymising database IDs in your code (in custom events for example), then you should use the same hashing algorithm for pseudonymisation that the gem uses in order to allow joining of pseudonymised data across different database tables.

The following method should be used in your code for pseudonymisation:

```ruby
DfE::Analytics.pseudonymise(value)
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asatwal worth adding a note here to say that the GoogleSQL equivalent of this is TO_HEX(SHA256(value)), and thus SQL can be used to anonymise data, but (obviously) not to de-anonymise it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that could usefully go into a comment in the anonymise method? I like that callers just use anonymise and don't have to worry about the algorithm. Not that we're likely to change it, but it feels hygienic.

Copy link
Collaborator Author

@asatwal asatwal May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I'm trying to keep too much technical details out of the READ.me anyway as it's getting very long.

### Adding specs

Expand Down Expand Up @@ -432,8 +444,11 @@ Please note that page caching is project specific and each project must carefull
> It could be nice to have tests to prove that connectivity to GCP still works after an update, but we aren't setup for that yet.
3. (Optional) Verify committed `CHANGELOG.md` changes and alter if necessary: `git show`
4. Push the branch: `git push origin v${NEW_VERSION}-release`, e.g. `git push origin v1.3.0-release`
5. Push the tags: `git push --tags`
6. Cut a PR on GitHub with the label `version-release`, and merge once approved
5. Cut a PR on GitHub with the label `version-release`, and wait for approval
6. Once the PR is approved push the tags, immediately prior to merging: `git push --tags`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you — that was bugging me!

7. Merge the PR.

IMPORTANT: Pushing the tags will immediately make the release available even on a unmerged branch. Therefore, push the tags to Github only when the PR is approved and immediately prior to merging the PR.

## License

Expand Down
4 changes: 2 additions & 2 deletions config/locales/en.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,9 @@ en:
return the identifier for the user. This is useful for systems with
users that don't use the id field.
default: proc { |user| user&.id }
anonymise_web_request_user_id:
pseudonymise_web_request_user_id:
description: |
Whether to anonymise the user_id field in the web request event.
Whether to pseudonymise the user_id field in the web request event.
default: false
rack_page_cached:
description: |
Expand Down
38 changes: 22 additions & 16 deletions lib/dfe/analytics.rb
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
require 'dfe/analytics/load_entities'
require 'dfe/analytics/load_entity_batch'
require 'dfe/analytics/requests'
require 'dfe/analytics/initialise'
require 'dfe/analytics/version'
require 'dfe/analytics/middleware/request_identity'
require 'dfe/analytics/middleware/send_cached_page_request_event'
Expand Down Expand Up @@ -58,7 +59,7 @@ def self.config
enable_analytics
environment
user_identifier
anonymise_web_request_user_id
pseudonymise_web_request_user_id
rack_page_cached
]

Expand All @@ -68,20 +69,20 @@ def self.config
def self.configure
yield(config)

config.enable_analytics ||= proc { true }
config.bigquery_table_name ||= ENV['BIGQUERY_TABLE_NAME']
config.bigquery_project_id ||= ENV['BIGQUERY_PROJECT_ID']
config.bigquery_dataset ||= ENV['BIGQUERY_DATASET']
config.bigquery_api_json_key ||= ENV['BIGQUERY_API_JSON_KEY']
config.bigquery_retries ||= 3
config.bigquery_timeout ||= 120
config.environment ||= ENV.fetch('RAILS_ENV', 'development')
config.log_only ||= false
config.async ||= true
config.queue ||= :default
config.user_identifier ||= proc { |user| user&.id }
config.anonymise_web_request_user_id ||= false
config.rack_page_cached ||= proc { |_rack_env| false }
config.enable_analytics ||= proc { true }
config.bigquery_table_name ||= ENV['BIGQUERY_TABLE_NAME']
config.bigquery_project_id ||= ENV['BIGQUERY_PROJECT_ID']
config.bigquery_dataset ||= ENV['BIGQUERY_DATASET']
config.bigquery_api_json_key ||= ENV['BIGQUERY_API_JSON_KEY']
config.bigquery_retries ||= 3
config.bigquery_timeout ||= 120
config.environment ||= ENV.fetch('RAILS_ENV', 'development')
config.log_only ||= false
config.async ||= true
config.queue ||= :default
config.user_identifier ||= proc { |user| user&.id }
config.pseudonymise_web_request_user_id ||= false
config.rack_page_cached ||= proc { |_rack_env| false }
end

def self.initialize!
Expand Down Expand Up @@ -177,10 +178,15 @@ def self.extract_model_attributes(model, attributes = nil)
allowed_attributes = attributes.slice(*exportable_attrs&.map(&:to_s))
obfuscated_attributes = attributes.slice(*exportable_pii_attrs&.map(&:to_s))

allowed_attributes.deep_merge(obfuscated_attributes.transform_values { |value| anonymise(value) })
allowed_attributes.deep_merge(obfuscated_attributes.transform_values { |value| pseudonymise(value) })
end

def self.anonymise(value)
pseudonymise(value)
end

def self.pseudonymise(value)
# Google SQL equivalent of this is TO_HEX(SHA256(value))
Digest::SHA2.hexdigest(value.to_s)
end

Expand Down
17 changes: 11 additions & 6 deletions lib/dfe/analytics/event.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,14 @@
module DfE
module Analytics
class Event
EVENT_TYPES = %w[web_request create_entity update_entity delete_entity import_entity].freeze
EVENT_TYPES = %w[
web_request
create_entity
update_entity
delete_entity
import_entity
initialise_analytics
].freeze

def initialize
time_zone = 'London'
Expand All @@ -24,9 +31,7 @@ def with_type(type)
allowed_types = EVENT_TYPES + DfE::Analytics.custom_events
raise 'Invalid analytics event type' unless allowed_types.include?(type.to_s)

@event_hash.merge!(
event_type: type
)
@event_hash.merge!(event_type: type)

self
end
Expand Down Expand Up @@ -119,7 +124,7 @@ def hash_to_kv_pairs(hash)
end

def anonymised_user_agent_and_ip(rack_request)
DfE::Analytics.anonymise(rack_request.user_agent.to_s + rack_request.remote_ip.to_s) if rack_request.remote_ip.present?
DfE::Analytics.pseudonymise(rack_request.user_agent.to_s + rack_request.remote_ip.to_s) if rack_request.remote_ip.present?
end

def ensure_utf8(str)
Expand All @@ -128,7 +133,7 @@ def ensure_utf8(str)

def user_identifier_for(user)
user_id = DfE::Analytics.user_identifier(user)
user_id = DfE::Analytics.anonymise(user_id) if user_id.present? && DfE::Analytics.config.anonymise_web_request_user_id
user_id = DfE::Analytics.pseudonymise(user_id) if user_id.present? && DfE::Analytics.config.pseudonymise_web_request_user_id

user_id
end
Expand Down
55 changes: 55 additions & 0 deletions lib/dfe/analytics/initialise.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# frozen_string_literal: true

module DfE
module Analytics
# DfE Analytics initialisation event
# - Event should only be sent once, but NOT on startup as this causes errors on some services
# - Event contains the dfe analytics version, config and other items
class Initialise
# Disable rubocop class variable warnings for class - class variable required to control sending of event
# rubocop:disable Style:ClassVars
@@initialise_event_sent = false # rubocop:disable Style:ClassVars

def self.trigger_initialise_event
new.send_initialise_event
end

def self.initialise_event_sent?
@@initialise_event_sent
end

def self.initialise_event_sent=(value)
@@initialise_event_sent = value # rubocop:disable Style:ClassVars
end

def send_initialise_event
return unless DfE::Analytics.enabled?

initialise_event = DfE::Analytics::Event.new
.with_type('initialise_analytics')
.with_data(initialisation_data)
.as_json

if DfE::Analytics.async?
DfE::Analytics::SendEvents.perform_later([initialise_event])
else
DfE::Analytics::SendEvents.perform_now([initialise_event])
end

@@initialise_event_sent = true # rubocop:disable Style:ClassVars
end

private

def initialisation_data
{
analytics_version: DfE::Analytics::VERSION,
config: {
pseudonymise_web_request_user_id: DfE::Analytics.config.pseudonymise_web_request_user_id
}
}
end
# rubocop:enable Style:ClassVars
end
end
end
3 changes: 3 additions & 0 deletions lib/dfe/analytics/send_events.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ module DfE
module Analytics
class SendEvents < AnalyticsJob
def self.do(events)
# The initialise event is a one-off event that must be sent to BigQuery once only
DfE::Analytics::Initialise.trigger_initialise_event unless DfE::Analytics::Initialise.initialise_event_sent?

events = events.map { |event| event.is_a?(Event) ? event.as_json : event }

if DfE::Analytics.async?
Expand Down
2 changes: 1 addition & 1 deletion spec/dfe/analytics/entities_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@
it 'sends events that are valid according to the schema' do
Candidate.create

expect(DfE::Analytics::SendEvents).to have_received(:perform_later) do |payload|
expect(DfE::Analytics::SendEvents).to have_received(:perform_later).once do |payload|
schema = DfE::Analytics::EventSchema.new.as_json
schema_validator = JSONSchemaValidator.new(schema, payload.first)

Expand Down
12 changes: 6 additions & 6 deletions spec/dfe/analytics/event_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ def as_json
describe 'with_user' do
let(:regular_user_class) { Struct.new(:id) }

it 'uses user.id by default without anonymisation' do
it 'uses user.id by default without pseudonymisation' do
event = described_class.new
id = rand(1000)
output = event.with_user(regular_user_class.new(id)).as_json
Expand All @@ -148,18 +148,18 @@ def as_json
end
end

context 'anonymisation of user_id' do
context 'pseudonymisation of user_id' do
before do
allow(DfE::Analytics.config).to receive(:anonymise_web_request_user_id).and_return(true)
allow(DfE::Analytics.config).to receive(:pseudonymise_web_request_user_id).and_return(true)
end

it 'anonymises the user id' do
it 'pseudonymises the user id' do
event = described_class.new
uuid = SecureRandom.uuid
anonymised_uuid = Digest::SHA2.hexdigest(uuid)
pseudonymised_uuid = Digest::SHA2.hexdigest(uuid)
output = event.with_user(regular_user_class.new(uuid)).as_json

expect(output['user_id']).to eq anonymised_uuid
expect(output['user_id']).to eq pseudonymised_uuid
end
end
end
Expand Down
31 changes: 31 additions & 0 deletions spec/dfe/analytics/initialise_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# frozen_string_literal: true

RSpec.describe DfE::Analytics::Initialise do
before do
allow(DfE::Analytics::SendEvents).to receive(:perform_later)
allow(DfE::Analytics).to receive(:enabled?).and_return(true)
end

describe 'trigger_initialise_event ' do
it 'includes the expected attributes' do
described_class.trigger_initialise_event

expect(DfE::Analytics::SendEvents).to have_received(:perform_later)
.with([a_hash_including({
'event_type' => 'initialise_analytics',
'data' => [
{ 'key' => 'analytics_version', 'value' => [DfE::Analytics::VERSION] },
{ 'key' => 'config',
'value' => ['{"pseudonymise_web_request_user_id":false}'] }
]
})])
end
end

describe '.initialise_event_sent=' do
it 'allows setting of the class variable' do
described_class.initialise_event_sent = true
expect(described_class.initialise_event_sent?).to eq(true)
end
end
end
8 changes: 8 additions & 0 deletions spec/dfe/analytics_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,14 @@
expect(DfE::Analytics::VERSION).not_to be nil
end

it 'supports the pseudonymise method' do
expect(DfE::Analytics.pseudonymise('foo_bar')).to eq('4928cae8b37b3d1113f5e01e60c967df6c2b9e826dc7d91488d23a62fec715ba')
end

it 'supports the anonymise method for backwards compatibility' do
expect(DfE::Analytics.anonymise('foo_bar')).to eq('4928cae8b37b3d1113f5e01e60c967df6c2b9e826dc7d91488d23a62fec715ba')
end

it 'has documentation entries for all the config options' do
config_options = DfE::Analytics.config.members

Expand Down
Loading