Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send config events #75

Merged
merged 6 commits into from
May 26, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 17 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,18 @@ DfE::Analytics.config.anonymise_web_request_user_id = false

Anonymisation of `user_id` would be required if the source field in the schema is in `analytics_pii.yml` so that analysts can join the IDs together. If the `user_id` is not in `analytics_pii.yml` but is in `analytics.yml` then `user_id` anonymisation would *not* be required so that the IDs could still be joined together.

### Data Anonymisation Algorithm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asatwal should we label this pseudonymisation rather than anonymisation to match ICO terminology? Technically the approach we take is not true anonymisation, so under GDPR is still PII. The advantage is masking which helps with security, rather than eliminating the need for it.

And also in other places where we've used the term.

See https://ico.org.uk/media/about-the-ico/consultations/4019579/chapter-3-anonymisation-guidance.pdf for more info

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenleggdfe agree that pseudonymisation would be better terminology. However, historically it's been called anonymisation in the gem and there are methods such as DfE::Analytics.anonymise that are used externally and config variables that would need to be updated as well as all the other documentation. For consistency we would have to update all these and this would be part of a bigger change with external dependencies, so I don't want to tackle this just yet.

Strictly speaking, this is just a hashing algorithm, so I would say it's more correct to call this "hashing" rather than "anonymisation", but that's a different story.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenleggdfe I decided to action this review comment in the end as this is a sensitive and important area.

The only item I didn't change is the anonymised_user_agent_and_ip field in event data for web requests. I thought this item would require a big migration and also this data doesn't identify a user right ? Only a user session ?

I've kept the DfE:Analytics.anonymise for backwards compatibility. We can deprecate in future releases. I believe the only team that use this is tv, although I think it's best to send out deprecation warnings and remove in a more release friendly way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me - thank you!


Generally all PII data should be anonymised, including data that directly or indirect references PII, for example database IDs.

The `dfe-analytics` gem also anonymises such data, if it is configured to do so. If you are anonymising database IDs in your code (in custom events for example), then you should uses the same hashing algorithm for anonymisation that the gem uses.

The following method should be used in your code for anonymisation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asatwal worth explaining why? i.e. "... to ensure that pseudonymised data can still be joined to other pseudonymised data"


```ruby
DfE::Analytics.anonymise(value)
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asatwal worth adding a note here to say that the GoogleSQL equivalent of this is TO_HEX(SHA256(value)), and thus SQL can be used to anonymise data, but (obviously) not to de-anonymise it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that could usefully go into a comment in the anonymise method? I like that callers just use anonymise and don't have to worry about the algorithm. Not that we're likely to change it, but it feels hygienic.

Copy link
Collaborator Author

@asatwal asatwal May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I'm trying to keep too much technical details out of the READ.me anyway as it's getting very long.

### Adding specs

#### Testing modes
Expand Down Expand Up @@ -432,8 +444,11 @@ Please note that page caching is project specific and each project must carefull
> It could be nice to have tests to prove that connectivity to GCP still works after an update, but we aren't setup for that yet.
3. (Optional) Verify committed `CHANGELOG.md` changes and alter if necessary: `git show`
4. Push the branch: `git push origin v${NEW_VERSION}-release`, e.g. `git push origin v1.3.0-release`
5. Push the tags: `git push --tags`
6. Cut a PR on GitHub with the label `version-release`, and merge once approved
5. Cut a PR on GitHub with the label `version-release`, and wait for approval
6. Once the PR is approved push the tags, immediately prior to merging: `git push --tags`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you — that was bugging me!

7. Merge the PR.

IMPORTANT: Pushing the tags will immediately make the release available even on a unmerged branch. Therefore, push the tags to Github only when the PR is approved and immediately prior to merging the PR.

## License

Expand Down
4 changes: 4 additions & 0 deletions lib/dfe/analytics.rb
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
require 'dfe/analytics/load_entities'
require 'dfe/analytics/load_entity_batch'
require 'dfe/analytics/requests'
require 'dfe/analytics/initialise'
require 'dfe/analytics/version'
require 'dfe/analytics/middleware/request_identity'
require 'dfe/analytics/middleware/send_cached_page_request_event'
Expand Down Expand Up @@ -103,6 +104,9 @@ def self.initialize!
model.include(DfE::Analytics::Entities)
end
end

DfE::Analytics::Initialise.trigger_initialise_event
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have an init routine in railtie.rb — could we use that?

Copy link
Collaborator Author

@asatwal asatwal May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In railtie.rb, I find the App isn't fully initialised, although it depends on where in railtie the methods are called. I suppose it could be called in the after_initialize hook after the DfE::Analytics.initialize!call, after which the config is fully initialised, as we need the config loaded before calling trigger_initialise_event.

I'll give that a try.

So I did try the suggestion of moving to railtie.rb and it does work. However, it's not possible to have collaboration tests, so if the call to DfE::Analytics.initialise.trigger_initialise_event ever got removed, the tests would not pick that up. For this reason I've left the trigger called in DfE::Analytics.initialize!.


rescue ActiveRecord::PendingMigrationError
Rails.logger.info('Database requires migration; DfE Analytics not initialized')
rescue ActiveRecord::ActiveRecordError
Expand Down
13 changes: 9 additions & 4 deletions lib/dfe/analytics/event.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,14 @@
module DfE
module Analytics
class Event
EVENT_TYPES = %w[web_request create_entity update_entity delete_entity import_entity].freeze
EVENT_TYPES = %w[
web_request
create_entity
update_entity
delete_entity
import_entity
analytics_initialise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asatwal maybe ignore this (!) but given we've used verb_noun form for the entity event types what would you think of using initialise_analytics for this new event type for consistency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenleggdfe yes initialise_analytics sounds better !

].freeze

def initialize
time_zone = 'London'
Expand All @@ -24,9 +31,7 @@ def with_type(type)
allowed_types = EVENT_TYPES + DfE::Analytics.custom_events
raise 'Invalid analytics event type' unless allowed_types.include?(type.to_s)

@event_hash.merge!(
event_type: type
)
@event_hash.merge!(event_type: type)

self
end
Expand Down
34 changes: 34 additions & 0 deletions lib/dfe/analytics/initialise.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# frozen_string_literal: true

module DfE
module Analytics
# Send event on dfe analytics startup during initialisation
# - Event contains the dfe analytics version, config and other items
class Initialise
def self.trigger_initialise_event
new.send_initialise_event
end

def send_initialise_event
return unless DfE::Analytics.enabled?

initialise_event = DfE::Analytics::Event.new
.with_type('analytics_initialise')
.with_data(initialise_data)

DfE::Analytics::SendEvents.do([initialise_event.as_json])
end

private

def initialise_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this initialisation_data?

{
analytics_version: DfE::Analytics::VERSION,
config: {
anonymise_web_request_user_id: DfE::Analytics.config.anonymise_web_request_user_id
}
}
end
end
end
end
4 changes: 2 additions & 2 deletions spec/dfe/analytics/entities_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@
it 'sends events that are valid according to the schema' do
Candidate.create

expect(DfE::Analytics::SendEvents).to have_received(:perform_later) do |payload|
expect(DfE::Analytics::SendEvents).to have_received(:perform_later).twice do |payload|
schema = DfE::Analytics::EventSchema.new.as_json
schema_validator = JSONSchemaValidator.new(schema, payload.first)

Expand Down Expand Up @@ -162,7 +162,7 @@
entity = Candidate.create
entity.update(email_address: 'bar@baz.com')

expect(DfE::Analytics::SendEvents).to have_received(:perform_later).twice do |payload|
expect(DfE::Analytics::SendEvents).to have_received(:perform_later).thrice do |payload|
schema = DfE::Analytics::EventSchema.new.as_json
schema_validator = JSONSchemaValidator.new(schema, payload.first)

Expand Down
24 changes: 24 additions & 0 deletions spec/dfe/analytics/initialise_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# frozen_string_literal: true

RSpec.describe DfE::Analytics::Initialise do
before do
allow(DfE::Analytics::SendEvents).to receive(:perform_later)
allow(DfE::Analytics).to receive(:enabled?).and_return(true)
end

describe 'trigger_initialise_event ' do
it 'includes the expected attributes' do
described_class.trigger_initialise_event

expect(DfE::Analytics::SendEvents).to have_received(:perform_later)
.with([a_hash_including({
'event_type' => 'analytics_initialise',
'data' => [
{ 'key' => 'analytics_version', 'value' => [DfE::Analytics::VERSION] },
{ 'key' => 'config',
'value' => ['{"anonymise_web_request_user_id":false}'] }
]
})])
end
end
end
8 changes: 8 additions & 0 deletions spec/dfe/analytics_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -176,4 +176,12 @@
end
end
end

it 'triggers the initialisation event' do
allow(DfE::Analytics::Initialise).to receive(:trigger_initialise_event)

DfE::Analytics.initialize!

expect(DfE::Analytics::Initialise).to have_received(:trigger_initialise_event)
end
end