Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to straighten out language and improve dfe:analytics:check #30

Merged
merged 8 commits into from
Jul 25, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .rubocop.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ Layout/MultilineMethodCallIndentation:
Style/FrozenStringLiteralComment:
Enabled: false

Metrics/ModuleLength:
Enabled: false

# Offense count: 2
# This cop supports safe auto-correction (--auto-correct).
# Configuration parameters: AllowHeredoc, AllowURI, URISchemes, IgnoreCopDirectives, IgnoredPatterns.
Expand Down
26 changes: 14 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# DfE::Analytics

**👉 Send every web request and model update to BigQuery**
**👉 Send every web request and database update to BigQuery**

**✋ Skip or anonymise fields containing PII**

Expand All @@ -13,9 +13,9 @@ This gem provides an _opinionated integration_ with Google BigQuery.
Once it is set up, every web request and database update (as permitted by
configuration) will flow to BigQuery.

It also provides a Rake task for backfilling BigQuery with models created
It also provides a Rake task for backfilling BigQuery with entities created
before you started sending events (see **Importing existing data** below), and
one for keeping your field configuration up to date.
another for keeping your field configuration up to date.

To set the gem up follow the steps in "Configuration", below.

Expand All @@ -25,11 +25,13 @@ To set the gem up follow the steps in "Configuration", below.

## Names and jargon

A Rails model is an analytics **Entity**.
A Rails model is an analytics **Entity**. All models are entities, but not all
entities are models — for example, an entity could be an association in a
many-to-many join table.

A change to a model (including creation and deletion) is an analytics
**Event**. When a model changes we send the entire new state of the model as
part of the event.
A change to a entity (update, creation or deletion) is an analytics **Event**.
When an entity changes we send the entire new state of the entity as part of
the event.

A web request is also an analytics **Event**.

Expand All @@ -48,7 +50,7 @@ sequenceDiagram
Controller->>Model: Update model
Model->>Analytics: after_update hook
Analytics-->>RequestStore: Retrieve request UUID
Analytics->>ActiveJob: enqueue Event with serialized model state and request UUID
Analytics->>ActiveJob: enqueue Event with serialized entity state and request UUID
Controller->>Analytics: after_action to send request event
Analytics->>ActiveJob: enqueue Event with serialized request and request UUID
Controller->>Client: 200 OK
Expand Down Expand Up @@ -306,8 +308,8 @@ bundle exec rails dfe:analytics:check
```

This will let you know whether there are any fields in your field configuration
which are present in the model but missing from the config, or present in the
config but missing from the model.
which are present in the database but missing from the config, or present in the
config but missing from the database.

**It's recommended to run this task regularly - at least as often as you run
database migrations. Consider enhancing db:migrate to run it automatically.**
Expand Down Expand Up @@ -412,10 +414,10 @@ Run
bundle exec rails dfe:analytics:import_all_entities
```

To reimport just one model, run:
To reimport just one entity, run:

```bash
bundle exec rails dfe:analytics:import_entity[ModelName]
bundle exec rails dfe:analytics:import_entity[entity_name]
```

## Contributing
Expand Down
41 changes: 27 additions & 14 deletions lib/dfe/analytics.rb
Original file line number Diff line number Diff line change
Expand Up @@ -79,15 +79,15 @@ def self.enabled?
end

def self.allowlist
@allowlist ||= Rails.application.config_for(:analytics)
Rails.application.config_for(:analytics)
end

def self.allowlist_pii
@allowlist_pii ||= Rails.application.config_for(:analytics_pii)
Rails.application.config_for(:analytics_pii)
end

def self.blocklist
@blocklist ||= Rails.application.config_for(:analytics_blocklist)
Rails.application.config_for(:analytics_blocklist)
end

def self.environment
Expand All @@ -102,20 +102,16 @@ def self.async?
config.async
end

def self.time_zone
'London'
def self.entities_for_analytics
allowlist.keys & all_entities_in_application
end

def self.models_for_analytics
Rails.application.eager_load!

tables_to_models = ActiveRecord::Base.descendants
.reject(&:abstract_class?)
.to_h { |m| [m.table_name.to_sym, m.name] }
def self.all_entities_in_application
entity_model_mapping.keys.map(&:to_sym)
end

allowlist.map do |table_name, _|
tables_to_models[table_name]
end
def self.model_for_entity(entity)
entity_model_mapping.fetch(entity.to_s)
end

def self.extract_model_attributes(model, attributes = nil)
Expand All @@ -136,5 +132,22 @@ def self.extract_model_attributes(model, attributes = nil)
def self.anonymise(value)
Digest::SHA2.hexdigest(value.to_s)
end

def self.entity_model_mapping
# ActiveRecord::Base.descendants will collect every model in the
# application, including internal models Rails uses to represent
# has_and_belongs_to_many relationships without their own models. We map
# these back to table_names which are equivalent to dfe-analytics
# "entities".
@entity_model_mapping ||= begin
Rails.application.eager_load!

ActiveRecord::Base.descendants
.reject(&:abstract_class?)
.index_by(&:table_name)
end
end

private_class_method :entity_model_mapping
end
end
6 changes: 1 addition & 5 deletions lib/dfe/analytics/event.rb
Original file line number Diff line number Diff line change
Expand Up @@ -114,11 +114,7 @@ def hash_to_kv_pairs(hash)
end

def anonymised_user_agent_and_ip(rack_request)
anonymise(rack_request.user_agent.to_s + rack_request.remote_ip.to_s) if rack_request.remote_ip.present?
end

def anonymise(text)
Digest::SHA2.hexdigest(text)
DfE::Analytics.anonymise(rack_request.user_agent.to_s + rack_request.remote_ip.to_s) if rack_request.remote_ip.present?
end

def ensure_utf8(str)
Expand Down
59 changes: 35 additions & 24 deletions lib/dfe/analytics/fields.rb
Original file line number Diff line number Diff line change
Expand Up @@ -11,41 +11,52 @@ def self.allowlist
DfE::Analytics.allowlist
end

def self.database
DfE::Analytics.all_entities_in_application
.reduce({}) do |list, entity|
attrs = DfE::Analytics.model_for_entity(entity).attribute_names
list[entity] = attrs

list
end
end

def self.generate_blocklist
diff_model_attributes_against(allowlist)[:missing]
diff_between(database, allowlist)
end

def self.unlisted_fields
diff_model_attributes_against(allowlist, blocklist)[:missing]
diff_between(database, allowlist, blocklist)
end

def self.surplus_fields
diff_model_attributes_against(allowlist)[:surplus]
diff_between(allowlist, database)
end

def self.diff_model_attributes_against(*lists)
Rails.application.eager_load!
ActiveRecord::Base.descendants
.reject { |model| model.name.include? 'ActiveRecord' } # ignore internal AR classes
.reduce({ missing: {}, surplus: {} }) do |diff, next_model|
table_name = next_model.table_name&.to_sym

if table_name.present?
attributes_considered = # then combine to get all the attrs we deal with
lists.map do |list|
# for each list of model attrs, look up the attrs for this model
list.fetch(table_name, [])
end.reduce(:concat)
missing_attributes = next_model.attribute_names - attributes_considered
surplus_attributes = attributes_considered - next_model.attribute_names

diff[:missing][table_name] = missing_attributes if missing_attributes.any?
def self.conflicting_fields
diff_between(allowlist, diff_between(allowlist, blocklist))
end

diff[:surplus][table_name] = surplus_attributes if surplus_attributes.any?
end
# extract and concatenate the fields associated with an entity in 1 or
# more entity->field lists
def self.extract_entity_attributes_from_lists(entity, *lists)
lists.map do |list|
list.fetch(entity, [])
end.reduce(:|)
end

diff
end
# returns keys and values present in leftmost list and not present in any
# of the other lists
#
# diff_between({a: [1, 2]}, {a: [2, 3]}) => {a: [1]}
def self.diff_between(primary_list, *lists_to_compare)
primary_list.reduce({}) do |diff, (entity, attrs_in_primary_list)|
attrs_in_lists_to_compare = extract_entity_attributes_from_lists(entity, *lists_to_compare)
differing_attrs = attrs_in_primary_list - attrs_in_lists_to_compare
diff[entity] = differing_attrs if differing_attrs.any?

diff
end
end
end
end
Expand Down
14 changes: 7 additions & 7 deletions lib/dfe/analytics/load_entities.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,26 @@ module Analytics
class LoadEntities
DEFAULT_BATCH_SIZE = 200

def initialize(model_name:, batch_size: DEFAULT_BATCH_SIZE)
@model_name = model_name
@model_class = Object.const_get(model_name)
def initialize(entity_name:, batch_size: DEFAULT_BATCH_SIZE)
@entity_name = entity_name
@batch_size = batch_size.to_i
end

def run
Rails.logger.info("Processing data for #{@model_class.name} with row count #{@model_class.count}")
model = DfE::Analytics.model_for_entity(@entity_name)
Rails.logger.info("Processing data for #{@entity_name} with row count #{model.count}")

batch_number = 0

@model_class.order(:id).in_batches(of: @batch_size) do |relation|
model.order(:id).in_batches(of: @batch_size) do |relation|
batch_number += 1

ids = relation.pluck(:id)

DfE::Analytics::LoadEntityBatch.perform_later(@model_class, ids, batch_number)
DfE::Analytics::LoadEntityBatch.perform_later(model, ids, batch_number)
end

Rails.logger.info "Enqueued #{batch_number} batches of #{@batch_size} #{@model_name} for importing to BigQuery"
Rails.logger.info "Enqueued #{batch_number} batches of #{@batch_size} #{@entity_name} for importing to BigQuery"
end
end
end
Expand Down
17 changes: 16 additions & 1 deletion lib/dfe/analytics/tasks/fields.rake
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ namespace :dfe do
task check: :environment do
surplus_fields = DfE::Analytics::Fields.surplus_fields
unlisted_fields = DfE::Analytics::Fields.unlisted_fields
conflicting_fields = DfE::Analytics::Fields.conflicting_fields

surplus_fields_failure_message = <<~HEREDOC
Database field removed! Please remove it from analytics.yml and then run
Expand All @@ -26,11 +27,25 @@ namespace :dfe do
#{unlisted_fields.to_yaml}
HEREDOC

conflicting_fields_failure_message = <<~HEREDOC
Conflict detected between analytics.yml and analytics_blocklist.yml!

The following fields exist in both files. To remove from the blocklist, run:

bundle exec rails dfe:analytics:regenerate_blocklist

Conflicting fields:

#{conflicting_fields.to_yaml}
HEREDOC

puts unlisted_fields_failure_message if unlisted_fields.any?

puts surplus_fields_failure_message if surplus_fields.any?

raise if surplus_fields.any? || unlisted_fields.any?
puts conflicting_fields_failure_message if conflicting_fields.any?

raise if surplus_fields.any? || unlisted_fields.any? || conflicting_fields.any?
end

desc 'Generate a new field blocklist containing all the fields not listed for sending to Bigquery'
Expand Down
8 changes: 4 additions & 4 deletions lib/dfe/analytics/tasks/import_entities.rake
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@ namespace :dfe do
namespace :analytics do
desc 'Send Analytics events for the (allowlisted) state of all records in the database'
task :import_all_entities, %i[batch_size] => :environment do |_, args|
models_for_analytics.each do |model_name|
DfE::Analytics::LoadEntities.new(model_name: model_name, **args).run
entities_for_analytics.each do |entity_name|
DfE::Analytics::LoadEntities.new(entity_name: entity_name, **args).run
end
end

desc 'Send Analytics events for the state of all records in a specified model'
task :import_entity, %i[model_name batch_size] => :environment do |_, args|
abort('You need to specify a model name as an argument to the Rake task, eg dfe:analytics:import_entity[Model]') unless args[:model_name]
task :import_entity, %i[entity_name batch_size] => :environment do |_, args|
abort('You need to specify a model name as an argument to the Rake task, eg dfe:analytics:import_entity[Model]') unless args[:entity_name]

DfE::Analytics::LoadEntities.new(args).run
end
Expand Down
23 changes: 23 additions & 0 deletions spec/dfe/analytics/fields_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,29 @@
end
end

describe '.conflicting_fields' do
context 'when fields conflict' do
let(:existing_allowlist) { { candidates: %w[email_address id first_name], institutions: %w[id] } }
let(:existing_blocklist) { { candidates: %w[email_address first_name] } }

it 'returns the conflicting fields' do
conflicts = described_class.conflicting_fields
expect(conflicts.keys).to eq(%i[candidates])
expect(conflicts[:candidates]).to eq(%w[email_address first_name])
end
end

context 'when there are no conflicts' do
let(:existing_allowlist) { { candidates: %w[email_address], institutions: %w[id] } }
let(:existing_blocklist) { { candidates: %w[id] } }

it 'returns nothing' do
conflicts = described_class.conflicting_fields
expect(conflicts).to be_empty
end
end
end

describe '.generate_blocklist' do
it 'returns all the fields in the model that aren’t in the allowlist' do
fields = described_class.generate_blocklist[:candidates]
Expand Down
6 changes: 3 additions & 3 deletions spec/dfe/analytics/load_entities_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@
end
end

it 'sends a model’s fields to BQ' do
it 'sends a entity’s fields to BQ' do
Candidate.create(email_address: 'known@address.com')

described_class.new(model_name: 'Candidate').run
described_class.new(entity_name: 'candidates').run

expect(DfE::Analytics::SendEvents).to have_received(:perform_later) do |payload|
schema = DfE::Analytics::EventSchema.new.as_json
Expand All @@ -46,7 +46,7 @@
Candidate.create
Candidate.create

described_class.new(model_name: 'Candidate', batch_size: 2).run
described_class.new(entity_name: 'candidates', batch_size: 2).run

expect(DfE::Analytics::SendEvents).to have_received(:perform_later).once
end
Expand Down
Loading