Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to straighten out language and improve dfe:analytics:check #30

Merged
merged 8 commits into from
Jul 25, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 14 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# DfE::Analytics

**👉 Send every web request and model update to BigQuery**
**👉 Send every web request and database update to BigQuery**

**✋ Skip or anonymise fields containing PII**

Expand All @@ -13,9 +13,9 @@ This gem provides an _opinionated integration_ with Google BigQuery.
Once it is set up, every web request and database update (as permitted by
configuration) will flow to BigQuery.

It also provides a Rake task for backfilling BigQuery with models created
It also provides a Rake task for backfilling BigQuery with entities created
before you started sending events (see **Importing existing data** below), and
one for keeping your field configuration up to date.
another for keeping your field configuration up to date.

To set the gem up follow the steps in "Configuration", below.

Expand All @@ -25,11 +25,13 @@ To set the gem up follow the steps in "Configuration", below.

## Names and jargon

A Rails model is an analytics **Entity**.
A Rails model is an analytics **Entity**. All models are entities, but not all
entities are models — for example, an entity could be an association in a
many-to-many join table.

A change to a model (including creation and deletion) is an analytics
**Event**. When a model changes we send the entire new state of the model as
part of the event.
A change to a entity (update, creation or deletion) is an analytics **Event**.
When an entity changes we send the entire new state of the entity as part of
the event.

A web request is also an analytics **Event**.

Expand All @@ -48,7 +50,7 @@ sequenceDiagram
Controller->>Model: Update model
Model->>Analytics: after_update hook
Analytics-->>RequestStore: Retrieve request UUID
Analytics->>ActiveJob: enqueue Event with serialized model state and request UUID
Analytics->>ActiveJob: enqueue Event with serialized entity state and request UUID
Controller->>Analytics: after_action to send request event
Analytics->>ActiveJob: enqueue Event with serialized request and request UUID
Controller->>Client: 200 OK
Expand Down Expand Up @@ -306,8 +308,8 @@ bundle exec rails dfe:analytics:check
```

This will let you know whether there are any fields in your field configuration
which are present in the model but missing from the config, or present in the
config but missing from the model.
which are present in the database but missing from the config, or present in the
config but missing from the database.

**It's recommended to run this task regularly - at least as often as you run
database migrations. Consider enhancing db:migrate to run it automatically.**
Expand Down Expand Up @@ -412,10 +414,10 @@ Run
bundle exec rails dfe:analytics:import_all_entities
```

To reimport just one model, run:
To reimport just one entity, run:

```bash
bundle exec rails dfe:analytics:import_entity[ModelName]
bundle exec rails dfe:analytics:import_entity[entity_name]
```

## Contributing
Expand Down
35 changes: 24 additions & 11 deletions lib/dfe/analytics.rb
Original file line number Diff line number Diff line change
Expand Up @@ -102,20 +102,16 @@ def self.async?
config.async
end

def self.time_zone
'London'
def self.entities_for_analytics
allowlist.keys & all_entities_in_application
end

def self.models_for_analytics
Rails.application.eager_load!

tables_to_models = ActiveRecord::Base.descendants
.reject(&:abstract_class?)
.to_h { |m| [m.table_name.to_sym, m.name] }
def self.all_entities_in_application
entity_model_mapping.keys.map(&:to_sym)
end

allowlist.map do |table_name, _|
tables_to_models[table_name]
end
def self.model_for_entity(entity)
entity_model_mapping.fetch(entity.to_s)
end

def self.extract_model_attributes(model, attributes = nil)
Expand All @@ -136,5 +132,22 @@ def self.extract_model_attributes(model, attributes = nil)
def self.anonymise(value)
Digest::SHA2.hexdigest(value.to_s)
end

def self.entity_model_mapping
# ActiveRecord::Base.descendants will collect every model in the
# application, including internal models Rails uses to represent
# has_and_belongs_to_many relationships without their own models. We map
# these back to table_names which are equivalent to dfe-analytics
# "entities".
@entity_model_mapping ||= begin
Rails.application.eager_load!

ActiveRecord::Base.descendants
.reject(&:abstract_class?)
.index_by(&:table_name)
end
end

private_class_method :entity_model_mapping
end
end
6 changes: 1 addition & 5 deletions lib/dfe/analytics/event.rb
Original file line number Diff line number Diff line change
Expand Up @@ -114,11 +114,7 @@ def hash_to_kv_pairs(hash)
end

def anonymised_user_agent_and_ip(rack_request)
anonymise(rack_request.user_agent.to_s + rack_request.remote_ip.to_s) if rack_request.remote_ip.present?
end

def anonymise(text)
Digest::SHA2.hexdigest(text)
DfE::Analytics.anonymise(rack_request.user_agent.to_s + rack_request.remote_ip.to_s) if rack_request.remote_ip.present?
end

def ensure_utf8(str)
Expand Down
34 changes: 15 additions & 19 deletions lib/dfe/analytics/fields.rb
Original file line number Diff line number Diff line change
Expand Up @@ -24,25 +24,21 @@ def self.surplus_fields
end

def self.diff_model_attributes_against(*lists)
Rails.application.eager_load!
ActiveRecord::Base.descendants
.reject { |model| model.name.include? 'ActiveRecord' } # ignore internal AR classes
.reduce({ missing: {}, surplus: {} }) do |diff, next_model|
table_name = next_model.table_name&.to_sym

if table_name.present?
attributes_considered = # then combine to get all the attrs we deal with
lists.map do |list|
# for each list of model attrs, look up the attrs for this model
list.fetch(table_name, [])
end.reduce(:concat)
missing_attributes = next_model.attribute_names - attributes_considered
surplus_attributes = attributes_considered - next_model.attribute_names

diff[:missing][table_name] = missing_attributes if missing_attributes.any?

diff[:surplus][table_name] = surplus_attributes if surplus_attributes.any?
end
DfE::Analytics.all_entities_in_application
.reduce({ missing: {}, surplus: {} }) do |diff, entity|
attributes_considered = lists.map do |list|
# for each list of model attrs, look up the attrs for this model
list.fetch(entity, [])
end.reduce(:concat)

model = DfE::Analytics.model_for_entity(entity)

missing_attributes = model.attribute_names - attributes_considered
surplus_attributes = attributes_considered - model.attribute_names

diff[:missing][entity] = missing_attributes if missing_attributes.any?

diff[:surplus][entity] = surplus_attributes if surplus_attributes.any?
Copy link
Contributor

@misaka misaka Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this bit a bit hard to digest. I think if you're going to be refactoring this bit, it would make sense to try to make it easier to comprehend, if possible. Maybe something like this (NB: untested code):

      def self.entity_attributes_from_lists(entity, lists)
        lists.map do |list|
          # for each list of model attrs, look up the attrs for this model
          list.fetch(entity, [])
        end.reduce(:concat)
      end

      def self.entities_with_missing_attributes_from_lists(*lists)
        DfE::Analytics.all_entities_in_application
          .reduce({}) do |entities_missing_attributes, entity|
            attributes_considered = entity_attributes_from_lists(entity, lists)
            model = DfE::Analytics.model_for_entity(entity)

            entities_missing_attributes[entity] = model.attribute_names - attributes_considered
            entities_missing_attributes
          end.reject { |_entity, attributes| attributes.empty? }
      end

      def self.entities_with_surplus_attributes_from_lists(*lists)
        DfE::Analytics.all_entities_in_application
          .reduce({}) do |entities_surplus_attributes, entity|
            attributes_considered = entity_attributes_from_lists(entity, lists)
            model = DfE::Analytics.model_for_entity(entity)

            entities_surplus_attributes[entity] =  attributes_considered - model.attribute_names
            entities_surplus_attributes
          end.reject { |_entity, attributes| attributes.empty? }
      end


diff
end
Expand Down
14 changes: 7 additions & 7 deletions lib/dfe/analytics/load_entities.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,26 @@ module Analytics
class LoadEntities
DEFAULT_BATCH_SIZE = 200

def initialize(model_name:, batch_size: DEFAULT_BATCH_SIZE)
@model_name = model_name
@model_class = Object.const_get(model_name)
def initialize(entity_name:, batch_size: DEFAULT_BATCH_SIZE)
@entity_name = entity_name
@batch_size = batch_size.to_i
end

def run
Rails.logger.info("Processing data for #{@model_class.name} with row count #{@model_class.count}")
model = DfE::Analytics.model_for_entity(@entity_name)
Rails.logger.info("Processing data for #{@entity_name} with row count #{model.count}")

batch_number = 0

@model_class.order(:id).in_batches(of: @batch_size) do |relation|
model.order(:id).in_batches(of: @batch_size) do |relation|
batch_number += 1

ids = relation.pluck(:id)

DfE::Analytics::LoadEntityBatch.perform_later(@model_class, ids, batch_number)
DfE::Analytics::LoadEntityBatch.perform_later(model, ids, batch_number)
end

Rails.logger.info "Enqueued #{batch_number} batches of #{@batch_size} #{@model_name} for importing to BigQuery"
Rails.logger.info "Enqueued #{batch_number} batches of #{@batch_size} #{@entity_name} for importing to BigQuery"
end
end
end
Expand Down
8 changes: 4 additions & 4 deletions lib/dfe/analytics/tasks/import_entities.rake
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@ namespace :dfe do
namespace :analytics do
desc 'Send Analytics events for the (allowlisted) state of all records in the database'
task :import_all_entities, %i[batch_size] => :environment do |_, args|
models_for_analytics.each do |model_name|
DfE::Analytics::LoadEntities.new(model_name: model_name, **args).run
entities_for_analytics.each do |entity_name|
DfE::Analytics::LoadEntities.new(entity_name: entity_name, **args).run
end
end

desc 'Send Analytics events for the state of all records in a specified model'
task :import_entity, %i[model_name batch_size] => :environment do |_, args|
abort('You need to specify a model name as an argument to the Rake task, eg dfe:analytics:import_entity[Model]') unless args[:model_name]
task :import_entity, %i[entity_name batch_size] => :environment do |_, args|
abort('You need to specify a model name as an argument to the Rake task, eg dfe:analytics:import_entity[Model]') unless args[:entity_name]

DfE::Analytics::LoadEntities.new(args).run
end
Expand Down
6 changes: 3 additions & 3 deletions spec/dfe/analytics/load_entities_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@
end
end

it 'sends a model’s fields to BQ' do
it 'sends a entity’s fields to BQ' do
Candidate.create(email_address: 'known@address.com')

described_class.new(model_name: 'Candidate').run
described_class.new(entity_name: 'candidates').run

expect(DfE::Analytics::SendEvents).to have_received(:perform_later) do |payload|
schema = DfE::Analytics::EventSchema.new.as_json
Expand All @@ -46,7 +46,7 @@
Candidate.create
Candidate.create

described_class.new(model_name: 'Candidate', batch_size: 2).run
described_class.new(entity_name: 'candidates', batch_size: 2).run

expect(DfE::Analytics::SendEvents).to have_received(:perform_later).once
end
Expand Down
6 changes: 3 additions & 3 deletions spec/dfe/analytics_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,16 @@
end
end

describe '#models_for_analytics' do
describe '#entities_for_analytics' do
before do
allow(DfE::Analytics).to receive(:allowlist).and_return({
candidates: %i[id],
institutions: %i[id] # table name for the School model, which doesn’t follow convention
})
end

it 'returns the Rails models in the allowlist' do
expect(DfE::Analytics.models_for_analytics).to eq %w[Candidate School]
it 'returns the entities in the allowlist' do
expect(DfE::Analytics.entities_for_analytics).to eq %i[candidates institutions]
end
end
end