Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix __typename prevent subscription deduplication #652

Conversation

darrenclark
Copy link
Contributor

This fixes an issue where identical subscription GraphQL queries would not be deduplicated, i.e. would end up with different doc IDs here:

hash = :crypto.hash(:sha256, :erlang.term_to_binary(blueprint)) |> Base.encode16()
doc_id = "__absinthe__:doc:#{hash}"

The anonymous function used when resolving __typename ends being "different" when run on separate processes, preventing identical subscription queries that use __typename from being deduplicated (when run from separate processes)


The root cause of the issue:

defmodule Example do
  def print(msg), do: IO.puts msg

  def make_print_func(), do: &Example.print/1

  def make_anon_print_func(), do: &print/1
end

a = Example.make_print_func()
b = Task.await(Task.async(&Example.make_print_func/0))

IO.puts "a == b?  #{a == b}"
IO.puts ":erlang.term_to_binary(a) == :erlang.term_to_binary(b)?  #{:erlang.term_to_binary(a) == :erlang.term_to_binary(b)}"

anon_a = Example.make_anon_print_func()
anon_b = Task.await(Task.async(&Example.make_anon_print_func/0))

IO.puts "anon_a == anon_b?  #{anon_a == anon_b}"
IO.puts ":erlang.term_to_binary(anon_a) == :erlang.term_to_binary(anon_b)?  #{:erlang.term_to_binary(anon_a) == :erlang.term_to_binary(anon_b)}"

ends up producing:

a == b?  true
:erlang.term_to_binary(a) == :erlang.term_to_binary(b)?  true
anon_a == anon_b?  true
:erlang.term_to_binary(anon_a) == :erlang.term_to_binary(anon_b)?  false

I noticed this issue also happens with Dataloader / Dataloader.Ecto because of these anonymous functions:

https://github.com/absinthe-graphql/dataloader/blob/a61f550a28820d2befcd94411f7b03163f30db30/lib/dataloader/ecto.ex#L190-L191

(was able to workaround that on my end though, as I just provided non-anonymous functions for query and run_batch in my code)


@benwilson512 How do you feel about:

  • Replacing :crypto.hash(:sha256, :erlang.term_to_binary(blueprint)) |> Base.encode16()? (maybe to ignore anonymous functions, or simply use the query's string representation, not sure)

and/or

  • Allowing projects to generate their own doc IDs? I believe the project I work on could do some optimizations specific to our schema

I'd be happy to take on both of those tasks, just wanted to get your thoughts before I started on them

The anonymous function ends being "different" when run on separate
processes, preventing identical subscription queries that use
__typename from being deduplicated (when run from separate processes)
@benwilson512
Copy link
Contributor

Hey @darrenclark thanks for taking a look at this!

In Absinthe 1.5 what I want to do is focus on letting people generate their own doc ids. The current system has a couple of flaws, one of which you've found.

My thought is that we'd probably eliminate "by default" deduplication. If you want to have things deduplicated then you need to provide some kind of identifier. Identifiers really have two parts:

  1. The document itself. A sha256 of the original query string is probably OK here and I'm not really sure it needs to be configurable. I want to avoid giving people obvious foot guns.
  2. The context. This is the part we would require from people. Right now I do the term to binary and sha but depending on the circumstance this is often too narrow. It's too narrow in the sense that term equality can fail at times when there's still a logical equality (two different loads of the user out of the database may not have term equality even though they're the same user). We also want to allow people to have control over deduplication. IE perhaps the context id can just be "logged-in" for any valid user instead of always using the current user id. This provides substantial performance improvements.

A PR that works on these two things would be incredible.

@darrenclark
Copy link
Contributor Author

@benwilson512 Perfect, we can definitely look into this. It should open up some significant optimization opportunities for us.

Any thoughts on the interface into this? Something similar to the document adapter interface (https://hexdocs.pm/absinthe/adapters.html)? Absinthe.Subscription.DocumentIdentification or something?

With that in mind, should I close this PR? Given the plans for 1.5, I'm not sure it adds anything of value.

@benwilson512
Copy link
Contributor

benwilson512 commented Dec 14, 2018

@darrenclark yeah I think we should drop this particular PR in favor of the above approach.

for the current user style id I was thinking it could be as simple as adding a key to the return value of the subscription config function IE:

config args, %{context: %{current_user: %{id: id}} ->
  {:ok, topic: args.whatever, id: id}
end

Or for a version that treated any logged in user the same

config args, %{context: %{current_user: _}} ->
  {:ok, topic: args.whatever, id: "authenticated"}
end

Then the full subscription id would be "authenticated:#{sha256(variables_json <> query_string)}". Do we need to make the document id be configurable? Are there situations where two documents that don't have the same query string sha should be treated as identical?

@darrenclark
Copy link
Contributor Author

darrenclark commented Dec 14, 2018

I have a contrived example of what I was hoping to support:

subscription PriceChange(productIds: [ID!]!) {
  priceDidChange(productIds: $ids) {
    product {
      __typename
      id
      name
      price
    }
  }
}

producing a stream of JSON objects something like:

{
  "event": {
     "__typename": "Product",
     "id": "Product:3242",
     "price": 10.5
  }
}

{
  "event": {
     "__typename": "Product",
     "id": "Product:12",
     "price": 11.25
  }
}

{
  "event": {
     "__typename": "Product",
     "id": "Product:924",
     "price": 5.05
  }
}

I was hoping that we'd be able to give all these documents the same ID:

# only whitespace differs
subscription PriceChange(productIds: [ID!]!) {
  priceDidChange(productIds: $productIds) { product { __typename id name price } }
}

# same fields, different order
subscription PriceChange(productIds: [ID!]!) {
  priceDidChange(productIds: $productIds) { product { id __typename price name } }
}

# different operation name / variables
subscription PriceDidChangeSubscription(ids: [ID!]!) {
  priceDidChange(productIds: $ids) { product { __typename id name price } }
}

# fragments
subscription PriceChange(productIds: [ID!]!) {
  priceDidChange(productIds: $ids) { product { ...ProductInfo } }
}
fragment ProductInfo on Product { id __typename price name }

It'd also be nice for us to be able to ignore the arguments for the subscription priceDidChange field (though maybe not all subscription fields in our schema), and rely on the provided topic:s to ensure each client got the data updates they were interested in.

Any thoughts on:

  • no id: provided, uses existing functionality
  • id: "authenticated" provided, does "authenticated:#{sha256(variables_json <> query_string)}"
  • when value provided to id: is a function, simply calls it to get the document ID

@darrenclark
Copy link
Contributor Author

Or rather, it would be nice to have the option to have full control over the IDs being generated, so that we could account for those situations.

I can only really see us accounting for those situations if we see a lot of duplication due to minor changes like those.

@benwilson512
Copy link
Contributor

This gets a little tricky. Notably, these all have completely different blueprint representations, so the current functionality today treats these as all different. And if you have a serializer that respects the order of the fields as queried then they really ARE different. The results of one are not substitutable as the results of the other.

However, Absinthe can sort of punt on these questions I suppose if we also allow a document_id: option that lets the user do whatever they wish:

config args, %{context: %{current_user: _}, document: document} ->
  {:ok, topic: args.whatever, id: "authenticated", document_id: compute_id(document)}
end

Notably, document: isn't actually a key in the resolution struct yet but it pretty easily could be. You'd just pass in the root blueprint struct, nice and simple.

@darrenclark
Copy link
Contributor Author

And if you have a serializer that respects the order of the fields as queried then they really ARE different

Ah yes, good to know (I had no idea they mentioned it in the spec hah)

However, Absinthe can sort of punt on these questions I suppose if we also allow a document_id: option that lets the user do whatever they wish:

The final ID would end up being "#{id}:#{document_id}" then? Users would provide id: and document_id: options?

This would work well for our use case

@benwilson512
Copy link
Contributor

benwilson512 commented Dec 14, 2018

The final ID would end up being "#{id}:#{document_id}" then? Users would provide id: and document_id: options?

Yes, which we'll refer to as the subscription_id. It may make sense to think of the subscription id as the composite of a context_id and a document_id. Having id simpliciter may be confusing.

@benwilson512
Copy link
Contributor

If no document_id is provided Absinthe will just hash the query string + variables. If no context_id is provided we could probably get away with :erlang.unique_integer.

@darrenclark
Copy link
Contributor Author

Cool, sounds good!

@darrenclark
Copy link
Contributor Author

Closing as per #652 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants