Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Similarity Transformer Architecture to compute both embeddings #233

Closed
wants to merge 5 commits into from
Closed

Update Similarity Transformer Architecture to compute both embeddings #233

wants to merge 5 commits into from

Conversation

activatedgeek
Copy link
Contributor

@activatedgeek activatedgeek commented Aug 23, 2018

This PR takes steps towards a better architecture by incorporating a conditional graph operation to allow switching between the code embedding network and the string embedding network.


This change is Reviewable


if 'targets' in features:
def embed_string():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: embed_string -> embed_query?

@jlewi
Copy link
Contributor

jlewi commented Oct 1, 2018

@activatedgeek What's the status of this PR?

Is there any way to right a unittest to make sure that this works as expected?

What if we mock out/override encode? Would that allow us to replace a transformer encoder with something known (maybe just pass the features through?)? So we would know what the expected output is.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: gaocegege

If they are not already assigned, you can assign the PR to them by writing /assign @gaocegege in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@activatedgeek
Copy link
Contributor Author

I've actually been using this branch to test out my variations and particularly to get a solution out for #259.

@@ -73,5 +72,9 @@ def encode(self, features, input_key):

def infer(self, features=None, **kwargs):
del kwargs

if 'targets' not in features:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to check if targets is in features?
Is the prediction in inference mode the concatenation of code and query embeddings?
If it is then do we need this if statement because the user can control what embeddings are computed

  1. setting inputs and targets to zeros or non-zero values
  2. Which elements of the output vector they look at.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eval step during the training also uses this function and deliberately does not send in targets. I added this to make sure they work. Certainly not ideal though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is kind of the client's responsibility to slice and dice the I/O vectors. This part is mostly safeguard the eval step of T2T.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put that in a comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is causing problems during inference using TFServing.

When I send predictions to TFServing I'm getting b'{ "error": "You must feed a value for placeholder tensor \'Placeholder\' with dtype int64 and shape [?,?,1,1]\n\t [[{{node Placeholder}} = Placeholder_output_shapes=[[?,?,1,1]], dtype=DT_INT64, shape=[?,?,1,1], _device=\"/job:localhost/replica:0/task:0/device:CPU:0\"]]" }'

@@ -14,16 +14,19 @@ def get_encoder(problem_name, data_dir):
return problem.feature_info["inputs"].encoder


def encode_query(encoder, query_str):
def encode_query(encoder, query_str, embed_code=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of embed_code now that we are concatenating the two vectors and computing both embeddings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually leads to a bunch of changes here and even the Dataflow step before this. Shall we defer this change to the final PR after we make sure the model works?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defer which change the whole PR or just removing embed_code?

@jlewi
Copy link
Contributor

jlewi commented Oct 15, 2018

@activatedgeek It looks like this PR is making two changes

  1. Change the loss function to fix [code_search] Fix the loss function #259
  2. Fix model export to allow [code_search] Fix model export for computing code embeddings #260 to compute embeddings

I believe #2 will necessitate changes to other parts of the code because the output of inference will now be a vector
[query_embeddings, code_embeddings]

So would it make sense to get an initial version of this PR merged so we can begin making the other changes to the code?

@jlewi
Copy link
Contributor

jlewi commented Oct 15, 2018

@activatedgeek And thank you so much for continuing to work on this! Great to see this coming along.

@activatedgeek
Copy link
Contributor Author

@jlewi I think we can go ahead and merge this.

I'm happy to be working on this, albeit slower than I'd like.

@jlewi jlewi changed the title [WIP] Update Similarity Transformer Architecture Update Similarity Transformer Architecture to compute both embeddings Oct 16, 2018
@jlewi
Copy link
Contributor

jlewi commented Oct 22, 2018

Regarding your earlier comment

This actually leads to a bunch of changes here and even the Dataflow step before this. Shall we defer this change to the final PR after we make sure the model works?

Was that referring to getting rid of embed_config or this PR as a whole?

I assume once we merge this PR a lot of the Dataflow code will break and need to be updated?

Any suggestions about how we can go about fixing the code?

Would it be possible to start writing unittests to verify the various Dataflow transforms are working with the new model? Perhaps we could support that by adding a test utility function to emit dummy models for the new model architecture?

features = {
"inputs": tf.train.Feature(int64_list=tf.train.Int64List(value=encoded_str)),
"targets": tf.train.Feature(int64_list=tf.train.Int64List(value=[0])),
"embed_code": tf.constant(embed_code, dtype=tf.bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a bug. I think FeatureProto can only be Int,Bytes, and Floats.
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto

Should "inputs" be the concatenation of the encoded query and the encoded targets now that we are treating output in prediction as concatenation of the two vectors?

jlewi pushed a commit to jlewi/examples that referenced this pull request Nov 2, 2018
Fix Model export to support computing code embeddings: Fix kubeflow#260

* The previous exported model was always using the embeddings trained for
  the search query.

* But we need to be able to compute embedding vectors for both the query
  and code.

* To support this we add a new input feature "embed_code" and conditional
  ops. The exported model uses the value of the embed_code feature to determine
  whether to treat the inputs as a query string or code and computes
  the embeddings appropriately.

* Originally based on kubeflow#233 by @activatedgeek

Loss function improvements

* See kubeflow#259 for a long discussion about different loss functions.

* @activatedgeek was experimenting with different loss functions in kubeflow#233
  and this pulls in some of those changes.

Add manual tests

* Related to kubeflow#258

* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
  input based on which embeddings we are computing.

Change Problem/Model name

* Register the problem github_function_docstring with a different name
  to distinguish it from the version inside the Tensor2Tensor library.
jlewi added a commit to jlewi/examples that referenced this pull request Nov 2, 2018
Fix Model export to support computing code embeddings: Fix kubeflow#260

* The previous exported model was always using the embeddings trained for
  the search query.

* But we need to be able to compute embedding vectors for both the query
  and code.

* To support this we add a new input feature "embed_code" and conditional
  ops. The exported model uses the value of the embed_code feature to determine
  whether to treat the inputs as a query string or code and computes
  the embeddings appropriately.

* Originally based on kubeflow#233 by @activatedgeek

Loss function improvements

* See kubeflow#259 for a long discussion about different loss functions.

* @activatedgeek was experimenting with different loss functions in kubeflow#233
  and this pulls in some of those changes.

Add manual tests

* Related to kubeflow#258

* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
  input based on which embeddings we are computing.

Change Problem/Model name

* Register the problem github_function_docstring with a different name
  to distinguish it from the version inside the Tensor2Tensor library.
@jlewi
Copy link
Contributor

jlewi commented Nov 2, 2018

I think we can close this in favor of #291.

I originally based #291 on this PR but I think I fixed the issues with the using conditionals.

  1. The inputs will always be provided by the "inputs" feature; we will just use different embedding variable scope
  2. We need to reduce the feature used as the predicate to a rank 0 vector. I think this may have caused the problems during eval.

@jlewi
Copy link
Contributor

jlewi commented Nov 2, 2018

Actually in #291 per @cwbeitel's I reverted the loss function changes; better to do them in a separate PR after experimenting with them.

k8s-ci-robot pushed a commit that referenced this pull request Nov 2, 2018
* Fix model export, loss function, and add some manual tests.

Fix Model export to support computing code embeddings: Fix #260

* The previous exported model was always using the embeddings trained for
  the search query.

* But we need to be able to compute embedding vectors for both the query
  and code.

* To support this we add a new input feature "embed_code" and conditional
  ops. The exported model uses the value of the embed_code feature to determine
  whether to treat the inputs as a query string or code and computes
  the embeddings appropriately.

* Originally based on #233 by @activatedgeek

Loss function improvements

* See #259 for a long discussion about different loss functions.

* @activatedgeek was experimenting with different loss functions in #233
  and this pulls in some of those changes.

Add manual tests

* Related to #258

* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
  input based on which embeddings we are computing.

Change Problem/Model name

* Register the problem github_function_docstring with a different name
  to distinguish it from the version inside the Tensor2Tensor library.

* * Skip the test when running under prow because its a manual test.
* Fix some lint errors.

* * Fix lint and skip tests.

* Fix lint.

* * Fix lint
* Revert loss function changes; we can do that in a follow on PR.

* * Run generate_data as part of the test rather than reusing a cached
  vocab and processed input file.

* Modify SimilarityTransformer so we can overwrite the number of shards
  used easily to facilitate testing.

* Comment out py-test for now.
@activatedgeek
Copy link
Contributor Author

I think this PR has diverged enough to be closed and all the newer code is more relevant anyways. Let me close this.

yixinshi pushed a commit to yixinshi/examples that referenced this pull request Nov 30, 2018
* Fix model export, loss function, and add some manual tests.

Fix Model export to support computing code embeddings: Fix kubeflow#260

* The previous exported model was always using the embeddings trained for
  the search query.

* But we need to be able to compute embedding vectors for both the query
  and code.

* To support this we add a new input feature "embed_code" and conditional
  ops. The exported model uses the value of the embed_code feature to determine
  whether to treat the inputs as a query string or code and computes
  the embeddings appropriately.

* Originally based on kubeflow#233 by @activatedgeek

Loss function improvements

* See kubeflow#259 for a long discussion about different loss functions.

* @activatedgeek was experimenting with different loss functions in kubeflow#233
  and this pulls in some of those changes.

Add manual tests

* Related to kubeflow#258

* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
  input based on which embeddings we are computing.

Change Problem/Model name

* Register the problem github_function_docstring with a different name
  to distinguish it from the version inside the Tensor2Tensor library.

* * Skip the test when running under prow because its a manual test.
* Fix some lint errors.

* * Fix lint and skip tests.

* Fix lint.

* * Fix lint
* Revert loss function changes; we can do that in a follow on PR.

* * Run generate_data as part of the test rather than reusing a cached
  vocab and processed input file.

* Modify SimilarityTransformer so we can overwrite the number of shards
  used easily to facilitate testing.

* Comment out py-test for now.
Svendegroote91 pushed a commit to Svendegroote91/examples that referenced this pull request Dec 6, 2018
* Fix model export, loss function, and add some manual tests.

Fix Model export to support computing code embeddings: Fix kubeflow#260

* The previous exported model was always using the embeddings trained for
  the search query.

* But we need to be able to compute embedding vectors for both the query
  and code.

* To support this we add a new input feature "embed_code" and conditional
  ops. The exported model uses the value of the embed_code feature to determine
  whether to treat the inputs as a query string or code and computes
  the embeddings appropriately.

* Originally based on kubeflow#233 by @activatedgeek

Loss function improvements

* See kubeflow#259 for a long discussion about different loss functions.

* @activatedgeek was experimenting with different loss functions in kubeflow#233
  and this pulls in some of those changes.

Add manual tests

* Related to kubeflow#258

* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
  input based on which embeddings we are computing.

Change Problem/Model name

* Register the problem github_function_docstring with a different name
  to distinguish it from the version inside the Tensor2Tensor library.

* * Skip the test when running under prow because its a manual test.
* Fix some lint errors.

* * Fix lint and skip tests.

* Fix lint.

* * Fix lint
* Revert loss function changes; we can do that in a follow on PR.

* * Run generate_data as part of the test rather than reusing a cached
  vocab and processed input file.

* Modify SimilarityTransformer so we can overwrite the number of shards
  used easily to facilitate testing.

* Comment out py-test for now.
Svendegroote91 pushed a commit to Svendegroote91/examples that referenced this pull request Apr 1, 2019
* Fix model export, loss function, and add some manual tests.

Fix Model export to support computing code embeddings: Fix kubeflow#260

* The previous exported model was always using the embeddings trained for
  the search query.

* But we need to be able to compute embedding vectors for both the query
  and code.

* To support this we add a new input feature "embed_code" and conditional
  ops. The exported model uses the value of the embed_code feature to determine
  whether to treat the inputs as a query string or code and computes
  the embeddings appropriately.

* Originally based on kubeflow#233 by @activatedgeek

Loss function improvements

* See kubeflow#259 for a long discussion about different loss functions.

* @activatedgeek was experimenting with different loss functions in kubeflow#233
  and this pulls in some of those changes.

Add manual tests

* Related to kubeflow#258

* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
  input based on which embeddings we are computing.

Change Problem/Model name

* Register the problem github_function_docstring with a different name
  to distinguish it from the version inside the Tensor2Tensor library.

* * Skip the test when running under prow because its a manual test.
* Fix some lint errors.

* * Fix lint and skip tests.

* Fix lint.

* * Fix lint
* Revert loss function changes; we can do that in a follow on PR.

* * Run generate_data as part of the test rather than reusing a cached
  vocab and processed input file.

* Modify SimilarityTransformer so we can overwrite the number of shards
  used easily to facilitate testing.

* Comment out py-test for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants