Update Similarity Transformer Architecture to compute both embeddings #233

activatedgeek · 2018-08-23T00:38:48Z

This PR takes steps towards a better architecture by incorporating a conditional graph operation to allow switching between the code embedding network and the string embedding network.

This change is

jlewi · 2018-10-01T12:11:45Z

code_search/src/code_search/t2t/similarity_transformer.py


-    if 'targets' in features:
+    def embed_string():


nit: embed_string -> embed_query?

jlewi · 2018-10-01T12:19:52Z

@activatedgeek What's the status of this PR?

Is there any way to right a unittest to make sure that this works as expected?

What if we mock out/override encode? Would that allow us to replace a transformer encoder with something known (maybe just pass the features through?)? So we would know what the expected output is.

k8s-ci-robot · 2018-10-13T17:14:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: gaocegege

If they are not already assigned, you can assign the PR to them by writing /assign @gaocegege in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

activatedgeek · 2018-10-13T18:51:45Z

I've actually been using this branch to test out my variations and particularly to get a solution out for #259.

jlewi · 2018-10-15T12:36:32Z

code_search/src/code_search/t2t/similarity_transformer.py

@@ -73,5 +72,9 @@ def encode(self, features, input_key):

  def infer(self, features=None, **kwargs):
    del kwargs
+
+    if 'targets' not in features:


Why do we need to check if targets is in features?
Is the prediction in inference mode the concatenation of code and query embeddings?
If it is then do we need this if statement because the user can control what embeddings are computed

setting inputs and targets to zeros or non-zero values

Which elements of the output vector they look at.

The eval step during the training also uses this function and deliberately does not send in targets. I added this to make sure they work. Certainly not ideal though.

It is kind of the client's responsibility to slice and dice the I/O vectors. This part is mostly safeguard the eval step of T2T.

Can you put that in a comment?

I think this is causing problems during inference using TFServing.

When I send predictions to TFServing I'm getting b'{ "error": "You must feed a value for placeholder tensor \'Placeholder\' with dtype int64 and shape [?,?,1,1]\n\t [[{{node Placeholder}} = Placeholder_output_shapes=[[?,?,1,1]], dtype=DT_INT64, shape=[?,?,1,1], _device=\"/job:localhost/replica:0/task:0/device:CPU:0\"]]" }'

jlewi · 2018-10-15T12:37:12Z

code_search/src/code_search/t2t/query.py

@@ -14,16 +14,19 @@ def get_encoder(problem_name, data_dir):
  return problem.feature_info["inputs"].encoder


-def encode_query(encoder, query_str):
+def encode_query(encoder, query_str, embed_code=False):


Can we get rid of embed_code now that we are concatenating the two vectors and computing both embeddings?

This actually leads to a bunch of changes here and even the Dataflow step before this. Shall we defer this change to the final PR after we make sure the model works?

Defer which change the whole PR or just removing embed_code?

jlewi · 2018-10-15T12:40:08Z

@activatedgeek It looks like this PR is making two changes

Change the loss function to fix [code_search] Fix the loss function #259
Fix model export to allow [code_search] Fix model export for computing code embeddings #260 to compute embeddings

I believe #2 will necessitate changes to other parts of the code because the output of inference will now be a vector
[query_embeddings, code_embeddings]

So would it make sense to get an initial version of this PR merged so we can begin making the other changes to the code?

jlewi · 2018-10-15T12:40:34Z

@activatedgeek And thank you so much for continuing to work on this! Great to see this coming along.

activatedgeek · 2018-10-15T23:49:12Z

@jlewi I think we can go ahead and merge this.

I'm happy to be working on this, albeit slower than I'd like.

jlewi · 2018-10-22T17:16:23Z

Regarding your earlier comment

This actually leads to a bunch of changes here and even the Dataflow step before this. Shall we defer this change to the final PR after we make sure the model works?

Was that referring to getting rid of embed_config or this PR as a whole?

I assume once we merge this PR a lot of the Dataflow code will break and need to be updated?

Any suggestions about how we can go about fixing the code?

Would it be possible to start writing unittests to verify the various Dataflow transforms are working with the new model? Perhaps we could support that by adding a test utility function to emit dummy models for the new model architecture?

jlewi · 2018-10-31T14:11:53Z

code_search/src/code_search/t2t/query.py

+  features = {
+      "inputs": tf.train.Feature(int64_list=tf.train.Int64List(value=encoded_str)),
+      "targets": tf.train.Feature(int64_list=tf.train.Int64List(value=[0])),
+      "embed_code": tf.constant(embed_code, dtype=tf.bool)


This looks like a bug. I think FeatureProto can only be Int,Bytes, and Floats.
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto

Should "inputs" be the concatenation of the encoded query and the encoded targets now that we are treating output in prediction as concatenation of the two vectors?

@activatedgeek

Fix Model export to support computing code embeddings: Fix kubeflow#260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on kubeflow#233 by @activatedgeek Loss function improvements * See kubeflow#259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in kubeflow#233 and this pulls in some of those changes. Add manual tests * Related to kubeflow#258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library.

@activatedgeek

Fix Model export to support computing code embeddings: Fix kubeflow#260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on kubeflow#233 by @activatedgeek Loss function improvements * See kubeflow#259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in kubeflow#233 and this pulls in some of those changes. Add manual tests * Related to kubeflow#258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library.

jlewi · 2018-11-02T00:47:02Z

I think we can close this in favor of #291.

I originally based #291 on this PR but I think I fixed the issues with the using conditionals.

The inputs will always be provided by the "inputs" feature; we will just use different embedding variable scope
We need to reduce the feature used as the predicate to a rank 0 vector. I think this may have caused the problems during eval.

jlewi · 2018-11-02T05:27:29Z

Actually in #291 per @cwbeitel's I reverted the loss function changes; better to do them in a separate PR after experimenting with them.

@activatedgeek

* Fix model export, loss function, and add some manual tests. Fix Model export to support computing code embeddings: Fix #260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on #233 by @activatedgeek Loss function improvements * See #259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in #233 and this pulls in some of those changes. Add manual tests * Related to #258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library. * * Skip the test when running under prow because its a manual test. * Fix some lint errors. * * Fix lint and skip tests. * Fix lint. * * Fix lint * Revert loss function changes; we can do that in a follow on PR. * * Run generate_data as part of the test rather than reusing a cached vocab and processed input file. * Modify SimilarityTransformer so we can overwrite the number of shards used easily to facilitate testing. * Comment out py-test for now.

activatedgeek · 2018-11-04T15:17:26Z

I think this PR has diverged enough to be closed and all the newer code is more relevant anyways. Let me close this.

@activatedgeek

* Fix model export, loss function, and add some manual tests. Fix Model export to support computing code embeddings: Fix kubeflow#260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on kubeflow#233 by @activatedgeek Loss function improvements * See kubeflow#259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in kubeflow#233 and this pulls in some of those changes. Add manual tests * Related to kubeflow#258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library. * * Skip the test when running under prow because its a manual test. * Fix some lint errors. * * Fix lint and skip tests. * Fix lint. * * Fix lint * Revert loss function changes; we can do that in a follow on PR. * * Run generate_data as part of the test rather than reusing a cached vocab and processed input file. * Modify SimilarityTransformer so we can overwrite the number of shards used easily to facilitate testing. * Comment out py-test for now.

@activatedgeek

* Fix model export, loss function, and add some manual tests. Fix Model export to support computing code embeddings: Fix kubeflow#260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on kubeflow#233 by @activatedgeek Loss function improvements * See kubeflow#259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in kubeflow#233 and this pulls in some of those changes. Add manual tests * Related to kubeflow#258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library. * * Skip the test when running under prow because its a manual test. * Fix some lint errors. * * Fix lint and skip tests. * Fix lint. * * Fix lint * Revert loss function changes; we can do that in a follow on PR. * * Run generate_data as part of the test rather than reusing a cached vocab and processed input file. * Modify SimilarityTransformer so we can overwrite the number of shards used easily to facilitate testing. * Comment out py-test for now.

@activatedgeek

* Fix model export, loss function, and add some manual tests. Fix Model export to support computing code embeddings: Fix kubeflow#260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on kubeflow#233 by @activatedgeek Loss function improvements * See kubeflow#259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in kubeflow#233 and this pulls in some of those changes. Add manual tests * Related to kubeflow#258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library. * * Skip the test when running under prow because its a manual test. * Fix some lint errors. * * Fix lint and skip tests. * Fix lint. * * Fix lint * Revert loss function changes; we can do that in a follow on PR. * * Run generate_data as part of the test rather than reusing a cached vocab and processed input file. * Modify SimilarityTransformer so we can overwrite the number of shards used easily to facilitate testing. * Comment out py-test for now.

k8s-ci-robot added the do-not-merge/work-in-progress label Aug 23, 2018

k8s-ci-robot requested review from jlewi and zjj2wry August 23, 2018 00:38

k8s-ci-robot added the size/M label Aug 23, 2018

k8s-ci-robot removed request for jlewi and zjj2wry August 23, 2018 00:39

activatedgeek mentioned this pull request Sep 28, 2018

[code_search] Train a high quality model #239

Closed

jlewi mentioned this pull request Sep 30, 2018

[code_search] Fix model export for computing code embeddings #260

Closed

jlewi reviewed Oct 1, 2018

View reviewed changes

activatedgeek and others added 5 commits October 12, 2018 22:16

Add a feature entry for code embedding flag, update embedding pipeline

6dcea2a

Upgrade t2t, new model conditional op changes

e130099

Update requirements, add tfexample decoders

01cbb53

Make top() a no-op

9c863d3

[WIP] Try new simple loss

b5f76ec

jlewi reviewed Oct 15, 2018

View reviewed changes

Raiyan111 approved these changes Oct 15, 2018

View reviewed changes

jlewi changed the title ~~[WIP] Update Similarity Transformer Architecture~~ Update Similarity Transformer Architecture to compute both embeddings Oct 16, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress label Oct 16, 2018

jlewi reviewed Oct 31, 2018

View reviewed changes

jlewi mentioned this pull request Nov 2, 2018

Use conditionals and add test for code search #291

Merged

activatedgeek closed this Nov 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Similarity Transformer Architecture to compute both embeddings #233

Update Similarity Transformer Architecture to compute both embeddings #233

activatedgeek commented Aug 23, 2018 •

edited by jlewi

Loading

jlewi Oct 1, 2018

jlewi commented Oct 1, 2018

k8s-ci-robot commented Oct 13, 2018

activatedgeek commented Oct 13, 2018

jlewi Oct 15, 2018

activatedgeek Oct 15, 2018

activatedgeek Oct 15, 2018

jlewi Oct 16, 2018

jlewi Oct 31, 2018

jlewi Oct 15, 2018

activatedgeek Oct 15, 2018

jlewi Oct 16, 2018

jlewi commented Oct 15, 2018

jlewi commented Oct 15, 2018

activatedgeek commented Oct 15, 2018

jlewi commented Oct 22, 2018

jlewi Oct 31, 2018

jlewi commented Nov 2, 2018

jlewi commented Nov 2, 2018

activatedgeek commented Nov 4, 2018

Update Similarity Transformer Architecture to compute both embeddings #233

Update Similarity Transformer Architecture to compute both embeddings #233

Conversation

activatedgeek commented Aug 23, 2018 • edited by jlewi Loading

Choose a reason for hiding this comment

jlewi commented Oct 1, 2018

k8s-ci-robot commented Oct 13, 2018

activatedgeek commented Oct 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlewi commented Oct 15, 2018

jlewi commented Oct 15, 2018

activatedgeek commented Oct 15, 2018

jlewi commented Oct 22, 2018

Choose a reason for hiding this comment

jlewi commented Nov 2, 2018

jlewi commented Nov 2, 2018

activatedgeek commented Nov 4, 2018

activatedgeek commented Aug 23, 2018 •

edited by jlewi

Loading