[code_search] Unittest for similarity and loss computation using TF.Eager #258

jlewi · 2018-09-30T01:01:49Z

Here:

examples/code_search/src/code_search/t2t/similarity_transformer.py

Line 32 in 2064b43

string_embedding_norm = tf.nn.l2_normalize(string_embedding, axis=1)

We construct a simple network to compute the loss for the query and code embeddings.

We can pull that code out into a class method and then write a unittest for it using TF.eager

cwbeitel · 2018-10-03T19:31:26Z

Here's a starting point:
https://github.com/cwbeitel/tk/blob/master/tk/models/similarity_transformer_test.py, https://github.com/cwbeitel/tk/blob/master/tk/models/similarity_transformer.py

@activatedgeek

Fix Model export to support computing code embeddings: Fix kubeflow#260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on kubeflow#233 by @activatedgeek Loss function improvements * See kubeflow#259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in kubeflow#233 and this pulls in some of those changes. Add manual tests * Related to kubeflow#258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library.

@activatedgeek

Fix Model export to support computing code embeddings: Fix kubeflow#260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on kubeflow#233 by @activatedgeek Loss function improvements * See kubeflow#259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in kubeflow#233 and this pulls in some of those changes. Add manual tests * Related to kubeflow#258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library.

@activatedgeek

* Fix model export, loss function, and add some manual tests. Fix Model export to support computing code embeddings: Fix #260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on #233 by @activatedgeek Loss function improvements * See #259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in #233 and this pulls in some of those changes. Add manual tests * Related to #258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library. * * Skip the test when running under prow because its a manual test. * Fix some lint errors. * * Fix lint and skip tests. * Fix lint. * * Fix lint * Revert loss function changes; we can do that in a follow on PR. * * Run generate_data as part of the test rather than reusing a cached vocab and processed input file. * Modify SimilarityTransformer so we can overwrite the number of shards used easily to facilitate testing. * Comment out py-test for now.

@activatedgeek

* Fix model export, loss function, and add some manual tests. Fix Model export to support computing code embeddings: Fix kubeflow#260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on kubeflow#233 by @activatedgeek Loss function improvements * See kubeflow#259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in kubeflow#233 and this pulls in some of those changes. Add manual tests * Related to kubeflow#258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library. * * Skip the test when running under prow because its a manual test. * Fix some lint errors. * * Fix lint and skip tests. * Fix lint. * * Fix lint * Revert loss function changes; we can do that in a follow on PR. * * Run generate_data as part of the test rather than reusing a cached vocab and processed input file. * Modify SimilarityTransformer so we can overwrite the number of shards used easily to facilitate testing. * Comment out py-test for now.

@activatedgeek

* Fix model export, loss function, and add some manual tests. Fix Model export to support computing code embeddings: Fix kubeflow#260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on kubeflow#233 by @activatedgeek Loss function improvements * See kubeflow#259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in kubeflow#233 and this pulls in some of those changes. Add manual tests * Related to kubeflow#258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library. * * Skip the test when running under prow because its a manual test. * Fix some lint errors. * * Fix lint and skip tests. * Fix lint. * * Fix lint * Revert loss function changes; we can do that in a follow on PR. * * Run generate_data as part of the test rather than reusing a cached vocab and processed input file. * Modify SimilarityTransformer so we can overwrite the number of shards used easily to facilitate testing. * Comment out py-test for now.

@activatedgeek

* Fix model export, loss function, and add some manual tests. Fix Model export to support computing code embeddings: Fix kubeflow#260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on kubeflow#233 by @activatedgeek Loss function improvements * See kubeflow#259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in kubeflow#233 and this pulls in some of those changes. Add manual tests * Related to kubeflow#258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library. * * Skip the test when running under prow because its a manual test. * Fix some lint errors. * * Fix lint and skip tests. * Fix lint. * * Fix lint * Revert loss function changes; we can do that in a follow on PR. * * Run generate_data as part of the test rather than reusing a cached vocab and processed input file. * Modify SimilarityTransformer so we can overwrite the number of shards used easily to facilitate testing. * Comment out py-test for now.

stale · 2019-06-27T04:50:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jlewi added priority/p1 area/example/code_search The code search example labels Sep 30, 2018

cwbeitel mentioned this issue Oct 3, 2018

[code_search] Fix model export for computing code embeddings #260

Closed

jlewi mentioned this issue Nov 2, 2018

Use conditionals and add test for code search #291

Merged

stale bot added the lifecycle/stale label Jun 27, 2019

stale bot closed this as completed Jul 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[code_search] Unittest for similarity and loss computation using TF.Eager #258

[code_search] Unittest for similarity and loss computation using TF.Eager #258

jlewi commented Sep 30, 2018

cwbeitel commented Oct 3, 2018 •

edited

Loading

stale bot commented Jun 27, 2019

[code_search] Unittest for similarity and loss computation using TF.Eager #258

[code_search] Unittest for similarity and loss computation using TF.Eager #258

Comments

jlewi commented Sep 30, 2018

cwbeitel commented Oct 3, 2018 • edited Loading

stale bot commented Jun 27, 2019

cwbeitel commented Oct 3, 2018 •

edited

Loading