-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use conditionals and add test for code search #291
Conversation
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. |
54db8d9
to
7c2ecdd
Compare
CLAs look good, thanks! |
Fix Model export to support computing code embeddings: Fix kubeflow#260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on kubeflow#233 by @activatedgeek Loss function improvements * See kubeflow#259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in kubeflow#233 and this pulls in some of those changes. Add manual tests * Related to kubeflow#258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library.
7c2ecdd
to
59df0dc
Compare
/hold cancel This is ready for review. |
@jlewi: GitHub didn't allow me to assign the following users: activatedgeek. Note that only kubeflow members and repo collaborators can be assigned. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @activatedgeek |
@jlewi: GitHub didn't allow me to assign the following users: activatedgeek. Note that only kubeflow members and repo collaborators can be assigned. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
* Fix some lint errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 19 files reviewed, 6 unresolved discussions (waiting on @jlewi, @zjj2wry, and @cwbeitel)
code_search/src/code_search/t2t/function_docstring.py, line 62 at r2 (raw file):
("func-doc-pairs-000{:02}-of-00100.csv".format(i),) ] for i in range(1)
It will be important to keep track of this going forward. Somehow this got set to 1 early on in my experiments and I was working with a 1/100 fraction of the data until I noticed it. Just making note of it for the future.
code_search/src/code_search/t2t/similarity_transformer.py, line 13 at r2 (raw file):
# We don't use the default name because there is already an older version # included as part of the T2T library with the default name. @registry.register_model(MODEL_NAME)
fwiw we can accomplish the same by just re-naming the class e.g. CsSimilarityTransformer -> cs_similarity_transformer but not necessarily better to do so.
code_search/src/code_search/t2t/similarity_transformer.py, line 37 at r2 (raw file):
code_embedding = self.encode(features, 'targets') p = tf.nn.sigmoid(tf.matmul(string_embedding, code_embedding,
I agree with Lukasz this is a substantial change that may or may not be an improvement vis. quality but as I understand the goal currently is to have something runnable that can be integrated into the rest of the system while we explore model improvements in parallel.
code_search/src/code_search/t2t/similarity_transformer_test.py, line 17 at r2 (raw file):
code_search must be a top level Python package. python -m code_searcch.t2t.similarity_transformer_export_test
ah code_search is misspelled, I think that was in my code, also fwiw this file is similarity_transformer_test not similarity_transformer_export_test, not too important
code_search/src/code_search/t2t/similarity_transformer_test.py, line 89 at r2 (raw file):
# TODO(jlewi): We only test datageneration if the files don't exist. # We should see about running it regularly. The problem I ran into # was that it downloads the entire dataset which is quite large.
If you have a static tmp dir, such as on NFS, instead of tempfile.mkdtemp() then it won't repeat the raw data downloads.
code_search/src/code_search/t2t/test_data/kf_github_function_docstring-dev-00000-of-00001, line 0 at r2 (raw file):
If these are checked in and similarity_transformer_test sets data_dir to ...file), test_data... data generation will be skipped and some changes to GithubFunctionDocstring may not be tested. Worth noting but up to you to determine the priority presently.
@jlewi Various minor comments, none blocking, ready to approve once the presubmit tests pass, looks like there are some minor linter errors and it can't run the tests b/c tensorflow can't be imported. |
If the tests passed locally it's probably fine, fixing a CI dependency is kind of a separate issue. /lgtm |
* Revert loss function changes; we can do that in a follow on PR.
vocab and processed input file. * Modify SimilarityTransformer so we can overwrite the number of shards used easily to facilitate testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 19 files reviewed, 6 unresolved discussions (waiting on @cwbeitel and @zjj2wry)
code_search/src/code_search/t2t/function_docstring.py, line 62 at r2 (raw file):
Previously, cwbeitel (Christopher Beitel) wrote…
It will be important to keep track of this going forward. Somehow this got set to 1 early on in my experiments and I was working with a 1/100 fraction of the data until I noticed it. Just making note of it for the future.
Good catch.
code_search/src/code_search/t2t/similarity_transformer.py, line 37 at r2 (raw file):
Previously, cwbeitel (Christopher Beitel) wrote…
I agree with Lukasz this is a substantial change that may or may not be an improvement vis. quality but as I understand the goal currently is to have something runnable that can be integrated into the rest of the system while we explore model improvements in parallel.
Good point reverted the loss function changes.
code_search/src/code_search/t2t/similarity_transformer_test.py, line 17 at r2 (raw file):
Previously, cwbeitel (Christopher Beitel) wrote…
ah code_search is misspelled, I think that was in my code, also fwiw this file is similarity_transformer_test not similarity_transformer_export_test, not too important
Done.
code_search/src/code_search/t2t/similarity_transformer_test.py, line 89 at r2 (raw file):
Previously, cwbeitel (Christopher Beitel) wrote…
If you have a static tmp dir, such as on NFS, instead of tempfile.mkdtemp() then it won't repeat the raw data downloads.
Changed to always run data generation.
I prefer to check it in if the data is reasonably sized. This makes the test more hermetic and repeatable.
code_search/src/code_search/t2t/test_data/kf_github_function_docstring-dev-00000-of-00001, line at r2 (raw file):
Previously, cwbeitel (Christopher Beitel) wrote…
If these are checked in and similarity_transformer_test sets data_dir to ...file), test_data... data generation will be skipped and some changes to GithubFunctionDocstring may not be tested. Worth noting but up to you to determine the priority presently.
Good point. I changed the test so data generation should now always be run.
/assign @activatedgeek |
@jlewi: GitHub didn't allow me to assign the following users: activatedgeek. Note that only kubeflow members and repo collaborators can be assigned. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @activatedgeek |
@jlewi: GitHub didn't allow me to assign the following users: activatedgeek. Note that only kubeflow members and repo collaborators can be assigned. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Nice. /lgtm |
@jlewi Oh and just noticed the normalization at the end of the inference step. I don't know if that's strictly necessary given nmslib is later finding nearest-neighbors using the cosine distance. Which according to their manual is the standard form i.e. equivalent to 1 - tf.matmul(tf.nn.l2_normalize(a, axis=1), tf.nn.l2_normalize(b, axis=1), transpose_b=True). |
@cwbeitel agree its not strictly necessary; but since our training metric using normalized vectors it seemed better to also use normalized vectors in inference. Otherwise it seems like users could easily forget to do that at inference time (e.g. if not using nmslib) and that could cause bugs. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Fix model export, loss function, and add some manual tests. Fix Model export to support computing code embeddings: Fix kubeflow#260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on kubeflow#233 by @activatedgeek Loss function improvements * See kubeflow#259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in kubeflow#233 and this pulls in some of those changes. Add manual tests * Related to kubeflow#258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library. * * Skip the test when running under prow because its a manual test. * Fix some lint errors. * * Fix lint and skip tests. * Fix lint. * * Fix lint * Revert loss function changes; we can do that in a follow on PR. * * Run generate_data as part of the test rather than reusing a cached vocab and processed input file. * Modify SimilarityTransformer so we can overwrite the number of shards used easily to facilitate testing. * Comment out py-test for now.
* Fix model export, loss function, and add some manual tests. Fix Model export to support computing code embeddings: Fix kubeflow#260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on kubeflow#233 by @activatedgeek Loss function improvements * See kubeflow#259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in kubeflow#233 and this pulls in some of those changes. Add manual tests * Related to kubeflow#258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library. * * Skip the test when running under prow because its a manual test. * Fix some lint errors. * * Fix lint and skip tests. * Fix lint. * * Fix lint * Revert loss function changes; we can do that in a follow on PR. * * Run generate_data as part of the test rather than reusing a cached vocab and processed input file. * Modify SimilarityTransformer so we can overwrite the number of shards used easily to facilitate testing. * Comment out py-test for now.
* Fix model export, loss function, and add some manual tests. Fix Model export to support computing code embeddings: Fix kubeflow#260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on kubeflow#233 by @activatedgeek Loss function improvements * See kubeflow#259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in kubeflow#233 and this pulls in some of those changes. Add manual tests * Related to kubeflow#258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library. * * Skip the test when running under prow because its a manual test. * Fix some lint errors. * * Fix lint and skip tests. * Fix lint. * * Fix lint * Revert loss function changes; we can do that in a follow on PR. * * Run generate_data as part of the test rather than reusing a cached vocab and processed input file. * Modify SimilarityTransformer so we can overwrite the number of shards used easily to facilitate testing. * Comment out py-test for now.
Fix model export, loss function, and add some manual tests.
Fix Model export to support computing code embeddings: Fix #260
The previous exported model was always using the embeddings trained for
the search query.
But we need to be able to compute embedding vectors for both the query
and code.
To support this we add a new input feature "embed_code" and conditional
ops. The exported model uses the value of the embed_code feature to determine
whether to treat the inputs as a query string or code and computes
the embeddings appropriately.
Originally based on Update Similarity Transformer Architecture to compute both embeddings #233 by @activatedgeek
Add manual tests
Related to [code_search] Unittest for similarity and loss computation using TF.Eager #258
We add a smoke test for T2T steps so we can catch bugs in the code.
We also add a smoke test for serving the model with TFServing.
We add a sanity check to ensure we get different values for the same
input based on which embeddings we are computing.
Change Problem/Model name
to distinguish it from the version inside the Tensor2Tensor library.
This change is