Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use conditionals and add test for code search #291

Merged
merged 7 commits into from
Nov 2, 2018

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Nov 1, 2018

Fix model export, loss function, and add some manual tests.

Fix Model export to support computing code embeddings: Fix #260

  • The previous exported model was always using the embeddings trained for
    the search query.

  • But we need to be able to compute embedding vectors for both the query
    and code.

  • To support this we add a new input feature "embed_code" and conditional
    ops. The exported model uses the value of the embed_code feature to determine
    whether to treat the inputs as a query string or code and computes
    the embeddings appropriately.

  • Originally based on Update Similarity Transformer Architecture to compute both embeddings #233 by @activatedgeek

Add manual tests

Change Problem/Model name

  • Register the problem github_function_docstring with a different name
    to distinguish it from the version inside the Tensor2Tensor library.

This change is Reviewable

@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

@googlebot
Copy link

CLAs look good, thanks!

Fix Model export to support computing code embeddings: Fix kubeflow#260

* The previous exported model was always using the embeddings trained for
  the search query.

* But we need to be able to compute embedding vectors for both the query
  and code.

* To support this we add a new input feature "embed_code" and conditional
  ops. The exported model uses the value of the embed_code feature to determine
  whether to treat the inputs as a query string or code and computes
  the embeddings appropriately.

* Originally based on kubeflow#233 by @activatedgeek

Loss function improvements

* See kubeflow#259 for a long discussion about different loss functions.

* @activatedgeek was experimenting with different loss functions in kubeflow#233
  and this pulls in some of those changes.

Add manual tests

* Related to kubeflow#258

* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
  input based on which embeddings we are computing.

Change Problem/Model name

* Register the problem github_function_docstring with a different name
  to distinguish it from the version inside the Tensor2Tensor library.
@jlewi
Copy link
Contributor Author

jlewi commented Nov 2, 2018

/hold cancel
/assign @cwbeitel @activatedgeek

This is ready for review.

@k8s-ci-robot
Copy link
Contributor

@jlewi: GitHub didn't allow me to assign the following users: activatedgeek.

Note that only kubeflow members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/hold cancel
/assign @cwbeitel @activatedgeek

This is ready for review.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jlewi jlewi changed the title [WIP] Use conditionals and add test for code search Use conditionals and add test for code search Nov 2, 2018
@jlewi
Copy link
Contributor Author

jlewi commented Nov 2, 2018

/assign @activatedgeek

@k8s-ci-robot
Copy link
Contributor

@jlewi: GitHub didn't allow me to assign the following users: activatedgeek.

Note that only kubeflow members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign @activatedgeek

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@cwbeitel cwbeitel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 19 files reviewed, 6 unresolved discussions (waiting on @jlewi, @zjj2wry, and @cwbeitel)


code_search/src/code_search/t2t/function_docstring.py, line 62 at r2 (raw file):

            ("func-doc-pairs-000{:02}-of-00100.csv".format(i),)
        ]
        for i in range(1)

It will be important to keep track of this going forward. Somehow this got set to 1 early on in my experiments and I was working with a 1/100 fraction of the data until I noticed it. Just making note of it for the future.


code_search/src/code_search/t2t/similarity_transformer.py, line 13 at r2 (raw file):

# We don't use the default name because there is already an older version
# included as part of the T2T library with the default name.
@registry.register_model(MODEL_NAME)

fwiw we can accomplish the same by just re-naming the class e.g. CsSimilarityTransformer -> cs_similarity_transformer but not necessarily better to do so.


code_search/src/code_search/t2t/similarity_transformer.py, line 37 at r2 (raw file):

        code_embedding = self.encode(features, 'targets')

      p = tf.nn.sigmoid(tf.matmul(string_embedding, code_embedding,

I agree with Lukasz this is a substantial change that may or may not be an improvement vis. quality but as I understand the goal currently is to have something runnable that can be integrated into the rest of the system while we explore model improvements in parallel.


code_search/src/code_search/t2t/similarity_transformer_test.py, line 17 at r2 (raw file):

code_search must be a top level Python package.
python -m code_searcch.t2t.similarity_transformer_export_test

ah code_search is misspelled, I think that was in my code, also fwiw this file is similarity_transformer_test not similarity_transformer_export_test, not too important


code_search/src/code_search/t2t/similarity_transformer_test.py, line 89 at r2 (raw file):

    # TODO(jlewi): We only test datageneration if the files don't exist.
    # We should see about running it regularly. The problem I ran into
    # was that it downloads the entire dataset which is quite large.

If you have a static tmp dir, such as on NFS, instead of tempfile.mkdtemp() then it won't repeat the raw data downloads.


code_search/src/code_search/t2t/test_data/kf_github_function_docstring-dev-00000-of-00001, line 0 at r2 (raw file):
If these are checked in and similarity_transformer_test sets data_dir to ...file), test_data... data generation will be skipped and some changes to GithubFunctionDocstring may not be tested. Worth noting but up to you to determine the priority presently.

@cwbeitel
Copy link
Contributor

cwbeitel commented Nov 2, 2018

@jlewi Various minor comments, none blocking, ready to approve once the presubmit tests pass, looks like there are some minor linter errors and it can't run the tests b/c tensorflow can't be imported.

@cwbeitel
Copy link
Contributor

cwbeitel commented Nov 2, 2018

If the tests passed locally it's probably fine, fixing a CI dependency is kind of a separate issue.

/lgtm

* Revert loss function changes; we can do that in a follow on PR.
@k8s-ci-robot k8s-ci-robot removed the lgtm label Nov 2, 2018
  vocab and processed input file.

* Modify SimilarityTransformer so we can overwrite the number of shards
  used easily to facilitate testing.
Copy link
Contributor Author

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 19 files reviewed, 6 unresolved discussions (waiting on @cwbeitel and @zjj2wry)


code_search/src/code_search/t2t/function_docstring.py, line 62 at r2 (raw file):

Previously, cwbeitel (Christopher Beitel) wrote…

It will be important to keep track of this going forward. Somehow this got set to 1 early on in my experiments and I was working with a 1/100 fraction of the data until I noticed it. Just making note of it for the future.

Good catch.


code_search/src/code_search/t2t/similarity_transformer.py, line 37 at r2 (raw file):

Previously, cwbeitel (Christopher Beitel) wrote…

I agree with Lukasz this is a substantial change that may or may not be an improvement vis. quality but as I understand the goal currently is to have something runnable that can be integrated into the rest of the system while we explore model improvements in parallel.

Good point reverted the loss function changes.


code_search/src/code_search/t2t/similarity_transformer_test.py, line 17 at r2 (raw file):

Previously, cwbeitel (Christopher Beitel) wrote…

ah code_search is misspelled, I think that was in my code, also fwiw this file is similarity_transformer_test not similarity_transformer_export_test, not too important

Done.


code_search/src/code_search/t2t/similarity_transformer_test.py, line 89 at r2 (raw file):

Previously, cwbeitel (Christopher Beitel) wrote…

If you have a static tmp dir, such as on NFS, instead of tempfile.mkdtemp() then it won't repeat the raw data downloads.

Changed to always run data generation.

I prefer to check it in if the data is reasonably sized. This makes the test more hermetic and repeatable.


code_search/src/code_search/t2t/test_data/kf_github_function_docstring-dev-00000-of-00001, line at r2 (raw file):

Previously, cwbeitel (Christopher Beitel) wrote…

If these are checked in and similarity_transformer_test sets data_dir to ...file), test_data... data generation will be skipped and some changes to GithubFunctionDocstring may not be tested. Worth noting but up to you to determine the priority presently.

Good point. I changed the test so data generation should now always be run.

@jlewi
Copy link
Contributor Author

jlewi commented Nov 2, 2018

/assign @activatedgeek

@k8s-ci-robot
Copy link
Contributor

@jlewi: GitHub didn't allow me to assign the following users: activatedgeek.

Note that only kubeflow members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign @activatedgeek

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jlewi
Copy link
Contributor Author

jlewi commented Nov 2, 2018

/assign @activatedgeek

@k8s-ci-robot
Copy link
Contributor

@jlewi: GitHub didn't allow me to assign the following users: activatedgeek.

Note that only kubeflow members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign @activatedgeek

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cwbeitel
Copy link
Contributor

cwbeitel commented Nov 2, 2018

Nice.

/lgtm

@cwbeitel
Copy link
Contributor

cwbeitel commented Nov 2, 2018

@jlewi Oh and just noticed the normalization at the end of the inference step. I don't know if that's strictly necessary given nmslib is later finding nearest-neighbors using the cosine distance. Which according to their manual is the standard form i.e. equivalent to 1 - tf.matmul(tf.nn.l2_normalize(a, axis=1), tf.nn.l2_normalize(b, axis=1), transpose_b=True).

@jlewi
Copy link
Contributor Author

jlewi commented Nov 2, 2018

@cwbeitel agree its not strictly necessary; but since our training metric using normalized vectors it seemed better to also use normalized vectors in inference. Otherwise it seems like users could easily forget to do that at inference time (e.g. if not using nmslib) and that could cause bugs.

@jlewi
Copy link
Contributor Author

jlewi commented Nov 2, 2018

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit acd8007 into kubeflow:master Nov 2, 2018
yixinshi pushed a commit to yixinshi/examples that referenced this pull request Nov 30, 2018
* Fix model export, loss function, and add some manual tests.

Fix Model export to support computing code embeddings: Fix kubeflow#260

* The previous exported model was always using the embeddings trained for
  the search query.

* But we need to be able to compute embedding vectors for both the query
  and code.

* To support this we add a new input feature "embed_code" and conditional
  ops. The exported model uses the value of the embed_code feature to determine
  whether to treat the inputs as a query string or code and computes
  the embeddings appropriately.

* Originally based on kubeflow#233 by @activatedgeek

Loss function improvements

* See kubeflow#259 for a long discussion about different loss functions.

* @activatedgeek was experimenting with different loss functions in kubeflow#233
  and this pulls in some of those changes.

Add manual tests

* Related to kubeflow#258

* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
  input based on which embeddings we are computing.

Change Problem/Model name

* Register the problem github_function_docstring with a different name
  to distinguish it from the version inside the Tensor2Tensor library.

* * Skip the test when running under prow because its a manual test.
* Fix some lint errors.

* * Fix lint and skip tests.

* Fix lint.

* * Fix lint
* Revert loss function changes; we can do that in a follow on PR.

* * Run generate_data as part of the test rather than reusing a cached
  vocab and processed input file.

* Modify SimilarityTransformer so we can overwrite the number of shards
  used easily to facilitate testing.

* Comment out py-test for now.
Svendegroote91 pushed a commit to Svendegroote91/examples that referenced this pull request Dec 6, 2018
* Fix model export, loss function, and add some manual tests.

Fix Model export to support computing code embeddings: Fix kubeflow#260

* The previous exported model was always using the embeddings trained for
  the search query.

* But we need to be able to compute embedding vectors for both the query
  and code.

* To support this we add a new input feature "embed_code" and conditional
  ops. The exported model uses the value of the embed_code feature to determine
  whether to treat the inputs as a query string or code and computes
  the embeddings appropriately.

* Originally based on kubeflow#233 by @activatedgeek

Loss function improvements

* See kubeflow#259 for a long discussion about different loss functions.

* @activatedgeek was experimenting with different loss functions in kubeflow#233
  and this pulls in some of those changes.

Add manual tests

* Related to kubeflow#258

* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
  input based on which embeddings we are computing.

Change Problem/Model name

* Register the problem github_function_docstring with a different name
  to distinguish it from the version inside the Tensor2Tensor library.

* * Skip the test when running under prow because its a manual test.
* Fix some lint errors.

* * Fix lint and skip tests.

* Fix lint.

* * Fix lint
* Revert loss function changes; we can do that in a follow on PR.

* * Run generate_data as part of the test rather than reusing a cached
  vocab and processed input file.

* Modify SimilarityTransformer so we can overwrite the number of shards
  used easily to facilitate testing.

* Comment out py-test for now.
Svendegroote91 pushed a commit to Svendegroote91/examples that referenced this pull request Apr 1, 2019
* Fix model export, loss function, and add some manual tests.

Fix Model export to support computing code embeddings: Fix kubeflow#260

* The previous exported model was always using the embeddings trained for
  the search query.

* But we need to be able to compute embedding vectors for both the query
  and code.

* To support this we add a new input feature "embed_code" and conditional
  ops. The exported model uses the value of the embed_code feature to determine
  whether to treat the inputs as a query string or code and computes
  the embeddings appropriately.

* Originally based on kubeflow#233 by @activatedgeek

Loss function improvements

* See kubeflow#259 for a long discussion about different loss functions.

* @activatedgeek was experimenting with different loss functions in kubeflow#233
  and this pulls in some of those changes.

Add manual tests

* Related to kubeflow#258

* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
  input based on which embeddings we are computing.

Change Problem/Model name

* Register the problem github_function_docstring with a different name
  to distinguish it from the version inside the Tensor2Tensor library.

* * Skip the test when running under prow because its a manual test.
* Fix some lint errors.

* * Fix lint and skip tests.

* Fix lint.

* * Fix lint
* Revert loss function changes; we can do that in a follow on PR.

* * Run generate_data as part of the test rather than reusing a cached
  vocab and processed input file.

* Modify SimilarityTransformer so we can overwrite the number of shards
  used easily to facilitate testing.

* Comment out py-test for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants