Use conditionals and add test for code search #291

jlewi · 2018-11-01T03:43:53Z

Fix model export, loss function, and add some manual tests.

Fix Model export to support computing code embeddings: Fix #260

The previous exported model was always using the embeddings trained for
the search query.
But we need to be able to compute embedding vectors for both the query
and code.
To support this we add a new input feature "embed_code" and conditional
ops. The exported model uses the value of the embed_code feature to determine
whether to treat the inputs as a query string or code and computes
the embeddings appropriately.
Originally based on Update Similarity Transformer Architecture to compute both embeddings #233 by @activatedgeek

Add manual tests

Related to [code_search] Unittest for similarity and loss computation using TF.Eager #258
We add a smoke test for T2T steps so we can catch bugs in the code.
We also add a smoke test for serving the model with TFServing.
We add a sanity check to ensure we get different values for the same
input based on which embeddings we are computing.

Change Problem/Model name

Register the problem github_function_docstring with a different name
to distinguish it from the version inside the Tensor2Tensor library.

This change is

googlebot · 2018-11-01T03:44:03Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

googlebot · 2018-11-02T00:41:26Z

CLAs look good, thanks!

@activatedgeek

Fix Model export to support computing code embeddings: Fix kubeflow#260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on kubeflow#233 by @activatedgeek Loss function improvements * See kubeflow#259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in kubeflow#233 and this pulls in some of those changes. Add manual tests * Related to kubeflow#258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library.

jlewi · 2018-11-02T00:44:03Z

/hold cancel
/assign @cwbeitel @activatedgeek

This is ready for review.

k8s-ci-robot · 2018-11-02T00:44:03Z

@jlewi: GitHub didn't allow me to assign the following users: activatedgeek.

Note that only kubeflow members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/hold cancel
/assign @cwbeitel @activatedgeek

This is ready for review.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jlewi · 2018-11-02T02:16:14Z

/assign @activatedgeek

k8s-ci-robot · 2018-11-02T02:16:15Z

@jlewi: GitHub didn't allow me to assign the following users: activatedgeek.

Note that only kubeflow members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign @activatedgeek

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

* Fix some lint errors.

cwbeitel

Reviewable status: 0 of 19 files reviewed, 6 unresolved discussions (waiting on @jlewi, @zjj2wry, and @cwbeitel)

code_search/src/code_search/t2t/function_docstring.py, line 62 at r2 (raw file):

            ("func-doc-pairs-000{:02}-of-00100.csv".format(i),)
        ]
        for i in range(1)

It will be important to keep track of this going forward. Somehow this got set to 1 early on in my experiments and I was working with a 1/100 fraction of the data until I noticed it. Just making note of it for the future.

code_search/src/code_search/t2t/similarity_transformer.py, line 13 at r2 (raw file):

# We don't use the default name because there is already an older version
# included as part of the T2T library with the default name.
@registry.register_model(MODEL_NAME)

fwiw we can accomplish the same by just re-naming the class e.g. CsSimilarityTransformer -> cs_similarity_transformer but not necessarily better to do so.

code_search/src/code_search/t2t/similarity_transformer.py, line 37 at r2 (raw file):

        code_embedding = self.encode(features, 'targets')

      p = tf.nn.sigmoid(tf.matmul(string_embedding, code_embedding,

I agree with Lukasz this is a substantial change that may or may not be an improvement vis. quality but as I understand the goal currently is to have something runnable that can be integrated into the rest of the system while we explore model improvements in parallel.

code_search/src/code_search/t2t/similarity_transformer_test.py, line 17 at r2 (raw file):

code_search must be a top level Python package.
python -m code_searcch.t2t.similarity_transformer_export_test

ah code_search is misspelled, I think that was in my code, also fwiw this file is similarity_transformer_test not similarity_transformer_export_test, not too important

code_search/src/code_search/t2t/similarity_transformer_test.py, line 89 at r2 (raw file):

    # TODO(jlewi): We only test datageneration if the files don't exist.
    # We should see about running it regularly. The problem I ran into
    # was that it downloads the entire dataset which is quite large.

If you have a static tmp dir, such as on NFS, instead of tempfile.mkdtemp() then it won't repeat the raw data downloads.

code_search/src/code_search/t2t/test_data/kf_github_function_docstring-dev-00000-of-00001, line 0 at r2 (raw file):
If these are checked in and similarity_transformer_test sets data_dir to ...file), test_data... data generation will be skipped and some changes to GithubFunctionDocstring may not be tested. Worth noting but up to you to determine the priority presently.

cwbeitel · 2018-11-02T03:50:45Z

@jlewi Various minor comments, none blocking, ready to approve once the presubmit tests pass, looks like there are some minor linter errors and it can't run the tests b/c tensorflow can't be imported.

cwbeitel · 2018-11-02T03:55:12Z

If the tests passed locally it's probably fine, fixing a CI dependency is kind of a separate issue.

/lgtm

* Revert loss function changes; we can do that in a follow on PR.

vocab and processed input file. * Modify SimilarityTransformer so we can overwrite the number of shards used easily to facilitate testing.

jlewi

Reviewable status: 0 of 19 files reviewed, 6 unresolved discussions (waiting on @cwbeitel and @zjj2wry)

code_search/src/code_search/t2t/function_docstring.py, line 62 at r2 (raw file):