Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openspiel+Bazel+Tensorflow build failure #172

Closed
lancelot-ch opened this issue Mar 15, 2020 · 78 comments
Closed

Openspiel+Bazel+Tensorflow build failure #172

lancelot-ch opened this issue Mar 15, 2020 · 78 comments

Comments

@lancelot-ch
Copy link

lancelot-ch commented Mar 15, 2020

Hello,

I tried to use Bazel to build Openspiel. It succeeds. But when I tried to use Bazel to build Openspiel in a Tensorflow folder, it failed.

I put open_spiel folder in Tensorflow folder. I used Tensorflow's WORKSPACE. It gives following errors(seemed that compiling passed, while linking failed with errors on absl). I tried removing .bazelrc file in Tensorflow source folder, and Bazel build passed again. Could anyone help? Thanks a lot.

Environment: WSL Ubuntu 18.04; Openspiel latest version; Bazel 1.2.1; Tensorflow: using source codes; Python 3.6.

Linking of rule '//tensorflow/open_spiel/games:go_test' failed (Exit 1)
bazel-out/k8-opt/bin/tensorflow/open_spiel/tests/_objs/basic_tests/basic_tests.o:basic_tests.cc:function std::__cxx11::basic_string<char, std::char_traits, std::allocator > absl::StrCat<char [25], int, char [17], int, char [2], std::__cxx11::basic_string<char, std::char_traits, std::allocator > >(absl::AlphaNum const&, absl::AlphaNum const&, absl::AlphaNum const&, absl::AlphaNum const&, absl::AlphaNum const&, char const (&) [25], int const&, char const (&) [17], int const&, char const (&) [2], std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&): error: undefined reference to 'absl::strings_internal::CatPieces[abi:cxx11](std::initializer_list<std::basic_string_view<char, std::char_traits > >)'
bazel-out/k8-opt/bin/tensorflow/open_spiel/_objs/spiel/spiel.o:spiel.cc:function open_spiel::SampleAction(std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > const&, absl::BitGenRef): error: undefined reference to 'absl::strings_internal::CatPieces[abi:cxx11](std::initializer_list<std::basic_string_view<char, std::char_traits > >)'
bazel-out/k8-opt/bin/tensorflow/open_spiel/_objs/spiel/spiel.o:spiel.cc:function open_spiel::Game::DeserializeState(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) const: error: undefined reference to 'absl::ByChar::Find(std::basic_string_view<char, std::char_traits >, unsigned long) const'
bazel-out/k8-opt/bin/tensorflow/open_spiel/_objs/spiel/spiel.o:spiel.cc:function open_spiel::Game::DeserializeState(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) const: error: undefined reference to 'absl::ByChar::Find(std::basic_string_view<char, std::char_traits >, unsigned long) const'
bazel-out/k8-opt/bin/tensorflow/open_spiel/_objs/spiel/spiel.o:spiel.cc:function open_spiel::GameRegisterer::CreateByName(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, open_spiel::GameParameter, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, open_spiel::GameParameter> > > const&): error: undefined reference to 'absl::strings_internal::CatPieces[abi:cxx11](std::initializer_list<std::basic_string_view<char, std::char_traits > >)'
bazel-out/k8-opt/bin/tensorflow/open_spiel/_objs/spiel/spiel.o:spiel.cc:function open_spiel::DeserializeGameAndState(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&): error: undefined reference to 'absl::ByChar::Find(std::basic_string_view<char, std::char_traits >, unsigned long) const'
bazel-out/k8-opt/bin/tensorflow/open_spiel/_objs/spiel/spiel.o:spiel.cc:function open_spiel::DeserializeGameAndState(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&): error: undefined reference to 'absl::ByChar::Find(std::basic_string_view<char, std::char_traits >, unsigned long) const'

Best regards,
Lancelot

@lancelot-ch lancelot-ch changed the title Openspiel+bazel+tensorflow build failure Openspiel+Bazel+Tensorflow build failure Mar 15, 2020
@lanctot
Copy link
Collaborator

lanctot commented Mar 15, 2020

Hi @lancelot-ch ,

Yes I think this is being cause by a recent bug in abseil. We got around it by fixing the checkout to a specific commit number for now, so if you pull from master (+ get rid of the abseil-cpp directory that install.sh clones, rerun install.sh) then I think you'll be fine.

There is also a very quick fix to a single file in abseil that you can apply and also abseil should be fixed by tomorrow. For details, see: abseil/abseil-cpp#640

@lancelot-ch
Copy link
Author

Thanks, @lanctot !

I tried the fix, and the problem isn't solved yet. I felt that could be the cause, given that I downloaded Abseil-cpp on March 9. However, even if I delete that and switched to the March 7 version Abseil-cpp as you suggested, the problem is still there.

One thought is that, I am using Bazel to build, so the change of a Cmake file might not help. Also, I remembered that I encountered this problem during previous build and by using below command I can circumvent this issue.

bazel build --copt=-std=c++17 open_spiel/games/go_test

Therefore the current problem might be induced by Tensorflow? Since it also drags in another version of Abseil? I am confused.

@lanctot
Copy link
Collaborator

lanctot commented Mar 15, 2020

Good point, agreed: it cannot be due to a change in CMake files in your case.

But I guess it must be related because those are the exact same errors I was getting (but in different files).

@lanctot
Copy link
Collaborator

lanctot commented Mar 15, 2020

Thanks, @lanctot !

I tried the fix, and the problem isn't solved yet. I felt that could be the cause, given that I downloaded Abseil-cpp on March 9. However, even if I delete that and switched to the March 7 version Abseil-cpp as you suggested, the problem is still there.

One thought is that, I am using Bazel to build, so the change of a Cmake file might not help. Also, I remembered that I encountered this problem
Therefore the current problem might be induced by Tensorflow? Since it also drags in another version of Abseil? I am confused.

Might be.

I will point the team to this issue, maybe someone will have an idea.

You could also try following up on the abseil-cpp thread, they responded pretty quickly and they might have an idea (even if it may not be directly related to absl).

@derekmauro
Copy link

When you say you put open_spiel in a TensorFlow folder, what exactly do you mean?

I think the problem is likely that you are accidentally doing a mixed-mode compile somehow. I think TensorFlow by default uses C++14 and it looks like open_spiel uses C++17. Linking will fail in this case because absl::string_view is its own backport type in C++14, but becomes std::string_view in C++17. The differing ABI is probably what is causing this not to link.

This is probably the single most common issue that trips people up.
https://github.com/abseil/abseil-cpp/blob/master/FAQ.md#how-to-i-set-the-c-dialect-used-to-build-abseil
https://github.com/abseil/abseil-cpp/blob/master/FAQ.md#what-is-abi-and-why-dont-you-recommend-using-a-pre-compiled-version-of-abseil

@lancelot-ch
Copy link
Author

lancelot-ch commented Mar 16, 2020

Thanks, @lanctot @derekmauro

By "put open_spiel in a TensorFlow folder", I mean I cloned Tensorflow souce codes, and put open_spiel in the souce code folder like "tensorflow/tensorflow/open_speil/spiel.cc" etc. I kept Tensorflow root folder WORKSPACE file. I cloned Abseil-cpp to open_spiel folder, like "tensorflow/tensorflow/open_spiel/Abseil-cpp". I used "bazel build --copt=-std=c++17 open_spiel/games/go_test" to build.

My goal is to use Openspiel for a C++ program for AlphaZero selp-play computation. With network inference, I am thinking of using Tensorflow C++ API, which has to be built from source codes, which leads to current situation.

Meanwhile, I will tweak with ABI options and see if I can work around this issue. If there are other options to achieve my goal, please let me know.

Thanks.

@derekmauro
Copy link

I'll also point you to https://abseil.io/blog/201901115-options and https://github.com/abseil/abseil-cpp/blob/master/absl/base/options.h. You should be able to make this work with a little tweaking.

@lancelot-ch
Copy link
Author

Thanks a lot, @derekmauro . I will give it a try.

I am not sure if I can handle it well though. Tensorflow will pull in an older version of Abseil as well. I might also work with that also. Besides, Tensorflow default with C++14 and OpenSpiel requires C++17. I had not expected such a difficult senario...

@qmaai
Copy link

qmaai commented Mar 20, 2020

@lanctot @lancelot-ch
Thanks for this thread, it saves my day. I have been using gcc/9.2.0 to compile and the abseil issue was gone after pulling the master head again.

@lanctot
Copy link
Collaborator

lanctot commented Mar 20, 2020

Thanks a lot, @derekmauro . I will give it a try.

I am not sure if I can handle it well though. Tensorflow will pull in an older version of Abseil as well. I might also work with that also. Besides, Tensorflow default with C++14 and OpenSpiel requires C++17. I had not expected such a difficult senario...

We are about to release some code in the next few weeks that would require TF inference (note, btw in contrib/ we have some code there already; it's not supported but works internally).

We might spend a bit of effort trying to see how to get this to work because it would be nice to support the TF-compiled code. It's unfortunately not high priority, but we can at least try. I agree the C++14 / C++17 mix is awkward.

One option is to go back down to C++14 until TF allows C++17 (I'm not sure how many C++17 features we truly rely on); this would require some work on our part, probably not much. But, it feels like the wrong direction. We will soon need to support TF 2.2 because Ubuntu comes with Python 3.8 ( see #166 for details ), and it would be really nice if TF 2.2 supported C++17, because then we would not have to mix C++ standards and would not have to go back to an old standard just to support compiling with TF.

@lancelot-ch
Copy link
Author

lancelot-ch commented Mar 22, 2020

Thanks, @lanctot . I didn't notice there is already a TF example. I will read it through. On a quick study, I feel I am also on a similar route. With the link you recommended in the example note, I also read it. It mentions either go with Bazel or link to a TF API in .so file.

I took the first path with Bazel and programmed below codes following codes and instructions from other links with some revisions. Meanwhile the TF inference in C++ works, but I cannot get it work when combined with TF, which induces this thread's question.

I also tried with the second path to link to a compiled TF .so file. But I know little about this linking thing so I failed several times. Probably I need to wait when Openspiel develops more progress.

With the methods Derek suggested earlier, I am still playing with it. Every time I changed the settings of Abseil option.h, I have to compile the whole TF again. It takes a very long time to try every possible solution. So far I still cannot get it. It seems that I always can only satisfy one of Openspiel or TF inference.

Bazel+TF inference solution example:

hello.cc
#include "memory"
#include "vector"
#include "string"
#include "unordered_set"
#include "iostream"

#include "tensorflow/core/public/session.h"
#include "tensorflow/core/platform/env.h"
#include "tensorflow/cc/saved_model/loader.h"
#include "tensorflow/cc/saved_model/tag_constants.h"

//using std;
using tensorflow::Tensor;
using tensorflow::TensorShape;
using tensorflow::Status;
using tensorflow::Session;
using tensorflow::SavedModelBundle;
using tensorflow::SessionOptions;
using tensorflow::RunOptions;

int main(void)
{
const std::string export_dir = "./tensorflow/test6/model";
SavedModelBundle bundle;
SessionOptions session_options;
RunOptions run_options;

// Load model from SavedModel
Status status = tensorflow::LoadSavedModel(session_options, run_options, export_dir, {tensorflow::kSavedModelTagServe}, &bundle);
if (!status.ok()) {
std::cout << "Failed to load saved model" << std::endl;
std::cout << status.ToString() << std::endl;
return -1;
}

Tensor input(tensorflow::DT_FLOAT, tensorflow::TensorShape({1,2,2,2}));
input.tensor<float, 4>()(0,0,0,0) = 1;
input.tensor<float, 4>()(0,0,0,1) = 2;
input.tensor<float, 4>()(0,0,1,0) = 3;
input.tensor<float, 4>()(0,0,1,1) = 4;
input.tensor<float, 4>()(0,1,0,0) = 5;
input.tensor<float, 4>()(0,1,0,1) = 6;
input.tensor<float, 4>()(0,1,1,0) = 7;
input.tensor<float, 4>()(0,1,1,1) = 8;
//vectortensorflow::Tensor outputs;
//string output_node = "output";

// Prediction
std::vector outputs;
status = bundle.session->Run({{"input", input}}, {"output/LogSoftmax","value/Tanh"}, {}, &outputs);
if (!status.ok()) {
std::cout << "Failed to run session (output/LogSoftmax:0)" << std::endl;
std::cout << status.ToString() << std::endl;
return -1;
}
std::cout<< outputs[0].DebugString()<<std::endl;
auto dd=outputs[0].tensor<float,2>();
std::cout<<dd<<std::endl;
std::cout<< outputs[1].DebugString()<<std::endl;
auto ee=outputs[1].tensor<float,2>();
std::cout<<ee<<std::endl;

return 0;
}

BUILD

load("//tensorflow:tensorflow.bzl", "tf_cc_binary")

package(
default_visibility = ["//tensorflow:internal"],
licenses = ["notice"], # Apache 2.0
)

tf_cc_binary(
name = "Hello",
srcs = [
"hello.cc",
],

deps = select({
    "//conditions:default": [
        "//tensorflow/cc:cc_ops",
        "//tensorflow/core:core_cpu",
        "//tensorflow/core:framework",
        "//tensorflow/core:framework_internal",
        "//tensorflow/core:lib",
        "//tensorflow/core:protos_all_cc",
        "//tensorflow/core:tensorflow",
        "//tensorflow/cc/saved_model:tag_constants",
        "//tensorflow/cc/saved_model:loader",
    ],
}),

)

train.py
import numpy as np
import tensorflow.compat.v1 as tf
tf.disable_eager_execution()

board_height=2
board_width=2
train = np.array([[[[1,1],[2,2]],[[3,3],[4,4]]]], dtype='float32')
x = tf.placeholder(
tf.float32, shape=[None, 2, 2, 2],name='input')
input_state = tf.transpose(x, [0, 2, 3, 1])

conv1 = tf.layers.conv2d(inputs=input_state,
filters=32, kernel_size=[3, 3],
padding="same", data_format="channels_last",
activation=tf.nn.relu)
conv2 = tf.layers.conv2d(inputs=conv1, filters=64,
kernel_size=[3, 3], padding="same",
data_format="channels_last",
activation=tf.nn.relu)
conv3 = tf.layers.conv2d(inputs=conv2, filters=128,
kernel_size=[3, 3], padding="same",
data_format="channels_last",
activation=tf.nn.relu)

action_conv = tf.layers.conv2d(inputs=conv3, filters=4,
kernel_size=[1, 1], padding="same",
data_format="channels_last",
activation=tf.nn.relu)

action_conv_flat = tf.reshape(
action_conv, [-1, 2 * board_height * board_width])

prediction = tf.layers.dense(inputs=action_conv_flat,
units=board_height * board_width,
activation=tf.nn.log_softmax,name="output")
evaluation_conv = tf.layers.conv2d(inputs=conv3, filters=2,
kernel_size=[1, 1],
padding="same",
data_format="channels_last",
activation=tf.nn.relu)
evaluation_conv_flat = tf.reshape(
evaluation_conv, [-1, 2 * board_height * board_width])
evaluation_fc1 = tf.layers.dense(inputs=evaluation_conv_flat,
units=64, activation=tf.nn.relu)

evaluation_fc2 = tf.layers.dense(inputs=evaluation_fc1,
units=1, activation=tf.nn.tanh,name="value")
for tensor in tf.get_default_graph().get_operations():
print (tensor.name)

test_data = np.array([[[[1,2],[3,4]],[[5,6],[7,8]]]], dtype='float32')
test_label = test_data

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())

results = sess.run([prediction, evaluation_fc2], feed_dict={ x: test_data })
print(results)

# Save model
inputs_dict = {
    "x": x
}

outputs_dict = {
    "output/LogSoftmax": prediction, "value/Tanh":evaluation_fc2
}

tf.saved_model.simple_save(sess, './model', inputs_dict, outputs_dict)

@lanctot
Copy link
Collaborator

lanctot commented Mar 22, 2020

I also tried with the second path to link to a compiled TF .so file. But I know little about this linking thing so I failed several times. Probably I need to wait when Openspiel develops more progress.

Interesting, I didn't realize this was an option. Can you point me to any places you found on how to do this or where you found out about it? If we could get it to work without having to compile TF from scratch, I'd far prefer that option. I wonder if the pip packages come with everything you need, though (we would need the headers in addition to the .so).

After a few quick searches and I found that TF actually has a CMake build: https://github.com/tensorflow/tensorflow/tree/9590c4c32dd4346ea5c35673336f5912c6072bf2/tensorflow/contrib/cmake . This is great, it might make it easier to integrate with OpenSpiel. I'll keep you posted if look at this further and make any progress.

@lancelot-ch
Copy link
Author

@lanctot At Openspiel tf_trajectories.h file's note link, https://tebesu.github.io/posts/Training-a-TensorFlow-graph-in-C++-API, it says "There are two ways to compile this: one is bazel and the other is linking against the tensorflow library. I prefer the latter."

https://stackoverflow.com/questions/42898577/list-of-headers-to-use-tensorflow-c-api-using-libtensorflow-cc-so

The above link mentioned how to link to libtensorflow_cc.so file. But I just cannot reproduce it on my system. While I still think this is a promising way to avoid having to unify TF and Openspiel's source codes' compiler. It seems to be such an elegant solution, that we can use a Tensorflow C++ API, just like Tensorflow Python API and Tensofrflow C API from TF.

@lancelot-ch
Copy link
Author

lancelot-ch commented Mar 22, 2020

@lanctot And please also refer to the following two links. One uses Bazel while another uses Cmake. There are some parts I cannot fully follow so I am still studying them.

https://github.com/FloopCZ/tensorflow_cc

https://github.com/bitbionic/keras-to-tensorflow

@lanctot
Copy link
Collaborator

lanctot commented Apr 20, 2020

Quick heads-up that I've been playing around with tensorflow_cc; for now just trying to get it to compile with TF2.2 and Ubuntu 20.04 (we need to change OpenSpiel to support these soon so I'm taking the opportunity to try it in this environment).

I've run into some trouble:

It's great that TF can be compiled with CMake.. it would make supporting compiling with it externally within OpenSpiel a lot easier.

@lanctot
Copy link
Collaborator

lanctot commented Apr 22, 2020

Ok, I've managed to compile Tensorflow via CMake using tensorflow_cc (and a very new version as well, TF2.2rc2 on Ubuntu 20.04!) Thanks for pointing us out to that @lancelot-ch , seems like a great project.

This is the first step to getting OpenSpiel + TF compiling externally together. I can't promise any time lines, but it means we're just a few steps from getting them to work together.

@lancelot-ch
Copy link
Author

@lanctot Thanks for your wonderful contribution. I tried a few times but with no luck. Recently I am occupied with some other work, but I will be eager to be looking forward to your new milestones!

@mrdaliri
Copy link
Contributor

Hi,
I was able to build latest commit (4300bc4) with Tensorflow v1.15.2 (via TensorflowCC and Bazel 0.26.1) on my macOS Catalina (Apple Clang 11.0.3). I had to modify some cmake files and rename some #include statements. Please see the diff here.

However, I'm getting the following error from vpnet_test (ctest --verbose -R vpnet_test):

20: dyld: lazy symbol binding failed: Symbol not found: __ZN10open_spiel10algorithms14CreateGraphDefERKNS_4GameEddRKNSt3__112basic_stringIcNS4_11char_traitsIcEENS4_9allocatorIcEEEESC_SA_iib
20:   Referenced from: ~/open_spiel/build/algorithms/alpha_zero/vpnet_test
20:   Expected in: flat namespace
20:
20: dyld: Symbol not found: __ZN10open_spiel10algorithms14CreateGraphDefERKNS_4GameEddRKNSt3__112basic_stringIcNS4_11char_traitsIcEENS4_9allocatorIcEEEESC_SA_iib
20:   Referenced from: ~/open_spiel/build/algorithms/alpha_zero/vpnet_test
20:   Expected in: flat namespace
20:
1/1 Test #20: vpnet_test .......................Child aborted***Exception:   1.06 sec

Do you have any ideas?

@mrdaliri
Copy link
Contributor

I also tried on an Ubuntu 18.04 machine with clang version 6.0.0-1ubuntu2. This time, the build failed with following errors:

CMakeFiles/vpnet_test.dir/vpnet_test.cc.o: In function `open_spiel::algorithms::(anonymous namespace)::BuildModel(open_spiel::Game const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)':
vpnet_test.cc:(.text+0x90a): undefined reference to `open_spiel::algorithms::CreateGraphDef(open_spiel::Game const&, double, double, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int, bool)'
vpnet_test.cc:(.text+0xa9d): undefined reference to `open_spiel::algorithms::VPNetModel::VPNetModel(open_spiel::Game const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
CMakeFiles/vpnet_test.dir/vpnet_test.cc.o: In function `open_spiel::algorithms::(anonymous namespace)::TestModelCreation(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
vpnet_test.cc:(.text+0xddf): undefined reference to `open_spiel::algorithms::VPNetModel::Inference(std::vector<open_spiel::algorithms::VPNetModel::InferenceInputs, std::allocator<open_spiel::algorithms::VPNetModel::InferenceInputs> > const&)'
vpnet_test.cc:(.text+0xf5e): undefined reference to `open_spiel::algorithms::VPNetModel::Learn(std::vector<open_spiel::algorithms::VPNetModel::TrainInputs, std::allocator<open_spiel::algorithms::VPNetModel::TrainInputs> > const&)'
CMakeFiles/vpnet_test.dir/vpnet_test.cc.o: In function `open_spiel::algorithms::(anonymous namespace)::TestModelLearnsSimple(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
vpnet_test.cc:(.text+0x14d7): undefined reference to `open_spiel::algorithms::VPNetModel::Inference(std::vector<open_spiel::algorithms::VPNetModel::InferenceInputs, std::allocator<open_spiel::algorithms::VPNetModel::InferenceInputs> > const&)'
vpnet_test.cc:(.text+0x17a2): undefined reference to `open_spiel::algorithms::VPNetModel::Learn(std::vector<open_spiel::algorithms::VPNetModel::TrainInputs, std::allocator<open_spiel::algorithms::VPNetModel::TrainInputs> > const&)'
CMakeFiles/vpnet_test.dir/vpnet_test.cc.o: In function `open_spiel::algorithms::(anonymous namespace)::TestModelLearnsOptimal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<open_spiel::algorithms::VPNetModel::TrainInputs, std::allocator<open_spiel::algorithms::VPNetModel::TrainInputs> > const&)':
vpnet_test.cc:(.text+0x21bf): undefined reference to `open_spiel::algorithms::VPNetModel::Learn(std::vector<open_spiel::algorithms::VPNetModel::TrainInputs, std::allocator<open_spiel::algorithms::VPNetModel::TrainInputs> > const&)'
collect2: error: ld returned 1 exit status
algorithms/alpha_zero/CMakeFiles/vpnet_test.dir/build.make:393: recipe for target 'algorithms/alpha_zero/vpnet_test' failed
make[2]: *** [algorithms/alpha_zero/vpnet_test] Error 1
CMakeFiles/Makefile2:4606: recipe for target 'algorithms/alpha_zero/CMakeFiles/vpnet_test.dir/all' failed

@mrdaliri
Copy link
Contributor

P.S: The TensorflowCC (Tensorflow v1.15.2) is working without problem in both environments.

@lanctot
Copy link
Collaborator

lanctot commented Jun 24, 2020

Hi @mrdaliri , yes I think you're not linking to the alpha_zero library that you're defining in open_spiel/algorithms/alpha_zero/CMakeLists.txt.

You're defining a library when you do this:

add_library (alpha_zero OBJECT
  alpha_zero.h
  alpha_zero.cc
  device_manager.h
  vpevaluator.h
  vpevaluator.cc
  vpnet.h
  vpnet.cc
)

But then the executables are not including them because they are not being bundled into OPEN_SPIEL_OBJECTS. You will also need to add a line somewhere around here: https://github.com/deepmind/open_spiel/blob/695fad0ac25383e7f66cb0bb30fa8a4ea07d6bb9/open_spiel/CMakeLists.txt#L154

@mrdaliri
Copy link
Contributor

Oh I assumed uncommenting was enough. I changed that block to the following:

set (OPEN_SPIEL_OBJECTS
  $<TARGET_OBJECTS:open_spiel_core>
  $<TARGET_OBJECTS:games>
  $<TARGET_OBJECTS:game_transforms>
  $<TARGET_OBJECTS:bridge_double_dummy_solver>
  $<TARGET_OBJECTS:algorithms>
  $<TARGET_OBJECTS:alpha_zero>
  $<TARGET_OBJECTS:utils>
)

Now I'm getting a great number of errors on Ubuntu, all complain about Tensorflow stuff. It seems it is not yet correctly linked:

alpha_zero/CMakeFiles/alpha_zero.dir/alpha_zero.cc.o: In function `tensorflow::core::RefCounted::~RefCounted()':
alpha_zero.cc:(.text._ZN10tensorflow4core10RefCountedD2Ev[_ZN10tensorflow4core10RefCountedD2Ev]+0x1a8): undefined reference to `tensorflow::internal::LogMessageFatal::LogMessageFatal(char const*, int)'
alpha_zero.cc:(.text._ZN10tensorflow4core10RefCountedD2Ev[_ZN10tensorflow4core10RefCountedD2Ev]+0x1d5): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
alpha_zero.cc:(.text._ZN10tensorflow4core10RefCountedD2Ev[_ZN10tensorflow4core10RefCountedD2Ev]+0x1f0): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
alpha_zero/CMakeFiles/alpha_zero.dir/alpha_zero.cc.o: In function `open_spiel::algorithms::VPNetModel::~VPNetModel()':
alpha_zero.cc:(.text._ZN10open_spiel10algorithms10VPNetModelD2Ev[_ZN10open_spiel10algorithms10VPNetModelD2Ev]+0x31): undefined reference to `tensorflow::MetaGraphDef::~MetaGraphDef()'
alpha_zero/CMakeFiles/alpha_zero.dir/alpha_zero.cc.o: In function `tensorflow::MetaGraphDef::MetaGraphDef(tensorflow::MetaGraphDef&&)':
alpha_zero.cc:(.text._ZN10tensorflow12MetaGraphDefC2EOS0_[_ZN10tensorflow12MetaGraphDefC2EOS0_]+0x1c): undefined reference to `tensorflow::MetaGraphDef::MetaGraphDef()'
alpha_zero/CMakeFiles/alpha_zero.dir/alpha_zero.cc.o: In function `tensorflow::MetaGraphDef::operator=(tensorflow::MetaGraphDef&&)':
alpha_zero.cc:(.text._ZN10tensorflow12MetaGraphDefaSEOS0_[_ZN10tensorflow12MetaGraphDefaSEOS0_]+0x63): undefined reference to `tensorflow::MetaGraphDef::InternalSwap(tensorflow::MetaGraphDef*)'
alpha_zero.cc:(.text._ZN10tensorflow12MetaGraphDefaSEOS0_[_ZN10tensorflow12MetaGraphDefaSEOS0_]+0x7f): undefined reference to `tensorflow::MetaGraphDef::CopyFrom(tensorflow::MetaGraphDef const&)'
alpha_zero/CMakeFiles/alpha_zero.dir/alpha_zero.cc.o: In function `tensorflow::ConfigProto::ConfigProto(tensorflow::ConfigProto&&)':
alpha_zero.cc:(.text._ZN10tensorflow11ConfigProtoC2EOS0_[_ZN10tensorflow11ConfigProtoC2EOS0_]+0x1c): undefined reference to `tensorflow::ConfigProto::ConfigProto()'
alpha_zero/CMakeFiles/alpha_zero.dir/alpha_zero.cc.o: In function `tensorflow::ConfigProto::operator=(tensorflow::ConfigProto&&)':
alpha_zero.cc:(.text._ZN10tensorflow11ConfigProtoaSEOS0_[_ZN10tensorflow11ConfigProtoaSEOS0_]+0x63): undefined reference to `tensorflow::ConfigProto::InternalSwap(tensorflow::ConfigProto*)'
alpha_zero.cc:(.text._ZN10tensorflow11ConfigProtoaSEOS0_[_ZN10tensorflow11ConfigProtoaSEOS0_]+0x7f): undefined reference to `tensorflow::ConfigProto::CopyFrom(tensorflow::ConfigProto const&)'
alpha_zero/CMakeFiles/alpha_zero.dir/alpha_zero.cc.o: In function `tensorflow::SessionOptions::~SessionOptions()':
alpha_zero.cc:(.text._ZN10tensorflow14SessionOptionsD2Ev[_ZN10tensorflow14SessionOptionsD2Ev]+0x1f): undefined reference to `tensorflow::ConfigProto::~ConfigProto()'
alpha_zero/CMakeFiles/alpha_zero.dir/alpha_zero.cc.o: In function `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >* tensorflow::internal::MakeCheckOpString<long, int>(long const&, int const&, char const*)':
alpha_zero.cc:(.text._ZN10tensorflow8internal17MakeCheckOpStringIliEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc[_ZN10tensorflow8internal17MakeCheckOpStringIliEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc]+0x24): undefined reference to `tensorflow::internal::CheckOpMessageBuilder::CheckOpMessageBuilder(char const*)'
alpha_zero.cc:(.text._ZN10tensorflow8internal17MakeCheckOpStringIliEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc[_ZN10tensorflow8internal17MakeCheckOpStringIliEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc]+0x51): undefined reference to `tensorflow::internal::CheckOpMessageBuilder::ForVar2()'
alpha_zero.cc:(.text._ZN10tensorflow8internal17MakeCheckOpStringIliEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc[_ZN10tensorflow8internal17MakeCheckOpStringIliEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc]+0x75): undefined reference to `tensorflow::internal::CheckOpMessageBuilder::NewString[abi:cxx11]()'
alpha_zero.cc:(.text._ZN10tensorflow8internal17MakeCheckOpStringIliEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc[_ZN10tensorflow8internal17MakeCheckOpStringIliEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc]+0x87): undefined reference to `tensorflow::internal::CheckOpMessageBuilder::~CheckOpMessageBuilder()'
alpha_zero.cc:(.text._ZN10tensorflow8internal17MakeCheckOpStringIliEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc[_ZN10tensorflow8internal17MakeCheckOpStringIliEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc]+0xa3): undefined reference to `tensorflow::internal::CheckOpMessageBuilder::~CheckOpMessageBuilder()'
alpha_zero/CMakeFiles/alpha_zero.dir/vpnet.cc.o: In function `open_spiel::algorithms::VPNetModel::VPNetModel(open_spiel::Game const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
vpnet.cc:(.text+0xa8a): undefined reference to `tensorflow::MetaGraphDef::MetaGraphDef()'
vpnet.cc:(.text+0xaab): undefined reference to `tensorflow::SessionOptions::SessionOptions()'
vpnet.cc:(.text+0xdc5): undefined reference to `tensorflow::Env::Default()'
vpnet.cc:(.text+0xdf9): undefined reference to `tensorflow::ReadBinaryProto(tensorflow::Env*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, google::protobuf::MessageLite*)'
vpnet.cc:(.text+0xe5d): undefined reference to `tensorflow::internal::LogMessageFatal::LogMessageFatal(char const*, int)'
vpnet.cc:(.text+0xe8d): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0xf18): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0xff6): undefined reference to `tensorflow::internal::LogMessageFatal::LogMessageFatal(char const*, int)'
vpnet.cc:(.text+0x1026): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x1055): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x1089): undefined reference to `tensorflow::NewSession(tensorflow::SessionOptions const&, tensorflow::Session**)'
vpnet.cc:(.text+0x10ed): undefined reference to `tensorflow::internal::LogMessageFatal::LogMessageFatal(char const*, int)'
vpnet.cc:(.text+0x111d): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x114c): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x120b): undefined reference to `tensorflow::internal::LogMessageFatal::LogMessageFatal(char const*, int)'
vpnet.cc:(.text+0x123b): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x126a): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x147a): undefined reference to `tensorflow::internal::LogMessageFatal::LogMessageFatal(char const*, int)'
vpnet.cc:(.text+0x14aa): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x1605): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x163e): undefined reference to `tensorflow::MetaGraphDef::~MetaGraphDef()'
alpha_zero/CMakeFiles/alpha_zero.dir/vpnet.cc.o: In function `open_spiel::algorithms::VPNetModel::SaveCheckpoint[abi:cxx11](int)':
vpnet.cc:(.text+0x174d): undefined reference to `tensorflow::Tensor::Tensor(tensorflow::DataType, tensorflow::TensorShape const&)'
vpnet.cc:(.text+0x1ad8): undefined reference to `tensorflow::internal::LogMessageFatal::LogMessageFatal(char const*, int)'
vpnet.cc:(.text+0x1b08): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x1d84): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x1ead): undefined reference to `tensorflow::Tensor::~Tensor()'
vpnet.cc:(.text+0x1f2f): undefined reference to `tensorflow::Tensor::~Tensor()'
alpha_zero/CMakeFiles/alpha_zero.dir/vpnet.cc.o: In function `open_spiel::algorithms::VPNetModel::LoadCheckpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
vpnet.cc:(.text+0x1fbe): undefined reference to `tensorflow::Tensor::Tensor(tensorflow::DataType, tensorflow::TensorShape const&)'
vpnet.cc:(.text+0x2336): undefined reference to `tensorflow::internal::LogMessageFatal::LogMessageFatal(char const*, int)'
vpnet.cc:(.text+0x2363): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x258f): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x2598): undefined reference to `tensorflow::Tensor::~Tensor()'
vpnet.cc:(.text+0x25aa): undefined reference to `tensorflow::Tensor::~Tensor()'
alpha_zero/CMakeFiles/alpha_zero.dir/vpnet.cc.o: In function `open_spiel::algorithms::VPNetModel::Inference(std::vector<open_spiel::algorithms::VPNetModel::InferenceInputs, std::allocator<open_spiel::algorithms::VPNetModel::InferenceInputs> > const&)':
vpnet.cc:(.text+0x2650): undefined reference to `tensorflow::Tensor::Tensor(tensorflow::DataType, tensorflow::TensorShape const&)'
vpnet.cc:(.text+0x26cb): undefined reference to `tensorflow::Tensor::Tensor(tensorflow::DataType, tensorflow::TensorShape const&)'
vpnet.cc:(.text+0x2d38): undefined reference to `tensorflow::Tensor::~Tensor()'
vpnet.cc:(.text+0x2d6c): undefined reference to `tensorflow::internal::LogMessageFatal::LogMessageFatal(char const*, int)'
vpnet.cc:(.text+0x2d9c): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x2f94): undefined reference to `tensorflow::Tensor::~Tensor()'
vpnet.cc:(.text+0x3029): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x3304): undefined reference to `tensorflow::Tensor::~Tensor()'
vpnet.cc:(.text+0x330d): undefined reference to `tensorflow::Tensor::~Tensor()'
vpnet.cc:(.text+0x3341): undefined reference to `tensorflow::Tensor::~Tensor()'
vpnet.cc:(.text+0x334a): undefined reference to `tensorflow::Tensor::~Tensor()'
alpha_zero/CMakeFiles/alpha_zero.dir/vpnet.cc.o: In function `open_spiel::algorithms::VPNetModel::Learn(std::vector<open_spiel::algorithms::VPNetModel::TrainInputs, std::allocator<open_spiel::algorithms::VPNetModel::TrainInputs> > const&)':
vpnet.cc:(.text+0x33f0): undefined reference to `tensorflow::Tensor::Tensor(tensorflow::DataType, tensorflow::TensorShape const&)'
vpnet.cc:(.text+0x346b): undefined reference to `tensorflow::Tensor::Tensor(tensorflow::DataType, tensorflow::TensorShape const&)'
vpnet.cc:(.text+0x34e9): undefined reference to `tensorflow::Tensor::Tensor(tensorflow::DataType, tensorflow::TensorShape const&)'
vpnet.cc:(.text+0x3560): undefined reference to `tensorflow::Tensor::Tensor(tensorflow::DataType, tensorflow::TensorShape const&)'
vpnet.cc:(.text+0x3fa0): undefined reference to `tensorflow::Tensor::~Tensor()'
vpnet.cc:(.text+0x3fd4): undefined reference to `tensorflow::internal::LogMessageFatal::LogMessageFatal(char const*, int)'
vpnet.cc:(.text+0x4004): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x4310): undefined reference to `tensorflow::Tensor::~Tensor()'
vpnet.cc:(.text+0x43a5): undefined reference to `tensorflow::internal::LogMessageFatal::~LogMessageFatal()'
vpnet.cc:(.text+0x450b): undefined reference to `tensorflow::Tensor::~Tensor()'
vpnet.cc:(.text+0x4517): undefined reference to `tensorflow::Tensor::~Tensor()'
vpnet.cc:(.text+0x4523): undefined reference to `tensorflow::Tensor::~Tensor()'
vpnet.cc:(.text+0x452c): undefined reference to `tensorflow::Tensor::~Tensor()'
vpnet.cc:(.text+0x4554): undefined reference to `tensorflow::Tensor::~Tensor()'
alpha_zero/CMakeFiles/alpha_zero.dir/vpnet.cc.o:vpnet.cc:(.text+0x4560): more undefined references to `tensorflow::Tensor::~Tensor()' follow
alpha_zero/CMakeFiles/alpha_zero.dir/vpnet.cc.o: In function `tensorflow::TfCheckOpHelper[abi:cxx11](tensorflow::Status, char const*)':
vpnet.cc:(.text._ZN10tensorflow15TfCheckOpHelperB5cxx11ENS_6StatusEPKc[_ZN10tensorflow15TfCheckOpHelperB5cxx11ENS_6StatusEPKc]+0x38): undefined reference to `tensorflow::TfCheckOpHelperOutOfLine[abi:cxx11](tensorflow::Status const&, char const*)'
alpha_zero/CMakeFiles/alpha_zero.dir/vpnet.cc.o: In function `tensorflow::MetaGraphDef::graph_def() const':
vpnet.cc:(.text._ZNK10tensorflow12MetaGraphDef9graph_defEv[_ZNK10tensorflow12MetaGraphDef9graph_defEv]+0x32): undefined reference to `tensorflow::_GraphDef_default_instance_'
alpha_zero/CMakeFiles/alpha_zero.dir/vpnet.cc.o: In function `tensorflow::TensorShape::TensorShape()':
vpnet.cc:(.text._ZN10tensorflow11TensorShapeC2Ev[_ZN10tensorflow11TensorShapeC2Ev]+0x11): undefined reference to `tensorflow::TensorShapeBase<tensorflow::TensorShape>::TensorShapeBase()'
alpha_zero/CMakeFiles/alpha_zero.dir/vpnet.cc.o: In function `tensorflow::TTypes<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, 1, long>::Scalar tensorflow::Tensor::scalar<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >()':
vpnet.cc:(.text._ZN10tensorflow6Tensor6scalarINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEENS_6TTypesIT_Li1ElE6ScalarEv[_ZN10tensorflow6Tensor6scalarINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEENS_6TTypesIT_Li1ElE6ScalarEv]+0x15): undefined reference to `tensorflow::Tensor::CheckIsAlignedAndSingleElement() const'
alpha_zero/CMakeFiles/alpha_zero.dir/vpnet.cc.o: In function `tensorflow::MetaGraphDef::saver_def() const':
vpnet.cc:(.text._ZNK10tensorflow12MetaGraphDef9saver_defEv[_ZNK10tensorflow12MetaGraphDef9saver_defEv]+0x32): undefined reference to `tensorflow::_SaverDef_default_instance_'
alpha_zero/CMakeFiles/alpha_zero.dir/vpnet.cc.o: In function `std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>::~pair()':
vpnet.cc:(.text._ZNSt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN10tensorflow6TensorEED2Ev[_ZNSt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN10tensorflow6TensorEED2Ev]+0x1f): undefined reference to `tensorflow::Tensor::~Tensor()'
alpha_zero/CMakeFiles/alpha_zero.dir/vpnet.cc.o: In function `tensorflow::TensorShape::TensorShapeBase(std::initializer_list<long long>)':
vpnet.cc:(.text._ZN10tensorflow11TensorShapeCI2NS_15TensorShapeBaseIS0_EEESt16initializer_listIxE[_ZN10tensorflow11TensorShapeCI2NS_15TensorShapeBaseIS0_EEESt16initializer_listIxE]+0x2c): undefined reference to `tensorflow::TensorShapeBase<tensorflow::TensorShape>::TensorShapeBase(std::initializer_list<long long>)'
alpha_zero/CMakeFiles/alpha_zero.dir/vpnet.cc.o: In function `tensorflow::TTypes<float, 1, long>::Scalar tensorflow::Tensor::scalar<float>()':
vpnet.cc:(.text._ZN10tensorflow6Tensor6scalarIfEENS_6TTypesIT_Li1ElE6ScalarEv[_ZN10tensorflow6Tensor6scalarIfEENS_6TTypesIT_Li1ElE6ScalarEv]+0x15): undefined reference to `tensorflow::Tensor::CheckIsAlignedAndSingleElement() const'
alpha_zero/CMakeFiles/alpha_zero.dir/vpnet.cc.o: In function `google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler::Type* google::protobuf::internal::RepeatedPtrFieldBase::Mutable<google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler>(int)':

@mrdaliri
Copy link
Contributor

@lanctot: I've added TensorflowCC example as per your suggestion, and disabled all other alpha_zero targets (alpha_zero library + vpnet_test). It compiles, links and runs perfectly. Please see my fork at commit #9ebbf6 for modified CMake files and tensorflow_cc_test file.

So, perhaps the issue is in one of the alpha_zero files and TensorflowCC (by itself) is working just fine. My guess is some sort of function name overlapping between external modules (Eigen, Tensorflow and Abseil).

@lanctot
Copy link
Collaborator

lanctot commented Jun 25, 2020

Ok thanks for doing that.. it will help move us along at least.

I unfortunately have a lot of reviewing to do tonight but maybe tomorrow I can clone your fork and take a look. @tewalds, any ideas?

I'm reluctant to blame Eigen or Abseil only because all the undefined references are coming from Tensorflow itself. So possibly TensorflowCC is not exposing everything we need from TF (doubtful) or it's providing several link targets and we're not using the ones we need. @FloopCZ, do these errors look familiar to you.. or do you have any ideas on what we could be doing wrong?

@lanctot
Copy link
Collaborator

lanctot commented Jun 25, 2020

Hey @mrdaliri I also noticed TensorflowCC is using TF2.2. I will have my PR that lets us upgrade to TF2.2 (#249) ready to go. I have already imported it, there's just a bunch of work to do on it which I plan to get done tomorrow. So it will likely be in on Friday morning. If we don't have this solved by then I'd like to try see what happens if we move over to trying this with TF2.2 instead of TF1.15.

@mrdaliri
Copy link
Contributor

Yes, by default is is using TF 2.2. Just to clarify, I modified TensorflowCC config file and changed its TF version to 1.15.2. So all tests were running against TF 1.15.2.

@michalsustr
Copy link
Collaborator

michalsustr commented Jun 25, 2020

I'm trying it out.

@lanctot
Copy link
Collaborator

lanctot commented Jul 2, 2020

Great let's Google this one function or link error and find out when it was added and/or how to include it in the .so. Almost there!!!

@lanctot
Copy link
Collaborator

lanctot commented Jul 2, 2020

At this point we can also try posting on TF github. Since it's just down to one error, with luck they might get back to us today.

@mrdaliri
Copy link
Contributor

mrdaliri commented Jul 2, 2020

Great progress, ok this now maybe comes down to just one extra flag you need to pass when building protobufs.

Out of curiosity, what does the output of ldd on your executable of Michal's binary look like?

Is TensorflowCC linking to a different version of protobufs?

Do you mean his run_tf example? I think it doesn't use protobufs. Here is the ldd output:

	linux-vdso.so.1 (0x00007ffc79f8f000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f21391a9000)
	libtensorflow_cc.so.2 => /usr/local/lib/libtensorflow_cc.so.2 (0x00007f212c2f2000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f212bf69000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f212bd51000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f212b960000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f212b741000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f212b3a3000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f212b19f000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f21395cf000)

What I understand is that Protobuf is NOT included in tensorflow_cc, so we have to install it externally, and then link it manually. (target_link_libraries(tf_trajectories_example ${PROTOBUF_LIBRARIES} TensorflowCC::TensorflowCC))

@lanctot
Copy link
Collaborator

lanctot commented Jul 2, 2020

TF uses protobufs, so they might have a subset of protobufs within the TF code (likely an older version).

Ok coukd I ask you to post a brief message on TF github pointing to the most relevant comment in this thread and quoting the final link error?

I will also post on our internal sites and contact the TF devs, but it would be great if I could include a link to your post.

@mrdaliri
Copy link
Contributor

mrdaliri commented Jul 2, 2020

Something has happened to that ArenaImpl::AllocateAligned function in this PR of protobuf library: protocolbuffers/protobuf#6869. It dates back to Nov 2019, so I'm trying v3.10.1 which is released on Oct 2019. It's currently compiling and should be done in ~15 min. If it fails, I'll post the issue on TF repo.

@mrdaliri
Copy link
Contributor

mrdaliri commented Jul 2, 2020

Great news! Protobuf 3.10.1 did the trick! I was able to run vpnnet_test. It ended with a core-dumped error but has generated some output:

TestModelCreation: mlp
WARNING:tensorflow:From /home/ubuntu/open_spiel/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0702 13:53:32.102488 140085524649792 deprecation.py:506] From /home/ubuntu/open_spiel/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *constraint arguments to layers.
WARNING:tensorflow:From /home/ubuntu/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py:280: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0702 13:53:32.153110 140085524649792 deprecation.py:323] From /home/ubuntu/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py:280: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
2020-07-02 13:53:32.536587: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-07-02 13:53:32.543252: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2199280000 Hz
2020-07-02 13:53:32.543927: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x440f1b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-02 13:53:32.543957: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Game: tic_tac_toe()
Model type: mlp(64, 2)
Model size: 14922 variables
Variables:
torso_0_dense/kernel:0: (27, 64)
torso_0_dense/bias:0: (64,)
torso_1_dense/kernel:0: (64, 64)
torso_1_dense/bias:0: (64,)
policy_dense/kernel:0: (64, 64)
policy_dense/bias:0: (64,)
policy/kernel:0: (64, 9)
policy/bias:0: (9,)
value_dense/kernel:0: (64, 64)
value_dense/bias:0: (64,)
value/kernel:0: (64, 1)
value/bias:0: (1,)
2020-07-02 13:53:32.839291: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-07-02 13:53:32.847541: F tensorflow/core/framework/op.cc:214] Non-OK-status: RegisterAlreadyLocked(deferred
[i]) status: Invalid argument: No attr with name '/cpu:0' for input 'constants'; in OpDef: name: "XlaLaunch" input_arg { name: "constants" description: "/cpu:0" type_attr: "/cpu:0" number_attr: "/cpu:0" type_list_attr: "Tconstants" } input_arg { name: "args" description: "/cpu:0" type_attr: "/cpu:0" number_attr: "/cpu:0" type_list_attr: "Targs" } input_arg { name: "resources" description: "/cpu:0" type: DT_RESOURCE type_attr: "/cpu:0" number_attr: "Nresources" type_list_attr: "/cpu:0" } output_arg { name: "results" description: "/cpu:0" type_attr: "/cpu:0" number_attr: "/cpu:0" type_list_attr: "Tresults" } attr { name: "Tconstants" type: "list(type)" description: "/cpu:0" has_minimum: true } attr { name: "Targs" type: "list(type)" description: "/cpu:0" has_minimum: true } attr { name: "Nresources" type: "int" description: "/cpu:0" has_minimum: true } attr { name: "Tresults" type: "list(type)" description: "/cpu:0" has_minimum: true } attr { name: "function" type: "func" description: "/cpu:0" } summary: "XLA Launch Op. For use by the XLA JIT only." description: "/cpu:0" is_stateful: true
[1] 10383 abort (core dumped) ./build/algorithms/alpha_zero/vpnet_test

@mrdaliri
Copy link
Contributor

mrdaliri commented Jul 2, 2020

@lanctot is it the expected output?

EDIT: I just realized that it is an error. Similar error occurred when I tried alpha_zero_example (from examples folder).

@lanctot
Copy link
Collaborator

lanctot commented Jul 2, 2020

Yeah I saw, could you try tf_trajectories_example? Probably will be the same error. I guess this one will be easier to find/solve with some online search.

@tewalds , do you recognize that error?

@mrdaliri
Copy link
Contributor

mrdaliri commented Jul 2, 2020

@lanctot tf_trajectories_examples throws the following error:

Spiel Fatal Error: InformationStateTensorShape unimplemented.

I changed line 41 of its python file:https://github.com/deepmind/open_spiel/blob/549e48010a81c023902a39c41319ed08769d3f26/open_spiel/contrib/python/export_graph.py#L41

to info_state_shape = game.observation_tensor_shape() as you told me previously. What should be changed in its C++ files?

@mrdaliri
Copy link
Contributor

mrdaliri commented Jul 2, 2020

Found this issue tensorflow/tensorflow#38393. It seems that the OP in that issue was using TF_CC with TF 2.2-rc2 (like me). Maybe it got solved in 2.2.0 final version. I have to build TF_CC again which takes some time. In the meantime, I'll try TF_CC docker container which uses 2.2.0 final version.

Side note: my machine is GPU-enabled with CUDA installed. Since the error looks related to XLA, the might be a mismatch between the installed CUDA libraries and what TF 2.2 requires.

@lanctot
Copy link
Collaborator

lanctot commented Jul 2, 2020

to info_state_shape = game.observation_tensor_shape() as you told me previously. What should be changed in its C++ files?

I will finally fix this today. But basically every instance of information_state_tensor should be changed to observation_tensor.

@mrdaliri
Copy link
Contributor

mrdaliri commented Jul 2, 2020

to info_state_shape = game.observation_tensor_shape() as you told me previously. What should be changed in its C++ files?

I will finally fix this today. But basically every instance of information_state_tensor should be changed to observation_tensor.

I think I've fixed it now (see my tf_trajectories_cpp branch). It also failed with same XLA-related runtime errors.

@mrdaliri
Copy link
Contributor

mrdaliri commented Jul 2, 2020

Same errors with TF_CC Docker (TF2.2).

I'm now re-building TF_CC with TF_ENABLE_XLA=0 to disable XLA completely.

@mrdaliri
Copy link
Contributor

mrdaliri commented Jul 2, 2020

It doesn't look related to XLA. The issue is with the SetDefaultDevice. Related issues: FloopCZ/tensorflow_cc#136 tensorflow/tensorflow#38393 tensorflow/tensorflow#5379 tensorflow/tensorflow#16291

For tf_trajectories_example, if I comment out that line, it runs without issues (no errors but no output). However, with vpnet_test, commenting out SetDefaultDevice leads to the following error, as reported earlier by @michalsustr:

F /usr/local/include/tensorflow/bazel-bin/tensorflow/include/tensorflow/core/platform/refcount.h:90] Check failed: ref_.load() == 0 (1 vs. 0)

@tewalds
Copy link
Contributor

tewalds commented Jul 3, 2020

Is there a better way to set the device, so that you can load the same graph onto each device on a system with multiple GPUs?

@mrdaliri
Copy link
Contributor

mrdaliri commented Jul 3, 2020

New updates:

I re-compiled TF_CC with a modified version of Tensorflow 2.2 (here). I added *protobuf*; to tensorflow/tf_version_script.lds which (I think) will include the bundled protobuf library in resulting .so file. So there is no need to externally compile, install and link protobuf.

It also eliminates all errors with SetDefaultDevice. Here is the current status of our three TF-based C++ targets in OpenSpiel:

tf_trajectories_example

It runs perfectly without any modifications.

vpnet_test

It is built, compiled and run without any modifications; however, it still throws the following runtime error:

F /usr/local/include/tensorflow/bazel-bin/tensorflow/include/tensorflow/core/platform/refcount.h:90] Check failed: ref_.load() == 0 (1 vs. 0)

alpha_zero_example

This is an example which uses vpnet and alpha_zero to play TicTacToe. It is also compiled and built without modifications. But it core-dumped because of the following TF check:

F tensorflow/core/framework/tensor.cc:672] Check failed: IsAligned() Aligned and single element

I found this issue (benoitsteiner/tensorflow-opencl#49) which suggests a relation to Eigen. There might be a version mismatch between the TF's Eigen and what we have included in OpenSpiel.

@jeremysalwen
Copy link
Contributor

jeremysalwen commented Jul 4, 2020

Hello all, I am on Debian 10, and I was able to get the tensorflow_cc_test from mrdaliri's comment (commit 9ebbf6b) building and running on my machine. (Fresh install of cuda 10.2 with cudnn 7.6.5.32, tensorflow_cc v2.2.0, libprotobuf-dev and protobuf-compiler 3.6.1.3-2)

However, when I update to the latest az_cpp_cmake branch (commit aba6e59), vpnet_test builds, but then fails on run with
"Could not load dynamic library 'libcudart.so.10.1'" (Note I have 10.2 installed instead of 10.1) This seems to be some sort of bug with tensorflow trying to link 10.1 even though I built it from source against 10.2. Following this comment, I created a symlink from libcudart.so.10.1=>libcudart.so.10.2 , which removed the error message, but I am still hitting

Game: tic_tac_toe()
Model type: mlp(64, 2)
Model size: 14922 variables
Variables:
torso_0_dense/kernel:0: (27, 64)
torso_0_dense/bias:0: (64,)
torso_1_dense/kernel:0: (64, 64)
torso_1_dense/bias:0: (64,)
policy_dense/kernel:0: (64, 64)
policy_dense/bias:0: (64,)
policy/kernel:0: (64, 9)
policy/bias:0: (9,)
value_dense/kernel:0: (64, 64)
value_dense/bias:0: (64,)
value/kernel:0: (64, 1)
value/bias:0: (1,)
2020-07-03 20:01:10.921455: F tensorflow/core/framework/op.cc:214] Non-OK-status: RegisterAlreadyLocked(deferred_[i]) status: Invalid argument: No attr with name '/cpu:0' for input 'constants'; in OpDef: name: "XlaLaunch" input_arg { name: "constants" description: "/cpu:0" type_attr: "/cpu:0" number_attr: "/cpu:0" type_list_attr: "Tconstants" } input_arg { name: "args" description: "/cpu:0" type_attr: "/cpu:0" number_attr: "/cpu:0" type_list_attr: "Targs" } input_arg { name: "resources" description: "/cpu:0" type: DT_RESOURCE type_attr: "/cpu:0" number_attr: "Nresources" type_list_attr: "/cpu:0" } output_arg { name: "results" description: "/cpu:0" type_attr: "/cpu:0" number_attr: "/cpu:0" type_list_attr: "Tresults" } attr { name: "Tconstants" type: "list(type)" description: "/cpu:0" has_minimum: true } attr { name: "Targs" type: "list(type)" description: "/cpu:0" has_minimum: true } attr { name: "Nresources" type: "int" description: "/cpu:0" has_minimum: true } attr { name: "Tresults" type: "list(type)" description: "/cpu:0" has_minimum: true } attr { name: "function" type: "func" description: "/cpu:0" } summary: "XLA Launch Op. For use by the XLA JIT only." description: "/cpu:0" is_stateful: true
Aborted

EDIT: Just saw your new comment from today, will try out the modified version of tensorflow.

@mrdaliri
Copy link
Contributor

mrdaliri commented Jul 4, 2020

Hi @jeremysalwen,
Thanks for testing it out!

You need to modify tensorflow/tf_version_script.lds and rebuild your tensorflow_cc. You can switch to my fork of tensorflow_cc (branch mrdaliri-tf-mirror), which is connected to my Tenorflow fork (modified lds file is in branch r2.2-protobuf). Then run CMake for TF_CC with an extra flag which sets TF version to the modified branch. Here is the complete set of commands:

git clone https://github.com/mrdaliri/tensorflow_cc 
cd tensorflow_cc
git checkout mrdaliri-tf-mirror
cd tensorflow_cc
mkdir build && cd build
cmake -DTENSORFLOW_TAG=r2.2-protobuf ..
make
sudo make install

@jeremysalwen
Copy link
Contributor

Hi @mrdaliri I was able to build and install tensorflow_cc using your modified version of tensorflow. I was then able to build open_spiel, but running the vpnet_test fails with

[libprotobuf FATAL google/protobuf/stubs/common.cc:68] This program requires version 3.8.0 of the Protocol Buffer runtime library, but the installed version is 3.6.1.  Please update your library.  If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library.  (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
terminate called after throwing an instance of 'google::protobuf::FatalException'
  what():  This program requires version 3.8.0 of the Protocol Buffer runtime library, but the installed version is 3.6.1.  Please update your library.  If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library.  (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)

I tried uninstalling my libprotobuf-dev debian package, but then openspiel refuses to compile at all (cmake complains) do you have further modifications to open_spiel to address this?

@lanctot
Copy link
Collaborator

lanctot commented Jul 8, 2020

@mrdaliri could you add a section to the AlphaZero README.md describing the steps necessary to compile and run TF within OpenSpiel?

Actually it could even be better as a separate independent doc (because maybe it will come up again in different contexts) and for now we can link from the AlphaZero doc and in the header of tf_trajectories.

@mrdaliri
Copy link
Contributor

mrdaliri commented Jul 9, 2020

Hi @lanctot,
By "compile and run TF within OpenSpiel", you mean to download/compile/install and TF (and Protobuf, if needed) just like an external library in a way that we have currently for Eigen and others? So far, I compiled and installed them externally (out of OpenSpiel) and then linked the installed shared libs with CMake.

@lanctot
Copy link
Collaborator

lanctot commented Jul 9, 2020

Yes basically something tidy that people can read and follow to reproduce your success in getting this to work.

@mrdaliri
Copy link
Contributor

mrdaliri commented Jul 9, 2020

Sure. I'll make a pull request so you can try it out before adding it to master.

@mrdaliri
Copy link
Contributor

mrdaliri commented Jul 9, 2020

Hi @mrdaliri I was able to build and install tensorflow_cc using your modified version of tensorflow. I was then able to build open_spiel, but running the vpnet_test fails with

[libprotobuf FATAL google/protobuf/stubs/common.cc:68] This program requires version 3.8.0 of the Protocol Buffer runtime library, but the installed version is 3.6.1.  Please update your library.  If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library.  (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
terminate called after throwing an instance of 'google::protobuf::FatalException'
  what():  This program requires version 3.8.0 of the Protocol Buffer runtime library, but the installed version is 3.6.1.  Please update your library.  If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library.  (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)

I tried uninstalling my libprotobuf-dev debian package, but then openspiel refuses to compile at all (cmake complains) do you have further modifications to open_spiel to address this?

Hi @jeremysalwen, if you have built and installed the modified version of TF, you don't need external Protobuf anymore. Please check out my latest commit on az_cpp_cmake branch. It removes protobuf from CMake files.

P.S. I don't have libprotobuf-dev package on my Ubuntu machine. So you need to uninstall it also. Protobuf is bundled with TF (AFAIK) so if you enable exporting Protobuf symbols (that's what I did in my TF modified version), you shouldn't need any extra libraries.

@alextrudeau
Copy link

Hi @mrdaliri, when do you think your fixes will be pushed to master in OpenSpiel? Will there also be documentation explaining how to properly compile an AlphaZero instance with the fixes?

Thanks!

@mrdaliri
Copy link
Contributor

Hi @alextrudeau,
Please see my latest PR (#307). It includes instructions for building against TensorflowCC, in an easy way!

@lanctot
Copy link
Collaborator

lanctot commented Nov 16, 2020

Thanks @mrdaliri , this PR will be merged tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants