Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model equivalent to nn4.small2.v1.t7 #108

Closed
mhaghighat opened this issue Jan 10, 2017 · 14 comments
Closed

Model equivalent to nn4.small2.v1.t7 #108

mhaghighat opened this issue Jan 10, 2017 · 14 comments

Comments

@mhaghighat
Copy link
Contributor

mhaghighat commented Jan 10, 2017

I wonder why the size of the model files (meta and ckpt) is so big compared to the Torch model (nn4.small2.v1.t7) provided in the OpenFace library?

  • model-20161116-234200.ckpt-80000 (600MB)
  • model-20161116-234200.meta (63MB)

compared to:

  • nn4.small2.v1.t7 (31.5MB)

And is there any model equivalent to the nn4.small2.v1.t7 that is small and can be used in the TensorFlow implementation?

Thanks.

@mhaghighat mhaghighat changed the title Equivalent model to nn4.small2.v1.t7 Model equivalent to nn4.small2.v1.t7 Jan 10, 2017
@davidsandberg
Copy link
Owner

Hi @mhaghighat,
I haven't looked at the difference between the file sizes, but it's definitely an interesting thing to look at. I guess there are mainly two factors contributing to the difference:

  • The nn4.small2.v1 is a model with fewer parameters compared to the inception_resnet_v1 model (don't know how much smaller though). The nn4.small2.v1 model for tensorflow can be found in the model directory, but I haven't used it for a while. Don't expect the performance to be super-impressive though.
  • The tensorflow checkpoint format is maybe not as size-efficient as the .t7 format. It would be interesting to compare the two formats for the same model.
    So my best advice is to try the nn4.small2.v1 model in the facenet repo to see how it performs and what file size you get for the checkpoint file.

@mhaghighat
Copy link
Contributor Author

Thank you for your reply, David.

The size of the model being more than 20 times larger was making me curious if there is any redundant data stored. So, I checked the content of the model file, following:

sess = tf.Session()
new_saver = tf.train.import_meta_graph('model-20161116-234200.meta')
new_saver.restore(sess, tf.train.latest_checkpoint('./'))
all_vars = tf.trainable_variables()
for v in all_vars: print(v.name)

Below is a screenshot of a part of the printed list. You can see there are several repetitions of the blocks and branches stored with extra _N suffixes. Do you think that these might be the result of the recreations of the variables added to the same graph? Please refer to the answer in this StackOverflow post for a similar issue.

Thanks again for your time and support.

image

@davidsandberg
Copy link
Owner

Ok, what you are seeing is just the structure of the model. A residual network (resnet) consists of a bunch of blocks which in tensorflow are created using slim.repeat. Check out models.inception_resnet_v1 to see how the model is created.
I still think you should compare the same model in Torch and Tensorflow to figure out why the size differs that much.

@mhaghighat
Copy link
Contributor Author

Unfortunately, I don't have the FaceScrub and the CASIA-WebFace databases to train the nn4.small2.v1. I wonder if anyone has done it and can share the meta and ckpt files.
I will do it as soon as I can get the databases.

@davidsandberg
Copy link
Owner

To check the size of the checkpoint you don't need to run any training. Just initialize the model and store the parameters.

@crockpotveggies
Copy link

How many parameters is the resnet model? The nn4.small2.v1 is approx 3.7 million parameters.

@mhaghighat
Copy link
Contributor Author

mhaghighat commented Jan 28, 2017

@davidsandberg,

Following your advice, I tried to train with the nn4_small2_v1 model. But the current facenet_train_classifier.py gives an error when using this model. The error is:

ValueError: Variable conv1_7x7/weights already exists, disallowed. Did you mean to set reuse=True in VarScope?

Can you please advise?

@davidsandberg
Copy link
Owner

This sepecific problem has been fixed when the input pipeline was refactored so you need to update your repo. But there is still a problem

tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'phase_train' with dtype bool

when global variables are initialized. I'm not sure why this problem happens but it has to do with batch normalization. It can be fixed by changing

phase_train_placeholder = tf.placeholder(tf.bool, name='phase_train')

to

phase_train_placeholder = tf.placeholder_with_default(tf.convert_to_tensor(True, dtype=tf.bool), shape=(), name='phase_train')

And then it seems to work fine.

@lodemo
Copy link

lodemo commented Mar 2, 2017

I think i have an error related to that above, but cant resolve it with the solution.

I freezed the 20170216-091149 model with the freeze_graph.py script, and used it like in compare.py only with a different loading routine for the freezed graph (resnet is the name with which its loaded).

images_placeholder = tf.get_default_graph().get_tensor_by_name("resnet/input:0")
embeddings = tf.get_default_graph().get_tensor_by_name("resnet/embeddings:0")
phase_train_placeholder = tf.placeholder_with_default(tf.convert_to_tensor(True, dtype=tf.bool), shape=(), name='resnet/phase_train')


feed_dict = { images_placeholder: imgs, phase_train_placeholder:False }
emb = session.run(embeddings, feed_dict=feed_dict)

results in the error

InvalidArgumentError: You must feed a value for placeholder tensor 'resnet/phase_train' with dtype bool [[Node: resnet/phase_train = Placeholder[dtype=DT_BOOL, shape=[], _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

using freezed graph with

phase_train_placeholder = tf.get_default_graph().get_tensor_by_name("resnet/phase_train:0")

results in

FailedPreconditionError: Attempting to use uninitialized value resnet/Bottleneck/BatchNorm/moving_variance [[Node: resnet/Bottleneck/BatchNorm/moving_variance/read = Identity[T=DT_FLOAT, _class=["loc:@resnet/Bottleneck/BatchNorm/moving_variance"], _device="/job:localhost/replica:0/task:0/cpu:0"](resnet/Bottleneck/BatchNorm/moving_variance)]]

then again using the unfreezed graph works normal

@ugtony
Copy link

ugtony commented Mar 2, 2017

@lodemo
You can check #161 to see how to use a frozen model. The batchNorm error was discussed and solved there.

@lodemo
Copy link

lodemo commented Mar 2, 2017

Thanks but i am already using the latest revision, which should include the bug fix, loading routine is also the same as discussed in #161. (Only difference is name='resnet', can try without this)

For reference older freezed models (20170117-215115) are running fine.

@ugtony
Copy link

ugtony commented Mar 2, 2017

@lodemo ,
Sorry for misinterpretting your question.
I think the error was because freeze_graph.py does not include the newly added bottleneck layer in the whitelist.

@lodemo
Copy link

lodemo commented Mar 2, 2017

@ugtony
Thank you for the suggestion, i did add the Bottleneck layer to the whitelist and it seems to resolved my error.
I dont know if is entirely correct but it seem to resolve my issue with the latest model.

Added Bottleneck to the condition like this

if node.name.startswith('InceptionResnetV1') or node.name.startswith('embeddings') or node.name.startswith('phase_train') or node.name.startswith('Bottleneck'):

@ugtony
Copy link

ugtony commented Mar 3, 2017

Good to know that it helped.
I think @davidsandberg should know the patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants