Added NetVLAD model

openvinotoolkit · Jan 26, 2021 · 978ad82 · 978ad82
1 parent 59e413f
commit 978ad82
Show file tree

Hide file tree

Showing 9 changed files with 233 additions and 0 deletions.
diff --git a/data/dataset_definitions.yml b/data/dataset_definitions.yml
@@ -1142,3 +1142,10 @@ datasets:
       prefix: duration_prediction
       input_suffix: in
       reference_suffix: out
+
+  - name: pitts30k_val
+    data_source: pitts250k
+    reader: pillow_imread
+    annotation_conversion:
+      converter: place_recognition
+      split_file: pitts250k/datasets/pitts30k_val.mat
diff --git a/models/public/index.md b/models/public/index.md
@@ -241,6 +241,14 @@ The task of image translation is to generate the output based on exemplar.
 | -----------| -------------- | ---------------------------------- | -------- | --------- | -------- |
 | CoCosNet   | PyTorch\*      | [cocosnet](./cocosnet/cocosnet.md) | 12.93dB  | 1080.7032 | 167.9141 |
 
+## Place Recognition
+
+The task of place recognition is to quickly and accurately recognize the location of a given query photograph.
+
+| Model Name | Implementation | OMZ Model Name                  | Accuracy | GFlops | mParams |
+| ---------- | ---------------| --------------------------------| -------- | ------ | ------- |
+| NetVLAD    | TensorFlow\*   | [netvlad](./netvlad/netvlad.md) | 82.0321% | 36.6374| 149.0021|
+
 ## Legal Information
 
 [*] Other names and brands may be claimed as the property of others.
diff --git a/models/public/netvlad/accuracy-check.yml b/models/public/netvlad/accuracy-check.yml
@@ -0,0 +1,16 @@
+models:
+  - name: netvlad
+    launchers:
+      - framework: dlsdk
+        adapter: reid
+    datasets:
+      - name: pitts30k_val
+
+        preprocessing:
+          - type: rgb_to_bgr
+          - type: resize
+            dst_height: 200
+            dst_width: 300
+
+        metrics:
+          - type: localization_recall
diff --git a/models/public/netvlad/model.yml b/models/public/netvlad/model.yml
@@ -0,0 +1,54 @@
+# Copyright (c) 2021 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+description: >-
+  NetVLAD is a CNN architecture which tackles the problem of large scale visual place
+  recognition. The architecture uses VGG 16 as base network and NetVLAD - a new trainable
+  generalized VLAD (Vector of Locally Aggregated Descriptors) layer. It is a place
+  recognition model pretrained on the Pittsburgh <http://www.ok.ctrl.titech.ac.jp/~torii/project/repttile/>
+  dataset.
+
+  The model input is a blob that consists of a single image of "1x200x300x3" in RGB
+  order.
+
+  The model output is vector of "1x4096" descriptors which are used as image representation.
+
+  For details see repository <https://github.com/uzh-rpg/netvlad_tf_opent> and paper
+  <https://arxiv.org/pdf/1511.07247.pdf>.
+task_type: place_recognition
+files:
+  - name: netvlad.zip
+    size: 1108966217
+    sha256: a6849eb7e2f9236c8ba87b89c1cf6ce97142296ce71683d8fc843f0569c022ea
+    source: http://rpg.ifi.uzh.ch/datasets/netvlad/vd16_pitts30k_conv5_3_vlad_preL2_intra_white.zip
+  - name: netvlad_tf/layers.py
+    size: 1492
+    sha256: 701fd91892d3ca71316504c088c33c47e7bcd6a091f3157171ed3a0caf1f07b2
+    source: https://github.com/uzh-rpg/netvlad_tf_open/raw/abe37fe9d656bf781cff32caf738efca525b7889/python/netvlad_tf/layers.py
+  - name: netvlad_tf/nets.py
+    size: 2613
+    sha256: c3baa73bd57ac2e83cd24ab8332af93dd66a7b7b950ad6435222c5f4e3b937b4
+    source: https://github.com/uzh-rpg/netvlad_tf_open/raw/abe37fe9d656bf781cff32caf738efca525b7889/python/netvlad_tf/nets.py
+postprocessing:
+  - $type: unpack_archive
+    format: zip
+    file: netvlad.zip
+model_optimizer_args:
+  - --reverse_input_channels
+  - --input_shape=[1,200,300,3]
+  - --input=Placeholder
+  - --output=vgg16_netvlad_pca/l2_normalize_1
+  - --input_model=$conv_dir/model_frozen.pb
+framework: tf
+license: https://raw.githubusercontent.com/uzh-rpg/netvlad_tf_open/master/LICENSE
diff --git a/models/public/netvlad/netvlad.md b/models/public/netvlad/netvlad.md
@@ -0,0 +1,94 @@
+# netvlad
+
+## Use Case and High-Level Description
+
+NetVLAD is a CNN architecture which tackles the problem of large scale visual place recognition. The architecture uses VGG 16 as base network and NetVLAD - a new trainable generalized VLAD (Vector of Locally Aggregated Descriptors) layer. It is a place recognition model pretrained on the [Pittsburgh](http://www.ok.ctrl.titech.ac.jp/~torii/project/repttile/) dataset.
+
+The model input is a blob that consists of a single image of "1x200x300x3" in RGB order.
+
+The model output is vector of "1x4096" descriptors which are used as image representation.
+
+For details see [repository](https://github.com/uzh-rpg/netvlad_tf_opent) and [paper](https://arxiv.org/pdf/1511.07247.pdf).
+
+## Specification
+
+| Metric            | Value             |
+|-------------------|-------------------|
+| Type              | Place recognition |
+| GFLOPs            | 36.6374           |
+| MParams           | 149.0021          |
+| Source framework  | TensorFlow\*      |
+
+## Accuracy
+
+Accuracy metrics are obtained on a smaller validation subset of Pittsburgh (Pitts250k) dataset (Pitts30k) containing 10k database images in each set (train/test/validation).  Images were resized to input size.
+
+| Metric              | Value   |
+| ------------------- | ------- |
+| localization_recall | 82.0321%|
+
+## Input
+
+### Original model
+
+Image, name - `Placeholder`,  shape - `1,200,300,3`, format is `B,H,W,C` where:
+
+- `B` - batch size
+- `C` - channel
+- `H` - height
+- `W` - width
+
+Channel order is `RGB`.
+
+### Converted model
+
+Image, name - `Placeholder`,  shape - `1,3,200,300`, format is `B,C,H,W` where:
+
+- `B` - batch size
+- `C` - channel
+- `H` - height
+- `W` - width
+
+Channel order is `BGR`.
+
+## Output
+
+### Original model
+
+Floating point embeddings, name - `vgg16_netvlad_pca/l2_normalize_1`,  shape - `1,4096`, output data format  - `B,C`, where:
+
+- `B` - batch size
+- `C` - vector of 4096 floating points values, local image descriptors
+
+### Converted model
+
+The converted model has the same parameters as the original model.
+
+## Legal Information
+
+The original model is distributed under
+[MIT license](https://raw.githubusercontent.com/uzh-rpg/netvlad_tf_open/master/LICENSE):
+
+```
+MIT License
+
+Copyright (c) 2018 Robotics and Perception Group
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+```
diff --git a/models/public/netvlad/pre-convert.py b/models/public/netvlad/pre-convert.py
@@ -0,0 +1,51 @@
+#!/usr/bin/env python3
+
+# Copyright (c) 2021 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import importlib
+import sys
+
+from pathlib import Path
+
+import tensorflow.compat.v1 as tf
+
+NETWORK_NAME = 'vd16_pitts30k_conv5_3_vlad_preL2_intra_white'
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('input_dir', type=Path)
+    parser.add_argument('output_dir', type=Path)
+    args = parser.parse_args()
+
+    sys.path.append(str(args.input_dir))
+    nets = importlib.import_module('netvlad_tf.nets')
+
+    tf.reset_default_graph()
+    image_batch = tf.placeholder(dtype=tf.float32, shape=[None, None, None, 3])
+    _net_out = nets.vgg16NetvladPca(image_batch)
+    saver = tf.train.Saver()
+
+    sess = tf.Session()
+    saver.restore(sess, str(args.input_dir / NETWORK_NAME / NETWORK_NAME))
+    outputs = ['vgg16_netvlad_pca/l2_normalize_1']
+    graph_def_freezed = tf.graph_util.convert_variables_to_constants(sess, sess.graph.as_graph_def(), outputs)
+
+    tf.io.write_graph(graph_def_freezed, str(args.output_dir), str(args.output_dir / 'model_frozen.pb'),
+                         as_text=False)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/accuracy_checker/configs/netvlad.yml b/tools/accuracy_checker/configs/netvlad.yml
@@ -0,0 +1 @@
+../../../models/public/netvlad/accuracy-check.yml
diff --git a/tools/downloader/README.md b/tools/downloader/README.md
@@ -440,6 +440,7 @@ describing a single model. Each such object has the following keys:
   * `monocular_depth_estimation`
   * `object_attributes`
   * `optical_character_recognition`
+  * `place_recognition`
   * `question_answering`
   * `semantic_segmentation`
   * `sound_classification`

diff --git a/tools/downloader/common.py b/tools/downloader/common.py
@@ -67,6 +67,7 @@
     'monocular_depth_estimation',
     'object_attributes',
     'optical_character_recognition',
+    'place_recognition',
     'question_answering',
     'semantic_segmentation',
     'sound_classification',