Skip to content

Latest commit

 

History

History
104 lines (67 loc) · 4.35 KB

File metadata and controls

104 lines (67 loc) · 4.35 KB

vitstr-small-patch16-224

Use-case and high-level description

The vitstr-small-patch16-224 model is small version of the ViTSTR models. ViTSTR is a simple single-stage model that uses a pre-trained Vision Transformer (ViT) to perform Scene Text Recognition (ViTSTR). Small version of model has an embedding size of 384 and number of heads of 6. Model is able to recognize alphanumeric case sensitive text and special characters.

More details provided in the paper and repository.

Specification

Metric Value
Type Scene Text Recognition
GFLOPs 9.1544
MParams 21.5061
Source framework PyTorch*

Accuracy

Alphanumeric subset of common scene text recognition benchmarks are used. For your convenience you can see dataset size. Note, that we use here ICDAR15 alphanumeric subset without irregular (arbitrary oriented, perspective or curved) texts. See details here, section 4.1. All reported results are achieved without using any lexicon.

Dataset Accuracy Dataset size
ICDAR-03 93.43% 867
ICDAR-13 90.34% 1015
ICDAR-15 75.04% 1811
SVT 85.47% 647
IIIT5K 87.07% 3000

Use accuracy_check [...] --model_attributes <path_to_folder_with_downloaded_model> to specify the path to additional model attributes. path_to_folder_with_downloaded_model is a path to the folder, where the current model is downloaded by Model Downloader tool.

Input

Original model

Image, name: image, shape: 1, 1, 224, 224 in the format B, C, H, W, where:

  • B - batch size
  • C - number of channels
  • H - image height
  • W - image width

Note that the source image should be tight aligned crop with detected text converted to grayscale.

Scale values - [255].

Converted model

Image, name: image, shape: 1, 1, 224, 224 in the format B, C, H, W, where:

  • B - batch size
  • C - number of channels
  • H - image height
  • W - image width

Note that the source image should be tight aligned crop with detected text converted to grayscale.

Output

Original model

Output tensor, name: logits, shape: 1, 25, 96 in the format B, W, L, where:

  • B - batch size
  • W - output sequence length
  • L - confidence distribution across [GO] - special start token for decoder, [s] - special end of sequence character for decoder and characters, listed in enclosed file vocab.txt.

The network output decoding process is pretty easy: get the argmax on L dimension, transform indices to letters and slice the resulting phrase on the first entry of end-of-sequence symbol.

Converted model

Output tensor, name: logits, shape: 1, 25, 96 in the format B, W, L, where:

  • B - batch size
  • W - output sequence length
  • L - confidence distribution across [GO] - special start token for decoder, [s] - special end of sequence character for decoder and characters, listed in enclosed file vocab.txt.

The network output decoding process is pretty easy: get the argmax on L dimension, transform indices to letters and slice the resulting phrase on the first entry of end-of-sequence symbol.

Download a Model and Convert it into OpenVINO™ IR Format

You can download models and if necessary convert them into OpenVINO™ IR format using the Model Downloader and other automation tools as shown in the examples below.

An example of using the Model Downloader:

omz_downloader --name <model_name>

An example of using the Model Converter:

omz_converter --name <model_name>

Demo usage

The model can be used in the following demos provided by the Open Model Zoo to show its capabilities:

Legal Information

The original model is distributed under the Apache License, Version 2.0. A copy of the license is provided in <omz_dir>/models/public/licenses/APACHE-2.0.txt.