Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

Commit

Permalink
[Example] BAAI/bge-large-en-v1.5 & BAAI/bge-base-en-v1.5 examples add (
Browse files Browse the repository at this point in the history
…#619)

* initialize the bge-large

* initialize the bge_large

* the bge_base inferneces pass
  • Loading branch information
Zhenzhong1 authored Nov 3, 2023
1 parent 8543e2f commit c399e38
Show file tree
Hide file tree
Showing 7 changed files with 1,162 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
Step-by-Step
=======
This document describes the end-to-end workflow for Huggingface model [BERT Base Uncased](https://huggingface.co/textattack/bert-base-uncased-MRPC) with Neural Engine backend.

# Prerequisite
## Prepare Python Environment
Create a python environment, optionally with autoconf for jemalloc support.
```shell
conda create -n <env name> python=3.8 [autoconf]
conda activate <env name>
```

Check that `gcc` version is higher than 9.0.
```shell
gcc -v
```

Install Intel® Extension for Transformers, please refer to [installation](/docs/installation.md).
```shell
# Install from pypi
pip install intel-extension-for-transformers

# Or, install from source code
cd <intel_extension_for_transformers_folder>
pip install -r requirements.txt
pip install -v .
```

Install required dependencies for this example
```shell
cd <intel_extension_for_transformers_folder>/examples/huggingface/pytorch/text-classification/deployment/mrpc/bert_base
pip install -r requirements.txt
```
>**Note**: Recommend install protobuf <= 3.20.0 if use onnxruntime <= 1.11
## Environment Variables (Optional)
```shell
# Preload libjemalloc.so may improve the performance when inference under multi instance.
conda install jemalloc==5.2.1 -c conda-forge -y
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libjemalloc.so

# Using weight sharing can save memory and may improve the performance when multi instances.
export WEIGHT_SHARING=1
export INST_NUM=<inst num>
```
>**Note**: This step is optional.
# Inference Pipeline

Neural Engine can parse ONNX model and Neural Engine IR.
We provide with three `mode`s: `accuracy`, `throughput` or `latency`. For throughput mode, we will use multi-instance with 4cores/instance occupying one socket.
You can run fp32 model inference by setting `precision=fp32`, command as follows:
```shell
bash run_bert_base.sh --model=textattack/bert-base-uncased-MRPC --dataset=mrpc --precision=fp32 --mode=throughput
```
By setting `precision=int8` you could get PTQ int8 model and setting `precision=bf16` to get bf16 model.
```shell
bash run_bert_base.sh --model=textattack/bert-base-uncased-MRPC --dataset=mrpc --precision=int8 --mode=throughput
```
By setting `precision=dynamic_int8`, you could benchmark dynamic quantized int8 model.
```shell
bash run_bert_base.sh --model=textattack/bert-base-uncased-MRPC --dataset=mrpc --precision=dynamic_int8 --mode=throughput
```


You could also compile the model to IR using python API as follows:
```python
from intel_extension_for_transformers.llm.runtime.deprecated.compile import compile
graph = compile('./model_and_tokenizer/int8-model.onnx')
graph.save('./ir')
```

# Benchmark
If you want to run local onnx model inference, we provide with python API and C++ API. To use C++ API, you need to transfer to model ir fisrt.

By setting `--dynamic_quanzite` for FP32 model, you could benchmark dynamic quantize int8 model.
## Accuracy
Python API Command as follows:
```shell
GLOG_minloglevel=2 python run_executor.py --input_model=./model_and_tokenizer/int8-model.onnx --tokenizer_dir=./model_and_tokenizer --mode=accuracy --dataset_name=glue --task_name=mrpc --batch_size=8
```

If you just want a quick start, you could try a small set of dataset, like this:
```shell
GLOG_minloglevel=2 python run_executor.py --input_model=./model_and_tokenizer/int8-model.onnx --tokenizer_dir=./model_and_tokenizer --mode=accuracy --dataset_name=glue --task_name=mrpc --batch_size=8 --max_eval_samples=10
```

>**Note**: The accuracy of partial dataset is unauthentic.
## Performance
Python API command as follows:
```shell
GLOG_minloglevel=2 python run_executor.py --input_model=./model_and_tokenizer/int8-model.onnx --mode=performance --dataset_name=glue --task_name=mrpc --batch_size=1 --seq_len=128
```

You could use C++ API as well. First, you need to compile the model to IR. And then, you could run C++.

> **Note**: The warmup below is recommended to be 1/10 of iterations and no less than 3.
```shell
export GLOG_minloglevel=2
export OMP_NUM_THREADS=<cpu_cores>
export DNNL_MAX_CPU_ISA=AVX512_CORE_AMX
export UNIFIED_BUFFER=1
numactl -C 0-<cpu_cores-1> neural_engine \
--batch_size=<batch_size> --iterations=<iterations> --w=<warmup> \
--seq_len=128 --config=./ir/conf.yaml --weight=./ir/model.bin
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (c) 2021 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import math
import numpy as np
from transformers import AutoTokenizer
from datasets import load_dataset

class DataLoader(object):
def __init__(self, batch_size, seq_len, dataset_name, task_name, data_dir, tokenizer_dir):
self.batch_size = batch_size
dataset = load_dataset(dataset_name, task_name, cache_dir=data_dir, split='validation')
tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
self.dataset = dataset.map(lambda e: tokenizer(e['sentence1'], e['sentence2'],
truncation=True, padding='max_length', max_length=seq_len), batched=True)

def __getitem__(self, idx):
start = idx * self.batch_size
end = start + self.batch_size
if end > len(self.dataset):
input_ids_data = self.dataset[start:]['input_ids']
segment_ids_data = self.dataset[start:]['token_type_ids']
input_mask_data = self.dataset[start:]['attention_mask']
label_data = self.dataset[start:]['label']
else:
input_ids_data = self.dataset[start:end]['input_ids']
segment_ids_data = self.dataset[start:end]['token_type_ids']
input_mask_data = self.dataset[start:end]['attention_mask']
label_data = self.dataset[start:end]['label']

sample_size = len(input_ids_data) if isinstance(input_ids_data, list) else 1

return [np.array(input_ids_data).reshape(sample_size, -1).astype('int32'),
np.array(segment_ids_data).reshape(sample_size, -1).astype('int32'),
np.array(input_mask_data).reshape(sample_size, -1).astype('int32')], \
np.array(label_data).reshape(sample_size, -1).astype('int32')

def __len__(self):
return math.ceil(len(self.dataset)/self.batch_size)
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (c) 2021 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import time
import numpy as np
from tqdm import tqdm
from datasets import load_metric
from executor_dataloader import DataLoader
import sys
import os

common_dir = os.path.join(sys.path[0], "../../../../neural_engine_utils/")
sys.path.append(common_dir)
from common import (log, DummyDataLoader, compute_performance, Neural_Engine_base)


class Neural_Engine(Neural_Engine_base):

def accuracy(self, batch_size, seq_len, dataset_name, task_name, data_dir, tokenizer_dir):
# load dataset
log.info("Load dataset ......")
dataset = DataLoader(batch_size, seq_len, dataset_name, task_name, data_dir, tokenizer_dir)
# load metric
log.info("Load metric ......")
if dataset_name and task_name is not None:
metric = load_metric(dataset_name, task_name)
else:
metric = load_metric("accuracy")
# execute
log.info("Start engine ......")
for idx in tqdm(range(len(dataset))):
inputs = dataset[idx][0]
labels = dataset[idx][1]
predictions = self.graph.inference(inputs)
predictions = list(predictions.values())[0]
predictions = np.argmax(predictions, axis=1)
metric.add_batch(
predictions=predictions,
references=labels,
)
# compute metrics
log.info("Compute metrics ......")
eval_metric = metric.compute()
accuracy_value = eval_metric.get("accuracy")
f1_value = eval_metric.get("f1")
log.info(f"Accuracy: {accuracy_value}")
log.info(f"F1: {f1_value}")

def performance(self, batch_size, seq_len, iteration, warm_up):
if warm_up >= iteration:
log.error("Warm up should less than iteration.")
raise ValueError()
# generate dummy dataset
log.info("Generate dummy dataset ......")
shape = [batch_size, seq_len]
dataset = DummyDataLoader(shapes=[shape, shape, shape],
lows=[1, 1, 1],
highs=[128, 1, 1],
dtypes=['int32', 'int32', 'int32'],
iteration=iteration)
compute_performance(dataset, self.graph, log, self.log_file, warm_up, batch_size, seq_len)
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
neural-compressor
transformers
accelerate
datasets >= 1.8.0
sentencepiece != 0.1.92
protobuf
torch==2.1.0
onnx>=1.12
onnxruntime==1.13.1

Loading

0 comments on commit c399e38

Please sign in to comment.