NDArray.set() fails on linux with " Inplace update to inference tensor outside InferenceMode" #1774

demq · 2022-07-06T02:04:55Z

Description

The NDArray.set() fails when trying to update tensors in a post-processing stage with a message:

Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See pytorch/rfcs#17 for more details.

This behavior is observed when running the code with PyTorch engine on a linux machine, the same code runs without any errors on a Mac M1 Pro. The work-around is to first duplicate the tensor by calling a NDArray.duplicate() and performing the .set() on the new tensor.

Expected Behavior

The PyTorch implementation of the NDArray should either perform the tensor duplication when trying to modify the tensors outside of the InferenceMode, or these tensors should be made immutable.

Error Message

Exception in thread "main" ai.djl.translate.TranslateException: ai.djl.engine.EngineException: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See pytorch/rfcs#17 for more details.
Caused by: ai.djl.engine.EngineException: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See pytorch/rfcs#17 for more details.
at ai.djl.pytorch.jni.PyTorchLibrary.torchMaskedPut(Native Method)
at ai.djl.pytorch.jni.JniUtils.booleanMaskSet(JniUtils.java:416)
at ai.djl.pytorch.engine.PtNDArrayIndexer.set(PtNDArrayIndexer.java:82)
at ai.djl.ndarray.index.NDArrayIndexer.set(NDArrayIndexer.java:157)
at ai.djl.ndarray.NDArray.set(NDArray.java:469)
at ai.djl.ndarray.NDArray.set(NDArray.java:490)
at processOutput(PtBertQATranslator.java:116)

How to Reproduce?

Create a custom QATranslator, override the processOutput() method like

   public List<QAResult> processOutput(TranslatorContext ctx, NDList list) {
        NDManager manager = ctx.getNDManager();
        NDArray start_logits = list.get(0);
        boolean[] bad_tokens_mask = new boolean[128];
        NDArray nd_bad_tokens_mask = manager.create(bad_tokens_mask);
        start_logits.set(nd_bad_tokens_mask, -10000.);

Steps to reproduce

(Paste the commands you ran that produced the error.)

Create a QA predictor using a model based on the custom translator and using "PyTorch" engine. Run predictor.predict() on a linux machine.

What have you tried to solve it?

Making a duplicate of the "output" tensors resolves the issue: NDArray startLogits = list.get(0).duplicate();

Environment Info

Please run the command ./gradlew debugEnv from the root directory of DJL (if necessary, clone DJL first). It will output information about your system, environment, and installation that can help us debug your issue. Paste the output of the command below:

./gradlew debugEnv
Starting a Gradle Daemon (subsequent builds will be faster)

> Task :integration:debugEnv
[DEBUG] - Registering EngineProvider: XGBoost
[DEBUG] - Registering EngineProvider: MXNet
[DEBUG] - Registering EngineProvider: PyTorch
[DEBUG] - Registering EngineProvider: TensorFlow
[DEBUG] - Found default engine: MXNet
----------- System Properties -----------
java.specification.version: 17
sun.jnu.encoding: UTF-8
java.class.path: /mnt/ssd4tb/user/Software/djl/integration/build/classes/java/main:/mnt/ssd4tb/user/Software/djl/integration/build/resources/main:/home/user/.gradle/caches/modules-2/files-2.1/commons-cli/commons-cli/1.5.0/dc98be5d5390230684a092589d70ea76a147925c/commons-cli-1.5.0.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-slf4j-impl/2.17.2/183f7c95fc981f3e97d008b363341343508848e/log4j-slf4j-impl-2.17.2.jar:/mnt/ssd4tb/user/Software/djl/basicdataset/build/libs/basicdataset-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/model-zoo/build/libs/model-zoo-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/testing/build/libs/testing-0.18.0-SNAPSHOT.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.testng/testng/7.5/1416a607fae667c14e390b484e8d02b5824c0674/testng-7.5.jar:/mnt/ssd4tb/user/Software/djl/engines/mxnet/mxnet-model-zoo/build/libs/mxnet-model-zoo-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/engines/pytorch/pytorch-model-zoo/build/libs/pytorch-model-zoo-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/engines/pytorch/pytorch-jni/build/libs/pytorch-jni-1.11.0-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/engines/tensorflow/tensorflow-model-zoo/build/libs/tensorflow-model-zoo-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/engines/ml/xgboost/build/libs/xgboost-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/engines/mxnet/mxnet-engine/build/libs/mxnet-engine-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/engines/pytorch/pytorch-engine/build/libs/pytorch-engine-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/engines/tensorflow/tensorflow-engine/build/libs/tensorflow-engine-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/api/build/libs/api-0.18.0-SNAPSHOT.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.slf4j/slf4j-api/1.7.36/6c62681a2f655b49963a5983b8b0950a6120ae14/slf4j-api-1.7.36.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-core/2.17.2/fa43ba4467f5300b16d1e0742934149bfc5ac564/log4j-core-2.17.2.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-api/2.17.2/f42d6afa111b4dec5d2aea0fe2197240749a4ea6/log4j-api-2.17.2.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-csv/1.9.0/b59d8f64cd0b83ee1c04ff1748de2504457018c1/commons-csv-1.9.0.jar:/home/user/.gradle/caches/modules-2/files-2.1/com.google.code.findbugs/jsr305/3.0.1/f7be08ec23c21485b9b5a1cf1654c2ec8c58168d/jsr305-3.0.1.jar:/home/user/.gradle/caches/modules-2/files-2.1/com.beust/jcommander/1.78/a3927de9bd6f351429bcf763712c9890629d8f51/jcommander-1.78.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.webjars/jquery/3.5.1/2392938e374f561c27c53872bdc9b6b351b6ba34/jquery-3.5.1.jar:/home/user/.gradle/caches/modules-2/files-2.1/ml.dmlc/xgboost4j_2.12/1.6.0/4623e78f614c998b4600c1cc58441ce06d80ba49/xgboost4j_2.12-1.6.0.jar:/home/user/.gradle/caches/modules-2/files-2.1/commons-logging/commons-logging/1.2/4bfc12adfe4842bf07b657f0369c4cb522955686/commons-logging-1.2.jar:/home/user/.gradle/caches/modules-2/files-2.1/com.google.code.gson/gson/2.9.0/8a1167e089096758b49f9b34066ef98b2f4b37aa/gson-2.9.0.jar:/home/user/.gradle/caches/modules-2/files-2.1/net.java.dev.jna/jna/5.11.0/27770efb6329f092f895c7329662d1aa8ee8c0ac/jna-5.11.0.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-compress/1.21/4ec95b60d4e86b5c95a0e919cb172a0af98011ef/commons-compress-1.21.jar:/mnt/ssd4tb/user/Software/djl/engines/tensorflow/tensorflow-api/build/libs/tensorflow-api-0.18.0-SNAPSHOT.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.tensorflow/tensorflow-core-api/0.4.0/2ac35ca087607cce0e5419953cc1ef0c3a5edaea/tensorflow-core-api-0.4.0.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.bytedeco/javacpp/1.5.6/1f18a820aadd943577b0b372554f9e35e1232e25/javacpp-1.5.6.jar:/home/user/.gradle/caches/modules-2/files-2.1/com.google.protobuf/protobuf-java/3.19.2/e958ce38f96b612d3819ff1c753d4d70609aea74/protobuf-java-3.19.2.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.tensorflow/ndarray/0.3.3/1b6d8cc3e3762f6e465b884580d9fc17ab7aeb4/ndarray-0.3.3.jar
java.vm.vendor: Red Hat, Inc.
sun.arch.data.model: 64
user.variant: 
java.vendor.url: https://www.redhat.com/
java.vm.specification.version: 17
os.name: Linux
sun.java.launcher: SUN_STANDARD
sun.boot.library.path: /usr/lib/jvm/java-17-openjdk-17.0.3.0.7-1.fc36.x86_64/lib:/usr/lib/jvm/java-17-openjdk-17.0.3.0.7-1.fc36.x86_64/lib
sun.java.command: ai.djl.integration.util.DebugEnvironment
jdk.debug: release
sun.cpu.endian: little
org.gradle.appname: gradlew
user.language: en
java.specification.vendor: Oracle Corporation
java.version.date: 2022-04-19
java.home: /usr/lib/jvm/java-17-openjdk-17.0.3.0.7-1.fc36.x86_64
ai.djl.logging.level: debug
org.gradle.internal.http.connectionTimeout: 60000
file.separator: /
java.vm.compressedOopsMode: Zero based
line.separator: 

java.vm.specification.vendor: Oracle Corporation
java.specification.name: Java Platform API Specification
sun.management.compiler: HotSpot 64-Bit Tiered Compilers
java.runtime.version: 17.0.3+7
path.separator: :
os.version: 5.17.12-300.fc36.x86_64
java.runtime.name: OpenJDK Runtime Environment
file.encoding: UTF-8
java.vm.name: OpenJDK 64-Bit Server VM
java.vendor.version: 21.9
java.vendor.url.bug: https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=java-17-openjdk&version=36
java.io.tmpdir: /tmp
org.gradle.internal.http.socketTimeout: 120000
java.version: 17.0.3
user.dir: /mnt/ssd4tb/user/Software/djl/integration
os.arch: amd64
java.vm.specification.name: Java Virtual Machine Specification
native.encoding: UTF-8
java.library.path: /usr/local/cuda-10.2/lib64:/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
java.vm.info: mixed mode, sharing
java.vendor: Red Hat, Inc.
java.vm.version: 17.0.3+7
sun.io.unicode.encoding: UnicodeLittle
library.jansi.path: /home/user/.gradle/native/jansi/1.18/linux64
java.class.version: 61.0
org.gradle.internal.publish.checksums.insecure: true

--------- Environment Variables ---------
PATH: /usr/local/cuda-10.2/bin:/usr/lib64/ccache:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
LD_LIBRARY_PATH: /usr/local/cuda-10.2/lib64
-------------- Directories --------------
temp directory: /tmp
DJL cache directory: /home/user/.djl.ai
Engine cache directory: /home/user/.djl.ai

------------------ CUDA -----------------
GPU Count: 1
CUDA: 102
ARCH: 52
GPU(0) memory used: 295370752 bytes

----------------- Engines ---------------
DJL version: 0.18.0
Default Engine: MXNet
[WARN ] - No matching cuda flavor for linux found: cu102mkl/sm_52.
[DEBUG] - Loading mxnet library from: /home/user/.djl.ai/mxnet/1.9.0-mkl-linux-x86_64/libmxnet.so
Default Device: cpu()
PyTorch: 2
MXNet: 0
XGBoost: 10
TensorFlow: 3

--------------- Hardware --------------
Available processors (cores): 48
Byte Order: LITTLE_ENDIAN
Free memory (bytes): 2084036752
Maximum memory (bytes): 32178700288
Total memory available to JVM (bytes): 2147483648
Heap committed: 2147483648
Heap nonCommitted: 31391744
GCC: 
gcc (GCC) 12.1.1 20220507 (Red Hat 12.1.1-1)
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


BUILD SUCCESSFUL in 8s
44 actionable tasks: 1 executed, 43 up-to-date

The text was updated successfully, but these errors were encountered:

KexinFeng · 2022-07-12T06:24:25Z

@demq
From the error message, it shows that you are using the old feature of setter.

at ai.djl.pytorch.jni.PyTorchLibrary.torchMaskedPut(Native Method)
at ai.djl.pytorch.jni.JniUtils.booleanMaskSet(JniUtils.java:416)
at ai.djl.pytorch.engine.PtNDArrayIndexer.set(PtNDArrayIndexer.java:82)
at ai.djl.ndarray.index.NDArrayIndexer.set(NDArrayIndexer.java:157)
at ai.djl.ndarray.NDArray.set(NDArray.java:469)
at ai.djl.ndarray.NDArray.set(NDArray.java:490)
at processOutput(PtBertQATranslator.java:116)

Could you try the setter with NDIndex? It will utilize the new feature.

start_logits.set(new NDIndex("{}", nd_bad_tokens_mask), -10000.);

This is in version 0.18.0, which is just released.

1. Fix issue 1773 and issue 1774 In NDArray.set(NDArray index, Number value) add the feature of setting array values with integer indices, as shown in the use case #1773 as well as #1774. 2. Fix an issue in the document of built-from-source

demq · 2022-07-14T00:05:41Z

There is no change to this behavior when I use the NDArray.set(NDIndex, Number), the error is coming from the way PyTorch restricts the modification of the tensors after inference:
Caused by: ai.djl.engine.EngineException: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See pytorch/rfcs#17 for more details.

KexinFeng · 2022-07-14T01:58:13Z

I see. Ok, if this is a bug in Pytorch, can you report an issue to PyTorch to see if they can solve this from their side?

frankfliu · 2022-07-14T02:07:57Z

@demq
Are you able to reproduce the issue in python?

demq · 2022-07-15T01:23:12Z

I need to clarify the reason I think this issue is a bug in DJL.

Pytorch behaves as expected, since the tensors created while in "c10:InferenceMode" are made immutable outside of "InferenceMode": https://pytorch.org/cppdocs/notes/inference_mode.html

DJL appears to be invoking the "InferenceMode" for the "newer" version of torch: https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-native/src/main/native/ai_djl_pytorch_jni_PyTorchLibrary_inference.cc

struct JITCallGuard {
#ifdef V1_10_X
  torch::autograd::AutoGradMode no_autograd_guard{false};
  torch::NoGradGuard no_grad;
#else
  c10::InferenceMode guard;
  torch::jit::GraphOptimizerEnabledGuard no_optimizer_guard{false};
#endif
};

The V1_10_X is set if PT_OLD_VERSION is true, which on its own is set in
https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-native/build.cmd

if "%VERSION%" == "1.10.0" (
    set PT_OLD_VERSION=1
)
if "%VERSION%" == "1.9.1" (
    set PT_OLD_VERSION=1
)

I suppose the M1 version of the djl is compiled with the V1_10_X defined, so the c10::InferenceMode guard; is not defined, while the linux version has it.

The NDArray.set() documentation does not outline this issue https://javadoc.io/static/ai.djl/api/0.18.0/ai/djl/ndarray/NDArray.html#set(ai.djl.ndarray.index.NDIndex,ai.djl.ndarray.NDArray)

DJL Needs to either Document this behavior, or ensure the tensors can be modified after the inference in PyTorch implementation to ensure the function behaves the same for all engines.
This can be done for example by:

Sub-optimal: Switching back from c10::InferenceMode guard; to torch::NoGradGuard no_grad;
Returning the duplicates of the tensors created outside of the InfereceMode
Doing something better than what I have proposed.

KexinFeng · 2022-07-17T16:59:34Z

@demq Thank you so much for this detailed investigation! The purpose of InferenceMode guard, is to free the array from being changed by autograd in the inference mode. We should try to keep consistent with this. But in your case, where you wanted to modify the inference array, we will need to think about how to resolve it.

KexinFeng · 2022-07-21T21:16:10Z

I have just updated the document of processOutput to notify this behaviour. So when users try to implement postProcessor, they will see this link.

I didn't do duplicates inside DJL, but leave it users, since it is good to keep default behaviour same as the engines.

demq added the bug Something isn't working label Jul 6, 2022

KexinFeng mentioned this issue Jul 12, 2022

Issue fix 1773 and 1774 #1789

Merged

KexinFeng mentioned this issue Jul 21, 2022

[doc] Immutable array output from InferenceMode PyTorch #1822

Merged

KexinFeng closed this as completed Jul 22, 2022

siddvenk mentioned this issue Sep 12, 2022

fix some bugs in pytorch based examples #2009

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NDArray.set() fails on linux with " Inplace update to inference tensor outside InferenceMode" #1774

NDArray.set() fails on linux with " Inplace update to inference tensor outside InferenceMode" #1774

demq commented Jul 6, 2022

KexinFeng commented Jul 12, 2022 •

edited

Loading

demq commented Jul 14, 2022

KexinFeng commented Jul 14, 2022 •

edited

Loading

frankfliu commented Jul 14, 2022

demq commented Jul 15, 2022 •

edited

Loading

KexinFeng commented Jul 17, 2022 •

edited

Loading

KexinFeng commented Jul 21, 2022 •

edited

Loading

NDArray.set() fails on linux with " Inplace update to inference tensor outside InferenceMode" #1774

NDArray.set() fails on linux with " Inplace update to inference tensor outside InferenceMode" #1774

Comments

demq commented Jul 6, 2022

Description

Expected Behavior

Error Message

How to Reproduce?

Steps to reproduce

What have you tried to solve it?

Environment Info

KexinFeng commented Jul 12, 2022 • edited Loading

demq commented Jul 14, 2022

KexinFeng commented Jul 14, 2022 • edited Loading

frankfliu commented Jul 14, 2022

demq commented Jul 15, 2022 • edited Loading

KexinFeng commented Jul 17, 2022 • edited Loading

KexinFeng commented Jul 21, 2022 • edited Loading

KexinFeng commented Jul 12, 2022 •

edited

Loading

KexinFeng commented Jul 14, 2022 •

edited

Loading

demq commented Jul 15, 2022 •

edited

Loading

KexinFeng commented Jul 17, 2022 •

edited

Loading

KexinFeng commented Jul 21, 2022 •

edited

Loading