Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONNX Embedding Model Thread-Safety Issue #23555

Closed
alfredogangemi opened this issue Jan 31, 2025 · 17 comments
Closed

ONNX Embedding Model Thread-Safety Issue #23555

alfredogangemi opened this issue Jan 31, 2025 · 17 comments
Labels
api:Java issues related to the Java API model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.

Comments

@alfredogangemi
Copy link

Bug description

When using the default ONNX embedding model in Spring AI (all-MiniLM-L6-v2), running the embedding process asynchronously with a ThreadPoolTaskExecutor results in inconsistent behavior and occasional runtime exceptions. The issue does not occur when executing the process synchronously.

Environment

Java Version: 17
Spring Boot Version: Latest
Spring AI Version: 1.0.0-M5
ONNX Model: all-MiniLM-L6-v2

Steps to reproduce

Handling embedding asynchronously.

Expected behavior

The embedding process should run correctly across multiple threads.

Observed behavior

  • Running the method synchronously works fine.
  • Running it asynchronously causes intermittent failures.
  • Running the embedding model in a single-threaded executor gives the same error.
  • Switching to OpenAI embeddings works fine, reinforcing the idea that the problem is ONNX-related.

Logs

2025-01-31 17:06:11.001 ERROR [semantic-search-server,,] [EmbeddingAsyncExecutor-1] i.c.w.s.s.etl.DocumentVectorService     : Errore durante l'upload del file: Svizzera.pdf (load DocumentVectorService.java 70)
java.lang.RuntimeException: ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: Non-zero status code returned while running Add node. Name:'/encoder/layer.0/attention/self/Add' Status Message: D:\a\_work\1\s\include\onnxruntime\core/common/logging/logging.h:340 onnxruntime::logging::LoggingManager::DefaultLogger Attempt to use DefaultLogger but none has been registered.

	at org.springframework.ai.transformers.TransformersEmbeddingModel.lambda$call$3(TransformersEmbeddingModel.java:351)
	at io.micrometer.observation.Observation.observe(Observation.java:564)
	at org.springframework.ai.transformers.TransformersEmbeddingModel.call(TransformersEmbeddingModel.java:298)
	at org.springframework.ai.embedding.EmbeddingModel.embed(EmbeddingModel.java:91)
	at org.springframework.ai.vectorstore.qdrant.QdrantVectorStore.doAdd(QdrantVectorStore.java:220)
	at org.springframework.ai.vectorstore.observation.AbstractObservationVectorStore.lambda$add$1(AbstractObservationVectorStore.java:91)
	at io.micrometer.observation.Observation.observe(Observation.java:498)
	at org.springframework.ai.vectorstore.observation.AbstractObservationVectorStore.add(AbstractObservationVectorStore.java:91)
	at it.cegeka.wemaind.semantic_search_server.service.etl.DocumentVectorService.load(DocumentVectorService.java:65)
	at it.cegeka.wemaind.semantic_search_server.service.etl.DocumentVectorService.loadAsync(DocumentVectorService.java:42)
	at jdk.internal.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:359)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:196)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
	at org.springframework.aop.interceptor.AsyncExecutionInterceptor.lambda$invoke$0(AsyncExecutionInterceptor.java:114)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: Non-zero status code returned while running Add node. Name:'/encoder/layer.0/attention/self/Add' Status Message: D:\a\_work\1\s\include\onnxruntime\core/common/logging/logging.h:340 onnxruntime::logging::LoggingManager::DefaultLogger Attempt to use DefaultLogger but none has been registered.

	at ai.onnxruntime.OrtSession.run(Native Method)
	at ai.onnxruntime.OrtSession.run(OrtSession.java:395)
	at ai.onnxruntime.OrtSession.run(OrtSession.java:242)
	at ai.onnxruntime.OrtSession.run(OrtSession.java:210)
	at org.springframework.ai.transformers.TransformersEmbeddingModel.lambda$call$3(TransformersEmbeddingModel.java:327)
	... 20 common frames omitted

Other infos

The JVM crashed in some tests:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00007fff9109bef3, pid=6540, tid=18268
#
# JRE version: OpenJDK Runtime Environment (17.0.10+13) (build 17.0.10+13-LTS)
# Java VM: OpenJDK 64-Bit Server VM (17.0.10+13-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, windows-amd64)
# Problematic frame:
# C[thread 29912 also had an error]
  [onnxruntime.dll+0x71bef3]
#
# No core dump will be written. Minidumps are not enabled by default on client versions of Windows
#
# An error report file with more information is saved as:
# C:\Users\alfredog\IdeaProjects\Microservices\semantic-search-server\hs_err_pid6540.log
#
# If you would like to submit a bug report, please visit:
#   https://bell-sw.com/support
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

I understand that this issue is likely caused by the interaction between Spring AI and ONNX Runtime, so I will also open an issue on the Spring AI repository. However, in the meantime, could you provide more information regarding the error encountered?

@github-actions github-actions bot added api:Java issues related to the Java API model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. labels Jan 31, 2025
@Craigacp
Copy link
Contributor

Craigacp commented Feb 1, 2025

I've done a bunch of multithreaded testing and multithreaded embedding runs without Spring AI, so I'm a bit surprised that they are managing to trigger this kind of error. It shouldn't be possible to crash the JVM using the ORT Java API, so something's definitely busted in the ORT Java API, just not sure what.

  • How many threads are in the executor and what kind of load is it getting (e.g. how many documents, what's the throughput)?
  • How is the OrtSession configured?
  • What ORT version is being used?
  • Is there a stack trace in the JVM crash error file?

@alfredogangemi
Copy link
Author

Hi @Craigacp, I will do my best to provide you with all of the information in your request below.

  • The executor I’m using is configured with a core pool size of 5, a maximum pool size of 10, and a queue capacity of 25. Each thread is responsible for processing a single document at a time. However, I have also tested configuring the thread pool with only one thread, and I still encounter the same error.
  • OrtSession is used by Spring AI in this class
  • OnnxRuntime version 1.19.2 (from Spring AI 1.0.0-M5)
  • This is the log file of the JVM crash hs_err_pid6540.log

@Craigacp
Copy link
Contributor

Craigacp commented Feb 3, 2025

I've made this small harness which should reproduce the error, but it doesn't (at least not on my M4 Pro Mac). Could you run it in your Windows environment and see if it crashes?

import ai.onnxruntime.OnnxTensor;
import ai.onnxruntime.OrtEnvironment;
import ai.onnxruntime.OrtException;
import ai.onnxruntime.OrtSession;

import java.io.File;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.RejectedExecutionHandler;
import java.util.concurrent.ThreadFactory;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
import java.util.logging.Logger;
import java.util.stream.LongStream;

public class MultithreadingTest {
    private static final Logger logger = Logger.getLogger(MultithreadingTest.class.getName());

    private static OrtEnvironment env = OrtEnvironment.getEnvironment();

    OrtSession.SessionOptions makeOpts() {
        OrtSession.SessionOptions opts = new OrtSession.SessionOptions();
        return opts;
    }

    OrtSession makeSession(OrtSession.SessionOptions opts) throws OrtException {
        Path path = new File(this.getClass().getResource("/all-minilm-l6-v2.onnx").getFile()).toPath();
        String modelPath = path.toString();
        OrtSession session = env.createSession(modelPath, opts);
        return session;
    }

    static ThreadFactory createThreadFactory() {
        return (Runnable runnable) -> {
            Thread thread = new Thread(runnable);
            thread.setDaemon(true);
            return thread;
        };
    }

    public ThreadPoolExecutor createThreadPoolExecutor() {
        RejectedExecutionHandler rejectedExecutionHandler = new ThreadPoolExecutor.AbortPolicy();
        BlockingQueue<Runnable> queue = new LinkedBlockingQueue<>(25);
        ThreadPoolExecutor executor = new ThreadPoolExecutor(
                2, 4, 60, TimeUnit.SECONDS,
                queue, createThreadFactory(), rejectedExecutionHandler);
        return executor;
    }

    //@Test
    public void arrayTest() throws OrtException, ExecutionException, InterruptedException {
        long[][] ids = new long[][]{LongStream.range(100, 600).toArray()};
        long[][] mask = new long[1][500];
        Arrays.fill(mask[0], 1);

        ThreadPoolExecutor executor = createThreadPoolExecutor();
        OrtSession.SessionOptions opts = makeOpts();
            try (OrtSession session = makeSession(opts)) {
                opts.close();
                Runnable r = () -> {
                    try (OnnxTensor inputIds = OnnxTensor.createTensor(env, ids);
                         OnnxTensor attentionMask = OnnxTensor.createTensor(env, mask)) {
                        Map<String, OnnxTensor> input = new HashMap<>();
                        input.put("input_ids", inputIds);
                        input.put("attention_mask", attentionMask);
                        try (OrtSession.Result result = session.run(input)) {
                            float[][][] output = (float[][][]) result.get(0).getValue();
                            logger.info("Output is ["+output.length+"]["+output[0].length+"]["+output[0][0].length+"]");
                        }
                    } catch (OrtException e) {
                        throw new RuntimeException(e);
                    }
                };

                List<Future<?>> futures = new ArrayList<>();
                for (int i = 0; i < 25; i++) {
                    futures.add(executor.submit(r));
                }

                for (Future<?> f : futures) {
                    f.get();
                }
            }

        logger.info("Submitted tasks");
        executor.shutdown();
        executor.awaitTermination(1, TimeUnit.MINUTES);
        logger.info("Shutdown executor");
    }

    public static void main(String[] args) throws OrtException, ExecutionException, InterruptedException {
        MultithreadingTest t = new MultithreadingTest();
        t.arrayTest();
    }

}

I've not replicated the async behaviour, as I've never used Spring or its async stuff, if you can tell me how to modify the thread pool/submit then maybe it'll trigger it?

@alfredogangemi
Copy link
Author

Got it!

� [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 6 2 1 9 1 8   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 6 2 1 9 2 3   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 6 6 0 1 4 1   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 6 6 1 2 6 1   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 6 7 2 2 6 4   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 6 8 1 2 1 7   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 7 7 3 0 4 6   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 7 8 4 7 3 6   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 8 0 0 1 0 8   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 8 3 4 4 6 4   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 8 4 8 2 4 5   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 8 5 8 9 3 6   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 8 8 3 4 5 9   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 8 9 6 2 6 9   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 9 0 7 0 4 9   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 9 1 7 5 4 8   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 6 8 5 5 6 1   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 9 3 8 7 0 1   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 � [ 1 ; 3 1 m 2 0 2 5 - 0 2 - 0 3   1 9 : 0 4 : 3 2 . 1 9 5 2 1 9 2   [ E : o n n x r u n t i m e : ,   s e q u e n t i a l _ e x e c u t o r . c c : 5 1 6   o n n x r u n t i m e : : E x e c u t e K e r n e l ]   N o n - z e r o   s t a t u s   c o d e   r e t u r n e d   w h i l e   r u n n i n g   G a t h e r   n o d e .   N a m e : ' / e m b e d d i n g s / t o k e n _ t y p e _ e m b e d d i n g s / G a t h e r '   S t a t u s   M e s s a g e :   D : \ a \ _ w o r k \ 1 \ s \ i n c l u d e \ o n n x r u n t i m e \ c o r e / f r a m e w o r k / o p _ k e r n e l _ c o n t e x t . h : 4 2   o n n x r u n t i m e : : O p K e r n e l C o n t e x t : : I n p u t   M i s s i n g   I n p u t :   t o k e n _ t y p e _ i d s 
 � [ m 
 Exception in thread "main" java.util.concurrent.ExecutionException: java.lang.RuntimeException: ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: Non-zero status code returned while running Gather node. Name:'/embeddings/token_type_embeddings/Gather' Status Message: D:\a\_work\1\s\include\onnxruntime\core/framework/op_kernel_context.h:42 onnxruntime::OpKernelContext::Input Missing Input: token_type_ids
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00007fff85ca52ad, pid=26104, tid=23084
#
# JRE version: OpenJDK Runtime Environment (17.0.10+13) (build 17.0.10+13-LTS)
# Java VM: OpenJDK 64-Bit Server VM (17.0.10+13-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, windows-amd64)
# Problematic frame:
# C  [onnxruntime.dll+0x852ad]
#
# No core dump will be written. Minidumps are not enabled by default on client versions of Windows
#
# An error report file with more information is saved as:
# C:\Users\alfredog\IdeaProjects\Microservices\Microservices\hs_err_pid26104.log

	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at it.cegeka.wemaind.semantic_search_server.util.MultithreadingTest.arrayTest(MultithreadingTest.java:94)
	at it.cegeka.wemaind.semantic_search_server.util.MultithreadingTest.main(MultithreadingTest.java:106)
Caused by: java.lang.RuntimeException: ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: Non-zero status code returned while running Gather node. Name:'/embeddings/token_type_embeddings/Gather' Status Message: D:\a\_work\1\s\include\onnxruntime\core/framework/op_kernel_context.h:42 onnxruntime::OpKernelContext::Input Missing Input: token_type_ids

	at it.cegeka.wemaind.semantic_search_server.util.MultithreadingTest.lambda$arrayTest$1(MultithreadingTest.java:84)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: Non-zero status code returned while running Gather node. Name:'/embeddings/token_type_embeddings/Gather' Status Message: D:\a\_work\1\s\include\onnxruntime\core/framework/op_kernel_context.h:42 onnxruntime::OpKernelContext::Input Missing Input: token_type_ids

	at ai.onnxruntime.OrtSession.run(Native Method)
	at ai.onnxruntime.OrtSession.run(OrtSession.java:395)
	at ai.onnxruntime.OrtSession.run(OrtSession.java:242)
	at ai.onnxruntime.OrtSession.run(OrtSession.java:210)
	at it.cegeka.wemaind.semantic_search_server.util.MultithreadingTest.lambda$arrayTest$1(MultithreadingTest.java:78)
	... 5 more
#
# If you would like to submit a bug report, please visit:
#   https://bell-sw.com/support
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Process finished with exit code 1

hs_err_pid26104.log

I used your code without changing anything.

@Craigacp
Copy link
Contributor

Craigacp commented Feb 3, 2025

Hmm, your version of all-MiniLM-L6-v2 needs an extra input which causes an error. Could you run this version? I'd like to differentiate between the crash coming from the error handling pathway, or from some other part of ORT.

import ai.onnxruntime.OnnxTensor;
import ai.onnxruntime.OrtEnvironment;
import ai.onnxruntime.OrtException;
import ai.onnxruntime.OrtSession;

import java.io.File;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.RejectedExecutionHandler;
import java.util.concurrent.ThreadFactory;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
import java.util.logging.Logger;
import java.util.stream.LongStream;

public class MultithreadingTest {
    private static final Logger logger = Logger.getLogger(MultithreadingTest.class.getName());

    private static OrtEnvironment env = OrtEnvironment.getEnvironment();

    OrtSession.SessionOptions makeOpts() {
        OrtSession.SessionOptions opts = new OrtSession.SessionOptions();
        return opts;
    }

    OrtSession makeSession(OrtSession.SessionOptions opts) throws OrtException {
        Path path = new File(this.getClass().getResource("/all-minilm-l6-v2.onnx").getFile()).toPath();
        String modelPath = path.toString();
        OrtSession session = env.createSession(modelPath, opts);
        return session;
    }

    static ThreadFactory createThreadFactory() {
        return (Runnable runnable) -> {
            Thread thread = new Thread(runnable);
            thread.setDaemon(true);
            return thread;
        };
    }

    public ThreadPoolExecutor createThreadPoolExecutor() {
        RejectedExecutionHandler rejectedExecutionHandler = new ThreadPoolExecutor.AbortPolicy();
        BlockingQueue<Runnable> queue = new LinkedBlockingQueue<>(25);
        ThreadPoolExecutor executor = new ThreadPoolExecutor(
                2, 4, 60, TimeUnit.SECONDS,
                queue, createThreadFactory(), rejectedExecutionHandler);
        return executor;
    }

    //@Test
    public void arrayTest() throws OrtException, ExecutionException, InterruptedException {
        long[][] ids = new long[][]{LongStream.range(100, 600).toArray()};
        long[][] mask = new long[1][500];
        Arrays.fill(mask[0], 1); 
       long[][] type = new long[1][500];
        Arrays.fill(type[0], 0);

        ThreadPoolExecutor executor = createThreadPoolExecutor();
        OrtSession.SessionOptions opts = makeOpts();
            try (OrtSession session = makeSession(opts)) {
                opts.close();
                Runnable r = () -> {
                    try (OnnxTensor inputIds = OnnxTensor.createTensor(env, ids);
                         OnnxTensor attentionMask = OnnxTensor.createTensor(env, mask);
                         OnnxTensor tokenType = OnnxTensor.createTensor(env, type)) {
                        Map<String, OnnxTensor> input = new HashMap<>();
                        input.put("input_ids", inputIds);
                        input.put("attention_mask", attentionMask);
                        input.put("token_type_ids", tokenType);
                        try (OrtSession.Result result = session.run(input)) {
                            float[][][] output = (float[][][]) result.get(0).getValue();
                            logger.info("Output is ["+output.length+"]["+output[0].length+"]["+output[0][0].length+"]");
                        }
                    } catch (OrtException e) {
                        throw new RuntimeException(e);
                    }
                };

                List<Future<?>> futures = new ArrayList<>();
                for (int i = 0; i < 25; i++) {
                    futures.add(executor.submit(r));
                }

                for (Future<?> f : futures) {
                    f.get();
                }
            }

        logger.info("Submitted tasks");
        executor.shutdown();
        executor.awaitTermination(1, TimeUnit.MINUTES);
        logger.info("Shutdown executor");
    }

    public static void main(String[] args) throws OrtException, ExecutionException, InterruptedException {
        MultithreadingTest t = new MultithreadingTest();
        t.arrayTest();
    }

}

@alfredogangemi
Copy link
Author

This code seems to work fine. I can provide you with a small Spring project in which I replicate the problem if you wish. I have configured the thread pool with the Spring classes, but I assume you have configured it the same way.

@Craigacp
Copy link
Contributor

Craigacp commented Feb 3, 2025

Ok, so maybe the JVM crash is something to do with the error handling pathway when it's overloaded. I'm still confused by the logger error you get when running in async. If you have a demo to drive it in Spring with the async task scheduler that would be helpful. I can try to replicate it in a Windows environment and then get the debugger on it.

@Craigacp
Copy link
Contributor

Craigacp commented Feb 3, 2025

Also, could you test it with the latest release (1.20.1)?

@yuslepukhin
Copy link
Member

There are two issues as I am seeing it.
One is the. Need to find out how this is possible

ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: Non-zero status code returned while running Gather node. Name:'/embeddings/token_type_embeddings/Gather' Status Message: D:\a_work\1\s\include\onnxruntime\core/framework/op_kernel_context.h:42 onnxruntime::OpKernelContext::Input Missing Input: token_type_ids

And the second one: why throwing an exception in multithreaded test produces Access Violation.

@Craigacp
Copy link
Contributor

Craigacp commented Feb 3, 2025

That runtime error comes from @alfredogangemi 's model being a little different from the one I was testing, but this is confusing to me:

Error code - ORT_RUNTIME_EXCEPTION - message: Non-zero status code returned while running Add node. Name:'/encoder/layer.0/attention/self/Add' Status Message: D:\a_work\1\s\include\onnxruntime\core/common/logging/logging.h:340 onnxruntime::logging::LoggingManager::DefaultLogger Attempt to use DefaultLogger but none has been registered.

I don't know how we can not have the default logger registered, the Java code shouldn't let you do anything without an OrtEnv being created first.

@alfredogangemi
Copy link
Author

Hi @Craigacp,
I have created a Spring test project that replicates the error. The operations performed are exactly the same as in my main project, including the files used. You will also find the JVM crash logs and the error file that occurs in other cases.
https://github.com/alfredogangemi/spring-ai-onnx-async

@Craigacp
Copy link
Contributor

Craigacp commented Feb 4, 2025

The test project replicates the default logger exception for me when running in Maven, and the crash when running in IntelliJ on my Windows box. I'll look into it. My suspicion is that the JVM crash is a consequence of the exceptions and something is weird in the exception handling path, but I've not run it down yet.

@Craigacp
Copy link
Contributor

Craigacp commented Feb 4, 2025

There's a lot going on here, but one major issue is that Spring seems to throw away the embedding call and something isn't holding a strong reference to it so something is getting closed out from under ONNX Runtime when running async. Adding a timeout to the async executor and making it wait causes the test to complete without error:

diff --git a/src/test/java/com/example/demo/DemoApplicationTests.java b/src/test/java/com/example/demo/DemoApplicationTests.java
index 50a6203..0a82d5f 100644
--- a/src/test/java/com/example/demo/DemoApplicationTests.java
+++ b/src/test/java/com/example/demo/DemoApplicationTests.java
@@ -10,12 +10,14 @@ import org.springframework.boot.test.context.SpringBootTest;
 import org.springframework.core.io.ClassPathResource;
 import org.springframework.core.io.InputStreamResource;
 import org.springframework.core.io.Resource;
+import org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor;

 import java.io.File;
 import java.io.FileInputStream;
 import java.io.IOException;
 import java.io.InputStream;
 import java.util.Objects;
+import java.util.concurrent.TimeUnit;
 import java.util.stream.Collectors;
 import java.util.stream.IntStream;

@@ -23,18 +25,23 @@ import java.util.stream.IntStream;
 @Slf4j
 class DemoApplicationTests {

+       @Autowired
+       private ThreadPoolTaskExecutor executor;
+
        @Autowired
        EmbeddingService embeddingService;

        @Test
        @SneakyThrows
-       public void asyncEmbed() throws IOException {
+       public void asyncEmbed() throws IOException, InterruptedException {
                Resource folderResource = new ClassPathResource("files"); //
                File folder = folderResource.getFile();
                for (File file : Objects.requireNonNull(folder.listFiles())) {
-                               log.info("Loading file {}", file.getName());
-                               embeddingService.embed(file);
-                       }
+                       log.info("Loading file {}", file.getName());
+                       embeddingService.embed(file);
                }
+               executor.getThreadPoolExecutor().awaitTermination(60,TimeUnit.SECONDS);
+               log.info("Finished waiting");
        }
+}

I think that's a problem with Spring AI and how it works with async stuff, as the embedding model shouldn't be closed until all the tasks have finished.

@alfredogangemi
Copy link
Author

Hi @Craigacp ,

First of all, a huge thank you for taking the time to investigate this issue! I really appreciate your help.

Now that I understand the root cause, it makes perfect sense. The issue is related to how Spring Boot handles unit tests. When the test finishes, Spring shuts down the application context, but the async executor threads are still running in the background. Since the embedding model (and possibly other beans) are managed by Spring, they get destroyed before the async tasks can complete, leading to issues when those threads try to access them.

Thanks again for your support!

@Craigacp
Copy link
Contributor

Craigacp commented Feb 5, 2025

The reproducer does crash for me on macOS, and while it's due to the async behaviour not waiting properly I'd really prefer the JVM not to crash if the user uses ORT wrong, so I'm still trying to figure out exactly what's causing the crash. Maybe the stdout has gone away?

@yuslepukhin this is the stack trace I get back out of a debug build on macOS, does it make more sense to you? It bottoms out in the logger's ISink::Send method, but I don't know what could be null in there. The crash is intermittent too, sometimes we get an ORT_RUNTIME_EXCEPTION back out complaining the default logger isn't configured. Any idea what could be racing? Unfortunately getting the JVM to actually dump a core file on macOS is troublesome as I need to re-sign it with fresh entitlements, so I'll have to try this on a Linux box.

C  [libonnxruntime.dylib+0x1a28c28]  onnxruntime::logging::ISink::Send(std::__1::chrono::time_point<std::__1::chrono::system_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000l>>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, onnxruntime::logging::Capture const&)+0x2c
C  [libonnxruntime.dylib+0x1a28bc8]  onnxruntime::logging::LoggingManager::Log(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, onnxruntime::logging::Capture const&) const+0x50
C  [libonnxruntime.dylib+0x1a27e14]  onnxruntime::logging::Logger::Log(onnxruntime::logging::Capture const&) const+0x28
C  [libonnxruntime.dylib+0x1a27db8]  onnxruntime::logging::Capture::~Capture()+0x40
C  [libonnxruntime.dylib+0x1a27e3c]  onnxruntime::logging::Capture::~Capture()+0x1c
C  [libonnxruntime.dylib+0x155f1e8]  onnxruntime::ExecuteKernel(onnxruntime::StreamExecutionContext&, unsigned long, unsigned long, bool const&, onnxruntime::SessionScope&)+0x4b4
C  [libonnxruntime.dylib+0x14c5304]  onnxruntime::LaunchKernelStep::Execute(onnxruntime::StreamExecutionContext&, unsigned long, onnxruntime::SessionScope&, bool const&, bool&)+0x5c
C  [libonnxruntime.dylib+0x15d86c8]  onnxruntime::RunSince(unsigned long, onnxruntime::StreamExecutionContext&, onnxruntime::SessionScope&, bool const&, unsigned long)+0x1e8
C  [libonnxruntime.dylib+0x15659b4]  onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<int const, 18446744073709551615ul>, std::__1::vector<OrtValue, std::__1::allocator<OrtValue>>&, std::__1::unordered_map<unsigned long, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__1::hash<unsigned long>, std::__1::equal_to<unsigned long>, std::__1::allocator<std::__1::pair<unsigned long const, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection const*, bool const&, bool, bool)::$_0::operator()() const+0x2c
C  [libonnxruntime.dylib+0x156597c]  decltype(std::declval<onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<int const, 18446744073709551615ul>, std::__1::vector<OrtValue, std::__1::allocator<OrtValue>>&, std::__1::unordered_map<unsigned long, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__1::hash<unsigned long>, std::__1::equal_to<unsigned long>, std::__1::allocator<std::__1::pair<unsigned long const, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection const*, bool const&, bool, bool)::$_0&>()()) std::__1::__invoke[abi:ne180100]<onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<int const, 18446744073709551615ul>, std::__1::vector<OrtValue, std::__1::allocator<OrtValue>>&, std::__1::unordered_map<unsigned long, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__1::hash<unsigned long>, std::__1::equal_to<unsigned long>, std::__1::allocator<std::__1::pair<unsigned long const, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection const*, bool const&, bool, bool)::$_0&>(onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<int const, 18446744073709551615ul>, std::__1::vector<OrtValue, std::__1::allocator<OrtValue>>&, std::__1::unordered_map<unsigned long, std::__1::function<onnxruntime::common::Status (onnxruntime
C  [libonnxruntime.dylib+0x1565934]  void std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ne180100]<onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<int const, 18446744073709551615ul>, std::__1::vector<OrtValue, std::__1::allocator<OrtValue>>&, std::__1::unordered_map<unsigned long, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__1::hash<unsigned long>, std::__1::equal_to<unsigned long>, std::__1::allocator<std::__1::pair<unsigned long const, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection const*, bool const&, bool, bool)::$_0&>(onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<int const, 18446744073709551615ul>, std::__1::vector<OrtValue, std::__1::allocator<OrtValue>>&, std::__1::unordered_map<unsigned long, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__1::hash<unsigned long>, std::__1::equal_to<unsigned long>, std::__1::allocator<std::__1::pair<unsigned long const, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection const*, bool const&, bool, bool)::$_0&)+0x18
C  [libonnxruntime.dylib+0x1565910]  std::__1::__function::__alloc_func<onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<int const, 18446744073709551615ul>, std::__1::vector<OrtValue, std::__1::allocator<OrtValue>>&, std::__1::unordered_map<unsigned long, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__1::hash<unsigned long>, std::__1::equal_to<unsigned long>, std::__1::allocator<std::__1::pair<unsigned long const, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection const*, bool const&, bool, bool)::$_0, std::__1::allocator<onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<int const, 18446744073709551615ul>, std::__1::vector<OrtValue, std::__1::allocator<OrtValue>>&, std::__1::unordered_map<unsigned long, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__1::hash<unsigned long>, std::__1::equal_to<unsigned long>, std::__1::allocator<std::__1::pair<unsigned long const, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection const*, bool const&, bool, bool)::$_0>, void ()>::operator()[abi:ne180100]()+0x1c
C  [libonnxruntime.dylib+0x1564918]  std::__1::__function::__func<onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<int const, 18446744073709551615ul>, std::__1::vector<OrtValue, std::__1::allocator<OrtValue>>&, std::__1::unordered_map<unsigned long, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__1::hash<unsigned long>, std::__1::equal_to<unsigned long>, std::__1::allocator<std::__1::pair<unsigned long const, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection const*, bool const&, bool, bool)::$_0, std::__1::allocator<onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<int const, 18446744073709551615ul>, std::__1::vector<OrtValue, std::__1::allocator<OrtValue>>&, std::__1::unordered_map<unsigned long, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__1::hash<unsigned long>, std::__1::equal_to<unsigned long>, std::__1::allocator<std::__1::pair<unsigned long const, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection const*, bool const&, bool, bool)::$_0>, void ()>::operator()()+0x1c
C  [libonnxruntime.dylib+0x13d134]  std::__1::__function::__value_func<void ()>::operator()[abi:ne180100]() const+0x44
C  [libonnxruntime.dylib+0x13d0e4]  std::__1::function<void ()>::operator()() const+0x18
C  [libonnxruntime.dylib+0xfa790]  onnxruntime::concurrency::ThreadPool::Schedule(onnxruntime::concurrency::ThreadPool*, std::__1::function<void ()>)+0x90
C  [libonnxruntime.dylib+0x155fa98]  onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<int const, 18446744073709551615ul>, std::__1::vector<OrtValue, std::__1::allocator<OrtValue>>&, std::__1::unordered_map<unsigned long, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__1::hash<unsigned long>, std::__1::equal_to<unsigned long>, std::__1::allocator<std::__1::pair<unsigned long const, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection const*, bool const&, bool, bool)+0x440
C  [libonnxruntime.dylib+0x160f5f8]  onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager const&, gsl::span<OrtValue const, 18446744073709551615ul>, std::__1::vector<OrtValue, std::__1::allocator<OrtValue>>&, std::__1::unordered_map<unsigned long, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__1::hash<unsigned long>, std::__1::equal_to<unsigned long>, std::__1::allocator<std::__1::pair<unsigned long const, std::__1::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection*, bool, onnxruntime::Stream*)+0x1dc
C  [libonnxruntime.dylib+0x160ef9c]  onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, gsl::span<OrtValue const, 18446744073709551615ul>, std::__1::vector<OrtValue, std::__1::allocator<OrtValue>>&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollectionHolder&, bool, onnxruntime::Stream*)+0x1f0
C  [libonnxruntime.dylib+0x160ffdc]  onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, gsl::span<OrtValue const, 18446744073709551615ul>, std::__1::vector<OrtValue, std::__1::allocator<OrtValue>>&, ExecutionMode, OrtRunOptions const&, onnxruntime::DeviceStreamCollectionHolder&, onnxruntime::logging::Logger const&)+0x8c
C  [libonnxruntime.dylib+0xf7378]  onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, 18446744073709551615ul>, std::__1::vector<OrtValue, std::__1::allocator<OrtValue>>*, std::__1::vector<OrtDevice, std::__1::allocator<OrtDevice>> const*)+0x102c
C  [libonnxruntime.dylib+0xf9ad0]  onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue const* const, 18446744073709551615ul>, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue*, 18446744073709551615ul>)+0x588
C  [libonnxruntime.dylib+0x1b4788]  OrtApis::Run(OrtSession*, OrtRunOptions const*, char const* const*, OrtValue const* const*, unsigned long, char const* const*, unsigned long, OrtValue**)+0x188
C  [libonnxruntime4j_jni.dylib+0xdc40]  Java_ai_onnxruntime_OrtSession_run+0x378
j  ai.onnxruntime.OrtSession.run(JJJ[Ljava/lang/String;[JJ[Ljava/lang/String;J[Lai/onnxruntime/OnnxValue;[JJ)[Z+0
j  ai.onnxruntime.OrtSession.run(Ljava/util/Map;Ljava/util/Set;Ljava/util/Map;Lai/onnxruntime/OrtSession$RunOptions;)Lai/onnxruntime/OrtSession$Result;+728
j  ai.onnxruntime.OrtSession.run(Ljava/util/Map;Ljava/util/Set;)Lai/onnxruntime/OrtSession$Result;+7
j  ai.onnxruntime.OrtSession.run(Ljava/util/Map;)Lai/onnxruntime/OrtSession$Result;+6

@yuslepukhin
Copy link
Member

yuslepukhin commented Feb 5, 2025

The problem is not necessarily caused by nullptr, it would be easy to detect then.
It can simply be an invalid ptr that points to an object that was valid some time ago.
Or the object was destroyed, but the ptr to it was not yet set to nullptr (or not visible to another thread) when it was attempted to be used again. We do not implement singleton resurrection.
 
JVM will crash. Neither ORT nor JVM are managed programs.

The intermittent nature of the crash points to a racing condition.
And, of course, occasional ORT exception points to that fact.

Without reference to this specific issue, in general, this may be caused:

  • The default logger created, but the object is not visible in another thread yet.
  • The logger was destroyed, and there are threads that are still using it.

Ort::Environment object must be global to the process and not to be Autoclosed prematurely (how's my Java terminology ?:)

@Craigacp
Copy link
Contributor

Craigacp commented Feb 5, 2025

That all sounds fine. Turns out I'd left myself a note in the commit which changed the behaviour of OrtEnvironment to note exactly the thing that causes this bug can occur (that Java daemon threads run concurrently with shutdown hooks, so the threads can see a partially shutdown OrtEnv) and it can't be prevented from Java - #10670. I'd completely forgotten about that interaction between shutdown hooks and daemon threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api:Java issues related to the Java API model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.
Projects
None yet
Development

No branches or pull requests

3 participants