Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix weird panama integration failure #96715

Merged
merged 5 commits into from
Jun 12, 2023

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Jun 8, 2023

The issue:

VectorUtil will Load the Panama Vector incubator module if users have it enabled. However, if the first time its loaded is in a restricted security context, it will fail. There are two issues:

The fix:

Eagerly load the module if supported through combining apache/lucene#12362 with org.elasticsearch.bootstrap.Elasticsearch#ensureInitialized.

Logging details & stack traces

Logger failure

[2023-06-08T13:17:36,397][WARN ][stderr                   ] [runTask-0] The system environment variables are not available to Log4j due to security restrictions: java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getenv.*")
[2023-06-08T13:17:36,397][WARN ][stderr                   ] [runTask-0] java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getenv.*")
[2023-06-08T13:17:36,397][WARN ][stderr                   ] [runTask-0]         at java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:488)
[2023-06-08T13:17:36,397][WARN ][stderr                   ] [runTask-0]         at java.base/java.security.AccessController.checkPermission(AccessController.java:1071)
[2023-06-08T13:17:36,397][WARN ][stderr                   ] [runTask-0]         at java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:411)
[2023-06-08T13:17:36,397][WARN ][stderr                   ] [runTask-0]         at java.base/java.lang.System.getenv(System.java:1202)
[2023-06-08T13:17:36,397][WARN ][stderr                   ] [runTask-0]         at org.apache.logging.log4j@2.19.0/org.apache.logging.log4j.util.EnvironmentPropertySource.containsProperty(EnvironmentPropertySource.java:101)
[2023-06-08T13:17:36,397][WARN ][stderr                   ] [runTask-0]         at org.apache.logging.log4j@2.19.0/org.apache.logging.log4j.util.PropertiesUtil$Environment.get(PropertiesUtil.java:533)

JDK leak:

[2023-06-08T12:56:50,801][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [runTask-0] fatal error in thread [elasticsearch[runTask-0][write][T#10]], exiting java.lang.NoClassDefFoundError: Could not initialize class jdk.incubator.vector.VectorIntrinsics
»       at jdk.incubator.vector/jdk.incubator.vector.AbstractSpecies.loopBound(AbstractSpecies.java:204)
»       at org.apache.lucene.core@9.7.0-snapshot-24df30cca69/org.apache.lucene.util.VectorUtilPanamaProvider.dotProduct(VectorUtilPanamaProvider.java:86)
»       at org.apache.lucene.core@9.7.0-snapshot-24df30cca69/org.apache.lucene.util.VectorUtil.dotProduct(VectorUtil.java:37)
»       at org.apache.lucene.core@9.7.0-snapshot-24df30cca69/org.apache.lucene.index.VectorSimilarityFunction$2.compare(VectorSimilarityFunction.java:54)
»       at org.apache.lucene.core@9.7.0-snapshot-24df30cca69/org.apache.lucene.util.hnsw.HnswGraphSearcher.compare(HnswGraphSearcher.java:333)

In painless here is the trace a user would see:

"jdk.incubator.vector/jdk.incubator.vector.AbstractSpecies.loopBound(AbstractSpecies.java:204)"
"org.apache.lucene.core@9.7.0-snapshot-24df30cca69/org.apache.lucene.util.VectorUtilPanamaProvider.dotProduct(VectorUtilPanamaProvider.java:86)"
"org.apache.lucene.core@9.7.0-snapshot-24df30cca69/org.apache.lucene.util.VectorUtil.dotProduct(VectorUtil.java:37)"
"org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.script.field.vectors.BinaryDenseVector.dotProduct(BinaryDenseVector.java:51)"
"org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.script.field.vectors.BinaryDenseVector.cosineSimilarity(BinaryDenseVector.java:116)"
"org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.script.VectorScoreScriptUtils$FloatCosineSimilarity.cosineSimilarity(VectorScoreScriptUtils.java:271)"
"org.elasticsearch.server@8.9.0-SNAPSHOT/org.elasticsearch.script.VectorScoreScriptUtils$CosineSimilarity.cosineSimilarity(VectorScoreScriptUtils.java:288)"
"cosineSimilarity(params.query, 'vector') + 1.0"
...<snip>...
"Could not initialize class jdk.incubator.vector.VectorIntrinsics","caused_by":{"type":"exception_in_initializer_error","reason":"Exception java.security.AccessControlException: access denied (\"java.util.PropertyPermission\" \"jdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK\" \"read\"

@benwtrent benwtrent added >non-issue :Search/Search Search-related issues that do not fall into other categories v8.9.0 labels Jun 8, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Jun 8, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@benwtrent
Copy link
Member Author

@ChrisHegarty do you think this "fix" is better used within the gradle starup scripts or something? I don't know how we can eagerly load a module and call the necessary privileged methods.

@benwtrent
Copy link
Member Author

This change makes me wonder if we really want: #96617

These weird privilege checks concern me a bit from the Painless context.

Also, I am not sure how we can test this thing reliably. Since multiple tests utilize the same JVM processes, the module may already be loaded and cached. How could we test this?

@benwtrent
Copy link
Member Author

benwtrent commented Jun 8, 2023

I have tried adding the following to modules/lang-painless/src/main/plugin-metadata/plugin-security.policy


  // needed for project panama initialization
  permission java.util.PropertyPermission "jdk.incubator.vector.*", "read";

Within the grant block. Even with this, I am still getting:

"type":"no_class_def_found_error","reason":"Could not initialize class jdk.incubator.vector.VectorIntrinsics","caused_by":{"type":"exception_in_initializer_error","reason":"Exception java.security.AccessControlException: access denied (\"java.util.PropertyPermission\" \"jdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK\" \"read\") [in thread \"elasticsearch[runTask-0][search][T#2]\"]"

// 1. log4j needs privileged access to get environment variable values to print out the preferred bit size
// 2. loopBound proved by the module makes a privileged call `jdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK`
// this is apparently cached as once its loaded, we don't run into issues again
private static final float LOADING_VECTOR_MODULE = VectorUtil.dotProduct(new float[257], new float[257]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not pretty, but necessary. I think the change is ok. If the problem is causing sufficient failures it might be best to push the change as is. However, I would like to revisit this issue with the intention of eventually being in a position to remove this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this to a static init block, and not retain the computed value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping a reference to the float array does not prevent the module from being "unloaded".

I would just initialize the VectorUtil class and save a class reference in a static final.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually computing a vector product should not be needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, computation is needed because the issue is inside the JDK's VectorIntrinsics class, which is not initialised by just loading any API classes. You kinda need to operate on it :-(.

Copy link
Contributor

@ChrisHegarty ChrisHegarty Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialising VectorUtil would work if VectorUtilPanamaProvider had something like this:

  static {
    int l = AccessController.doPrivileged((PrivilegedAction<Integer>) () ->
       IntVector.fromArray(IntVector.SPECIES_128, new int[] { 1, 2, 3, 4 }, 0).length());
  }

IntVector.fromArray is one way to trigger the initialisation of the JDK's VectorIntrinsics class.

Which is what I would propose, if Lucene wants to workaround this too ?

@ChrisHegarty
Copy link
Contributor

Oh, we have log4j permission problems here too. :-( The permission check flowing out from the JDK is a bug. I'll get that fixed upstream.

@ChrisHegarty
Copy link
Contributor

I have tried adding the following to modules/lang-painless/src/main/plugin-metadata/plugin-security.policy

// needed for project panama initialization
permission java.util.PropertyPermission "jdk.incubator.vector.*", "read";

This is a bit surprising to me. I was hoping that it would be sufficient.

If you get time @benwtrent, then maybe try running with debug enabled to hopefully identify the exact codebase that denies the permission, e.g.

-Djava.security.debug=access,failure,domain or -Dtests.jvm.argline="-Djava.security.debug=access,failure,domain" if using Gradle. (there will be a lot of output, maybe just redirect stdout/stderr to a file and grep )

Also, if there is a relatively straightforward way to reproduce, then I will investigate too. But again, if this is causing much noise please feel free to merge, and we can continue to investigate later.

@benwtrent
Copy link
Member Author

Also, if there is a relatively straightforward way to reproduce, then I will investigate too. But again, if this is causing much noise please feel free to merge, and we can continue to investigate later.

I am going to create a rally track to reproduce. The trick is that the first time the panama module is loaded is when its used in a script. All our rally tracks index the vectors. The module is loaded at index time.

@benwtrent
Copy link
Member Author

OK, trying -Djava.security.debug=access,failure,domain on an assembled jar. But, dang, this is so slow :D

@benwtrent
Copy link
Member Author

@ChrisHegarty so you have steps to reproduce.

On a completely new node/cluster (never have ran vector search before). Do the following console commands:

You may need to increase dims if your preferredBitWidth is larger.

DELETE vectors

PUT vectors
{
    "mappings": {
        "properties": {
            "vector": {
                "type": "dense_vector",
                "dims": 32
            }
        }
    }
}

POST vectors/_doc
{
    "vector": [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]
}


GET vectors/_search?size=1
{
    "query": {
        "script_score": {
            "query": {
                "match_all": {}
            },
            "script": {
                "source": "cosineSimilarity(params.query_vector, 'vector') + 1.0",
                "params": {
                    "query_vector": [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]
                }
            }
        }
    }
}

@ChrisHegarty
Copy link
Contributor

Thank you @benwtrent

@ChrisHegarty
Copy link
Contributor

Thanks for the reproducer @benwtrent - I managed to reproduce the issue.

FWIW, here are the permissions that we would need to grant to scripts to allow this to work:

  1. getenv, for logging
  2. property permission read

I'm not overly concerned if we grant scripts these permissions, but we should check with @rjernst.

diff --git a/server/src/main/resources/org/elasticsearch/bootstrap/untrusted.policy b/server/src/main/resources/org/elasticsearch/bootstrap/untrusted.policy
index fd26a66e3439..d644cda2b1f1 100644
--- a/server/src/main/resources/org/elasticsearch/bootstrap/untrusted.policy
+++ b/server/src/main/resources/org/elasticsearch/bootstrap/untrusted.policy
@@ -14,4 +14,8 @@ grant {
   // needed IndyInterface selectMethod (setCallSiteTarget)
   // TODO: clean this up / only give it to engines that really must have it
   permission java.lang.RuntimePermission "getClassLoader";
+
+  // needed if scripts trigger the initial loading of VectorUtil
+  permission java.lang.RuntimePermission "getenv.*";
+  permission java.util.PropertyPermission "jdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK", "read";
 };

@ChrisHegarty
Copy link
Contributor

FTR - the JDK bug https://bugs.openjdk.org/browse/JDK-8309727

@ChrisHegarty
Copy link
Contributor

The way I see it, we have two options here:

  1. Force the initialisation of VectorUtils and at least one computation, or
  2. Update painless scripts to allow the permission.

At this point, while crummy, I favour the former (option 1), as it is least impactful.

// 1. log4j needs privileged access to get environment variable values to print out the preferred bit size
// 2. loopBound proved by the module makes a privileged call `jdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK`
// this is apparently cached as once its loaded, we don't run into issues again
private static final float LOADING_VECTOR_MODULE = VectorUtil.dotProduct(new float[257], new float[257]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this to a static init block, and not retain the computed value?

@uschindler
Copy link
Contributor

For the other issue. Make sure that log4j is initialized early on startup. This is nothing we can prevent in Lucene.

@uschindler
Copy link
Contributor

(may happen with mmap dir, too).

@ChrisHegarty
Copy link
Contributor

For the other issue. Make sure that log4j is initialized early on startup. This is nothing we can prevent in Lucene.
(may happen with mmap dir, too).

Oh yeah. It kinda surprises me that we've not already initialised log4j at this point. Even logging something from DenseVectorFieldMapper clinit / init would work.

@uschindler
Copy link
Contributor

With my workaround in apache/lucene#12362 we could just initialize the Lucene VectorUtil class in ES core code. So an explicit vector calc is not needed, just: static final Class<VectorUtil> DUMMY = Class.forName("org.apache.lucene.VectorUtil");

@benwtrent
Copy link
Member Author

I have confirmed @uschindler 's fix. Thank you!

I will update this PR once we have updated our Lucene version to a release that contains his fix.

Pausing work on this for now.

@benwtrent benwtrent marked this pull request as draft June 9, 2023 20:07
@DaveCTurner
Copy link
Contributor

We have a few other classes that we eagerly initialise, early on in startup, see org.elasticsearch.bootstrap.Elasticsearch#ensureInitialized. Maybe consider adding this there?

@uschindler
Copy link
Contributor

We have a few other classes that we eagerly initialise, early on in startup, see org.elasticsearch.bootstrap.Elasticsearch#ensureInitialized. Maybe consider adding this there?

The code there looks fine. I like the Lookup::ensureInitialized(Class) used for that, much better than my idea with Class#forName(String). Cool!

It is only important to initialize logging before that, haven't closely looked into the code!

@ChrisHegarty ChrisHegarty mentioned this pull request Jun 12, 2023
@benwtrent
Copy link
Member Author

Combining apache/lucene#12362 with org.elasticsearch.bootstrap.Elasticsearch#ensureInitialized. fixes the issue.

@benwtrent benwtrent marked this pull request as ready for review June 12, 2023 11:42
Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. I cannot approve PR.

@uschindler
Copy link
Contributor

uschindler commented Jun 12, 2023

Maybe you should also initialize MMapDircetory.class as it could also cause log4j issues. But that one is initialized very early, and it is unlikely that painless does it indirect.

On the other hand: as this is done before security manager is installed, you could remove the permission to access sun.misc classes :-)

@benwtrent benwtrent merged commit a9e807a into elastic:main Jun 12, 2023
@benwtrent benwtrent deleted the feature/bug-fix-panama-integration branch June 12, 2023 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants