Retain resource URL in java-parser results, improve caching #124

plexus · 2021-09-18T08:52:22Z

The java parser/legacy-parser returns a resource's :file, actually a relative
path as a string, and :path, the relative path converted to absolute. It does
this conversion through an io/resource lookup to find the file system location
of the artifact. There are two issues with this:

The return value of io/resource is discarded, even though it is needed
later on in orchard.java, causing unnecessary resource lookups.
The path is obtained through (.getPath (io/resource path)), which yields
nonsensical results when used on resources that are inside a JAR.

This change keeps the :file returned from java-parser (relative path string),
but removes the :path value in favor of a :resource-url, i.e. the return
value of (io/resource path). It then provides utility functions to work with
this resource URL directly, namely mapping it to a filesystem artifact, or
retrieving its modification time.

java.parser/parser already does a io/resource lookup for the resource it is
parsing, meaning it has a full URL (jar: or file:). By including this URL in the
map it returns callers can do further checks or operations on the resource
without having to re-scan the classpath.

This in turn allows us to simplify and optimize orchard.java/class-info. The
old version would call io/resource on each invocation, even when the result
was already cached, in order to find the artifact's modification time.

This can noticably speed up cider-nrepl's "stacktrace" op, which calls
orchard.java/class-info on each stack frame, in extreme cases taking multiple
seconds to analyze all stack frames and return a result. This leads to rather
jarring UX, with the stacktrace buffering popping up in Emacs seconds after the
evaluation and exception happened. This is exarcerbated when using the nREPL
sideloader, which adds overhead to each resource lookup.

This also makes the return value of java-parser more correct. As mentioned the
previous version would always call (.getPath (io/resource path)), but this
only makes sense for file: resources, not for jar: resources. For file:
URLs .getPath returns the path of the file. For jar: URLs it returns the
nested url+the jar entry, so a file: URL but with a dangling !/jar/entry.

Illustration of why calling getPath on jar: URLs is not useful:

(io/resource "lambdaisland/witchcraft.clj")
;;=> #java.net.URL "file:/srv/mc/witchcraft/src/lambdaisland/witchcraft.clj"

(.getPath (io/resource "lambdaisland/witchcraft.clj"))
;;=> "/srv/mc/witchcraft/src/lambdaisland/witchcraft.clj"

(io/resource "clojure/lang/RT.class")
;;=> #java.net.URL "jar:file:/root/.m2/repository/org/clojure/clojure/1.10.3/clojure-1.10.3.jar!/clojure/lang/RT.class"

(.getPath (io/resource "clojure/lang/RT.class"))
;;=> "file:/root/.m2/repository/org/clojure/clojure/1.10.3/clojure-1.10.3.jar!/clojure/lang/RT.class"

vemv

Awesome! Appreciate the PR desc.

I guess this could have test coverage both directly, and more importantly it's worth checking out the consumer of this defn - surely it lacked test coverage as well.

bbatsov · 2021-09-20T06:37:26Z

Great PR indeed! If we can easily improve the test coverage, that'd be great, but I'm willing to take the improvement regardless.

plexus · 2021-09-20T06:48:13Z

The consumer has tests, but no tests that consider :lastModified. It's a bit of a pain to test, because you'd need a class file that is not inside a jar. I guess you could bundle a pre-compiled .class just for testing, or include a Java file that you compile on the fly.

Not sure if tests for immutable-source-file? would have helped, since it seems the issue is assumptions made about its inputs. You'd only really notice if you had tests that combine the java parser and immutable-source-file?. But I can add a few tests that demonstrate the input it gets, and maybe add a comment or two.

The current behavior is that immutable-source-file? always returns true, (because the io/resource call always returns nil) which means it never expires the cached info about classes. Which for most users will actually be fine, unless they're using something like Virgil to work on Java and Clojure simultaneously. But despite the caching the immutable-source-file? check still happens every time (to decide if it should expire), and that check was needlessly costly.

Actually we could go a step further here and skip the check on class info that is already cached and coming from a JAR, on the assumption that if a class is coming from a JAR now it will still be coming from a JAR in the future.

Seems this also needs a rebase now, I'll get to that.

bbatsov · 2021-09-20T07:15:29Z

Actually we could go a step further here and skip the check on class info that is already cached and coming from a JAR, on the assumption that if a class is coming from a JAR now it will still be coming from a JAR in the future.

I like this idea!

The consumer has tests, but no tests that consider :lastModified. It's a bit of a pain to test, because you'd need a class file that is not inside a jar. I guess you could bundle a pre-compiled .class just for testing, or include a Java file that you compile on the fly.

That's too much hassle for little return, so let's not bother with it.

plexus · 2021-09-20T10:04:20Z

Looking at this a bit closer the problem is really in java-parser

          (assoc ...
                 :file path
                 :path (.getPath (io/resource path)))

io/resource returns a URL. If that's a file: URL then .getPath will indeed return the path on the filesystem, that seems to be mostly what this code is assuming. But it could also be a jar: or even an http:, ftp: or data: URL (depending on how funky your classloader is). The main cases to consider here are file: and jar: URL, but .getPath on a jar: URL pretty much gives you gibberish, because it strips the jar: but leaves the !/... at the end. (jar URLs look like jar:<url>/!<entry>, and so you get <url>/!<entry>). The correct way to parse a jar URL is apparently via JarURLConnection

Also keep in mind that the <url> inside the jar:<url>.. doesn't have to be a file: url, it could also e.g. be http:.

So I think the better solution is actually for java-parser to do

(assoc ... :file path :resource-url (io/resource path))

and skip the .getPath alltogether.

If it's possible that other callers might still be relying on :path we can leave it in the result as well.

plexus · 2021-09-21T06:47:57Z

Glad I slept on this, on closer inspection immutable-source-file? did return the right result, but in a way that unnecessarily calls io/resource even though this is already done in source-info, and in a way that bypasses the cache check.

I've changed the behavior of parser/source-file slightly, for the case where a file is coming from a jar

;; Before
(select-keys (parser/source-info 'clojure.lang.RT) [:file :path :archive?])
{:file "clojure/lang/RT.java", :path "file:/home/arne/.m2/repository/org/clojure/clojure/1.10.1/clojure-1.10.1-sources.jar!/clojure/lang/RT.java"}

;; After
(select-keys (parser/source-info 'clojure.lang.RT) [:file :path :archive?])
{:file "clojure/lang/RT.java", :path "/home/arne/.m2/repository/org/clojure/clojure/1.10.1/clojure-1.10.1-sources.jar", :archive? true}

So that :path is always a filesystem path, just as it is with non-archive resources. I've also added the :archive? key so we don't have to scan for .(jar|zip)!/ later on.

;; non-archive resources
(select-keys (parser/source-info 'orchard.java.DummyClass) [:file :path :archive?])
{:file "orchard/java/DummyClass.java", :path "/home/arne/github/orchard/test/orchard/java/DummyClass.java", :archive? false}

bbatsov · 2021-09-21T12:10:20Z

Seems like a fantastic change to me!

file:/home/arne/.m2/repository/org/clojure/clojure/1.10.1/clojure-1.10.1-sources.jar!/clojure/lang/RT.java

Did you check in CIDER how are those paths currently handled? I wondering if the updated format won't breaking something.

plexus · 2021-09-21T14:27:56Z

I can't immediately find anywhere in CIDER where :path is used, only :file.

Do you have any pointers for things to check? This is specifically for java class info, so things like find-var or find-ns should not be impacted. I ended up here because of its use in stacktrace, and there cider-stacktrace is definitely only checking :file.

The thing is that when info ops return :file has been made absolute, if you look at cider.nrepl.middleware.info.

      (-> info
          (merge
               (when-let [file (:file info)]
                   (info/file-info file))
)
)

Where info/info-file does another resource lookup, this logic is also somewhat flawed

(defn file-info
  [path]
  (let [[resource-relative resource-full] (resource/resource-path-tuple path)]
    (merge {:file (or (file-path path) resource-full path)}
           ;; Classpath-relative path if possible
           (when resource-relative
             {:resource resource-relative}))))

Is this path a classpath-relative path, or a CWD-relative path? resource-path-tuple assumes the former, file-path assumes the latter.

And in this case (when the info is coming from java.parser) we have all this information already...

And then in cider-stacktrace-navigate this is basically ignored, instead it tries to guess first if it can figure out itself where that class source is, even if we just told it where it is...

(let (
         (file (or
                (and (null var) (cider-resolve-java-class class))
                (nrepl-dict-get info "file")
                (button-get button 'file)))
)

bbatsov · 2021-09-21T14:34:44Z

Do you have any pointers for things to check? This is specifically for java class info, so things like find-var or find-ns should not be impacted. I ended up here because of its use in stacktrace, and there cider-stacktrace is definitely only checking :file.

I briefly checked the code and it seems you're right.

Is this path a classpath-relative path, or a CWD-relative path? resource-path-tuple assumes the former, file-path assumes the latter.

I think all relative paths should be classpath-relative.

And then in cider-stacktrace-navigate this is basically ignored, instead it tries to guess first if it can figure out itself where that class source is, even if we just told it where it is...

Too much legacy - I'm pretty sure we've slowly accumulated a lot of small inconsistencies over the years. I agree the current behavior is flawed.

plexus · 2021-09-21T14:59:48Z

ok, in that case let's take this from the source to the consumer step by step, and try to be consistent about the meaning of each key. In the end there are two things that matter: the classpath-relative path, and the full file/jar URL.

The info maps coming from orchard seem to all use :file as the classpath-relative path. However in the info op in cider-nrepl :file is (at least most of the time it seems?) a full URL (file: or jar:) which makes sense. Except if orchard gives it a relative path that it can't resolve, in that case it stays a relative path...

For your own REPL perusal

(require '[orchard.info :as info] '[orchard.java :as java] '[cider.nrepl.middleware.info :as cider-info])

(info/info 'clojure.core 'str)
(info/info-java 'clojure.lang.RT 'baseLoader)
(java/class-info 'clojure.lang.RT)

(cider-info/format-response (cider-info/info {:ns 'clojure.core :sym 'str}))
(cider-info/format-response (cider-info/info {:ns 'cider.nrepl.middleware.info :sym 'info}))
(cider-info/format-response (cider-info/info {:ns 'clojure.lang.RT :sym 'baseLoader}))

My current thinking is this: we look at the places in orchard.info and orchard.java that return info-like maps with a :file key. If the code creating these maps already has done resource resolution then it adds an extra :resource-url as a string (so a file: or jar: url). And we provide helpers for getting the actual filesystem object from there, e.g. for when you need to know the modification time.

The cider.nrepl.middleware.info can use :resource-url as its :file in "file" ops, if it is present. This should save a lot of resource lookups already, and yield more consistent results. Same for nrepl.lookup.

Finally we can see how CIDER uses this information, heuristics should only be used when no :file is provided.

vemv

I'll be interested in reviewing this PR again later, I guess it's a WIP again

(will be nice to have an updated title / desc summarizing the newer findings)

vemv · 2021-09-22T00:24:02Z

src/orchard/java/parser.clj

+  "Given a jar:file:...!/... URL, return the location of the jar file on the
+  filesystem. Returns nil on any other URL."
+  [^URL jar-resource]
+  (when (= "jar" (url-protocol jar-resource))


(assert (= "jar" (url-protocol jar-resource)) ;; debug info: (url-protocol jar-resource))

is just as cheap, but also ensures our assumptions hold true. Specifically, jar-path has only one caller, so it's pretty easy to reason about what our preconditions are

vemv · 2021-09-24T05:01:20Z

~~One last note, from this branch's CI timings, it looks like certain jobs got slower (30s -> 60s)~~

(other jobs are known to take 2m+ minutes)

This is worth keeping in mind, ideally we wouldn't regress on performance. I think that if solving all these problems at once we'd get a fast build for every job, see also #131 (comment)

plexus · 2021-09-27T09:11:56Z

I pushed a new iteration which simply returns the :resource-url from the parser, and then provides some helpers in orchard.misc.io to work on the resource URL. Going to try this locally with a patched cider-nrepl and report back.

Having looked around a bit more it seems it's going to be challenging to standardize the different keys used for things, e.g.

orchard.java.resource

:root : classpath entry, java.io.File
:file : resource location, java.io.File
:relpath : relative resource path, String
:url : resource location, java.net.URL

orchard.java.parser

:file : relative resource path, String

orchard.info/file-info

:file : resource location, java.net.URL
:resource : relative resource path, String

orchard.info/info*

:file : relative resource path, String

plexus · 2021-10-01T09:08:09Z

I think this is good from my end. It's a fairly conservative change so far, it'll get more interesting once cider-nrepl can starting taking advantage of the :resource-url instead of doing another io/resource lookup. That said this should already be useful for any Orchard consumer because of the better caching of java results.

vemv

Is the PR desc up to date?

vemv · 2021-10-01T09:18:35Z

src/orchard/util/io.clj

+  [^java.net.URL url]
+  (.getProtocol url))
+
+(defn in-archive?


file-within-archive? perhaps?

Originally I thought this meant a different thing ("is this an archive")

Isn't this clear from the param? To me it seems obvious that we're checking if an URL is in archive.

To me it seems obvious that we're checking if an URL is in archive.

Hindsight is always 20/20 :) as mentioned at first the name genuinely confused me. If we pick this code up some years later we can forget these meanings, which is why being extra careful wrt clarity seems desirable for a project like this one.

No biggie though!

This is now url-to-file-within-archive? as in "does this URL point at a file within an archive?"

vemv · 2021-10-01T09:20:00Z

src/orchard/util/io.clj

+  [^java.net.URL url]
+  (= "jar" (url-protocol url)))
+
+(defn jar-location


jar-location-of-file-within-archive?

More verbose, but:

consistent w/ my prior suggestion

does not take a generic name for future use cases

I'm fine with a shorter name as the longer seems extremely verbose. I'm all for clear naming, but we don't want to go overboard here.

this is now resource-jarfile, since it returns the .jar File instance for a given resource (resources being URLs).

vemv · 2021-10-01T09:21:45Z

src/orchard/util/io.clj

+      (io/as-file inner-url))))
+
+(defn resource-artifact
+  "Get the fully qualified file name containing the resource identified by the


Get is confusing, lead me to believe it referred to the return value at first

Rephrased, thanks.

vemv · 2021-10-01T09:23:35Z

test/orchard/java/parser_test.clj

+  (:require [orchard.misc :as misc]
+            [clojure.test :refer :all]))
+
+(when (>= orchard.misc/java-api-version 9)


Can you follow instead this pattern? ffb71cd

Done, I added test-newer-jdks analogous to src-newer-jdks

plexus · 2021-10-01T11:30:41Z

Thanks for the input @vemv and @bbatsov . I'll try to make some time to give this another pass.

bbatsov · 2021-10-01T13:06:38Z

src/orchard/info.clj

@@ -198,3 +198,7 @@
  classes. If no source is available, return the relative path as is."
  [^String path]
  {:javadoc (java/resolve-javadoc-path path)})
+
+(comment


Is this some debug code you forgot or it's something you meant to keep?

I was meaning to keep it.

Got rid of it after all, the tests provide enough examples.

bbatsov · 2021-10-25T08:36:56Z

It'd be nice to include this with the next CIDER release. ETA a couple of weeks. I guess if Arne doesn't have time to tweak this further I'll just merge it as is

plexus · 2021-10-28T10:00:53Z

Hey folks, I've been on holiday the last two weeks, and have a good bunch of other stuff on my plate. This is definitely on my stack for soon ™, but I can't give a hard ETA.

bbatsov · 2021-10-28T10:45:04Z

No rush.

plexus · 2021-11-05T08:48:56Z

Rebased, going through the review comments now, and will then update the description.

The java parser/legacy-parser returns a resource's `:file`, actually a relative path as a string, and `:path`, the relative path converted to absolute. It does this conversion through an `io/resource` lookup to find the file system location of the artifact. There are two issues with this: 1. The return value of `io/resource` is discarded, even though it is needed later on in `orchard.java`, causing unnecessary resource lookups. 2. The path is obtained through `(.getPath (io/resource path))`, which yields nonsensical results when used on resources that are inside a JAR. This change keeps the `:file` returned from java-parser (relative path string), but removes the `:path` value in favor of a `:resource-url`, i.e. the return value of `(io/resource path)`. It then provides utility functions to work with this resource URL directly, namely mapping it to a filesystem artifact, or retrieving its modification time. java.parser/parser already does a `io/resource` lookup for the resource it is parsing, meaning it has a full URL (jar: or file:). By including this URL in the map it returns callers can do further checks or operations on the resource without having to re-scan the classpath. This in turn allows us to simplify and optimize `orchard.java/class-info`. The old version would call `io/resource` on each invocation, even when the result was already cached, in order to find the artifact's modification time. This can noticably speed up cider-nrepl's `"stacktrace"` op, which calls `orchard.java/class-info` on each stack frame, in extreme cases taking multiple seconds to analyze all stack frames and return a result. This leads to rather jarring UX, with the stacktrace buffering popping up in Emacs seconds after the evaluation and exception happened. This is exarcerbated when using the nREPL sideloader, which adds overhead to each resource lookup. This also makes the return value of java-parser more correct. As mentioned the previous version would always call `(.getPath (io/resource path))`, but this only makes sense for `file:` resources, not for `jar:` resources. For `file:` URLs `.getPath` returns the path of the file. For `jar:` URLs it returns the nested url+the jar entry, so a `file:` URL but with a dangling `!/jar/entry`. Illustration of why calling `getPath` on `jar:` URLs is not useful: ```clojure (io/resource "lambdaisland/witchcraft.clj") ;;=> #java.net.URL "file:/srv/mc/witchcraft/src/lambdaisland/witchcraft.clj" (.getPath (io/resource "lambdaisland/witchcraft.clj")) ;;=> "/srv/mc/witchcraft/src/lambdaisland/witchcraft.clj" (io/resource "clojure/lang/RT.class") ;;=> #java.net.URL "jar:file:/root/.m2/repository/org/clojure/clojure/1.10.3/clojure-1.10.3.jar!/clojure/lang/RT.class" (.getPath (io/resource "clojure/lang/RT.class")) ;;=> "file:/root/.m2/repository/org/clojure/clojure/1.10.3/clojure-1.10.3.jar!/clojure/lang/RT.class" ```

plexus · 2021-11-05T10:49:21Z

Squashed, addressed the review comments, updated the commit message and PR title/desc.

vemv · 2021-11-05T10:54:51Z

Eyeing in a bit 👀

vemv · 2021-11-05T11:51:56Z

Illustration of why calling getPath on jar: URLs is not useful:

(.getPath (io/resource "clojure/lang/RT.class"))
;;=> "file:/root/.m2/repository/org/clojure/clojure/1.10.3/clojure-1.10.3.jar!/clojure/lang/RT.class"

The string may not be directly useful but it can be successfully, meaningfully parsed with a simple algo. I think it was (is?) done in cider.el.

I also happen to have Emacs code that parses such strings from cider-nrepl.

Regardless, out of caution we can stick to a simple no-breakage policy

src/orchard/util/io.clj

vemv

If you have the chance to QA this (cider.el -> cider-nrepl -> orchard) it would probably be greatly useful. cider.el's test suite is not exactly an e2e one.

This project has a make install command (after which, I reckon, you should run cider-nrepl's own make install)

.clj-kondo/config.edn

plexus · 2021-11-05T13:07:32Z

@vemv feel free to push changes if you deem them necessary

vemv · 2021-11-08T13:02:41Z

Can't push to your branch (permission denied), can you confirm Allow edits is on?

plexus · 2021-11-08T15:21:56Z

plexus · 2021-11-08T15:23:16Z

I'll make a new pr from a branch on the main repo, just not at the computer right now.

vemv · 2021-11-08T15:27:30Z

Thanks! Yeah it wasn't working for w/e reason 🌀

plexus force-pushed the arne/remove-nonsensible-resource-lookup branch from aebf914 to c71adc3 Compare September 18, 2021 08:54

vemv reviewed Sep 18, 2021

View reviewed changes

plexus force-pushed the arne/remove-nonsensible-resource-lookup branch 3 times, most recently from 639c881 to 82bee7c Compare September 21, 2021 06:47

plexus force-pushed the arne/remove-nonsensible-resource-lookup branch 6 times, most recently from 0861687 to 578e5f9 Compare September 21, 2021 07:56

vemv reviewed Sep 22, 2021

View reviewed changes

plexus force-pushed the arne/remove-nonsensible-resource-lookup branch 2 times, most recently from 8c4a051 to 0a1523e Compare September 27, 2021 09:06

vemv requested changes Oct 1, 2021

View reviewed changes

bbatsov reviewed Oct 1, 2021

View reviewed changes

plexus force-pushed the arne/remove-nonsensible-resource-lookup branch from 0a1523e to 6d8e936 Compare November 5, 2021 08:48

plexus force-pushed the arne/remove-nonsensible-resource-lookup branch from 6d8e936 to 3e047e2 Compare November 5, 2021 10:48

plexus changed the title ~~Remove costly io/resource lookup that wasn't doing anything~~ Retain resource URL in java-parser results, improve caching Nov 5, 2021

vemv reviewed Nov 5, 2021

View reviewed changes

src/orchard/util/io.clj Show resolved Hide resolved

vemv reviewed Nov 5, 2021

View reviewed changes

.clj-kondo/config.edn Show resolved Hide resolved

vemv marked this pull request as draft November 5, 2021 13:20

plexus closed this Nov 11, 2021

plexus mentioned this pull request Nov 11, 2021

Retain resource URL in java-parser results, improve caching #139

Merged

5 tasks

Retain resource URL in java-parser results, improve caching #124

Retain resource URL in java-parser results, improve caching #124

Conversation

plexus commented Sep 18, 2021 • edited Loading

vemv left a comment

Choose a reason for hiding this comment

bbatsov commented Sep 20, 2021

plexus commented Sep 20, 2021

bbatsov commented Sep 20, 2021

plexus commented Sep 20, 2021

plexus commented Sep 21, 2021 • edited Loading

bbatsov commented Sep 21, 2021

plexus commented Sep 21, 2021

bbatsov commented Sep 21, 2021

plexus commented Sep 21, 2021

vemv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vemv commented Sep 24, 2021 • edited Loading

plexus commented Sep 27, 2021

plexus commented Oct 1, 2021

vemv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbatsov Oct 1, 2021 • edited Loading

Choose a reason for hiding this comment

vemv Oct 4, 2021 • edited by bbatsov Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plexus commented Oct 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbatsov commented Oct 25, 2021

plexus commented Oct 28, 2021 • edited Loading

bbatsov commented Oct 28, 2021

plexus commented Nov 5, 2021

plexus commented Nov 5, 2021

vemv commented Nov 5, 2021

vemv commented Nov 5, 2021

vemv left a comment

Choose a reason for hiding this comment

plexus commented Nov 5, 2021

vemv commented Nov 8, 2021

plexus commented Nov 8, 2021

plexus commented Nov 8, 2021

vemv commented Nov 8, 2021

plexus commented Sep 18, 2021 •

edited

Loading

plexus commented Sep 21, 2021 •

edited

Loading

vemv commented Sep 24, 2021 •

edited

Loading

bbatsov Oct 1, 2021 •

edited

Loading

vemv Oct 4, 2021 •

edited by bbatsov

Loading

plexus commented Oct 28, 2021 •

edited

Loading