Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain resource URL in java-parser results, improve caching #124

Conversation

plexus
Copy link
Contributor

@plexus plexus commented Sep 18, 2021

The java parser/legacy-parser returns a resource's :file, actually a relative
path as a string, and :path, the relative path converted to absolute. It does
this conversion through an io/resource lookup to find the file system location
of the artifact. There are two issues with this:

  1. The return value of io/resource is discarded, even though it is needed
    later on in orchard.java, causing unnecessary resource lookups.

  2. The path is obtained through (.getPath (io/resource path)), which yields
    nonsensical results when used on resources that are inside a JAR.

This change keeps the :file returned from java-parser (relative path string),
but removes the :path value in favor of a :resource-url, i.e. the return
value of (io/resource path). It then provides utility functions to work with
this resource URL directly, namely mapping it to a filesystem artifact, or
retrieving its modification time.

java.parser/parser already does a io/resource lookup for the resource it is
parsing, meaning it has a full URL (jar: or file:). By including this URL in the
map it returns callers can do further checks or operations on the resource
without having to re-scan the classpath.

This in turn allows us to simplify and optimize orchard.java/class-info. The
old version would call io/resource on each invocation, even when the result
was already cached, in order to find the artifact's modification time.

This can noticably speed up cider-nrepl's "stacktrace" op, which calls
orchard.java/class-info on each stack frame, in extreme cases taking multiple
seconds to analyze all stack frames and return a result. This leads to rather
jarring UX, with the stacktrace buffering popping up in Emacs seconds after the
evaluation and exception happened. This is exarcerbated when using the nREPL
sideloader, which adds overhead to each resource lookup.

This also makes the return value of java-parser more correct. As mentioned the
previous version would always call (.getPath (io/resource path)), but this
only makes sense for file: resources, not for jar: resources. For file:
URLs .getPath returns the path of the file. For jar: URLs it returns the
nested url+the jar entry, so a file: URL but with a dangling !/jar/entry.

Illustration of why calling getPath on jar: URLs is not useful:

(io/resource "lambdaisland/witchcraft.clj")
;;=> #java.net.URL "file:/srv/mc/witchcraft/src/lambdaisland/witchcraft.clj"

(.getPath (io/resource "lambdaisland/witchcraft.clj"))
;;=> "/srv/mc/witchcraft/src/lambdaisland/witchcraft.clj"

(io/resource "clojure/lang/RT.class")
;;=> #java.net.URL "jar:file:/root/.m2/repository/org/clojure/clojure/1.10.3/clojure-1.10.3.jar!/clojure/lang/RT.class"

(.getPath (io/resource "clojure/lang/RT.class"))
;;=> "file:/root/.m2/repository/org/clojure/clojure/1.10.3/clojure-1.10.3.jar!/clojure/lang/RT.class"

@plexus plexus force-pushed the arne/remove-nonsensible-resource-lookup branch from aebf914 to c71adc3 Compare September 18, 2021 08:54
Copy link
Member

@vemv vemv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Appreciate the PR desc.

I guess this could have test coverage both directly, and more importantly it's worth checking out the consumer of this defn - surely it lacked test coverage as well.

@bbatsov
Copy link
Member

bbatsov commented Sep 20, 2021

Great PR indeed! If we can easily improve the test coverage, that'd be great, but I'm willing to take the improvement regardless.

@plexus
Copy link
Contributor Author

plexus commented Sep 20, 2021

The consumer has tests, but no tests that consider :lastModified. It's a bit of a pain to test, because you'd need a class file that is not inside a jar. I guess you could bundle a pre-compiled .class just for testing, or include a Java file that you compile on the fly.

Not sure if tests for immutable-source-file? would have helped, since it seems the issue is assumptions made about its inputs. You'd only really notice if you had tests that combine the java parser and immutable-source-file?. But I can add a few tests that demonstrate the input it gets, and maybe add a comment or two.

The current behavior is that immutable-source-file? always returns true, (because the io/resource call always returns nil) which means it never expires the cached info about classes. Which for most users will actually be fine, unless they're using something like Virgil to work on Java and Clojure simultaneously. But despite the caching the immutable-source-file? check still happens every time (to decide if it should expire), and that check was needlessly costly.

Actually we could go a step further here and skip the check on class info that is already cached and coming from a JAR, on the assumption that if a class is coming from a JAR now it will still be coming from a JAR in the future.

Seems this also needs a rebase now, I'll get to that.

@bbatsov
Copy link
Member

bbatsov commented Sep 20, 2021

Actually we could go a step further here and skip the check on class info that is already cached and coming from a JAR, on the assumption that if a class is coming from a JAR now it will still be coming from a JAR in the future.

I like this idea!

The consumer has tests, but no tests that consider :lastModified. It's a bit of a pain to test, because you'd need a class file that is not inside a jar. I guess you could bundle a pre-compiled .class just for testing, or include a Java file that you compile on the fly.

That's too much hassle for little return, so let's not bother with it.

@plexus
Copy link
Contributor Author

plexus commented Sep 20, 2021

Looking at this a bit closer the problem is really in java-parser

          (assoc ...
                 :file path
                 :path (.getPath (io/resource path)))

io/resource returns a URL. If that's a file: URL then .getPath will indeed return the path on the filesystem, that seems to be mostly what this code is assuming. But it could also be a jar: or even an http:, ftp: or data: URL (depending on how funky your classloader is). The main cases to consider here are file: and jar: URL, but .getPath on a jar: URL pretty much gives you gibberish, because it strips the jar: but leaves the !/... at the end. (jar URLs look like jar:<url>/!<entry>, and so you get <url>/!<entry>). The correct way to parse a jar URL is apparently via JarURLConnection

Also keep in mind that the <url> inside the jar:<url>.. doesn't have to be a file: url, it could also e.g. be http:.

So I think the better solution is actually for java-parser to do

(assoc ... :file path :resource-url (io/resource path))

and skip the .getPath alltogether.

If it's possible that other callers might still be relying on :path we can leave it in the result as well.

@plexus plexus force-pushed the arne/remove-nonsensible-resource-lookup branch 3 times, most recently from 639c881 to 82bee7c Compare September 21, 2021 06:47
@plexus
Copy link
Contributor Author

plexus commented Sep 21, 2021

Glad I slept on this, on closer inspection immutable-source-file? did return the right result, but in a way that unnecessarily calls io/resource even though this is already done in source-info, and in a way that bypasses the cache check.

I've changed the behavior of parser/source-file slightly, for the case where a file is coming from a jar

;; Before
(select-keys (parser/source-info 'clojure.lang.RT) [:file :path :archive?])
{:file "clojure/lang/RT.java", :path "file:/home/arne/.m2/repository/org/clojure/clojure/1.10.1/clojure-1.10.1-sources.jar!/clojure/lang/RT.java"}

;; After
(select-keys (parser/source-info 'clojure.lang.RT) [:file :path :archive?])
{:file "clojure/lang/RT.java", :path "/home/arne/.m2/repository/org/clojure/clojure/1.10.1/clojure-1.10.1-sources.jar", :archive? true}

So that :path is always a filesystem path, just as it is with non-archive resources. I've also added the :archive? key so we don't have to scan for .(jar|zip)!/ later on.

;; non-archive resources
(select-keys (parser/source-info 'orchard.java.DummyClass) [:file :path :archive?])
{:file "orchard/java/DummyClass.java", :path "/home/arne/github/orchard/test/orchard/java/DummyClass.java", :archive? false}

@plexus plexus force-pushed the arne/remove-nonsensible-resource-lookup branch 6 times, most recently from 0861687 to 578e5f9 Compare September 21, 2021 07:56
@bbatsov
Copy link
Member

bbatsov commented Sep 21, 2021

Seems like a fantastic change to me!

file:/home/arne/.m2/repository/org/clojure/clojure/1.10.1/clojure-1.10.1-sources.jar!/clojure/lang/RT.java

Did you check in CIDER how are those paths currently handled? I wondering if the updated format won't breaking something.

@plexus
Copy link
Contributor Author

plexus commented Sep 21, 2021

I can't immediately find anywhere in CIDER where :path is used, only :file.

Do you have any pointers for things to check? This is specifically for java class info, so things like find-var or find-ns should not be impacted. I ended up here because of its use in stacktrace, and there cider-stacktrace is definitely only checking :file.

The thing is that when info ops return :file has been made absolute, if you look at cider.nrepl.middleware.info.

      (-> info
          (merge
               (when-let [file (:file info)]
                   (info/file-info file))
)
)

Where info/info-file does another resource lookup, this logic is also somewhat flawed

(defn file-info
  [path]
  (let [[resource-relative resource-full] (resource/resource-path-tuple path)]
    (merge {:file (or (file-path path) resource-full path)}
           ;; Classpath-relative path if possible
           (when resource-relative
             {:resource resource-relative}))))

Is this path a classpath-relative path, or a CWD-relative path? resource-path-tuple assumes the former, file-path assumes the latter.

And in this case (when the info is coming from java.parser) we have all this information already...

And then in cider-stacktrace-navigate this is basically ignored, instead it tries to guess first if it can figure out itself where that class source is, even if we just told it where it is...

(let (
         (file (or
                (and (null var) (cider-resolve-java-class class))
                (nrepl-dict-get info "file")
                (button-get button 'file)))
)

@bbatsov
Copy link
Member

bbatsov commented Sep 21, 2021

Do you have any pointers for things to check? This is specifically for java class info, so things like find-var or find-ns should not be impacted. I ended up here because of its use in stacktrace, and there cider-stacktrace is definitely only checking :file.

I briefly checked the code and it seems you're right.

Is this path a classpath-relative path, or a CWD-relative path? resource-path-tuple assumes the former, file-path assumes the latter.

I think all relative paths should be classpath-relative.

And then in cider-stacktrace-navigate this is basically ignored, instead it tries to guess first if it can figure out itself where that class source is, even if we just told it where it is...

Too much legacy - I'm pretty sure we've slowly accumulated a lot of small inconsistencies over the years. I agree the current behavior is flawed.

@plexus
Copy link
Contributor Author

plexus commented Sep 21, 2021

ok, in that case let's take this from the source to the consumer step by step, and try to be consistent about the meaning of each key. In the end there are two things that matter: the classpath-relative path, and the full file/jar URL.

The info maps coming from orchard seem to all use :file as the classpath-relative path. However in the info op in cider-nrepl :file is (at least most of the time it seems?) a full URL (file: or jar:) which makes sense. Except if orchard gives it a relative path that it can't resolve, in that case it stays a relative path...

For your own REPL perusal

(require '[orchard.info :as info] '[orchard.java :as java] '[cider.nrepl.middleware.info :as cider-info])

(info/info 'clojure.core 'str)
(info/info-java 'clojure.lang.RT 'baseLoader)
(java/class-info 'clojure.lang.RT)

(cider-info/format-response (cider-info/info {:ns 'clojure.core :sym 'str}))
(cider-info/format-response (cider-info/info {:ns 'cider.nrepl.middleware.info :sym 'info}))
(cider-info/format-response (cider-info/info {:ns 'clojure.lang.RT :sym 'baseLoader}))

My current thinking is this: we look at the places in orchard.info and orchard.java that return info-like maps with a :file key. If the code creating these maps already has done resource resolution then it adds an extra :resource-url as a string (so a file: or jar: url). And we provide helpers for getting the actual filesystem object from there, e.g. for when you need to know the modification time.

The cider.nrepl.middleware.info can use :resource-url as its :file in "file" ops, if it is present. This should save a lot of resource lookups already, and yield more consistent results. Same for nrepl.lookup.

Finally we can see how CIDER uses this information, heuristics should only be used when no :file is provided.

Copy link
Member

@vemv vemv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be interested in reviewing this PR again later, I guess it's a WIP again

(will be nice to have an updated title / desc summarizing the newer findings)

"Given a jar:file:...!/... URL, return the location of the jar file on the
filesystem. Returns nil on any other URL."
[^URL jar-resource]
(when (= "jar" (url-protocol jar-resource))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(assert (= "jar" (url-protocol jar-resource))
  ;; debug info:
  (url-protocol jar-resource))

is just as cheap, but also ensures our assumptions hold true. Specifically, jar-path has only one caller, so it's pretty easy to reason about what our preconditions are

@vemv
Copy link
Member

vemv commented Sep 24, 2021

One last note, from this branch's CI timings, it looks like certain jobs got slower (30s -> 60s)

(other jobs are known to take 2m+ minutes)

This is worth keeping in mind, ideally we wouldn't regress on performance. I think that if solving all these problems at once we'd get a fast build for every job, see also #131 (comment)

@plexus plexus force-pushed the arne/remove-nonsensible-resource-lookup branch 2 times, most recently from 8c4a051 to 0a1523e Compare September 27, 2021 09:06
@plexus
Copy link
Contributor Author

plexus commented Sep 27, 2021

I pushed a new iteration which simply returns the :resource-url from the parser, and then provides some helpers in orchard.misc.io to work on the resource URL. Going to try this locally with a patched cider-nrepl and report back.


Having looked around a bit more it seems it's going to be challenging to standardize the different keys used for things, e.g.

orchard.java.resource

  • :root : classpath entry, java.io.File
  • :file : resource location, java.io.File
  • :relpath : relative resource path, String
  • :url : resource location, java.net.URL

orchard.java.parser

  • :file : relative resource path, String

orchard.info/file-info

  • :file : resource location, java.net.URL
  • :resource : relative resource path, String

orchard.info/info*

  • :file : relative resource path, String

@plexus
Copy link
Contributor Author

plexus commented Oct 1, 2021

I think this is good from my end. It's a fairly conservative change so far, it'll get more interesting once cider-nrepl can starting taking advantage of the :resource-url instead of doing another io/resource lookup. That said this should already be useful for any Orchard consumer because of the better caching of java results.

Copy link
Member

@vemv vemv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the PR desc up to date?

[^java.net.URL url]
(.getProtocol url))

(defn in-archive?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file-within-archive? perhaps?

Originally I thought this meant a different thing ("is this an archive")

Copy link
Member

@bbatsov bbatsov Oct 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this clear from the param? To me it seems obvious that we're checking if an URL is in archive.

Copy link
Member

@vemv vemv Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it seems obvious that we're checking if an URL is in archive.

Hindsight is always 20/20 :) as mentioned at first the name genuinely confused me. If we pick this code up some years later we can forget these meanings, which is why being extra careful wrt clarity seems desirable for a project like this one.

No biggie though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now url-to-file-within-archive? as in "does this URL point at a file within an archive?"

[^java.net.URL url]
(= "jar" (url-protocol url)))

(defn jar-location
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jar-location-of-file-within-archive?

More verbose, but:

  • consistent w/ my prior suggestion
  • does not take a generic name for future use cases

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with a shorter name as the longer seems extremely verbose. I'm all for clear naming, but we don't want to go overboard here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now resource-jarfile, since it returns the .jar File instance for a given resource (resources being URLs).

(io/as-file inner-url))))

(defn resource-artifact
"Get the fully qualified file name containing the resource identified by the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get is confusing, lead me to believe it referred to the return value at first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased, thanks.

(:require [orchard.misc :as misc]
[clojure.test :refer :all]))

(when (>= orchard.misc/java-api-version 9)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you follow instead this pattern? ffb71cd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I added test-newer-jdks analogous to src-newer-jdks

@plexus
Copy link
Contributor Author

plexus commented Oct 1, 2021

Thanks for the input @vemv and @bbatsov . I'll try to make some time to give this another pass.

@@ -198,3 +198,7 @@
classes. If no source is available, return the relative path as is."
[^String path]
{:javadoc (java/resolve-javadoc-path path)})

(comment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this some debug code you forgot or it's something you meant to keep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was meaning to keep it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got rid of it after all, the tests provide enough examples.

@bbatsov
Copy link
Member

bbatsov commented Oct 25, 2021

It'd be nice to include this with the next CIDER release. ETA a couple of weeks. I guess if Arne doesn't have time to tweak this further I'll just merge it as is

@plexus
Copy link
Contributor Author

plexus commented Oct 28, 2021

Hey folks, I've been on holiday the last two weeks, and have a good bunch of other stuff on my plate. This is definitely on my stack for soon ™, but I can't give a hard ETA.

@bbatsov
Copy link
Member

bbatsov commented Oct 28, 2021

No rush.

@plexus plexus force-pushed the arne/remove-nonsensible-resource-lookup branch from 0a1523e to 6d8e936 Compare November 5, 2021 08:48
@plexus
Copy link
Contributor Author

plexus commented Nov 5, 2021

Rebased, going through the review comments now, and will then update the description.

The java parser/legacy-parser returns a resource's `:file`, actually a relative
path as a string, and `:path`, the relative path converted to absolute. It does
this conversion through an `io/resource` lookup to find the file system location
of the artifact. There are two issues with this:

1. The return value of `io/resource` is discarded, even though it is needed
later on in `orchard.java`, causing unnecessary resource lookups.

2. The path is obtained through `(.getPath (io/resource path))`, which yields
nonsensical results when used on resources that are inside a JAR.

This change keeps the `:file` returned from java-parser (relative path string),
but removes the `:path` value in favor of a `:resource-url`, i.e. the return
value of `(io/resource path)`. It then provides utility functions to work with
this resource URL directly, namely mapping it to a filesystem artifact, or
retrieving its modification time.

java.parser/parser already does a `io/resource` lookup for the resource it is
parsing, meaning it has a full URL (jar: or file:). By including this URL in the
map it returns callers can do further checks or operations on the resource
without having to re-scan the classpath.

This in turn allows us to simplify and optimize `orchard.java/class-info`. The
old version would call `io/resource` on each invocation, even when the result
was already cached, in order to find the artifact's modification time.

This can noticably speed up cider-nrepl's `"stacktrace"` op, which calls
`orchard.java/class-info` on each stack frame, in extreme cases taking multiple
seconds to analyze all stack frames and return a result. This leads to rather
jarring UX, with the stacktrace buffering popping up in Emacs seconds after the
evaluation and exception happened. This is exarcerbated when using the nREPL
sideloader, which adds overhead to each resource lookup.

This also makes the return value of java-parser more correct. As mentioned the
previous version would always call `(.getPath (io/resource path))`, but this
only makes sense for `file:` resources, not for `jar:` resources. For `file:`
URLs `.getPath` returns the path of the file. For `jar:` URLs it returns the
nested url+the jar entry, so a `file:` URL but with a dangling `!/jar/entry`.

Illustration of why calling `getPath` on `jar:` URLs is not useful:

```clojure
(io/resource "lambdaisland/witchcraft.clj")
;;=> #java.net.URL "file:/srv/mc/witchcraft/src/lambdaisland/witchcraft.clj"

(.getPath (io/resource "lambdaisland/witchcraft.clj"))
;;=> "/srv/mc/witchcraft/src/lambdaisland/witchcraft.clj"

(io/resource "clojure/lang/RT.class")
;;=> #java.net.URL "jar:file:/root/.m2/repository/org/clojure/clojure/1.10.3/clojure-1.10.3.jar!/clojure/lang/RT.class"

(.getPath (io/resource "clojure/lang/RT.class"))
;;=> "file:/root/.m2/repository/org/clojure/clojure/1.10.3/clojure-1.10.3.jar!/clojure/lang/RT.class"
```
@plexus plexus force-pushed the arne/remove-nonsensible-resource-lookup branch from 6d8e936 to 3e047e2 Compare November 5, 2021 10:48
@plexus plexus changed the title Remove costly io/resource lookup that wasn't doing anything Retain resource URL in java-parser results, improve caching Nov 5, 2021
@plexus
Copy link
Contributor Author

plexus commented Nov 5, 2021

Squashed, addressed the review comments, updated the commit message and PR title/desc.

@vemv
Copy link
Member

vemv commented Nov 5, 2021

Eyeing in a bit 👀

@vemv
Copy link
Member

vemv commented Nov 5, 2021

Illustration of why calling getPath on jar: URLs is not useful:

(.getPath (io/resource "clojure/lang/RT.class"))
;;=> "file:/root/.m2/repository/org/clojure/clojure/1.10.3/clojure-1.10.3.jar!/clojure/lang/RT.class"

The string may not be directly useful but it can be successfully, meaningfully parsed with a simple algo. I think it was (is?) done in cider.el.

I also happen to have Emacs code that parses such strings from cider-nrepl.

Regardless, out of caution we can stick to a simple no-breakage policy

Copy link
Member

@vemv vemv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have the chance to QA this (cider.el -> cider-nrepl -> orchard) it would probably be greatly useful. cider.el's test suite is not exactly an e2e one.

This project has a make install command (after which, I reckon, you should run cider-nrepl's own make install)

.clj-kondo/config.edn Show resolved Hide resolved
@plexus
Copy link
Contributor Author

plexus commented Nov 5, 2021

@vemv feel free to push changes if you deem them necessary

@vemv vemv marked this pull request as draft November 5, 2021 13:20
@vemv
Copy link
Member

vemv commented Nov 8, 2021

Can't push to your branch (permission denied), can you confirm Allow edits is on?

@plexus
Copy link
Contributor Author

plexus commented Nov 8, 2021

Screenshot_20211108-162122

@plexus
Copy link
Contributor Author

plexus commented Nov 8, 2021

I'll make a new pr from a branch on the main repo, just not at the computer right now.

@vemv
Copy link
Member

vemv commented Nov 8, 2021

Thanks! Yeah it wasn't working for w/e reason 🌀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants