Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The current Java source file analyser doesn't detect implicit package usages #1

Open
salmonb opened this issue Jul 21, 2022 · 21 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@salmonb
Copy link
Member

salmonb commented Jul 21, 2022

The current Java source analyser is a simple implementation based on regular expressions. It can detect only packages that are explicitly written in the Java source file (such as the ones listed in the imports, or in the code when using a fully qualified class name). However, it's possible to have implicit usages of packages, and the CLI currently doesn't detect them, like for example in the following code:

package mypackage;

import packageA; // Where ClassA is defined in moduleA
import packageC; // Where ClassC is defined in moduleC

public class MyClass {

    public void myMethod() {
        ClassA a = new CLassA();
        ClassC c = a.getB().getC(); // implicit usage of packageB where ClassB is defined <= not detected by the CLI 
    }

}

So it needs to be replaced with a much more elaborate Java syntax parser, able to detect the getB() call and tell what type is the returned value (through an analyse of the ClassA source).

There are probably other cases of implicit package usage, such as those coming from inheritance.

Any volunteer to work on this issue?

@salmonb salmonb added bug Something isn't working help wanted Extra attention is needed labels Jul 21, 2022
@amischler
Copy link

Spoon might help.

@salmonb
Copy link
Member Author

salmonb commented Feb 28, 2023

Thanks for letting me know about this tool, I didn't know it, and it looks very interesting. I will investigate as soon as I will have time.

The About page says Spoon is developed at Inria Lille, and your GitHub profile also says you're from Lille.
Is it just a coincidence? 😜

@amischler
Copy link

Not a coincidence ;-)

I have been using this library in past projects and know some team members.

@AlexBelchUK
Copy link

Hmm I think there is already a Java parser with the JDK as part of the Compiler API. What are the required inputs and outputs needed ? - Would it be that given a set of Java source files to parse, a unique list of import strings would be returned covering the standard imports plus any implicit import as in the example e.g. a.getB().getC() ?

@salmonb
Copy link
Member Author

salmonb commented Mar 23, 2023

Yes, that's exactly it. For each Java file, I need the list of packages explicitly or implicitly used by that file.

Does the JDK Java parser have an API directly usable from Java?

@AlexBelchUK
Copy link

I think it does yes, you can use the compile api parse() method to return a tree of compile unit declarations (classes etc) and then traverse the tree to get other tree parts such as packages etc. I don't know yet about the case of transitive but a little prototype may help discover more detail.

@salmonb
Copy link
Member Author

salmonb commented Mar 23, 2023

Yes, if you can make a prototype, that would be wonderful. It can be a separate project in your own repo to start with.

@AlexBelchUK
Copy link

Hi Bruno, I tried out using the java parser and came to.a stop as the parser could not extract A.getB().grtC() as it had not enough information. I've looked at the Apache BCEL byte code engineering library which works on .class files. I think we can get the information using that. Each class file has a constant pool and a string pool. I think we can use BCEL to get the classes used from the constant pool - it has an index that points to a string in the string pool and then we get the package and class name from the sting pool. The strings need a little manipulation to cope with arrays etc. This means that the cli export would need to pass in .class files after the maven compile phase (so the pre package phase would do)

@salmonb
Copy link
Member Author

salmonb commented Mar 30, 2023

Hi Alex, thanks for this research that saves me a lot of time. I don't think we can use any tool based on byte code such as Apache BCEL, because the whole idea of the WebFX CLI is to create the build chain from an analysis of the source code, and you can't get the byte code without the build chain... Byte code could be ok for third-party libraries but not for the project itself. It's only after a WebFX update pass that the build chain is ready to compile a WebFX project. So the WebFX CLI can't require any byte code as input to do its update pass.

So I think we need to keep investigating alternative Java source parsers.
I hope to have a bit of time over the weekend to start investigating spoon.
Would you mind to investigate this Java parser in the meantime? (it looks like quite popular)
And we can share any progress made?

@AlexBelchUK
Copy link

Can I ask in the case of A.getB().getC() would the source files for A.java, B.java and C.java be supplied ? If this is the case then just parsing each class would build up a list of packages and classes using the prototype I've written so far with further work would do that. I suspect you would need a parser that follows the source path for each file to resolve the A.getB().getC() scenario - quick look at JavaParser I was only supplying a single source file.. Just thinking if there is another way to get the information needed.

@salmonb
Copy link
Member Author

salmonb commented Mar 31, 2023

This is the tricky part!.. It would be inefficient in terms of performance to provide straightaway all the Java sources that may be used by the WebFX project. At the beginning, the input should be the project source files only. Then let's say one of these Java files uses A.getB().getC() and A is not part of the project but part of a library declared in webfx.xml, the parser should at least report at this point the package that class A is belonging to (but not B and C at this point). Ideally, it would be great if the parser had a callback feature to request more sources. Then the WebFX CLI could search which library this package belongs to among the declared libraries (this part is already implemented), and once found, pass it back to the parser so it can keep going, and eventually ask about B & C if these classes are also declared in other libraries. At the end of this process, we should have the complete information.

That callback feature would be the ideal Java Parser API for us. Can you check if the parser you're investigating provides one?

If not, maybe we can try to make a process with several passes. The first pass would be with the project sources only. The result of this first parsing would be partial because of the missing sources, but at least we should be able to collect all external packages. Then the WebFX CLI can resolve these packages, and we start another pass with the scope extended (project + libraries declaring these packages). And we repeat that process until there is no more external packages. It will be less efficient in terms of performance to have several passes, but it should work.

@salmonb
Copy link
Member Author

salmonb commented Apr 2, 2023

Just to share my investigation on Spoon, I didn't go far as I was stopped by a few bugs (see the issues I opened here and here). These bugs are quite basic but I guess this is because Spoon has not been intensively tested with a partial set a sources (probably most people provide a full set).

That said, I'm not concluding that Spoon is not the right tool. It's apparently possible to do incremental parsing (no need to parse everything again when the scope increases), which is a very good point for our use case with several passes. I didn't want to investigate more because of the bugs, but I will restart once they fix them (these issues are also a good opportunity to see how active their project is).

@AlexBelchUK
Copy link

I investigated JavaParser and it looked like up to JDK 13 is supported but beyond that not yet - lacking effort. The code itself is quite similar in some ways to the JDK compiler API regarding source level parsing and class API, there is some statement evaluation but not useful enough for our needs. The java compiler API is better as it is up to date i.e. JDK 20+ I have gone back to my prototype code and I'm working,albeit gradually, to get the prototype code to extract all class, field, method parameter, method variable, method return type, inner declared classes, exceptions, annotations, generic types, records, enum objects, interfaces etc from a file and hold them in a package and class list. I will then resolve packages the classes etc belong to with a set of rules and then have a callable interface to ask cli 'does this package and class exist ?' - if cli can't resolve to a package path and class then it would return null i.e package path / file not found. The parser would then try a different package and class combination based on the imports for the current file. If cli found a file then that would then be passed in to parse and the cycle continue recursively. I would also keep track of visited files to prevent a circularity error. In the webfx.xml I think there would still need to be a manual section for dynamically loaded classes as these would not get parsed and possibly other case where sources were not available but were needed in order to fully resolve a package. As it is Easter I'm away so possibly do some more next weekend.

@I-Al-Istannen
Copy link

@salmonb

These bugs are quite basic but I guess this is because Spoon has not been intensively tested with a partial set a sources (probably most people provide a full set).

Yes and no. Spoon relies on the eclipse compiler (JDT) to analyze the source code and then transforms its internal representation into a more high-level format. If you give it an incomplete classpath, JDT will guess its way through it and spoon tries to fix up the results. That part of spoon is quite error prone compared to having the full classpath though — with an incomplete classpath many java classes are suddenly ambiguous. Historically, these issues were handled whenever they crept up (and were at least partially fixable).

As spoon does not maintain its own compiler, the semantic analysis it can perform on invalid code is limited by what JDT offers. Spoon just massages the problem AST nodes created by JDT to squeeze out a more plausible AST, but it fundamentally cannot and does not always produce correct results. This area of spoon also hasn't seen the most love: Quite often there are semantic clues spoon could use to disambiguate cases.


That being said, Spoon is able to pick up on classes in your classpath. If you have compiled libraries, just supplying them as additional classpath resources and only giving it the project's source code should also lift you into a much better position. Spoon does not need to analyze the whole source of all libraries, just reflectively introspect the types referenced by your code.


Another project you could look at (though the docs are severely lacking) is IntelliJ PSI. I haven't used it much, but I would expect it to also handle incomplete types gracefully. It's the backbone of their IDE, after all.
Spoon is not married to a single AST producer (it uses reflection/JDT currently), but JavaC and PSI were the alternatives the previous maintainers explored at some point.

@salmonb
Copy link
Member Author

salmonb commented Apr 11, 2023

Thanks @I-Al-Istannen for your comment and explanation.

Regarding your suggestion to supply the additional classpath to libraries, it doesn't fit in my use case as I'm writing a tool whose aim is to (re)create the Maven build chain (pom.xml, etc...) of a project from an analysis of its source code alone. So the starting point is necessarily an incomplete classpath (the project source code without the libraries) and the source code analysis must return the external missing packages that my tool will then resolve by finding the appropriate libraries in Maven remote repositories (using some minimal information). So it's a bit the opposite way as a normal usage.

I saw you submitted a PR for the issue I raised, thank you very much for this contribution. I will restart my tests with Spoon once your PR will be merged.

If it's still not matching my needs, I will give a try to IntelliJ PSI. Thanks also for that suggestion.

@AlexBelchUK
Copy link

Bruno, can you see my webfx-prototype project on AlexBelchUK GitHub or do I need to publish to make visible ? It's still work in progress at the moment though

@salmonb
Copy link
Member Author

salmonb commented Apr 11, 2023

@AlexBelchUK, thanks for your previous feedback Alex, I was also unavailable during the Easter weekend. Ok for continuing your investigation with the JDK compiler API 👍

Yes I can see your prototype and clone it, let me know how it works. If you give it A.java and B.java as input (as described in that Spoon issue), will it detect that a.A is using c.C (and therefore a.A is using package c)?

@AlexBelchUK
Copy link

AlexBelchUK commented Apr 11, 2023 via email

@salmonb
Copy link
Member Author

salmonb commented Apr 12, 2023

@AlexBelchUK I tried to use your prototype but got a NPE (ClassDefinitionData.packageClassList was not initialised), did you push the latest version?

I suggest we continue with concrete simple cases, so I created this webfx-cli-cases repository, where I will put some concrete cases we can work on.

The first case webfx-cli-case1 has 3 classes (a.A, b.B and c.C) and no external dependencies (we will work on external dependencies after we make this first case work). a.A is explicitly importing b.B but implicitly using c.C (through b.B). So the goal of this case is to check that the tool can detect that a.A is implicitly using c.C.

I created this webfx-cli-spoon repository, and the class SpoonCase1 analyses webfx-cli-case1. It correctly detects the implicit usage as you can see in the output:

a.A is using:
 - c.C (package c)
 - b.B (package b)
b.B is using:
 - c.C (package c)
c.C is using:

Could you please write a PrototypeCase1 class in your repository that analyses webfx-cli-case1 and produces a similar output showing that it detects the implicit usage of c.C for a.A?

@AlexBelchUK
Copy link

Hi Bruno,

I should try to explain that the code I'm writing is very much 'in-progress' and incomplete,
I just push the (incomplete) progress done so far. I've still to connect the parse and resolve 
classes together, and write more in the parse class. I'm working full time and work 
commitments mean less time to try things out - so slow going. Once I get to
a runnable stage, I'll try the simple case - I did add a set of tests in my prototype project
to check extraction and lookup of classes used in various scenarios so these may be of help
to you. 

It may be that the spoon API may provide you with what you need, it looks to be able to
perform a source file search and resolve - I'd be interested to know if It can resolve a record 
(one of the newer language features) e.g: 

    public record ARecord(B b, C c) { }   

Alex

@salmonb
Copy link
Member Author

salmonb commented Apr 14, 2023

Hi Alex, no worries and no pressure, we all have busy lives. It was just my wrong interpretation that you already had some results to share with your prototype repository, sorry...

Yes, Spoon recognises records. It's based on the eclipse compiler (JDT), as I-Al-Istannen explained above, and it supports Java level up to 19.

Yes the Spoon API is good for the WebFX CLI, but not the runtime, too buggy at the moment when used with incomplete classpath... But this week they fixed one of the 2 issues I submitted (the simplest one), so we will see how it goes.

So, Spoon is uncertain at this stage, and I'm still interested in your prototype. I understand it's a lower level to directly work with the JDK compiler, but if we can make it work, it's better as we won't be dependent on another project to fix the bugs for our use case. Also it will be probably faster, because it looks like Spoon is building a lot of stuff on top of the compiler before giving the report, and the performance of webfx update will be important for WebFX developers.

Let me know if you make any progress, but take your time, I know it's an additional effort to your job, which makes your contribution even more appreciable :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Development

No branches or pull requests

4 participants