Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toposort use iterator to avoid stackoverflow #12286

Merged
merged 9 commits into from
May 15, 2023

Conversation

tang-hi
Copy link
Contributor

@tang-hi tang-hi commented May 11, 2023

Description

In Issue #12284, I observed that Lucene limits the recursion level in the topoSortStatesRecurse method to avoid a StackOverflow error during automaton topological sorting. I propose we could use an iterative approach instead of recursion.
I've implemented an iterative version as shown below.

private static int topoSortStatesRecurse(
      Automaton a, BitSet visited, int[] states) {
      
    Stack<Integer> stack = new Stack<>();
    stack.push(0); // Assuming that the initial state is 0.
    int upto = 0;
    Transition t = new Transition();

    while (!stack.empty()) {
        int state = stack.pop();

        int count = a.initTransition(state, t);
        for (int i = 0; i < count; i++) {
            a.getNextTransition(t);
            if (!visited.get(t.dest)) {
                visited.set(t.dest);
                stack.push(t.dest);
            }
        }
        states[upto] = state;
        upto++;
    }
    return upto;
}

However, I noticed that the test TestAutomaton.java depends on the order in which recursive calls are made or items are added to and removed from the stack. To maintain the same order as the recursive version in the iterative approach, so I've used a particular technique in the pull request.

To further improve this change, here are my plans:

  1. Remove the trick and modify the test code accordingly.
  2. Perhaps we should check if the automaton contains cycles, and throw an IllegalArgumentException if it does?
  3. Should we continue to limit the size of the Automaton?
    I welcome any suggestions or feedback on this approach.

@mikemccand
Copy link
Member

Thank you @tang-hi! We have tried to build iterative versions of this in the past, but it was conterversial because it was so much more complicated than the existing recursive version. But your iterative solution looks very simple!

However, I noticed that the test TestAutomaton.java depends on the order in which recursive calls are made or items are added to and removed from the stack.

The test really should not be relying on such an implementation detail? By definition, toposort might have many valid outputs for the same input graph. The test should only assert that the returned order indeed ensures that all transitions are only forwards in the sorted states. So I think it's fine to fix this test bug and keep the simpler iterative solution here?

  1. Should we continue to limit the size of the Automaton?

I don't think we should.

  1. Perhaps we should check if the automaton contains cycles, and throw an IllegalArgumentException if it does?

That would be nice, though it really is illegal usage of the API -- can you detect that w/o much added cost in the iterative solution?

@tang-hi
Copy link
Contributor Author

tang-hi commented May 11, 2023

Thank you, @mikemccand, for your valuable feedback. I've made updates to my pull request based on your suggestions.

Regarding the modification of the test code, it appears there's no need for adjustments. The current test is correct. It was my initial implementation that had issues, which caused the test failure.

As for checking if the automaton contains cycles, I have incorporated this check in the updated version. I have used a BitSet to track whether the current state is already on the stack. If we encounter a state that's already on the stack, we've detected a cycle. You can refer to this resource for more information on cycle detection.

While adding the BitSet introduces a small overhead and slightly increases the complexity of the code, I believe the code remains understandable. Please let me know if you have any further suggestions or feedback. 😄

@tang-hi tang-hi marked this pull request as ready for review May 11, 2023 17:08
@rmuir
Copy link
Member

rmuir commented May 12, 2023

i really like it, thanks for doing this!

code is still simple, but more robust (heap instead of stack!) and the cycle detection is a really nice bonus. thanks for the unit test for that.

i made a cosmetic suggestion.

Also, should we add a basic test that the toposort works correctly to TestOperations? TestAutomaton does exercise it too, but I'm not sure how well or how easy to debug if it fails :)

@dweiss
Copy link
Contributor

dweiss commented May 12, 2023

Use ArrayDeque instead of java.util.Stack (synchronized methods)?

@uschindler
Copy link
Contributor

Use ArrayDeque instead of java.util.Stack (synchronized methods)?

We should but this on forbiddenapis.

@tang-hi
Copy link
Contributor Author

tang-hi commented May 12, 2023

@rmuir I have added a unit test to evaluate the toposort functionality. I developed a method that generates a random automaton, and you can include a cycle by passing the hasCycle argument. This will allow us to test both toposort and cycle detection.

@uschindler @dweiss Thank you for your advice. I have made the change from using Stack to Deque

@tang-hi tang-hi requested a review from uschindler May 12, 2023 14:22
@tang-hi tang-hi requested a review from rmuir May 13, 2023 12:43
Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only 2 minor comments, but it's approved in general.

@tang-hi
Copy link
Contributor Author

tang-hi commented May 15, 2023

Maybe we could merge this pull request if there are no other comments.

@uschindler
Copy link
Contributor

I can take care of that, but I would like to get a final "go" by @rmuir. :-)
Uwe

*
* @param a The automaton whose states are to be topologically sorted.
* @param states An int array which stores the states.
* @return the reversed topologically sorted array of state ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: this should say 'returns the number of states in the final sorted list' I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, My bad.Already fix that

@uschindler uschindler self-assigned this May 15, 2023
@uschindler uschindler added this to the 9.7.0 milestone May 15, 2023
Copy link
Member

@rmuir rmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great. thanks again.

@uschindler uschindler merged commit 5d203f8 into apache:main May 15, 2023
asfgit pushed a commit that referenced this pull request May 15, 2023
Co-authored-by: tangdonghai <tangdonghai@meituan.com>
# Conflicts:
#	lucene/CHANGES.txt
@tang-hi
Copy link
Contributor Author

tang-hi commented May 15, 2023

Thank you everyone for your valuable comments on my PR.

@uschindler
Copy link
Contributor

Thanks @tang-hi for the nice work!

@uschindler
Copy link
Contributor

One additional thing as my last comment: We moved from recursive to iterative, but we still have a stack (deque). It is not so limited like the OS stack by the Java VM, but still for some FSA it could consume a lot of heap space.

Any thoughts?

@tang-hi
Copy link
Contributor Author

tang-hi commented May 15, 2023

One additional thing as my last comment: We moved from recursive to iterative, but we still have a stack (deque). It is not so limited like the OS stack by the Java VM, but still for some FSA it could consume a lot of heap space.

Any thoughts?

The only approach I can think of to limit heap consumption is to restrict the size of deque. However, I believe this will only occur in special cases. In addition, if users wish to perform topological sorting on large automata and are prepared to accommodate the associated memory consumption and time costs, I don't think we should impose any limitations.If there are any better suggestions, I am open to them at any time.

@mikemccand
Copy link
Member

I think the RAM consumption is OK, but, we should clearly advertise it in the javadocs for this method? Since we detect cycles we will never have an "attempt to use infinite RAM".

@uschindler
Copy link
Contributor

I think we have the same problem with other methods in this class that were transformed to be iterative earlier.
So we should maybe make a comment about what types of automatons could consume lot of space. But for end users it might be important to have this in our query classes.

@tang-hi
Copy link
Contributor Author

tang-hi commented May 15, 2023

I think adding a comment would be great. I have already submitted a new pull request. @uschindler @mikemccand

@uschindler
Copy link
Contributor

Looks fine, I can merge that later, no need for additional changes.txt.

@uschindler
Copy link
Contributor

uschindler commented May 15, 2023

Hi,
I had to fix the test to use TestUtil#nextInt(min, max) as Java 11 has no two-parameter Random#nextInt(origin, bound). I had to substract 1 from bound as its is inclusive in TestUtil.

@rmuir
Copy link
Member

rmuir commented May 15, 2023

@tang-hi @uschindler @mikemccand I think now that there are no more recursive algorithms in src/java we can now move Operations.MAX_RECURSION_LEVEL to AutomatonTestUtil in src/test? This test helper still uses it only because it has some simple recursive implementations used for validating correctness. And I think it is a good step to be able to move it to just tests!

@mikemccand
Copy link
Member

@tang-hi @uschindler @mikemccand I think now that there are no more recursive algorithms in src/java we can now move Operations.MAX_RECURSION_LEVEL to AutomatonTestUtil in src/test? This test helper still uses it only because it has some simple recursive implementations used for validating correctness. And I think it is a good step to be able to move it to just tests!

Yay, what a great milestone!

@tang-hi
Copy link
Contributor Author

tang-hi commented May 16, 2023

@rmuir Already raise a PR to move that.

hiteshk25 pushed a commit to cowpaths/lucene that referenced this pull request Jul 18, 2023
…dc8ca633e8bcf`) (#20)

* Add next minor version 9.7.0

* Fix SynonymQuery equals implementation (apache#12260)

The term member of TermAndBoost used to be a Term instance and became a
BytesRef with apache#11941, which means its equals impl won't take the field
name into account. The SynonymQuery equals impl needs to be updated
accordingly to take the field into account as well, otherwise synonym
queries with same term and boost across different fields are equal which
is a bug.

* Fix MMapDirectory documentation for Java 20 (apache#12265)

* Don't generate stacktrace in CollectionTerminatedException (apache#12270)

CollectionTerminatedException is always caught and never exposed to users so there's no point in filling
in a stack-trace for it.

* add missing changelog entry for apache#12260

* Add missing author to changelog entry for apache#12220

* Make query timeout members final in ExitableDirectoryReader (apache#12274)

There's a couple of places in the Exitable wrapper classes where
queryTimeout is set within the constructor and never modified. This
commit makes such members final.

* Update javadocs for QueryTimeout (apache#12272)

QueryTimeout was introduced together with ExitableDirectoryReader but is
now also optionally set to the IndexSearcher to wrap the bulk scorer
with a TimeLimitingBulkScorer. Its javadocs needs updating.

* Make TimeExceededException members final (apache#12271)

TimeExceededException has three members that are set within its constructor and never modified. They can be made final.

* DOAP changes for release 9.6.0

* Add back-compat indices for 9.6.0

* `ToParentBlockJoinQuery` Explain Support Score Mode (apache#12245) (apache#12283)

* `ToParentBlockJoinQuery` Explain Support Score Mode

---------

Co-authored-by: Marcus <marcuseagan@gmail.com>

* Simplify SliceExecutor and QueueSizeBasedExecutor (apache#12285)

The only behaviour that QueueSizeBasedExecutor overrides from SliceExecutor is when to execute on the caller thread. There is no need to override the whole invokeAll method for that. Instead, this commit introduces a shouldExecuteOnCallerThread method that can be overridden.

* [Backport] GITHUB-11838 Add api to allow concurrent query rewrite (apache#12197)

* GITHUB-11838 Change API to allow concurrent query rewrite (apache#11840)

Replace Query#rewrite(IndexReader) with Query#rewrite(IndexSearcher)

Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>

Backport of apache#11840

Changes from original:
 - Query keeps `rewrite(IndexReader)`, but it is now deprecated
 - VirtualMethod is used to correct delegate to the overridden methods
 - The changes to `RewriteMethod` type classes are reverted, this increased the backwards compatibility impact. 

------------------------------

### Description
Issue: apache#11838 

#### Updated Proposal
 * Change signature of rewrite to `rewrite(IndexSearcher)`
 * How did I migrate the usage:
   * Use Intellij to do preliminary refactoring for me
   * For test usage, use searcher whenever is available, otherwise create one using `newSearcher(reader)`
   * For very few non-test classes which doesn't have IndexSearcher available but called rewrite, create a searcher using `new IndexSearcher(reader)`, tried my best to avoid creating it recurrently (Especially in `FieldQuery`)
   * For queries who have implemented the rewrite and uses some part of reader's functionality, use shortcut method when possible, otherwise pull out the reader from indexSearcher.

* Backport: Concurrent rewrite for KnnVectorQuery (apache#12160) (apache#12288)

* Concurrent rewrite for KnnVectorQuery (apache#12160)


- Reduce overhead of non-concurrent search by preserving original execution
- Improve readability by factoring into separate functions

---------

Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>

* adjusting for backport

---------

Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com>
Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>

* toposort use iterator to avoid stackoverflow (apache#12286)

Co-authored-by: tangdonghai <tangdonghai@meituan.com>
# Conflicts:
#	lucene/CHANGES.txt

* Fix test to compile with Java 11 after backport of apache#12286

* Update Javadoc for topoSortStates method after apache#12286 (apache#12292)

* Optimize HNSW diversity calculation (apache#12235)

* Minor cleanup and improvements to DaciukMihovAutomatonBuilder (apache#12305)

* GITHUB-12291: Skip blank lines from stopwords list. (apache#12299)

* Wrap Query rewrite backwards layer with AccessController (apache#12308)

* Make sure APIJAR reproduces with different timezone (unfortunately java encodes the date using local timezone) (apache#12315)

* Add multi-thread searchability to OnHeapHnswGraph (apache#12257)

* Fix backport error

* [MINOR] Update javadoc in Query class (apache#12233)

- add a few missing full stops
- update wording in the description of Query#equals method

* [Backport] Integrate the Incubating Panama Vector API apache#12311 (apache#12327)

Leverage accelerated vector hardware instructions in Vector Search.

Lucene already has a mechanism that enables the use of non-final JDK APIs, currently used for the Previewing Pamana Foreign API. This change expands this mechanism to include the Incubating Pamana Vector API. When the jdk.incubator.vector module is present at run time the Panamaized version of the low-level primitives used by Vector Search is enabled. If not present, the default scalar version of these low-level primitives is used (as it was previously).

Currently, we're only targeting support for JDK 20. A subsequent PR should evaluate JDK 21.
---------

Co-authored-by: Uwe Schindler <uschindler@apache.org>
Co-authored-by: Robert Muir <rmuir@apache.org>

* Parallelize knn query rewrite across slices rather than segments (apache#12325)

The concurrent query rewrite for knn vectory query introduced with apache#12160
requests one thread per segment to the executor. To align this with the
IndexSearcher parallel behaviour, we should rather parallelize across
slices. Also, we can reuse the same slice executor instance that the
index searcher already holds, in that way we are using a
QueueSizeBasedExecutor when a thread pool executor is provided.

* Optimize ConjunctionDISI.createConjunction (apache#12328)

This method is showing up as a little hot when profiling some queries.
Almost all the time spent in this method is just burnt on ceremony
around stream indirections that don't inline.
Moving this to iterators, simplifying the check for same doc id and also saving one iteration (for the min
cost) makes this method far cheaper and easier to read.

* Update changes to be correct with ARM (it is called NEON there)

* GH#12321: Marked DaciukMihovAutomatonBuilder as deprecated (apache#12332)

Preparing to reduce visibility of this class in a future release

* add BitSet.clear() (apache#12268)

# Conflicts:
#	lucene/CHANGES.txt

* Clenaup and update changes and synchronize with 9.x

* Update TestVectorUtilProviders.java (apache#12338)

* Don't generate stacktrace for TimeExceededException (apache#12335)

The exception is package private and never rethrown, we can avoid
generating a stacktrace for it.

* Introduced the Word2VecSynonymFilter (apache#12169)

Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>

* Word2VecSynonymFilter constructor null check (apache#12169)

* Use thread-safe search version of HnswGraphSearcher (apache#12246)

Addressing comment received in the PR apache#12246

* Word2VecSynonymProvider to use standard Integer max value for hnsw searches (apache#12235)
We observed this change was not ported previously from main in an old cherry-pick

* Fix searchafter high latency when after value is out of range for segment (apache#12334)

* Make memory fence in `ByteBufferGuard` explicit (apache#12290)

* Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit (apache#12320)

* Add updateDocuments API which accept a query (reopen) (apache#12346)

* GITHUB#11350: Handle backward compatibility when merging segments with different FieldInfo

This commits restores Lucene 9's ability to handle indices created with Lucene 8 where there are discrepancies in FieldInfos, such as different IndexOptions

* [Tessellator] Improve the checks that validate the diagonal between two polygon nodes (apache#12353)

# Conflicts:
#	lucene/CHANGES.txt

* feat: soft delete optimize (apache#12339)

* Better paging when random reads go backwards (apache#12357)

When reading data from outside the buffer, BufferedIndexInput always resets
its buffer to start at the new read position. If we are reading backwards (for example,
using an OffHeapFSTStore for a terms dictionary) then this can have the effect of
re-reading the same data over and over again.

This commit changes BufferedIndexInput to use paging when reading backwards,
so that if we ask for a byte that is before the current buffer, we read a block of data
of bufferSize that ends at the previous buffer start.

Fixes apache#12356

* Work around SecurityManager issues during initialization of vector api (JDK-8309727) (apache#12362)

* Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth (apache#12249)

* Implement MMapDirectory with Java 21 Project Panama Preview API (apache#12294)
Backport incl JDK21 apijar file with java.util.Objects regenerated

* remove relic in apijar folder caused by vector additions

* Speed up IndexedDISI Sparse #AdvanceExactWithinBlock for tiny step advance (apache#12324)

* Add checks in KNNVectorField / KNNVectorQuery to only allow non-null, non-empty and finite vectors (apache#12281)


---------

Co-authored-by: Uwe Schindler <uschindler@apache.org>

* Implement VectorUtilProvider with Java 21 Project Panama Vector API (apache#12363) (apache#12365)

This commit enables the Panama Vector API for Java 21. The version of
VectorUtilPanamaProvider for Java 21 is identical to that of Java 20.
As such, there is no specific 21 version - the Java 20 version will be
loaded from the MRJAR.

* Add CHANGES.txt for apache#12334 Honor after value for skipping documents even if queue is not full for PagingFieldCollector (apache#12368)

Signed-off-by: gashutos <gashutos@amazon.com>

* Move TermAndBoost back to its original location. (apache#12366)

PR apache#12169 accidentally moved the `TermAndBoost` class to a different location,
which would break custom sub-classes of `QueryBuilder`. This commit moves it
back to its original location.

* GITHUB-12252: Add function queries for computing similarity scores between knn vectors (apache#12253)

Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>

* hunspell (minor): reduce allocations when processing compound rules (apache#12316)

(cherry picked from commit a454388)

* hunspell (minor): reduce allocations when reading the dictionary's morphological data (apache#12323)

there can be many entries with morph data, so we'd better avoid compiling and matching regexes and even stream allocation

(cherry picked from commit 4bf1b94)

* TestHunspell: reduce the flakiness probability (apache#12351)

* TestHunspell: reduce the flakiness probability

We need to check how the timeout interacts with custom exception-throwing checkCanceled.
The default timeout seems not enough for some CI agents, so let's increase it.

Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>
(cherry picked from commit 5b63a18)

* This allows VectorUtilProvider tests to be executed although hardware may not fully support vectorization or if C2 is not enabled (apache#12376)

---------

Signed-off-by: gashutos <gashutos@amazon.com>
Co-authored-by: Alan Woodward <romseygeek@apache.org>
Co-authored-by: Luca Cavanna <javanna@apache.org>
Co-authored-by: Uwe Schindler <uschindler@apache.org>
Co-authored-by: Armin Braun <me@obrown.io>
Co-authored-by: Mikhail Khludnev <mkhludnev@users.noreply.github.com>
Co-authored-by: Marcus <marcuseagan@gmail.com>
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com>
Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>
Co-authored-by: tang donghai <tangdhcs@gmail.com>
Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com>
Co-authored-by: Greg Miller <gsmiller@gmail.com>
Co-authored-by: Jerry Chin <metrxqin@gmail.com>
Co-authored-by: Patrick Zhai <zhai7631@gmail.com>
Co-authored-by: Andrey Bozhko <andybozhko@gmail.com>
Co-authored-by: Chris Hegarty <62058229+ChrisHegarty@users.noreply.github.com>
Co-authored-by: Robert Muir <rmuir@apache.org>
Co-authored-by: Jonathan Ellis <jbellis@datastax.com>
Co-authored-by: Daniele Antuzi <daniele.antuzi@gmail.com>
Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>
Co-authored-by: Chaitanya Gohel <104654647+gashutos@users.noreply.github.com>
Co-authored-by: Petr Portnov | PROgrm_JARvis <pportnov@ozon.ru>
Co-authored-by: Tomas Eduardo Fernandez Lobbe <tflobbe@apache.org>
Co-authored-by: Ignacio Vera <ivera@apache.org>
Co-authored-by: fudongying <30896830+fudongyingluck@users.noreply.github.com>
Co-authored-by: Chris Fournier <chris.fournier@shopify.com>
Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>
Co-authored-by: Elia Porciani <e.porciani@sease.io>
Co-authored-by: Peter Gromov <peter@jetbrains.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants