Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection in bad state #49

Open
tomsorgie opened this issue Sep 8, 2015 · 11 comments
Open

Connection in bad state #49

tomsorgie opened this issue Sep 8, 2015 · 11 comments

Comments

@tomsorgie
Copy link

I'm not certain that this is a Bitronix issue; but we've had the following condition occur in our production system.

The problem seems to stem from an issue during either commit or rollback of an Lrc resource. I am not certain whether the fact that it is in Lrc mode is related - since this particular resource is always in that mode. I do not suspect that it is related to Lrc.

Here is a sample stack:

bitronix.tm.internal.BitronixXAException: error preparing non-XA resource
    at bitronix.tm.resource.jdbc.lrc.LrcXAResource.prepare(LrcXAResource.java:229)
    at bitronix.tm.twopc.Preparer$PrepareJob.execute(Preparer.java:157)
    at bitronix.tm.twopc.executor.Job.run(Job.java:68)
    at bitronix.tm.twopc.executor.SyncExecutor.submit(SyncExecutor.java:27)
    at bitronix.tm.twopc.AbstractPhaseEngine.runJobsForPosition(AbstractPhaseEngine.java:120)
    at bitronix.tm.twopc.AbstractPhaseEngine.executePhase(AbstractPhaseEngine.java:84)
    at bitronix.tm.twopc.Preparer.prepare(Preparer.java:88)
    at bitronix.tm.BitronixTransaction.commit(BitronixTransaction.java:281)
    at com.gtnexus.server.mdb.StandardMessageService.run(StandardMessageService.java:202)
    at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:60)
    at com.google.common.util.concurrent.Callables$3.run(Callables.java:93)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLNonTransientConnectionException: [GTN][SQLServer JDBC Driver]Connection reset by peer: socket write error
    at com.gtnexus.jdbc.sqlserverbase.ddcw.b(Unknown Source)
    at com.gtnexus.jdbc.sqlserverbase.ddcw.a(Unknown Source)
    at com.gtnexus.jdbc.sqlserverbase.ddcv.b(Unknown Source)
    at com.gtnexus.jdbc.sqlserverbase.ddcv.a(Unknown Source)
    at com.gtnexus.jdbc.sqlserverbase.ddcv.a(Unknown Source)
    at com.gtnexus.jdbc.sqlserver.tds.ddg.a(Unknown Source)
    at com.gtnexus.jdbc.sqlserver.SQLServerImplConnection.a(Unknown Source)
    at com.gtnexus.jdbc.sqlserver.SQLServerImplConnection.j(Unknown Source)
    at com.gtnexus.jdbc.sqlserverbase.BaseConnection.commit(Unknown Source)
    at bitronix.tm.resource.jdbc.lrc.LrcXAResource.prepare(LrcXAResource.java:225)

After this occurs, the connection/datasource remains associated with this original transaction; but is made available again in the pool. On the next use, that connection will fail the subsequent transaction, complaining that; "resource already started on XID a Bitronix XID". Effectively saying that it cannot participate in a new transaction, because it is already associated with the original one.

i'd like to be able to mark the referenced connection as bad when it encounters the first problem - and ideally remove it from the pool - to make room for a replacement. I am comfortable making changes in the core Bitronix code, but could use an assist in finding the best way to accomplish this -- or any other ideas about how to prevent this problem from occurring.

From everything we can tell, the original issue (which happens very infrequently) does not have an addressable root cause.

Many thanks for any assistance!

@tomsorgie
Copy link
Author

I believe i have found the root cause of our issue; and it does in fact appear to be related to the Lrc code.

The LrcXAResource keeps a direct handle to the xid of the transaction it believes it is working with. This xid is set to null in 3 places. On an end that fails, during commit, or during rollback. In my scenario (a connection error during prepare), the code flows down the rollback path only.

The issue is that since we are in a PREPARED state - we fail one of the state management checks before the try statement that guarantees the xid = null cleanup.

To resolve this issue, i am patching line 301 to null the xid before throwing the exception:

From:

        else if (state == PREPARED) {
            this.state = NO_TX;
            throw new BitronixXAException("resource committed during prepare on XID " + this.xid, XAException.XA_HEURCOM);
        }

To:

        else if (state == PREPARED) {
            this.state = NO_TX;
            String xidStr = this.xid != null ? this.xid.toString() : "null";
            this.xid = null;
            throw new BitronixXAException("resource committed during prepare on XID " + xidStr, XAException.XA_HEURCOM);
        }

We will test this out for a few days - if it appears to function without issue, i will submit a patch branch.

Please let me know if you see any reason why this would not work, or any concerns you have.

thanks!

@lorban
Copy link
Contributor

lorban commented Sep 9, 2015

Hi,

I think your analysis of the problem is right: when the LrcXAResource' state changes to NO_TX, the xid should always be nullified. There are a few places where this isn't happening that should be changed.

But I disagree with your proposed change as it would return the transaction manager a heuristic commit error (telling it your DB committed) while what happened is that the DB rolled back (or the commit call on line 225 wouldn't have thrown).

With your change, the TM will be confused as the call to rollback will throw XAER_HEURCOM that indicates that the resource committed so the TM will throw HeuristicMixedException with an error message indicating that your LRC resource is at fault.

The real problem is that the state machine described in the class' javadoc actually is incomplete. It should be instead:

NO_TX
  |
  | start(TMNOFLAGS)
  |
  |       end(TMFAIL)
STARTED -------------- NO_TX
  |
  | end(TMSUCCESS)
  |
  |    start(TMJOIN)
ENDED ---------------- STARTED
  |\
  | \  commit (one phase)
  |  ----------------- NO_TX
  |   \
  |    \ prepare (exception)
  |     -------------- ROLLED_BACK
  |                    |
  |                    | commit() or
  |                    | rollback()
  |                    |
  |                    NO_TX
  |
  | prepare() (no exception)
  |
  |       commit() or
  |       rollback()
PREPARED ------------- NO_TX

So here's what I believe is the proper fix:

  • A new state (ROLLED_BACK) should be introduced.
  • The state should switch to ROLLED_BACK in the prepare's SQLException catch block.
  • the commit() method should be modified. It should check for the state to be ROLLED_BACK and in that case nullify the xid, switch the state to NO_TX and throw XAException(XAER_NOTA).
  • the rollback method should be modified. It should check for the state to be ROLLED_BACK and in that case nullify the xid and switch the state to NO_TX.
  • Other methods should be modified to check for state ROLLED_BACK and throw XAException(XAER_PROTO) without modifying anything in that case.

The result will be that the LrcXAResource will inform that upon prepare, the resource unilaterally rolled back which will instruct it to rollback all resources (including the LRC one, which will make its rollback method cleanup the XID) and throw RollbackException.

And by the way, the JMS LrcXAResource most certainly suffers from the exact same problem.

@tomsorgie
Copy link
Author

That completely makes sense. I also noticed the issue with my proposed fix regarding how line 225 could not have passed. It turned out that this was in a regrettable area of our system where 2 Lrc resources were in use. So the problem was that the first LrcResource prepared/committed successfully - but then the second had the error that i logged. In the first LrcResource, the state becomes out-of-sync with the xid value.

I recognize that this is a terrible practice (one i inherited), and there really is not reliable response the system can make during this type of error.

None the less, this showed real state management issues in this Lrc code (jdbc & jms) - and I will try to find some time to make a branch that addresses them as described.

@hastingsr713
Copy link

Any update on this. I tried to patch this following these guidelines but it just pushed the problem to another state. Now I am seeing this:

Caused by: bitronix.tm.internal.BitronixXAException: cannot disable autocommit on non-XA connection
at bitronix.tm.resource.jdbc.lrc.LrcXAResource.start(LrcXAResource.java:176)
at bitronix.tm.internal.XAResourceHolderState.start(XAResourceHolderState.java:220)

I am assuming that is because the connection is closed and should be taken out of the pool.

@lorban
Copy link
Contributor

lorban commented Mar 4, 2017

Unfortunately, BTM has been lingering in limbo for many years as I do not have the steam to maintain it anymore and no one stood up to take the mantle.

Are your changes in a publicly accessible repo where I could have a look? Do you have a test that reproduces the problem? Those would be the first steps to get some advice about what goes wrong, which is all I can offer.

@hastingsr713
Copy link

lorban,

Thanks for your candor, I understand. I do not have a test yet to reproduce the problem. My changes so far are here: https://www.dropbox.com/s/811agmyai9hjtoe/LrcXAResource.java?dl=0 I am going to spend some time to understand how transactions work and hopefully I will be able to create the problem using a debugger.

The trigger for the problem is that a connection gets reset causing the state machine to go to the NO_TA state but the connection and the xid remain as mentioned above. I made the changes you recommended and a couple others but now I get the error I reported. I think this is because the connection is closed. If I get something working I will put it in a pull request. I am new to maven.

@tomsorgie
Copy link
Author

@hastingsr713 do you have more details on the actual SQL exception that was thrown at line-176? It seems incorrect that your connection would be LRC - but be set to AutoCommit at start. Do you have multiple LRC resources in a single transaction? If not - do you know what triggered the rollback after PREPARE? Under normal circumstances - setting the state to PREPARED is the last move on what should have been the last-resource. Perhaps you have a resource ordering issue? We have made that mistake in the past.

@lorban i actually wound up making the change i suggested (nulling the xid) in a local patch back in Sept 2015. It did have the desired effect at the time - although we ultimately removed the second LRC resource - removing the root case.

@hastingsr713
Copy link

Unfortunately, this is all third party code for me. I am using JBPM 6.2.1-SNAPSHOT in a cluster using Tomcat with bitronix 2.4.1. I have the code for both and have not reproduced the problem yet. I am trying to work from the LrcXAResource up. I do not have any more details on the SQL Exception. Lorban added that patch later. I'll let you know what I find.

I am also looking at setting enableJdbc4ConnectionTest to true and possibly maxLifeTime as the original issue is caused by the db connection being reset. However these may just mask the issue so I am still trying to dig into this.

@lorban
Copy link
Contributor

lorban commented Mar 7, 2017

@hastingsr713 please keep in mind that BTM 2.x is substantially different from the 3.x code hosted here on Github: there were quite a few changes to the resource pools' internals that do impact their lifecycle, which is exactly what you're investigating.

Any testing you're doing should be done against a 3.x build, with the extra annoyance that no build has ever been published to maven central.

@tomsorgie just in case you feel like maintaining BTM would be something for you, let me know. :)

@hastingsr713
Copy link

Thanks for your input. I try building from the Master and see how that works.

@ysstech
Copy link

ysstech commented May 3, 2018

@lorban
I have met the same problem.
What is the final solution to this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants