-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
G2-item anomaly with master kills #2125
Comments
I've been exploring different techniques to speed reproduction, and I have a second example of G2 for you:
We killed all three masters: 2019-08-22 18:22:34,180{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :kill-master nil
2019-08-22 18:22:34,591{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :kill-master {"ec2-54-226-85-143.compute-1.amazonaws.com" "", "e
c2-35-173-249-185.compute-1.amazonaws.com" "", "ec2-54-172-38-176.compute-1.amazonaws.com" ""} Then restarted all three nodes, and hit both of these errors three seconds after that start operation completed: 2019-08-22 18:23:11,694{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start-master nil
2019-08-22 18:23:12,001{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start-master {"ec2-54-226-85-143.compute-1.amazonaws.com" "", "ec2-54-172-38-176.compute-1.amazonaws.com" "", "ec2-35-173-249-185.compute-1.amazonaws.com" ""} Full logs are here: 20190822T180648.000Z.zip And can be reproduced on Jepsen 64ea9ea via:
This time it took roughly 4 test runs. Following @spolitov's hunch that this might be related to a timing anomaly, I lowered the clock skew threshold to 1 us (from 50 ms). I also increased the number of writes per key to give us a better picture of how the system evolves over time (which makes the examples much longer), and added some delays to increase the fraction of successful transactions; maybe 90% were aborting in earlier runs. |
With improved append test tuning, and more frequent crashes affecting all master processes, we can reproduce a G2 anomalies in most 500-second tests. With Jepsen 3ee5900213bfe5b6eecbaf663ebab422a6f6eaae, try:
|
I can also reproduce G2 anomalies with master pauses! 20190823T211304.000Z.zip |
Summary: For backward compatibility, we have a list of capabilities, supported by tserver. Right after master restart, while it did not receive heartbeats from tservers it does not know their capabilities. So, when meta cache is updated tserver with empty capabilities could be stored in it. Also, we have a feature that read time for the transaction with snapshot isolation is picked at tablet server, while processing the first request from this transaction. But, when iterating with tablet server that does not support picking read time, we set transaction read time at the proxy layer (i.e. YCQL or YSQL proxy). This particular code contained a bug and also set read time for the transaction with serializable isolation. Read time for the transaction with serializable isolation should not be set, since if it would be set transaction with serializable isolation could miss writes made by committed transactions. The scenario that happened in #2125: 1) All masters are turned off. 2) Meta cache is invalidated and right after master restarted client fetches tablet server information with empty capabilities. 3) When the client tries to send the next transaction request, the buggy code is executed and read time is picked for the transaction with serializable isolation (it is T2 transaction in #2125). 4) Operation executed for this transaction missed values that were recently committed by other transactions. Following by incorrect state. So actually it is not just serializable violation but could cause more serious consistency bugs. Fixed by avoid setting read time for the transaction with serializable isolation. Extended PgLibPqTest.OnConflict test: 1) Add reads to transaction 2) Find order cycles Added PgLibPqTest.OnConlictWithKillMaster. Test Plan: ybd --gtest_filter PgLibPqTest.OnConflict -n 100 Reviewers: mikhail Reviewed By: mikhail Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D7108
It looks to me like this is fixed with yugabyte-1.3.1.2-b1-release-centos-x86_64, with mbautin@3e09352! |
Thx @aphyr ! |
In rare cases (1 in ~100 hours of testing), YugaByte DB 1.3.1.0 can exhibit G2 (anti-dependency cycles) in the Jepsen append test workload.
This anomaly occurred during a kill-master and start-master combo: we killed all three masters from 14:00 to 14:01:
... and started those nodes from 14:06 to 14:25:
Full logs are here.
This anomaly occurred in a 1200 second test with master kills, on a 5-node EC2 cluster. You can reproduce it (extremely slowly) using Jepsen 47fba9e4c5e8477a51e24a5185703bcb596b52ab, and running:
The text was updated successfully, but these errors were encountered: