Skip to content

Commit

Permalink
Add logical reasoning samples
Browse files Browse the repository at this point in the history
  • Loading branch information
wangbin579 committed Sep 5, 2024
1 parent 44abd8f commit c9f547a
Show file tree
Hide file tree
Showing 7 changed files with 118 additions and 0 deletions.
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@ Extract some insightful technical points from the book "The Art of Problem-Solvi

[How to Mitigate Performance Fluctuations in MySQL Group Replication?](group_replication_jitter.md)

[logical_reasoning_samples](logical_reasoning_samples.md)

[TCPCopy Introduction](tcpcopy_introduction.md)



# Book Link:
Expand Down
Binary file added images/3a5d10caf25003e7ea4dcace59a181f6.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/4309ffee12d2eed2548845d1e1d2e848.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/5c6f81f69eac7f61744ba3bc035b29e7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/6670e70b6d5f0f5152f643c153d13487.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
65 changes: 65 additions & 0 deletions logical_reasoning_samples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Logical Reasoning in Network Problems

## 1.1 Classic Case 1

Many software professionals lack in-depth knowledge of TCP/IP logic reasoning, which often leads to misidentifying problems as mysterious problems. Some are discouraged by the complexity of TCP/IP networking literature, while others are misled by confusing details in Wireshark. For instance, a DBA facing performance problems might misinterpret packet capture data in Wireshark, erroneously concluding that TCP retransmissions are the cause.

![](images/3a5d10caf25003e7ea4dcace59a181f6.gif)

Figure 1. Packet capture screenshot provided by DBA suspecting retransmission problems.

Since retransmission is suspected, it's essential to understand its nature. Retransmission fundamentally involves timeout retransmission. To confirm if retransmission is indeed the cause, time-related information is necessary, which is not provided in the screenshot above. After requesting a new screenshot from the DBA, the timestamp information was included.

![](images/6670e70b6d5f0f5152f643c153d13487.png)

Figure 4-46. Packet capture screenshot with time information added.

When analyzing network packets, timestamp information is crucial for accurate logical reasoning. A time difference in the microsecond range between two duplicate packets suggests either a timeout retransmission or duplicate packet capture. In a typical LAN environment with a Round-trip Time (RTT) of around 100 microseconds, where TCP retransmissions require at least one RTT, a retransmission occurring at just 1/100th of the RTT likely indicates duplicate packet capture rather than an actual timeout retransmission.

## 1.2 Classic Case 2

Another classic case illustrates the importance of logical reasoning in network problem analysis.

One day, one business developer came rushing over, saying that a scheduled script using the MySQL database middleware had failed in the early morning hours with no response. Upon hearing about the problem, I checked the error logs of the MySQL database middleware but found no valuable clues. So, I asked the developers if they could reproduce the problem, knowing that once reproducible, a problem becomes easier to solve.

The developers tried multiple times to reproduce the problem but were unsuccessful. However, they made a new discovery: they found that executing the same SQL queries during the day resulted in different response times compared to the early morning. They suspected that when the SQL response was slow, the MySQL database middleware was blocking the session and not returning results to the client.

Based on this insight, the database operations team were asked to modify the script's SQL to simulate a slow SQL response. As a result, the MySQL database middleware returned the results without encountering the hang problem seen in the early morning hours.

For a while, the root cause couldn't be identified, and developers discovered a functional problem with the MySQL database middleware. Therefore, developers and DBA operations became more convinced that the MySQL database middleware was delaying responses. In reality, these problems were not related to the response times of the MySQL database middleware.

From the events of the first day, the problem did indeed occur. Everyone involved tried to pinpoint the cause, making various guesses, but the true reason remained elusive.

The next day, developers reported that the script problem reoccurred in the early morning, yet they couldn't reproduce it during the day. Developers, feeling pressured as the script was soon to be used online, complained about the situation. My only suggestion was for them to use the script during the day to avoid problems in the early morning. With all suspicions focused on the MySQL database middleware, it was challenging to analyze the problem from other perspectives.

As a developer responsible for the MySQL database middleware, such mysterious problems cannot be easily overlooked. Ignoring them could impact subsequent use of the MySQL database middleware, and there is also pressure from leadership to solve the problem promptly. Finally, it was decided to implement a low-cost packet capture analysis solution: during the execution of the script in the early morning, packet captures would be performed on the server to analyze what was happening at that time. The goal was to determine if the MySQL database middleware either failed to send a response at all or if it did send a response that the client script did not receive. Once it could be confirmed that the MySQL database middleware did send a response, the problem would not be attributed to the MySQL database middleware developers.

On the third day, developers reported that the early morning problem did not recur, and packet capture analysis confirmed that the problem did not occur. After careful consideration, it seemed unlikely that the problem was solely with the MySQL database middleware: frequent occurrences in the early morning and rare occurrences during the day were puzzling. The only course of action was to wait for the problem to occur again and analyze it based on the packet captures.

On the fourth day, the problem did not surface again.

However, on the fifth day, the problem finally reappeared, bringing hope for resolution.

The packet capture files are numerous. First, ask the developers to provide the timestamp when the problem occurred, then search through the extensive packet capture data to identify the SQL query that caused the problem. The final result is as follows:

![](images/4309ffee12d2eed2548845d1e1d2e848.png)

Figure 2. Key packet information captured for problem resolution.

From the packet capture content above (captured from the server), it appears that the SQL query was sent at 3 AM. The MySQL database middleware took 630 seconds (03:10:30.899249-03:00:00.353157) to return the SQL response to the client, indicating that the MySQL database middleware did indeed respond to the SQL query. However, just 238 microseconds later (03:10:30.899487-03:10:30.899249), the server's TCP layer received a reset packet, which is suspiciously quick. It's important to note that this reset packet cannot be immediately assumed to be from the client.

Firstly, it is necessary to confirm who sent the reset packet—either it was sent by the client or by an intermediate device along the way. Since packet capture was performed only on the server side, information about the client's packet situation is not available. By analyzing the packet capture files from the server side and applying logical reasoning, the aim is to identify the root cause of the problem.

If the assumption is made that the client sent a reset, it would imply that the client's TCP layer no longer recognizes the TCP state of this connection—transitioning from an established state to a nonexistent one. This change in TCP state would notify the client application of a connection problem, causing the client script to immediately error out. However, in reality, the client script is still waiting for the response to come back. Therefore, the assumption that the client sent a reset does not hold true—the client did not send a reset. The client's connection is still active, but on the server side, the corresponding connection has been terminated by the reset.

Who sent the reset, then? The primary suspect is Amazon's cloud environment. Based on this packet capture analysis, the DBA operations queried Amazon customer service and received the following information:

![图片](images/5c6f81f69eac7f61744ba3bc035b29e7.png)

Figure 3. Final response from Amazon customer service.

Customer service's response aligns with the analysis results, indicating that Amazon's ELB (Elastic Load Balancer, similar to LVS) forcibly terminated the TCP session. According to their feedback, if a response exceeds the 350-second threshold (as observed in the packet capture as 630 seconds), Amazon's ELB device sends a reset to the responding party (in this case, the server). The client scripts deployed by the developers did not receive the reset and mistakenly assumed the server connection was still active. Official recommendations for such problems include using TCP keepalive mechanisms to mitigate these problems.

With the official response obtained, the problem was considered fully solved.

This specific case illustrates how online problems can be highly complex, requiring the capture of critical information—in this instance, packet capture data—to understand the situation as it occurred. Through logical reasoning and the application of reductio ad absurdum, the root cause was identified.
49 changes: 49 additions & 0 deletions tcpcopy_introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# TCPCopy Introduction

TCPCopy [1] is an open-source TCP traffic replication tool, commonly used for reusing and replaying online traffic to assist in performance testing, stress testing, or debugging of server applications. It can replicate production traffic and send it to a test environment without impacting the production environment, allowing for testing of new applications, features, or architectures.

Below are some key concepts and mechanisms of TCPCopy's architecture:

### 1 Traffic Replication Mechanism

The core function of TCPCopy is to capture traffic from the production environment and replicate it to another designated target server (typically a test server). The specific steps are as follows:

- **Capturing Production Traffic:** The TCPCopy client (deployed on the production server) listens on a specific port for TCP traffic (such as HTTP requests) and captures the TCP packets from the traffic.
- **Forwarding Traffic:** The captured data packets are sent by the TCPCopy client to the target test server. The test server, usually simulating applications in the production environment, receives this traffic for testing purposes.

### 2 Traffic Replay on the Target Server

The traffic sent to the target server by TCPCopy is similar to the production traffic, but when processing this traffic, the test server's configuration needs to be isolated from the production server to avoid application-level overlaps and ensure no impact on the production environment, such as accessing the same backend MySQL database. This process is mainly used to verify the system's performance under production loads, such as:

- Verifying the correctness of the new version of the code.
- Assessing system performance and scalability.
- Examining how new configurations, architectures, or hardware behave under production-level traffic.

### 3 Multi-Node Architecture

In large-scale deployments, TCPCopy often uses a multi-node architecture to distribute the load of traffic replication. Each TCPCopy client can handle a portion of the production server's traffic and, based on a predefined load balancing strategy, replicate the traffic and send it to different test servers. This helps avoid single-point performance bottlenecks.

### 4 Real-Time and Accuracy

TCPCopy replicates traffic in near real-time, but not exactly in real-time.

### 5 Isolation

Traffic is transmitted between the production and test environments, but TCPCopy does not directly modify production environment data or state. This isolation allows large-scale testing without impacting the stability of the production system.

### 6 Compatibility and Scalability

TCPCopy supports a variety of network protocols, especially common application protocols like HTTP and MySQL. It is designed to support large traffic replication tasks and can work efficiently in high-concurrency environments. Additionally, users can combine TCPCopy with other monitoring and analysis tools, such as MySQL Proxy or Wireshark, to further analyze and process the traffic.

### 7.0 Use Cases

- **Performance Regression Testing**: Before the new version of a system goes live, replay production traffic using TCPCopy to verify if there are any performance regressions.
- **Fault Diagnosis**: Replay production traffic in a test environment to reproduce anomalous behavior observed in production for debugging and troubleshooting.
- **Stress Testing**: Amplify production traffic by replaying it multiple times to assess the system's performance under high load conditions.

## **Summary**

TCPCopy is a simple yet powerful tool, ideal for testing scenarios that require traffic replay from production environments. Its efficient and non-intrusive design makes it an important tool for developers conducting performance tests, fault diagnosis, and system evaluations.

[1] https://github.com/session-replay-tools/tcpcopy.

0 comments on commit c9f547a

Please sign in to comment.