Diagnosing common performance problems with the help of TCP graphs

Introduction

Here are some situations where you encounter unexpected behavior, with suggestions on how to fix them.

Growing slowly

In the below graph, you can see that throughput takes a long time before reaching the maximum: 


The reason turns out to be that the slow-start-threshold value is set too low. If we increase the slow-start threshold to infinite and re-run the test we get the result we want:

Slower than expected

In this example we expected a throughput near 1 Gbit/s, however, it stabilized at around 400 Mbit/s:


If you see a stable low throughput without retransmissions, then this typically indicates that the window scale factor is set too low.

The solution is to increase the window scale factor in the TCP config. We find that normally a value of 4-6 is sufficient. In a high-latency environment, a higher value may be needed. In the above example, the window scale value was set to 2. If we increase it to 4 then we achieve a much higher average throughput:

Very large round-trip times

In this graph the throughput is fine but the round-trip time of >100ms seems to be way too high:

This usually means that the window-scale factor was set too high. A high window scale value leads to a very large transmit window and makes TCP send out packets faster than the network card can. This causes internal buffers to fill up resulting in queuing delay. This delay is what causes the large round-trip times.

In our example, the window-scale factor was to the maximum value of 8. Reducing it to 4-6 should greatly improve the round-trip times.

Unstable throughput with many retransmissions

Unstable throughput is typically caused by packet loss. TCP is designed to slow down in case packet loss is detected. Even a small loss rate can severely impact performance. Here are a few things you can try to improve the throughput in this situation:

Solution 1: Try setting a rate-limit to reduce the transmission-rate.

Sometimes packet loss is caused by a device that is unable to keep up with the ByteBlower's transmission rate. A quick way to fix this is by introducing a rate limit. You can do this in the TCP configuration tab.

Solution 2: Try using a different congestion avoidance algorithm.

The congestion avoidance algorithm determines how TCP deals with packet loss. TCP interprets packet loss as a sign of network congestion (competition with other flows) and it actively slows down so that other flows are able to get their fair share of the available bandwidth. Different congestion avoidance algorithms work better in different types of network environments. This is a very complex topic and is still actively researched today.

Here are a few simple guidelines that can help you to select the best congestion avoidance algorithm:

  • Cubic is good for high-latency networks.
  • SACK is good for unreliable networks like Wifi
Solution 3: Capture the network traffic and perform in-depth analysis using a tool like Wireshark

Sometimes this is the only way to figure out what’s really going on.