This directory contains a re-run of the ECT(1) tests performed at @heistp/sce-l4s-ect1. More precisely, it provides the test results for L4S, after fixing a bug reported in the original set of experiments (see commit#xxx).
In the next sections, the text of the README is unchanged from its original version, hence might no longer match what is shown in the displayed graphs.
All the data generated by the test is also available there (in case the pcaps get stripped from this repo): https://drive.google.com/drive/folders/1sj9Ox6nHJN-v3xrK4ginlfO3AewAsy85
Whenever you rely on a heuristic, rather than an explicit signal, you need to establish:
You also need to determine how severe the consequences of these failures are, which in this case means checking the degree of unfairness to competing traffic that results, and the impact on the performance of the L4S flow itself. This is what we set out to look for.
First, to give some credit, the “classic AQM detection heuristic” does appear to work in some circumstances, as we can see in the following plot:
Figure 1
When faced with a single-queue Codel or PIE AQM at default parameters, TCP Prague appears to successfully switch into its fallback mode and compete with reasonable fairness. Under good network conditions, it also correctly detects an L4S queue at the bottleneck. It even successfully copes with the tricky case of the bottleneck being changed between DualQ-PI2 and a PIE instance with ECN disabled, though it takes several sawtooth cycles to switch back into L4S mode after DualQ-PI2 is restored to the path. We suspect this represents the expected behaviour of the heuristic, from its authors’ point of view.
However, we didn’t have to expand our search very far to find cases that the heuristic did not cope well with, and some of which even appeared to break TCP Prague’s congestion control entirely. That is where our concern lies.
Figure 2
False-negative detections are the most serious, when it comes to maintaining “friendly coexistence” with conventional traffic. We found them in three main areas:
Figure 3
The above failure scenarios are not at all exotic, and can be encountered either by accident, in case of a mis-configuration, or on purpose, when an AQM is configured to prioritize low delay or low memory consumption over utilization. This should cast serious doubt over reliance on this heuristic for maintaining effective congestion control on the Internet. By contrast, SCE flows encountering these same scenarios behave indistinguishably from normal CUBIC or NewReno flows.
Figure 4
False-positive detections undermine L4S performance, as measured by the criteria of maintaining minimum latency and maximum throughput on suitably fitted networks. We found these in three main areas:
L4S flows affected by a false-positive detection will have their throughput cut to significantly less than the true path capacity, especially if competing at the bottleneck with unaffected L4S flows.
Figure 5
Desensitising of the heuristic appears to occur in the presence of packet drops (see Figure 5). We are not certain why this would have been designed in, although one hypothesis is that it was added to improve behaviour on the “capacity reduction” test we presented at an earlier TSVWG interim meeting. During that test, we noticed that L4S previously exhibited a lot of packet loss, followed by a long recovery period with almost no goodput. Now, there is still a lot of loss at the reduction stage, but the recovery time is eliminated.
This desensitising means that TCP Prague remains in the L4S mode when in fact the path produces conventional congestion control signals by packet loss instead of ECN marks. The exponential growth of slow-start means that the first loss is experienced before the heuristic has switched over to the classic fallback mode, even if it occurs only after filling an 80ms path and a 250ms queue (which are not unusual on Internet paths). However, this would not necessarily be a problem as long as packet loss is always treated as a conventional congestion signal, and responded to with the conventional Multiplicative Decrease.
Unfortunately, that brings us to the final flaw in TCP Prague’s congestion control that we identified. When in the classic fallback mode, TCP Prague does indeed respond to loss in essentially the correct manner. However when in L4S mode, it appears to ignore loss entirely for the purposes of congestion control (see Figure 6). We repeatably observed full utilisation of the receive window in the face of over 90% packet loss. A competing TCP CUBIC flow was completely starved of throughput; exactly the sort of behaviour that occurred during the congestion collapse events of the 1980s, which the AIMD congestion control algorithm was introduced to solve.
Figure 6
This is not effective congestion control.
Foremost in L4S’ key goals is “Consistently ultra low latency”. A precise definition of this is difficult to find in their documentation, but conversations indicate that they aim to achieve under 1ms of peak queue delay. We consider this to be an unachievable goal on the public Internet, due to the jitter and burstiness of real traffic and real Internet paths. Even the receive path of a typical Ethernet NIC has about 1ms of jitter, due to interrupt latency designed in to reduce CPU load.
Some data supporting this conclusion is included in the appendix, which shows that over even modest geographical distances on wired connections, the jitter on the path can be larger than the peak delay L4S targets. Over intercontinental distances it is larger still. But this jitter has to be accommodated in the queue to maintain full throughput, which is another stated L4S goal.
To accommodate these real-world effects, the SCE reference implementation defaults to 2.5ms target delay (without the low-latency PHB), and accepts short-term delay excursions without excessive congestion signalling.
The L4S congestion signalling strategy is much more aggressive, so that encountering this level of jitter causes a severe reduction in throughput - all the more so because this also triggers the classic AQM detection heuristic.
The following two plots (Figure 7 and Figure 8) illustrate the effect of adding a simulated wifi link to a typical 80ms Internet path - first with an SCE setup, then with an L4S one. These plots have the same axis scales. The picture is broadly similar on a 20ms path, too.
Figure 7 Figure 8
A larger question might be: what should “ultra low delay” be defined as, in an Internet context? Perhaps we should refer to what queuing delay is typically observed today. As an extreme outlier, this author has personally experienced over 40 seconds of queue delay, induced by a provisioning shaper at a major ISP. Most good network engineers would agree that even 4 seconds is excessive. A “correctly sized” drop-tail FIFO might reach 400ms during peak traffic hours, when capacity is stretched and available bandwidth per subscriber is lower than normal - so let’s take that as our reference point.
Compared to 400ms, a conventional AQM might show a 99th-percentile delay of 40ms under sustained load. We can reasonably call that “low latency”, as it’s comparable to a single frame time of standard-definition video (at 25 fps), and well within the preferred jitter buffer dimensions of typical VoIP clients. So perhaps “ultra low delay” is reasonably defined as an order of magnitude better than that, at 4ms; that’s comparable to the frame time of a high-end gaming monitor.
Given experience with SCE’s default 2.5ms target delay, we think 4ms peak delay is realistically achievable on a good, short Internet path with full throughput. The Codel AQM we’ve chosen for SCE can already achieve that in favourable conditions, while still obtaining reasonable throughput and latency control when conditions are less than ideal.
There is nothing magical about the codepoint used for this signalling; both L4S and SCE should be able to achieve the same performance if the same algorithms are applied. But SCE aims for an achievable goal with the robustness to permit safe experimentation, and this may fundamentally explain the contrast in the plots above.
In the following results, the links are named as follows:
Bandwidth | RTT | SCE | L4S |
---|---|---|---|
5Mbit | 20ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
5Mbit | 80ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
5Mbit | 160ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
50Mbit | 20ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
50Mbit | 80ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
50Mbit | 160ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
250Mbit | 20ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
250Mbit | 80ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
250Mbit | 160ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
RTT | SCE | L4S |
---|---|---|
20ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
80ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
Bandwidth1 | RTT | SCE | L4S |
---|---|---|---|
40Mbit | 20ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
40Mbit | 80ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
5Mbit | 20ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
5Mbit | 80ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
qdisc | RTT | SCE |
---|---|---|
cake | 20ms | plot - cli.pcap - srv.pcap - teardown |
cake | 80ms | plot - cli.pcap - srv.pcap - teardown |
twin_codel_af | 20ms | plot - cli.pcap - srv.pcap - teardown |
twin_codel_af | 80ms | plot - cli.pcap - srv.pcap - teardown |
qdisc | RTT | L4S |
---|---|---|
dualpi2 | 20ms | plot - cli.pcap - srv.pcap - teardown |
dualpi2 | 80ms | plot - cli.pcap - srv.pcap - teardown |
Note: netem jitter params are: total added delay, jitter and correlation
netem-jitter-params | RTT | SCE | L4S |
---|---|---|---|
2ms 1ms 10% | 80ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
4ms 2ms 10% | 80ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
10ms 5ms 10% | 80ms | plot - cli.pcap - srv.pcap - teardown | plot - cli.pcap - srv.pcap - teardown |
The test setup consists of a dumbbell configuration (client, middlebox and server) for both SCE and L4S. For these tests, all results were produced on a single physical machine for each using network namespaces. Flent was used for all tests.
For L4S, commit L4STeam/linux@1014c0e45f63 (from Apr 24, 2020) was used.
The single fl script performs the following functions:
If there are more questions, feel free to file an issue.