L4S development hub

SCE-L4S ECT(1) Test Results

This directory contains a re-run of the ECT(1) tests performed at @heistp/sce-l4s-ect1. More precisely, it provides the test results for L4S, after fixing a bug reported in the original set of experiments (see commit#xxx).

In the next sections, the text of the README is unchanged from its original version, hence might no longer match what is shown in the displayed graphs.

All the data generated by the test is also available there (in case the pcaps get stripped from this repo): https://drive.google.com/drive/folders/1sj9Ox6nHJN-v3xrK4ginlfO3AewAsy85

Key Findings
Elaboration on Key Findings
Full Results
Appendix
1. Test Setup

Key Findings

In the L4S reference implementation, RFC 3168 bottleneck detection is unreliable in at least the following ways:
- False negatives (undetected RFC 3168 bottlenecks) occur with tightened AQM settings for Codel, RED and PIE, resulting in the starvation of competing traffic (in Scenario 2, see results for the aforementioned qdiscs).
- False positives (L4S bottlenecks incorrectly identified as RFC 3168) occur in the presence of about 2ms or more of jitter, resulting in under-utilization (see the L4S results in Scenario 6). Further false positives also occur at low bandwidths, with the same effect (see Scenario 1 at 5Mbit, with 80ms or 160ms RTT).
- Insensitivity to the delay-variation signal occurs when packet loss is experienced. If the detection is currently for L4S, it will remain so, and likewise for RFC 3168. This interacts adversely with dropping AQMs.
In the L4S reference implementation, packet loss is apparently not treated as a congestion signal, unless the detection algorithm has placed it in the RFC 3168 compatible mode. This does not adhere to the principle of effective congestion control (for one example, in Scenario 2, see the pfifo results for L4S).
Ultra-low delay, defined here as queueing delay <= ~1ms, is not achievable for the typically bursty traffic on the open Internet without significant reductions in utilization, and should therefore not be a key selection criteria between the two proposals when it comes to the ECT(1) codepoint decision (in Scenario 5, see Prague utilization in L4S results, compared to twin_codel_af utilization with Codel’s burst-tolerant SCE marking behavior, in the SCE results).
Ultra-low delay is achievable in the SCE architecture on appropriate paths, currently by using DSCP as a classifier to select tightened AQM settings (in Scenario 1, see 50Mbit and 250Mbit cases at 20ms RTT).

Elaboration on Key Findings

Whenever you rely on a heuristic, rather than an explicit signal, you need to establish:

which cases may result in false-positive detections (defined here as detecting a path as a classic AQM when in fact it is providing L4S signalling),
which may result in false-negative detections (defined here as failing to recognise a classic AQM as such), and
what circumstances may result in an unintentional desensitisation of the heuristic.

You also need to determine how severe the consequences of these failures are, which in this case means checking the degree of unfairness to competing traffic that results, and the impact on the performance of the L4S flow itself. This is what we set out to look for.

Utopia

First, to give some credit, the “classic AQM detection heuristic” does appear to work in some circumstances, as we can see in the following plot:

When everything goes well
Figure 1

When faced with a single-queue Codel or PIE AQM at default parameters, TCP Prague appears to successfully switch into its fallback mode and compete with reasonable fairness. Under good network conditions, it also correctly detects an L4S queue at the bottleneck. It even successfully copes with the tricky case of the bottleneck being changed between DualQ-PI2 and a PIE instance with ECN disabled, though it takes several sawtooth cycles to switch back into L4S mode after DualQ-PI2 is restored to the path. We suspect this represents the expected behaviour of the heuristic, from its authors’ point of view.

However, we didn’t have to expand our search very far to find cases that the heuristic did not cope well with, and some of which even appeared to break TCP Prague’s congestion control entirely. That is where our concern lies.

False Negatives

Hunting for the wrong answer Figure 2

False-negative detections are the most serious, when it comes to maintaining “friendly coexistence” with conventional traffic. We found them in three main areas:

Using RED with a limit of 150000, in which the heuristic can oscillate between detection states (see Figure 2),
Codel and PIE instances tuned for shorter path lengths than default, in which the delay-variance signal that the heuristic relies upon is attenuated (see Figure 3),
Queues which signal congestion with packet-drops instead of ECN marks, including dumb drop-tail FIFOs (both deep and shallow) which represent the majority of queues in today’s Internet, and PIE with ECN support disabled as it is in DOCSIS-3.1 cable modems. We hypothesise this is due to desensitising of the heuristic in the presence of drops, combined with a separate and more serious fault that we’ll discuss later.

Codel 1q 20ms target Figure 3

The above failure scenarios are not at all exotic, and can be encountered either by accident, in case of a mis-configuration, or on purpose, when an AQM is configured to prioritize low delay or low memory consumption over utilization. This should cast serious doubt over reliance on this heuristic for maintaining effective congestion control on the Internet. By contrast, SCE flows encountering these same scenarios behave indistinguishably from normal CUBIC or NewReno flows.

False Positives

Serialisation killer Figure 4

False-positive detections undermine L4S performance, as measured by the criteria of maintaining minimum latency and maximum throughput on suitably fitted networks. We found these in three main areas:

Low-capacity paths (see Figure 4 above for a 5Mbps result) introduce enough latency variance via the serialisation delay of individual packets to trigger the heuristic. This prevents L4S from using the full capacity of these links, which is especially desirable.
Latency variation introduced by bursty and jittery paths, such as those including a simulated wifi segment, also trigger the heuristic. This occurs even if the wifi link is never the overall bottleneck in the path, and the actual bottleneck has L4S support.
After the bottleneck shifts from a conventional AQM to an L4S one, it takes a number of seconds for the heuristic to notice this, usually over several AIMD sawtooth cycles.

L4S flows affected by a false-positive detection will have their throughput cut to significantly less than the true path capacity, especially if competing at the bottleneck with unaffected L4S flows.

Desensitisation

Ribbed for nobody's pleasure Figure 5

Desensitising of the heuristic appears to occur in the presence of packet drops (see Figure 5). We are not certain why this would have been designed in, although one hypothesis is that it was added to improve behaviour on the “capacity reduction” test we presented at an earlier TSVWG interim meeting. During that test, we noticed that L4S previously exhibited a lot of packet loss, followed by a long recovery period with almost no goodput. Now, there is still a lot of loss at the reduction stage, but the recovery time is eliminated.

This desensitising means that TCP Prague remains in the L4S mode when in fact the path produces conventional congestion control signals by packet loss instead of ECN marks. The exponential growth of slow-start means that the first loss is experienced before the heuristic has switched over to the classic fallback mode, even if it occurs only after filling an 80ms path and a 250ms queue (which are not unusual on Internet paths). However, this would not necessarily be a problem as long as packet loss is always treated as a conventional congestion signal, and responded to with the conventional Multiplicative Decrease.

Ignoring Packet Loss

Unfortunately, that brings us to the final flaw in TCP Prague’s congestion control that we identified. When in the classic fallback mode, TCP Prague does indeed respond to loss in essentially the correct manner. However when in L4S mode, it appears to ignore loss entirely for the purposes of congestion control (see Figure 6). We repeatably observed full utilisation of the receive window in the face of over 90% packet loss. A competing TCP CUBIC flow was completely starved of throughput; exactly the sort of behaviour that occurred during the congestion collapse events of the 1980s, which the AIMD congestion control algorithm was introduced to solve.

Absolutely Comcastic Figure 6

This is not effective congestion control.

Ultra Low Delay

Foremost in L4S’ key goals is “Consistently ultra low latency”. A precise definition of this is difficult to find in their documentation, but conversations indicate that they aim to achieve under 1ms of peak queue delay. We consider this to be an unachievable goal on the public Internet, due to the jitter and burstiness of real traffic and real Internet paths. Even the receive path of a typical Ethernet NIC has about 1ms of jitter, due to interrupt latency designed in to reduce CPU load.

Some data supporting this conclusion is included in the appendix, which shows that over even modest geographical distances on wired connections, the jitter on the path can be larger than the peak delay L4S targets. Over intercontinental distances it is larger still. But this jitter has to be accommodated in the queue to maintain full throughput, which is another stated L4S goal.

To accommodate these real-world effects, the SCE reference implementation defaults to 2.5ms target delay (without the low-latency PHB), and accepts short-term delay excursions without excessive congestion signalling.

The L4S congestion signalling strategy is much more aggressive, so that encountering this level of jitter causes a severe reduction in throughput - all the more so because this also triggers the classic AQM detection heuristic.

The following two plots (Figure 7 and Figure 8) illustrate the effect of adding a simulated wifi link to a typical 80ms Internet path - first with an SCE setup, then with an L4S one. These plots have the same axis scales. The picture is broadly similar on a 20ms path, too.

Wireless SCE Figure 7 Wireless L4S Figure 8

A larger question might be: what should “ultra low delay” be defined as, in an Internet context? Perhaps we should refer to what queuing delay is typically observed today. As an extreme outlier, this author has personally experienced over 40 seconds of queue delay, induced by a provisioning shaper at a major ISP. Most good network engineers would agree that even 4 seconds is excessive. A “correctly sized” drop-tail FIFO might reach 400ms during peak traffic hours, when capacity is stretched and available bandwidth per subscriber is lower than normal - so let’s take that as our reference point.

Compared to 400ms, a conventional AQM might show a 99th-percentile delay of 40ms under sustained load. We can reasonably call that “low latency”, as it’s comparable to a single frame time of standard-definition video (at 25 fps), and well within the preferred jitter buffer dimensions of typical VoIP clients. So perhaps “ultra low delay” is reasonably defined as an order of magnitude better than that, at 4ms; that’s comparable to the frame time of a high-end gaming monitor.

Given experience with SCE’s default 2.5ms target delay, we think 4ms peak delay is realistically achievable on a good, short Internet path with full throughput. The Codel AQM we’ve chosen for SCE can already achieve that in favourable conditions, while still obtaining reasonable throughput and latency control when conditions are less than ideal.

There is nothing magical about the codepoint used for this signalling; both L4S and SCE should be able to achieve the same performance if the same algorithms are applied. But SCE aims for an achievable goal with the robustness to permit safe experimentation, and this may fundamentally explain the contrast in the plots above.

Full Results

In the following results, the links are named as follows:

plot: the plot svg
cli.pcap: the client pcap
srv.pcap: the server pcap
teardown: the teardown log, showing qdisc config and stats

Scenario 1: One Flow

Bandwidth	RTT	SCE	L4S
5Mbit	20ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown
5Mbit	80ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown
5Mbit	160ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown
50Mbit	20ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown
50Mbit	80ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown
50Mbit	160ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown
250Mbit	20ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown
250Mbit	80ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown
250Mbit	160ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown

Scenario 2: Two Flow Competition

RTT	qdisc	SCE
20ms	codel1q	plot - cli.pcap - srv.pcap - teardown
80ms	codel1q	plot - cli.pcap - srv.pcap - teardown
160ms	codel1q	plot - cli.pcap - srv.pcap - teardown
20ms	codel1q(40ms)	plot - cli.pcap - srv.pcap - teardown
80ms	codel1q(40ms)	plot - cli.pcap - srv.pcap - teardown
160ms	codel1q(40ms)	plot - cli.pcap - srv.pcap - teardown
20ms	codel1q(20ms)	plot - cli.pcap - srv.pcap - teardown
80ms	codel1q(20ms)	plot - cli.pcap - srv.pcap - teardown
160ms	codel1q(20ms)	plot - cli.pcap - srv.pcap - teardown
20ms	lfq_cobalt	plot - cli.pcap - srv.pcap - teardown
80ms	lfq_cobalt	plot - cli.pcap - srv.pcap - teardown
160ms	lfq_cobalt	plot - cli.pcap - srv.pcap - teardown
20ms	pfifo(1000)	plot - cli.pcap - srv.pcap - teardown
80ms	pfifo(1000)	plot - cli.pcap - srv.pcap - teardown
160ms	pfifo(1000)	plot - cli.pcap - srv.pcap - teardown
20ms	pfifo(50)	plot - cli.pcap - srv.pcap - teardown
80ms	pfifo(50)	plot - cli.pcap - srv.pcap - teardown
160ms	pfifo(50)	plot - cli.pcap - srv.pcap - teardown
20ms	pie	plot - cli.pcap - srv.pcap - teardown
80ms	pie	plot - cli.pcap - srv.pcap - teardown
160ms	pie	plot - cli.pcap - srv.pcap - teardown
20ms	pie(1000p/20ms)	plot - cli.pcap - srv.pcap - teardown
80ms	pie(1000p/20ms)	plot - cli.pcap - srv.pcap - teardown
160ms	pie(1000p/20ms)	plot - cli.pcap - srv.pcap - teardown
20ms	pie(100p/20ms)	plot - cli.pcap - srv.pcap - teardown
80ms	pie(100p/20ms)	plot - cli.pcap - srv.pcap - teardown
160ms	pie(100p/20ms)	plot - cli.pcap - srv.pcap - teardown
20ms	pie(100p/5ms)	plot - cli.pcap - srv.pcap - teardown
80ms	pie(100p/5ms)	plot - cli.pcap - srv.pcap - teardown
160ms	pie(100p/5ms)	plot - cli.pcap - srv.pcap - teardown
20ms	pie(noecn)	plot - cli.pcap - srv.pcap - teardown
80ms	pie(noecn)	plot - cli.pcap - srv.pcap - teardown
160ms	pie(noecn)	plot - cli.pcap - srv.pcap - teardown
20ms	red(150000)	plot - cli.pcap - srv.pcap - teardown
80ms	red(150000)	plot - cli.pcap - srv.pcap - teardown
160ms	red(150000)	plot - cli.pcap - srv.pcap - teardown
20ms	red(400000)	plot - cli.pcap - srv.pcap - teardown
80ms	red(400000)	plot - cli.pcap - srv.pcap - teardown
160ms	red(400000)	plot - cli.pcap - srv.pcap - teardown
20ms	twin_codel_af	plot - cli.pcap - srv.pcap - teardown
80ms	twin_codel_af	plot - cli.pcap - srv.pcap - teardown
160ms	twin_codel_af	plot - cli.pcap - srv.pcap - teardown

RTT	qdisc	L4S
20ms	codel1q	plot - cli.pcap - srv.pcap - teardown
80ms	codel1q	plot - cli.pcap - srv.pcap - teardown
160ms	codel1q	plot - cli.pcap - srv.pcap - teardown
20ms	codel1q(40ms)	plot - cli.pcap - srv.pcap - teardown
80ms	codel1q(40ms)	plot - cli.pcap - srv.pcap - teardown
160ms	codel1q(40ms)	plot - cli.pcap - srv.pcap - teardown
20ms	codel1q(20ms)	plot - cli.pcap - srv.pcap - teardown
80ms	codel1q(20ms)	plot - cli.pcap - srv.pcap - teardown
160ms	codel1q(20ms)	plot - cli.pcap - srv.pcap - teardown
20ms	dualpi2	plot - cli.pcap - srv.pcap - teardown
80ms	dualpi2	plot - cli.pcap - srv.pcap - teardown
160ms	dualpi2	plot - cli.pcap - srv.pcap - teardown
20ms	pfifo(1000)	plot - cli.pcap - srv.pcap - teardown
80ms	pfifo(1000)	plot - cli.pcap - srv.pcap - teardown
160ms	pfifo(1000)	plot - cli.pcap - srv.pcap - teardown
20ms	pfifo(50)	plot - cli.pcap - srv.pcap - teardown
80ms	pfifo(50)	plot - cli.pcap - srv.pcap - teardown
160ms	pfifo(50)	plot - cli.pcap - srv.pcap - teardown
20ms	pie	plot - cli.pcap - srv.pcap - teardown
80ms	pie	plot - cli.pcap - srv.pcap - teardown
160ms	pie	plot - cli.pcap - srv.pcap - teardown
20ms	pie(1000p/20ms)	plot - cli.pcap - srv.pcap - teardown
80ms	pie(1000p/20ms)	plot - cli.pcap - srv.pcap - teardown
160ms	pie(1000p/20ms)	plot - cli.pcap - srv.pcap - teardown
20ms	pie(100p/20ms)	plot - cli.pcap - srv.pcap - teardown
80ms	pie(100p/20ms)	plot - cli.pcap - srv.pcap - teardown
160ms	pie(100p/20ms)	plot - cli.pcap - srv.pcap - teardown
20ms	pie(100p/5ms)	plot - cli.pcap - srv.pcap - teardown
80ms	pie(100p/5ms)	plot - cli.pcap - srv.pcap - teardown
160ms	pie(100p/5ms)	plot - cli.pcap - srv.pcap - teardown
20ms	pie(noecn)	plot - cli.pcap - srv.pcap - teardown
80ms	pie(noecn)	plot - cli.pcap - srv.pcap - teardown
160ms	pie(noecn)	plot - cli.pcap - srv.pcap - teardown
20ms	red(150000)	plot - cli.pcap - srv.pcap - teardown
80ms	red(150000)	plot - cli.pcap - srv.pcap - teardown
160ms	red(150000)	plot - cli.pcap - srv.pcap - teardown
20ms	red(400000)	plot - cli.pcap - srv.pcap - teardown
80ms	red(400000)	plot - cli.pcap - srv.pcap - teardown
160ms	red(400000)	plot - cli.pcap - srv.pcap - teardown

Scenario 3: Bottleneck Shift

RTT	SCE	L4S
20ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown
80ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown

Scenario 4: Capacity Reduction

Bandwidth1	RTT	SCE	L4S
40Mbit	20ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown
40Mbit	80ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown
5Mbit	20ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown
5Mbit	80ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown

Scenario 5: WiFi Burstiness

qdisc	RTT	SCE
cake	20ms	plot - cli.pcap - srv.pcap - teardown
cake	80ms	plot - cli.pcap - srv.pcap - teardown
twin_codel_af	20ms	plot - cli.pcap - srv.pcap - teardown
twin_codel_af	80ms	plot - cli.pcap - srv.pcap - teardown

qdisc	RTT	L4S
dualpi2	20ms	plot - cli.pcap - srv.pcap - teardown
dualpi2	80ms	plot - cli.pcap - srv.pcap - teardown

Scenario 6: Jitter

Note: netem jitter params are: total added delay, jitter and correlation

netem-jitter-params	RTT	SCE	L4S
2ms 1ms 10%	80ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown
4ms 2ms 10%	80ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown
10ms 5ms 10%	80ms	plot - cli.pcap - srv.pcap - teardown	plot - cli.pcap - srv.pcap - teardown

Appendix

Test Setup

The test setup consists of a dumbbell configuration (client, middlebox and server) for both SCE and L4S. For these tests, all results were produced on a single physical machine for each using network namespaces. Flent was used for all tests.

For L4S, commit L4STeam/linux@1014c0e45f63 (from Apr 24, 2020) was used.

The single fl script performs the following functions:

updates itself onto the management server and clients
runs tests (./fl run), plot results (./fl plot) and pushes them to a server
acts as a harness for flent, setting up and tearing down the test config
generates this README.md from a template

If there are more questions, feel free to file an issue.