TCP high performance and maximum usable bandwidth

The second half of Februari saw two main topics on the NANOG list: DS3 performance and satellite latency. The long round trip times for satellite connections wreak havoc on TCP performance. In order to be able to utilize the available bandwidth, TCP needs to keep sending data without waiting for an acknowledgment for at least a full round trip time. Or in other words: TCP performance is limited to the window size multiplied by the round trip time. The TCP window (amount of data TCP will send before stopping and waiting for an acknowledgment) is limited by two factors: the send buffer on the sending system and the 16 bit window size field in the TCP header. So on a 600 ms RTT satellite link the maximum TCP performance is limited to 107 kilobytes per second (850 kbps) by the size of the header field, and if a sender uses a 16 kilobyte buffer (a fairly common size) this drops to as little as 27 kilobytes per second (215 kbps). Because of the TCP slow start mechanism, it takes several seconds to reach this speed as well. Fortunately, RFC 1323, TCP Extensions for High Performance introduces a "window scale" option to increase the TCP window to a maximum of 1 GB, if both ends of the connection allocate enough buffer space.

The other subject that received a lot of attention, the maximum usable bandwidth of a DS3/T3 line, is also related to TCP performance. When the line gets close to being fully utilized, short data bursts (which are very common in IP) will fill up the send queue. When the queue is full, additional incoming packets are discarded. This is called a "tail drop". If the TCP session which loses a packet doesn't support "fast retransmit", or if several packets from the same session are dropped, this TCP session will go into "slow start" and slow down a lot. This often happens to several TCP sessions at the same time, so those now all perform slow start at the same time. So they all reach the point where the line can't handle the traffic load at the same time, and another small burst will trigger another round of tail drops.

A possible solution is to use Random Early Detect (RED) queuing rather than First In, First Out (FIFO). RED will start dropping more and more packets as the queue fills up, to trigger TCP congestion avoidance and slow down the TCP sessions more gently. But this only works if there aren't (m)any tail drops, which is unlikely if there is only limited buffer space. Unfortunately, Cisco uses a default queue size of 40 packets. Queuing theory tells us this queue will be filled entirely (on average) at 97% line utilization. So at 97%, even a one packet burst will result in a tail drop. The solution is to increase the queue size, in addition to enabling RED. On a Cisco:

interface ATM0
random-detect
hold-queue 500 out

This gives RED the opportunity to start dropping individual packets long before the queue fills up entirely and tail drops occur. The price is a somewhat longer queuing delay. At 99% utilization, there will be an average of 98 packets in the queue, but at 45 Mbps this will only introduce a delay of 9 ms.

Permalink - posted 2002-03-31

April fools day RFCs

April fools day is coming up again! Don't let it catch you by surprise. Over the years, a number of RFCs have been published on April first, such as...

Read the article - posted 2002-04-01

Packet reordering

During the second week of April there was some discussion on reordering of packets on parallel links at Internet Exchanges. Equipment vendors try very hard to make sure this doesn't happen, but this has the risk that balancing traffic over parallel links doesn't work as good as it should. It is generally accepted that reordering leeds to inefficiency or even slowdowns in TCP implementations, but it seems unlikely reordering will happen much hosts are connected at the speed of the parallel links (ie, Gigabit Ethernet) or there is significant congestion.

Permalink - posted 2002-06-29

Ownership of address space

In the first week of May, a message was posted on the NANOG list by someone who had a dispute with one of his ISPs. When it became obvious this dispute wasn't going to be resolved, the ISP wasn't content with no longer providing any service, but they also contacted the other ISP this network connected to, and asked them to stop routing the /22 out of their range the (ex-)customer was using. The second ISP complied and the customer network was cut off from the internet. (This all happened on a sunday afternoon, so it is likely there is more to the story than what was posted on the NANOG list.)

The surprising thing was that many people on the list didn't think this was a very unreasonable thing to do. It is generally accepted that a network using an ISP's address space should stop using these addresses when it no longer connects to that ISP, but in the cases I have been involved with there was always a reasonable time to renumber. Obviously depending on such a grace period is a very dangerous thing to do. You have been warned.

Permalink - posted 2002-06-30

White House National Strategy to Secure Cyberspace draft

At the end of September, the White House published a National Strategy to Secure Cyberspace. It seems that at the last moment, a lot of text was cut and the 60 odd pages PDF document offered for download was made a draft, with the government actively soliciting comments. One of the prime recommendations in the document is:

R4-1	A public-private partnership should refine and accelerate the adoption of improved security for Border Gateway Protocol, Internet Protocol, Domain Name System, and others.

Some people say the government wants Secure BGP (S-BGP) to be adopted. It is unclear how reliable these claims are. In any event, S-BGP has been a draft for two years, with no sign of becoming an RFC or implementations being in the works.

In 2001 4th quarter interdomain routing news I ranted about the general problems with strong crypto in the routing system. It is widely assumed BGP is insecure because "anybody can inject any information into the global routing table." It is true that the protocol itself doesn't offer protection against abuse, but since BGP has many hooks for implementing policies, it is not a big problem to create filters that only allow announcements from customers or peers that are known to be good. However, the Routing Registries that are supposed to be the source of this information aren't always 100% accurate and although their security has greatly improved the last few years, it is not inconceivable that someone could enter false information in a routing database.

In an effort to make BGP more secure, S-BGP goes way overboard. Not only are BGP announcements supposed to be cryptographically signed (this wouldn't be the worst idea ever, although it remains to be seen whether it is really necessary), routers along the way are also supposed to sign the data. And the source gets to determine who may or may not announce the prefix any further. I see three main problems with this approach:

The CPU time needed to do all this asymmetric crypto
The storage needed for the signatures and related information
The interface between the public key infrastructure and the routers and the circular dependency between the PKI on the routers

And even if all of these problems can be solved, it gets much, much harder to get a BGP announcement up and running. This will lead to unreachability while people are getting their certificates straightened out. Also, routers in colo facilities aren't the best place to store private keys.

Permalink - posted 2002-10-28