iljitsch.com - network

Upgrading Fiber To The Home to terabit speeds

Tue, 09 Apr 2024 11:16:41 GMT

Last week, Jaap van Till asked me if BGP would be capable of supporting the terabit class interconnectivity that he foresees we’ll need in the future, possibly due to the rise of artificial intelligence. He explains his reasoning in the blog post What Link speeds will we need for AI, where he quotes VAN TILL’s CONJECTURE:

The network connection Wide Area access speed will grow in time until it matches the internal device BUS speed of the more and more complex processors and datastores.

And then concludes that 14 Tbps external links will be required in 2039. Today I can get 4 Gbps where I live. So that means a 70% speed increase per year.

Let’s first get that BGP question out of the way: I see no problems. 25 years ago I ran BGP over 64 and 128 kbps links without trouble. Six orders of magnitude later, BGP is still fine, and there is no reason to believe that even faster speeds will be a problem, just as long as the packet loss rates remain minimal.

But what would terabit class network connectivity at home look like?

Actually, I think we have all the parts to build this today. With Wavelength Devision Multiplexing (WDM), it’s possible to transmit multiple data streams through a single fiber by using slightly different wavelengths/frequencies of infrared laser light. Coarse WDM (CWDM) is relatively cheap and appropriate over shorter distances, with 18 wavelengths standardized over high performance fiber. (Fewer over most existing fiber.) For long distances, dense WDM (DWDM) can use as many as 160 wavelengths over a single fiber pair.

Bandwidth per wavelength is now 100 or 200 Gbps, and expected to increase in the future. So anything between say 10 x 100 Gbps = 1 Tbps and the 20 Tbps used by modern seacables should be possible. The catch is of course the cost.

The difficulty is with the transmitting side, as this requires a tuned laser per wavelength. On the receiving side, the wavelengths can be split using a prism and hit a set of wideband receivers. As someone who is definitely not in the business of building this equipment, it seems to me that a system with one or a small number of transmitters, a passive optical bus, and a large(r) set of receivers is definitely something that could enjoy radical performance vs cost improvements over time. And it fits perfectly with the most efficient / high speed way to connect homes to the internet that we have today: PON (passive optical network). So just add additional wavelengths to existing PON installations to gain more bandwidth in the downstream direction.

However, now we have a new challenge: TCP/IP is not a good fit for sending the massive data streams that would make good use of such a network. The problem is that TCP tries to adjust its end-to-end data transmission rate to available bandwidth. This means it needs to wait for acknowledgments from the receiving side to know it can increase its transmission rate, maintain it, or slow the transmission rate down. Downloading 100 MB over a 1 Tbps link takes less than a millisecond. But even over PON, the round-trip-time is a millisecond or two. This means that the bottleneck is the number of round trip TCP requires to reach that full terabit speed. Even if that’s an extremely unrealistic 10 RTTs, that means the total transmission time is now 11 ms, effectively only using a tenth of the available bandwidth.

So we need to overhaul TCP/IP for the super high speed stuff and instead use something more like circuit switching / time division multiplexing / token passing. Yes, everything old is new again! So for instance reserve ten 100 μs timeslots and transmit ten 10 MB “megapackets”.

So I think all of this is highly doable!

Well, there is the slight challenge of how to pipe all that bandwidth into your laptop without connecting/disconnecting that fiber all the time. Maybe use eight Thunderbolt 5 interfaces in parallel to reach 960 Gbps?

Should the datacenter be in the middle?

Thu, 07 Sep 2023 08:49:00 GMT

The other day, I landed on this article: In Focus: Subsea Network Architecture: IXPs. The article takes some time to arrive at the point that undersea internet exchanges would be a good idea. The most eyecatching part is a variation on this image:

As the article starts out discussing how datacenters have been moving away from large cities to take advantage of opportunities such as space, cheap energy and easier cooling, this image seems to suggest that these blue dots in are good locations for datacenters and/or internet exchanges in general. And that's definitely not the point of the paper that the image is from.

That paper is very specifically about the best locations to place servers for high speed algorithmic trading on multiple markets some distance away from each other. This immediately explains why there is nothing around the western US: there are simply no stock exchanges / markets there (the red dots in the image).

The math looks more complicated, but presumably, in these cases it helps when the servers executing the trading algorithms are in the middle between the "users", rather than close to one and further from the other(s).

If you need data from two places far away from each other, then it's better when each is 25 milliseconds away, as you can then complete your action in 25 ms plus however long it takes to do your own processing. If you're close to one so it's 0 ms for one data source and 50 ms for the other, then the entire action takes at least 50 ms.

But is that a common situation?

In general, you can just copy the data beforehand. So this only applies if you're using "live" data from two or more locations. Videoconferencing with a number of participants could be an example, where a server receives the video from all the participants, mixes it into a single feed and then sends that single feed out to all the participants. If the server is in the middle, this limits the maximum delay. I guess that could be somewhat helpful. But to the degree that it makes sense to have datacenters in the middle of the ocean? I'm not convinced.

My BGP minilab

Fri, 11 Nov 2022 12:15:12 GMT

When I wrote my first BGP book I painstakingly made the config examples on actual Cisco routers. In my opinion, it's crucial to make sure that configuration examples that go in a book actually work.

So when I started writing my new BGP book, I did the same. But this time, I used open source routing software (FRRouting) running in Docker containers. Basically, those containers are very light-weight virtual machines.

This makes it possible to run a dozen virtual routers that start up and shut down in just a few seconds. So it's very easy to run different examples by starting the required virtual routers with the configuration for that example.

This was super useful when I was writing the book.

So I thought it would also be very useful for people reading the book.

So I'm making the "BGP minilab" with all the config examples from the book available to my readers. Download version 2022-11 of the minilab that goes with the first version of the book here.

You can also run the examples in the minilab if you don't have the book. And you can create your own labs based on these scripts.

The minilab consist of four scripts:

start: to start an example or lab
connectrouter: connect to an already running virtual router
stoprouters: to stop all running routers
run-gortr: runs the GoRTR RPKI cache

There are Mac/Linux shell script and Windows Powershell versions of each script.

Oh SNAP! There is more to Wi-Fi ↔︎ Ethernet than I thought

Thu, 21 Jul 2022 14:00:06 GMT

The tag line for World IPv6 Launch ten years ago was "the future is forever". You know what else seems to be forever? The past. Let's talk about IEEE 802 LLC/SNAP encapsulation.

I always thought when you send IP packets over Wi-Fi, the IP packet would go inside an Ethernet frame, and then the Ethernet frame inside an 802.11 frame. Turns out this is not how it works: there is no Ethernet header inside IEEE 802.11 packets/frames¹.

What actually happens is that packets are bridged between Ethernet and Wi-Fi. Surprising. But the real shock is that the bridging between Ethernet and Wi-Fi is exactly the same as bridging between Ethernet and FDDI. (An old 100 Mbps fiber ring technology from when Ethernet was still stuck at 10 Mbps.) It's all laid out in this ancient RFC from 1988: RFC 1042.

Bridging is the process of translating one OSI layer 2 frame format to another. In this case, a Wi-Fi access point translates between the Wi-Fi 802.11 header and the Ethernet II header.

Most of the stuff in a Wi-Fi header or Ethernet header is only relevant on the Wi-Fi and Ethernet drivers, respectively. But higher layers in the networking stack do need to interact with two things from the Ethernet header: the MAC addresses and the ethertype. So bridging between two network protocols is possible if the MAC addresses and ethertype are compatible.

Which creates a bit of a problem, as the 802.x family, including 802.11, doesn't do ethertypes. So how do they make sure different packets, such as IP, ARP, IPX, AppleTalk, Wake-on-LAN, et cetera are interpreted correctly by the receiver? Their first try was the LLC header. But that didn't accommodate a sufficiently large number of possible protocols. So add to LLC the SNAP header. The ethertype goes inside the SNAP header. So now we have all the information we need to translate between 802.x protocols (such as 802.11 Wi-Fi) and Ethernet II. The slide below that I copied from this presentation shows how that works:

(Where DEST and SRC are the MAC addresses, type is the ethertype, and AA AA 03 00.00.00 are the values in the LLC and SNAP headers preceding the ethertype. P is the IP packet. Also see the last image here for an actual packet dump.)

But wait... isn't Ethernet IEEE 802.3, so shouldn't it use LLC/SNAP, too?

Well, kinda-sorta-but-not-really. Remember that Ethernet was developed by the DIX consortium (DEC, Intel, Xerox) and then handed off to the IEEE for further tinkering and standardization. As a result, there is a difference between the old Ethernet II header, which has the ethertype, and the 802.3 header, where the same place in the header is actually a length field. (Which we don't need, the Ethernet hardware can tell how long packets are just fine by itself.)

So for IP over actual IEEE 802.3 you need at least the LLC header (which supports IPv4 and ARP) but in practice IP over 802.x always uses LLC+SNAP.

So IP over Ethernet II is actually a relic from the past. But still quite alive. Novell did move to LLC and SNAP, but this just meant that there were now four different frame formats to choose from in your IPX network, and having four incompatible ways to do the exact same thing is never helpful when running a network.

¹ Remember, OSI layer 2 has frames, OSI layer 3 packets and OSI layer 4 segments. So the TCP segment goes inside an IP packet and the IP packet inside an Ethernet frame, each layer adding a header and sometimes a trailer with information relevant to that layer.

OSPF: time to get rid of the totally not so stubby legacy

Thu, 12 May 2022 10:50:45 GMT

Recently, I was looking through some networking certification material. A very large part of it was about OSPF. That's fair, OSPF is probably the most widely used routing protocol in IP networks. But the poor students were submitted to a relentless sequence of increasingly baroquely named features: stub areas, not-so-stubby-areas, totally stubby areas, culminating in totally not-so-stubby areas.

Can we please get rid of some of that legacy? And if not from the standard documents or the router implementations, then at least from the certification requirements and training materials?

Shortest path first, but not so fast

The Open Shortest Path First routing protocol (OSPF, Internet Standard 54) was first defined in RFC 1131 in 1989. So in internet time, OSPF is truly ancient. The base OSPFv2 specification is over 200 pages, with additional extensions in separate documents spanning the early 1990s to the late 2010s.

OSPF is powered by Edsgar Dijkstra's shortest path first algorithm. SPF is a relatively efficient algorithm for finding the shortest path between two places, in the real world or in a network. Still, in a large network there's a lot of paths to check until you can be sure you've found the shortest one. The problem here is that for a network that's 10 times larger, SPF needs 60 times as long to run. So if a router in a network with, say, 100 routers, needs a second to do its SPF calculations after an update, in a network with 1000 routers that takes a minute, and in a network with 10,000 routers an hour.

So in order to make OSPF useful in large networks, you can split your network into different areas. The SPF calculations are then contained to the routers within each area. So rather than calculate SPF over a 10,000-router network, you could have 100 areas with 100 routers each. Then routers that connect two areas would have to calculate SPF over 100 routers for two areas, so 2 seconds rather than an hour worth of SPF calculations.

But if each of those 10,000 routers still injects two, three or four address blocks into OSPF, that means the OSPF database will have something like 30,000 entries. So now updating and remembering all those address blocks becomes a bottleneck. Solution: summarize link advertisements. So if routers in area 35 advertise address blocks 10.35.1.x, 10.35.2.x, … 10.35.95.x, rather than push out all that information to all 10,000 routers throughout the network, the area border routers for area 35 simply say “10.35.x.x” to the rest of the network.

Even better: if an area only connects to the “backbone” area (area 0) and doesn't learn any routing information from other areas or from outside OSPF, it's a stub area that really doesn't even need to know anything that's happening in the rest of the network, so let's give it a default route to reach the rest of the world.

Variations on a stubby theme

Stub areas still have some OSPF routing information from other areas. We can get rid of that too, and then we have a totally stubby area.

On the other hand, maybe we want to import external routing information into OSPF even in our stub area, and then propagate that external information to other areas. This makes for a not-so-stubby area.

And who said you can't have your cake and eat it: let's make our totally stubby area not-so-stubby, and we'll have a totally not-so-stubby area, guaranteeing certification income for years to come. (See Wikipedia's page on OSPF for more details.)

Spring cleaning

As protocol designers, we're really good at adding more capabilities, more options. As network architects and engineers, we're really good at adding complexity to make our networks do something they won't do out of the box. But we can't just keep adding options and complexity without ever taking any of it away. At least not if we want to have a fighting chance at teaching our craft to the next generation so we can retire at some point.

Our routers/computers are now 1000 times as fast and have 1000 times the memory as the 68030-based routers/computers back in 1990. OSPF implementations support incremental SPF.

10,000 routers in one area will melt the network operations center long before the SPF calculations melt the router CPUs. I've personally worked on a network with 600 routers in area 0 back in 1999. SPF performance was the least of our concerns.

So I'm calling it: OSPF areas and summarization are now legacy. New and current OSPF networks should just use a flat area 0 rather than try to micromanage the information flow between areas. Students should no longer have to learn how areas work, and only be informed about the various flavors of stubbiness as an example of humorous naming that doesn't age well.