#VelocityConf: Resolution for a Faster Site

After the keynotes today, I decided to just dive deep and go to a talk about how DNS resolution affects website performance. There are some really interesting things that I learned about how the different operating systems actually deal with DNS resolution. It explains a lot about how slowly things fail when your network goes completely down.

First, I want to say how absolutely terrible it is that Windows resolves DNS entries at half the speed of Mac and Unix. We're talking 600ms vs. 200-300ms. That's pretty insane. Part of that seems to be due t othe way that failures are handled and how IPv6 integrates into the equation.

It seems that when doing standard DNS lookups in an IPv4 world or IPv6 world (but not dual stack), a complete resolution failure is going to take at least 20 seconds, no matter what platform you are using. This is largely because these are the timeouts set, which allow the operating system to hit 3 DNS servers and then try the first one again. All of this is done serially, so it can be pretty painful with the various backoffs.

When you add a dual stack with both IPv4 and IPv6 into the mix, different operating systems will behave differently. Except on Mac, IPv6 will be preferred. In Windows the IPv6 version is requested first, requesting the IPv4 record only if the IPv6 record fails. Mac at least requests both in parallel and will use whichever it thinks is fastest. On the other hand, that doesn't encourage people to use IPv6 very much, does it?

Round robin can add entire new problems once the connection is being established. Trying to determine which host to use can be tricky as they may fail. Failover to the next option can take a very long time.

The short takeaway is that DNS resolution is not a trivial part of the request. If you're going for sub-1-second latencies, this can eat a significant portion of your perceived latency before you even know what the user wants.

#VelocityConf: Bits on the Wire

Talk link

This was a really cool talk that walked down the network stack to really show how abstractions make our work possible.

The first thing to understand is that there are tons of abstractions that we rely on every day. Unfortunately, abstractions are usually not perfect, they leak concepts of their implementations into the interfaces they expose. This is a common problem in most software. The other problem is that abstractions, by producing more layers of code, are typically slower and less efficient.

He also made the statement that abstractions create problems you can't see. I don't completely agree with that. This is a rather long argument, so I may write that post later tonight.

Now, when you start dealing with abstractions and distributed computing, it is very important to understand how that changes the problem. One of the teams near me used to have a sign on their door with SOA fallacies. They included things like: "The network is reliable." He recommended a paper that I haven't read called "A Note on Distributed Computing". I will have to look that up. In any case, the general upshot is that the partial failure modes and concurrency in distributed systems can cause headaches for anyone used to a localized world. I cringe to remember the days of working with WebSphere and getting the "success: maybe" log messages.

HTTP

As we've learned, especially when trying to write a REST service, HTTP is hard. It's a really complex spec that was apparently rushed in the first place. The new version of the HTTP 1.1 spec is apparently going to be better-constructed and in 6 parts rather than a single 170-something page document.

From the wire perspective, the key to HTTP is the headers. With headers, I have my second new tool of the conference: REDbot. This thing is cool. It will tell you everything you're doing wrong with your headers. I think we need one of these at my job. There is also htracr, which will show you the actual network impact of your HTTP requests. As a colleague reminded me this morning, the waterfalls produced in most of our tools are how the browser percieves things, not necessarily how the network behaves.

Problems with HTTP

Other than its paralyzing complexity, HTTP has problems created by some of its "solutions". The best example is pipelining. Pipelining has been part of the spec for a long time, but no one uses it due to the fact that the client cannot know the best way to use the connection. Therefore, it is absurd to think that the client should direct the pipelining, which is what the spec requires.

Intermediares

Obviously, there are good and bad intermediaries when dealing with HTTP. The good ones are proxies and gateways. These are configured by the client or the server for very specific reasons understood by the party configuring them.

On the other hand, the bad ones are prevalent and occasionally malicious. They include virus scanners and content modification systems. The fact that there are prevalent specs around content modficiation scares me. Now, I'm really starting to think everything should be done over TLS.

DNS

He also talked at some length about DNS, which was interesting but not necessarily good material. The cool things I found were a tool called dig, alternative DNS services (OpenDNS, Google, and Comodo), and how terrible our ISP routers and modems are. I also learned how easy it is to spoof DNS responses and requests, both of which can be used for very malicious attacks.

TCP and IP

The lower-level stuff was all interesting, but I think the slides are pretty good on those. The one thing that I found really interesting was around the congestion windows for TCP. The fact that packet loss is used as the determination for when a network is congested must be a nightmare for mobile devices. Even wifi must suffer from this at times. It seems like we need a better mechanism for determining when the network is congested.

#VelocityConf: RUM for Breakfast - Distilling Insight from Noise

This talk didn't have a lot of new information for me. The most interesting thing in here was actually about how Facebook is trying to map the world of DNS resolvers.

The strategy is really interesting. They randomly inject JavaScript into a page to get a test sample. The JavaScript then hits a randomly-generated domain to download an image. The domain looks something like asdfasgad.dns.facebook.com. By doing this, they randomly assign a datacenter to the DNS entry requested while recording the datacenter used and the IP of the resolver.

When the image is downloaded, the server records the IP of the user as well as the latency. All of this information (user IP, resolver IP, datacenter, and latency) are put into a database. From there, you can start to see what resolvers serve the most traffic and what datacenters are fastest for those resolvers. Obviously, you need an inordinate amount of traffic to get good data here (which Facebook can do). The mapping technique is pretty cool, though.

The rest of the talk focused on how gathering RUM data is very noisy. Many of these ideas come straight from basic statistics (e.g. QR filtering), so I won't go into them here.