drplokta: (Default)
[personal profile] drplokta
Warning: the following is excessively technical, and is intended more for the sake of the next poor sod who types "vista dns round robin resolution" into Google than it is for my actual friends list. (Except for a few of you. And you know who you are.) Also, since I want this to be searchable on Google, I can't friends-lock it, so I'm not going to mention who I work for; please don't do so in comments, which are screened for that reason.

So, our website is run out of three separate hosting centres, each of which has its own statically routed IP block (a couple of /27s and a /26). Site 1's block is under 213.x.x.x, site 2 under 146.x.x.x and site 3 under 80.x.x.x. We use F5's Global Traffic Manager (GTM) on BigIP hardware to spread the user load (up to 22 million pages per day) across the three sites (and also Local Traffic Manager to load balance across individual web servers within each site). We cache some user data in the web application, so we use cookies (with an eight hour lifetime) to identify which server you're currently on, and we send you back to that server where possible, even if it's at a different site (we have high-bandwidth private links between the sites).

To improve our resilience, GTM is configured to return two A records for each DNS request for hostname www.<domain>.co.uk, with equal weighting between the three sites. This is to enable some web browsers to fail over more quickly to another site if one site fails for some reason. We don't try to do anything clever like sending the user to the site that will give them the quickest response, since most of our users are in the UK and geography's not much of an issue.

For some time now, we've been noticing higher traffic at site 1 and lower traffic at site 3, and the difference has been slowly increasing. This is odd, since the load should be equally balanced between the three sites.

After several days of analysing log files, it looked like a performance problem at site 3, and to a lesser extent at site 2. We use Javascript to report the total rendering time of 10% of pages served, and those times were longer at site 3 than at site 2, which in turn were longer than site 1. However, site 3 was slow both for users coming to an IP address at site 3 but being sent to a web server at another site, and also for users coming to an IP address at another site but being sent to a web server at site 3, which didn't seem to make any sense. Users sent between sites 1 and 2 had better performance, even though (as it happened) the private network link between sites 1 and 2 also went through site 3. The obvious conclusion was that users at site 3 were having pages load slowly enough that they loaded fewer of them, which would be bad news -- it could potentially be reducing our total traffic by several million pages per day.

Further analysis of log files then showed that the problem was only affecting Windows Vista users (30% of our total traffic). Other users showed the same performance and traffic at all three sites.

Googling for Vista network performance issues turned up a big red herring about TCP window scaling, which Vista implements for the first time in Windows, and can cause performance issues with some routers. This was still hard to use as an explanation, given that users on site 1 had good performance, but users coming to site 3 web servers through site 1, using the same router, had poor performance.

So as an experiment, we took site 3 out of the DNS pool altogether for a day. All DNS lookups now returned the addresses for sites 1 & 2. Suddenly, site 2 was just as bad as site 3 had been -- its total number of pages to Vista users went down rather than up, even though its total traffic was up by nearly 50%.

This suggested strongly that for some reason Vista was preferring site 1 to site 2 or 3, and site 2 to site 3, when choosing an IP address from the round-robin A records presented to it. Some more Googling eventually found RFC3484, which relates to DNS resolution in IPV6, but part of which is back-ported to IPV4. Vista is apparently the first major client OS to implement it, specifically section 6 rule 9. That specifies that the selection of an address from multiple A records is no longer random, but instead the destination address which shares the most prefix bits with the source address is selected, presumably on the basis that it's in some sense "closer" in the network.

Now, this may well make sense in IPV6 (I don't know enough about it to comment), but it's an insane algorithm to use in IPV4. First, the Internet is not laid out that way. As any comic artist can tell you, Europe does have a nice block from 80.0.0.0 to 91.255.255.255, but it also has chunks from 193-195 and 212-213, plus there's lots of geographically random stuff between 128 and 172.

But second, and more important, very few Windows client PCs actually have public IP addresses. If you're behind a NAT gateway, the DNS client in your Windows PC doesn't know the IP address you're using on the Internet, just the local network address you're using in one of the ranges specified by RFC1918. Now, in theory, that could be in 10.0.0.0/8, 172.16.0.0/12 or 192.168.0.0/16, but in practice nearly all home routers allocate addresses in the 192.168 range. As it happens, that shares two prefix bits with our site 1 address, one bit with our site 2 address and 0 bits with our site 3 address, so any Vista PC on a home network will always prefer site 1 over sites 2 or 3, and site 2 over site 3. This explains the difference in traffic volumes. A user with a slow and dodgy connection may have pages timeout, at which point their browser sends them to another IP address, so those users who have inherently worse performance are much more likely to find their way to site 3. Also, the few remaining dialup users actually have public IP addresses, which may well be in the European range from 80.0.0.0 to 91.255.255.255, which shares the most prefix bits with site 3 and thus is more likely to go to site 3. These factors explain the poor performance we saw at site 3.

So we're going to have to take a slight hit to our resilience and reduce the number of A records we return for a DNS lookup to one instead of two. This will be affecting other large multi-site websites as well -- for example, www.google.com returns three IP addresses in different ranges. And Microsoft have broken the Internet. Again. Although, to be fair, they did have some help this time from the IETF.

(I found this from a discussion on the Debian mailing list about the implementation of RFC3484 in glibc in Debian Etch. They eventually backed it out and only used section 6 rule 9 for destination addresses on the same subnet, which seems like a much better way to do it.)
Page 1 of 2 << [1] [2] >>

(no subject)

Date: 2009-03-04 12:08 pm (UTC)
andrewducker: (Default)
From: [personal profile] andrewducker
So Vista sucks more because it implemented a standard properly.

Oh, the hilarity.

(no subject)

From: [personal profile] andrewducker - Date: 2009-03-04 12:44 pm (UTC) - Expand

(no subject)

From: [identity profile] 026.livejournal.com - Date: 2009-03-04 07:45 pm (UTC) - Expand

(no subject)

From: (Anonymous) - Date: 2009-03-04 11:13 pm (UTC) - Expand

(no subject)

From: (Anonymous) - Date: 2009-03-05 04:14 pm (UTC) - Expand

(no subject)

Date: 2009-03-04 12:26 pm (UTC)
ext_73228: Headshot of Geri Sullivan, cropped from Ultraman Hugo pix (Default)
From: [identity profile] gerisullivan.livejournal.com
Interesting...in all senses of the word, alas. Also, you've written it up so that even this non-technical person could follow it right up until one term in the last paragraph. I had to Google "glibc in Debian Etch." Thank you for explaining it so clearly.

(no subject)

From: [identity profile] bohemiancoast.livejournal.com - Date: 2009-03-05 12:20 am (UTC) - Expand

The shirt for you...

From: (Anonymous) - Date: 2009-03-05 01:58 am (UTC) - Expand

(no subject)

Date: 2009-03-04 12:38 pm (UTC)
From: [identity profile] ffutures.livejournal.com
Unless I'm misunderstanding this badly, wouldn't it make more sense under most circumstances to pick randomly?

(no subject)

From: [identity profile] sharikkamur.livejournal.com - Date: 2009-03-04 01:11 pm (UTC) - Expand

(no subject)

From: [identity profile] markgritter.livejournal.com - Date: 2009-03-05 06:29 am (UTC) - Expand

(no subject)

Date: 2009-03-04 12:42 pm (UTC)
ext_267: Photo of DougS, who has a round face with thinning hair and a short beard (Elections)
From: [identity profile] dougs.livejournal.com
Here's a vote from one member of an under-represented minority -- at the moment my home IP address is 172.20.51.xx.

And the default for users of Microsoft Small Business Server -- not numerous, but impossible to ignore -- is 10.0.0.xx.

(no subject)

Date: 2009-03-04 09:53 pm (UTC)
From: [identity profile] jib.myopenid.com (from livejournal.com)
My previous ADSL modem (D-Link DSL-G604T) used 10.0.0.xx by default.

(no subject)

Date: 2009-03-04 01:58 pm (UTC)
fanf: (Default)
From: [personal profile] fanf
Oh god. RFC 3484 is a disaster. Bloody architecture astronauts.

(no subject)

Date: 2009-03-04 02:46 pm (UTC)
From: [identity profile] nojay.livejournal.com
This sounds like the sort of problem that only appears when new software and its underlying algorithms actually hits the road and bounces. I can't envisage this effect being discovered in lab tests or even in beta releases since there wouldn't be enough incidents in a given time period to make the effect visible in the statistical noise.

(no subject)

Date: 2009-03-04 08:23 pm (UTC)
fanf: (Default)
From: [personal profile] fanf
This RFC 3484 algorithm caused problems for Debian in December 2007. It should never have been approved, but sadly even some DNS experts think that 15 years of operational dependence on randomized RRset ordering doesn't matter.

rfc's

From: (Anonymous) - Date: 2009-03-04 09:04 pm (UTC) - Expand

(no subject)

Date: 2009-03-04 03:40 pm (UTC)
vampwillow: (Default)
From: [personal profile] vampwillow
Actually, it does make quite a bit of sense on IPv6 because the theory would hold that there is a semi-heirarchical allocation logic. It is a totally crap idea for IPv4 though ;-0

(no subject)

Date: 2009-03-04 08:25 pm (UTC)
fanf: (Default)
From: [personal profile] fanf
IPv6 address allocation is no longer hierarchial. The idea of "easy renumbering" turned out not to be possible in practice, so portable address space became mandatory (i.e. PI allocation) which isn't compatible with topological/hierarcial allocation.

(no subject)

Date: 2009-03-04 06:09 pm (UTC)
ckd: (cpu)
From: [personal profile] ckd
There's a side effect of NAT that interacts badly with other Internet architectural decisions? How utterly precedented.

Nothing to do with Nat

Date: 2009-03-05 12:47 am (UTC)
From: (Anonymous)
It doesn't matter if you have NAT or not. Even if you are using Real IPs, you could be connecting from anywhere. If IBM is using their 9.0.0.0/8 network, they are all going to use the same "round-robin" A record.

http://xkcd.com/195/

(no subject)

Date: 2009-03-04 07:58 pm (UTC)
From: [identity profile] sethb.livejournal.com
Can't you return one address to Vista inquiries and two to sane OS inquiries? That way, only Vista loses the resilience. (And even for Vista, you could return two addresses if the first was already going to be Site 1.)

(no subject)

From: [identity profile] dougs.livejournal.com - Date: 2009-03-04 10:10 pm (UTC) - Expand

Anycast

Date: 2009-03-04 08:13 pm (UTC)
From: (Anonymous)
Sounds like you could benefit from an Anycast based solution rather then round robin DNS.

Re: Anycast

From: [personal profile] fanf - Date: 2009-03-05 11:50 am (UTC) - Expand

Let us know what you see from Windows 7!

Date: 2009-03-04 08:37 pm (UTC)
From: (Anonymous)
I'm very curious to see how many problems like this get addressed by Windows 7. My feelings are "not many". I think they're just throwing lipstick on a pig so they can get to a new OS name ASAP.

(no subject)

Date: 2009-03-04 10:35 pm (UTC)
From: (Anonymous)
Use SRV Records instead of A records.

Good Luck
Marcos Eliziário Santos

SRV records

From: (Anonymous) - Date: 2009-03-05 10:33 am (UTC) - Expand

SRV Records

From: (Anonymous) - Date: 2009-03-05 01:37 am (UTC) - Expand

Anycast

Date: 2009-03-04 11:10 pm (UTC)
From: (Anonymous)
Can't you just use Anycast?

It'd do a beter job. I believe much of the dns infrastructure uses this in some fashion?

Re: Anycast

From: [identity profile] macros.livejournal.com - Date: 2009-03-05 01:14 pm (UTC) - Expand

(no subject)

Date: 2009-03-04 11:24 pm (UTC)
From: [identity profile] joel.livejournal.com
Fantastic summary. Thanks for posting this online!

(no subject)

Date: 2009-03-05 01:03 am (UTC)
From: [identity profile] dray.livejournal.com
Interesting observations. I have always considered DNS round-robin to be 'advisory only' at best for load sharing purposes, and 'not a good idea' at worst, given that the DNS itself has no concept of things like application-layer activity on a given server, etc.

It's still certainly a good idea for resiliency purposes, in the sense that if one of the hosts is completely gone or unreachable, the other host(s) will be used. If you want truly balanced load sharing, it seems to me that you have to enforce it yourself through the use of a proper load balancing system of some sort.

Cheers,
-dre

Round-robin real world use

Date: 2009-03-05 01:35 am (UTC)
From: (Anonymous)
I'm curious how many browsers actually use that round-robin failover feature, and how often it actually happens, and when it happens what is the client experience like? I see your mention of "take a slight hit to resilience" and I'm curious what the data shows there.

Bit matching is lame

Date: 2009-03-05 05:50 am (UTC)
From: (Anonymous)
The whole logic of bit matching and choosing a destination based on number of prefix bits matched is pretty lame. It will not work for machines behind NAT.

What is surprising is, this problem with Vista hasn't been reported as extensively as this before.

BTW, You may do something like this. When usage reaches 75% in Machine 1 redirect it to 2 or 3. For the remaining 25% , accept traffic only from Machine 2 or Machine 3.

In other words, machine 1 when exceeds 75% of its capacity will serve traffic redirected only from Machine 2 or 3.

Machine 2 or 3 will redirect only when they won't be able serve the traffic. I am not familiar how GTM works. Is there a way to do something like this ?

Thanks
Saswat Praharaj

Re: Bit matching is lame

From: (Anonymous) - Date: 2009-03-05 06:38 am (UTC) - Expand

(no subject)

Date: 2009-03-05 07:06 am (UTC)
From: [identity profile] bifrosty2k.livejournal.com
ok, I know I am perhaps being overly simplistic - RoundRobin DNS is pointless to impliment, especially if you're using a GSLB. It actually makes your reliability less than if you were just spewing out one A record with DNS based GSLB.

Think of it as RAID-0 for your servers - it makes some things better, but it also increases your chances for failure.

brower behaviour

From: (Anonymous) - Date: 2009-03-05 07:26 am (UTC) - Expand

(no subject)

From: [identity profile] bifrosty2k.livejournal.com - Date: 2009-03-05 07:43 am (UTC) - Expand

(no subject)

From: [identity profile] bifrosty2k.livejournal.com - Date: 2009-03-05 07:47 am (UTC) - Expand

(no subject)

From: [identity profile] bifrosty2k.livejournal.com - Date: 2009-03-05 08:01 am (UTC) - Expand

(no subject)

From: [identity profile] bifrosty2k.livejournal.com - Date: 2009-03-05 09:04 am (UTC) - Expand

(no subject)

From: [identity profile] bifrosty2k.livejournal.com - Date: 2009-03-05 07:50 am (UTC) - Expand

(no subject)

From: (Anonymous) - Date: 2009-03-05 08:09 am (UTC) - Expand

(no subject)

From: [personal profile] fanf - Date: 2009-03-05 11:54 am (UTC) - Expand

(no subject)

Date: 2009-03-05 07:19 am (UTC)
From: [identity profile] alexmc.livejournal.com
Thanks for cheering me up this is quite funny.

(But I wont be slagging Microsoft off for this since, as you say, they were just following the standard.

(no subject)

From: [identity profile] alexmc.livejournal.com - Date: 2009-03-05 07:21 am (UTC) - Expand

(no subject)

From: [identity profile] artagnon.com - Date: 2009-03-06 04:01 am (UTC) - Expand

thanks for letting is know

Date: 2009-03-05 12:21 pm (UTC)
From: [identity profile] zaphodb.pip.verisignlabs.com (from livejournal.com)
I found out about your blog entry via Florians mail to dns-operations:

https://lists.dns-oarc.net/pipermail/dns-operations/2009-March/003614.html

Nice work and very good to know indeed.

Zap

Two cents

Date: 2009-03-05 10:26 pm (UTC)
From: (Anonymous)
I'm not as technical as the rest of you guys, but I am an idea guy. Based on what you've said and your reponses to the comments posted here, there seem to be very few options if any to your problem. So if there are no solutions to the way you currently operate and the results you would like to see then you will need to change something. Scrapping everything is not an option, of course, but if the current result is not what you want then perhaps your system needs to be altered to better operate with Vista.

Take a step back and look at it from another angle.

Please fix the typo: "RCF..."

Date: 2009-03-05 10:52 pm (UTC)
From: (Anonymous)
Yes, horrible, awful, no good, very bad, etc.

...and you're getting propagated around the intertubes with the "RCF3484" typo.

kb

(no subject)

Date: 2009-03-05 10:59 pm (UTC)
From: [identity profile] headlouse.livejournal.com
If you actually want this searchable via google post it on bloglines or some other blog. LJ entries are pretty invisible to google -- even public posts.

Least of your problems

Date: 2009-03-05 11:33 pm (UTC)
From: (Anonymous)
GTM isn't that resilient in reality. I recommend my ecom clients do BGP and LTM for balance and fail-over. No other way to "be sure".

Megaproxies ignore (or invent their own) TTL so when a transit provider (or a transit provider's peer) go down all the idiots behind a megaproxy are still resolving to the A record they received from GTM when everything was fine.

ryan@hack.net

IP6

Date: 2009-03-05 11:35 pm (UTC)
From: (Anonymous)
My understanding of IP6 is that the address space is large enough to make a hierarchal address assignment practical, so choosing the address matching the most high order bits makes sense there. My guess is that someone thought it be simplest to use the same algorithm for IP4 as IP6, so the algorithm for best for IP6 was implemented.
Page 1 of 2 << [1] [2] >>