Issue
Both in SLC4 and SLC5,
getaddrinfo()
in
glibc
sorts multiple IP
addresses returned by DNS according to the algorithm defined in
RFC3484. The implementation
is
explained by
Ulrich Drepper.
The impact at CERN is that while DNS round-robin load-balancing
nicely distributes the returned addresses by rotating them,
glibc
sorts the returned addresses according to longest matching
prefix to the client address before they are returned to the
application by
getaddrinfo()
. This is a Good Thing™ in
a IPv6 environment but it violates the least amount of surprise
principle in IPv4 systems and applications will prefer one address
more than the others depending on the source address.
Impact: applications will not connect()
uniformly to the addresses behind a DNS alias, but skewed.
Currently this affects all applications on SLC4/SLC5 that use
getaddrinfo()
to resolve DNS names and it cannot be switched off. The
only known workaround is that the application ignores the order of the
returned addresses and randomly picks one instead of always the first one
relying on the rotation by the DNS server.
This issue does not affect applications using
gethostbyname()
.
Example
lxplus257:~% for i in $( seq 100 ); do host lxplus.cern.ch | sed -n '/address/{p;q}'; done | sort | uniq -c | sort -n
12 lx64slc4.cern.ch has address 137.138.141.148
12 lx64slc4.cern.ch has address 137.138.141.149
12 lx64slc4.cern.ch has address 137.138.141.156
12 lx64slc4.cern.ch has address 137.138.4.19
13 lx64slc4.cern.ch has address 137.138.5.220
13 lx64slc4.cern.ch has address 137.138.5.223
13 lx64slc4.cern.ch has address 137.138.5.233
13 lx64slc4.cern.ch has address 137.138.5.234
lxplus257:~% for i in $( seq 100 ); do ssh -q lxplus.cern.ch hostname ; done | sort | uniq -c
100 lxplus254.cern.ch
lxplus257:~%
IOW, while the DNS nicely rotates the returned addresses,
ssh
will always go to lxplus254 (137.138.141.156) because this address
has the longest matching prefix of the source address of lxplus257
(137.138.141.158).
--
PeterKelemen - 11 Mar 2009