NSCD dies on SLC5 if used with nss_ldap, doesn't find users
(affects ZUUL, i.e. deploying PLUS accounts via LDAP - probably needs fixing before we have wide-scale deployment)
Contact:
JanVanEldik,
PeterKelemen
Symptom1: lookup failure
we see intermittent failures of passwd lookups, with (generic) messages
like:
lxbrb2001:
Feb 8 20:52:21 lxbrb2001 nscd: nss_ldap: could not search LDAP server - Server is unavailable
Feb 8 22:22:21 lxbrb2001 nscd: nss_ldap: could not search LDAP server - Server is unavailable
Feb 8 23:50:29 lxbrb2001 nscd: nss_ldap: could not search LDAP server - Server is unavailable
Feb 9 07:52:21 lxbrb2001 nscd: nss_ldap: could not search LDAP server - Server is unavailable
Feb 9 19:13:03 lxbrb2001 nscd: nss_ldap: could not search LDAP server - Server is unavailable
Feb 10 09:03:12 lxbrb2001 nscd: nss_ldap: could not search LDAP server - Server is unavailable
Note the interesting timestamps.
This happens AFAIK on all our SLC5 boxes, the rate may be activity dependent. Error message is generic, i.e. all failures will give this..
We don't have a simple, clean, reproducible test for this.
S/w versions:
[root@lxbra2701 ~]# rpm -q nss_ldap nscd
nscd-2.5-24.el5_2.2
nss_ldap-253-15.el5
Symptom2: nscd crashes / dumps core (if configured)
(core dumping - needs local tweaks, done by JvE)
Sep 23 13:14:44 lxbra2701 kernel: nscd[21845]: segfault at 00000000418b1000 rip 00002b5e935e14c4 rsp 00000000417c23c0 error 6
Sep 23 13:16:03 lxbra2701 nscd: 12574 Access Vector Cache (AVC) started
Sep 23 13:16:03 lxbra2701 nscd: 12574 invalid persistent database file "/var/db/nscd/passwd": verification failed
lxbra2701: -rw------- 1 root nscd 26894336 Oct 2 04:03 /core.28267
lxplus257: -rw------- 1 root nscd 25108480 Oct 4 04:03 /core.12522
gdb shows 1000+ deep overwritten stack, recursion? (also possible gdb bug being triggered).
gethostbyname2_r()
as top function.
Peter: (Sachin) cores are useless due to overwritten stack, should attach
gdb to running daemon. Well, that won't work because gdb(1) keeps chewing
100% CPU without progress.
Symptom3: applications bypass nscd and contact LDAP directly
shouldn't happen since will trigger eventual LDAP server overload, and indicates that something is not working between glibc and nscd:
[root@lxdev38 nss_ldap-262]# strace -f -econnect perl -e 'print getpwuid(6927);'
connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0
connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0
connect(3, {sa_family=AF_FILE, path="/var/run/setrans/.setrans-unix"}, 110) = 0
connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0
connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0
(OK until here)
connect(3, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("137.138.166.118")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("137.138.166.150")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("137.138.28.239")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("137.138.140.47")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("137.138.143.166")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("137.138.166.118")}, 16) = -1 EINPROGRESS (Operation now in progress)
(bad, application should not even know about these servers)
Attaching via GDB to
nscd
at this stage gives (gdb error message ->RH?) no "unusual" backtraces from the threads.
Peter: only seems to happen as root
Moreover attaching nscd doesn't seem to work at all:
[root@refrh5es-ia32e certs]# gdb /usr/sbin/nscd $( pidof nscd )
GNU gdb Red Hat Linux (6.5-37.el5_2.2rh)
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging symbols found)
Using host libthread_db library "/lib64/libthread_db.so.1".
Attaching to program: /usr/sbin/nscd, process 10489
(no debugging symbols found)
Loaded symbols for /usr/sbin/nscd
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 47886080697648 (LWP 10489)]
[New Thread 1115932992 (LWP 10497)]
../../gdb/linux-nat.c:1175: internal-error: linux_ptrace_post_attach: Assertion `pid == GET_LWP (ptid)' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n)
Symptom3.b
ldap-nss.c:1319: do_init: Assertion `cfg->ldc_uris[__session.ls_current_uri] != ((void *)0)' failed.
Trigger by running
ldagetusers
on lxplus258. or
perl -e 'my @foo = (getpwuid(1)); @foo = (getpwuid(1));'
probably related to file permission on
/etc/ldap.conf
? why does the first
getpwent
one run..?
Peter: verified, chmod 0644 /etc/ldap.conf makes the assertion failure go
away. Reproduced on vanilla RHEL5.
The same problem affects the startup of the dbus-daemon (reported by Alex I)
Additional info:
- workaround: "nscd" is alarmed and should get restarted automatically.