NSCD dies on SLC5 if used with nss_ldap, doesn't find users

(affects ZUUL, i.e. deploying PLUS accounts via LDAP - probably needs fixing before we have wide-scale deployment)

Contact: JanVanEldik, PeterKelemen

Symptom1: lookup failure

we see intermittent failures of passwd lookups, with (generic) messages
like:
lxbrb2001:
  Feb  8 20:52:21 lxbrb2001 nscd: nss_ldap: could not search LDAP server - Server is unavailable
  Feb  8 22:22:21 lxbrb2001 nscd: nss_ldap: could not search LDAP server - Server is unavailable
  Feb  8 23:50:29 lxbrb2001 nscd: nss_ldap: could not search LDAP server - Server is unavailable
  Feb  9 07:52:21 lxbrb2001 nscd: nss_ldap: could not search LDAP server - Server is unavailable
  Feb  9 19:13:03 lxbrb2001 nscd: nss_ldap: could not search LDAP server - Server is unavailable
  Feb 10 09:03:12 lxbrb2001 nscd: nss_ldap: could not search LDAP server - Server is unavailable

Note the interesting timestamps.

This happens AFAIK on all our SLC5 boxes, the rate may be activity dependent. Error message is generic, i.e. all failures will give this..

We don't have a simple, clean, reproducible test for this.

S/w versions:

  [root@lxbra2701 ~]# rpm -q nss_ldap nscd
  nscd-2.5-24.el5_2.2
  nss_ldap-253-15.el5

Symptom2: nscd crashes / dumps core (if configured)

(core dumping - needs local tweaks, done by JvE)

  Sep 23 13:14:44 lxbra2701 kernel: nscd[21845]: segfault at 00000000418b1000 rip 00002b5e935e14c4 rsp 00000000417c23c0 error 6
  Sep 23 13:16:03 lxbra2701 nscd: 12574 Access Vector Cache (AVC) started
  Sep 23 13:16:03 lxbra2701 nscd: 12574 invalid persistent database file "/var/db/nscd/passwd": verification failed

  lxbra2701: -rw------- 1 root nscd 26894336 Oct  2 04:03 /core.28267
  lxplus257: -rw------- 1 root nscd 25108480 Oct  4 04:03 /core.12522

gdb shows 1000+ deep overwritten stack, recursion? (also possible gdb bug being triggered). gethostbyname2_r() as top function.

Peter: (Sachin) cores are useless due to overwritten stack, should attach gdb to running daemon. Well, that won't work because gdb(1) keeps chewing 100% CPU without progress.

Symptom3: applications bypass nscd and contact LDAP directly

shouldn't happen since will trigger eventual LDAP server overload, and indicates that something is not working between glibc and nscd:

[root@lxdev38 nss_ldap-262]# strace -f -econnect perl -e 'print getpwuid(6927);'
connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0
connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0
connect(3, {sa_family=AF_FILE, path="/var/run/setrans/.setrans-unix"}, 110) = 0
connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0
connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0
(OK until here)
connect(3, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("137.138.166.118")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("137.138.166.150")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("137.138.28.239")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("137.138.140.47")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("137.138.143.166")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("137.138.166.118")}, 16) = -1 EINPROGRESS (Operation now in progress)
(bad, application should not even know about these servers)

Attaching via GDB to nscd at this stage gives (gdb error message ->RH?) no "unusual" backtraces from the threads.

Peter: only seems to happen as root

Moreover attaching nscd doesn't seem to work at all:

[root@refrh5es-ia32e certs]# gdb /usr/sbin/nscd $( pidof nscd )
GNU gdb Red Hat Linux (6.5-37.el5_2.2rh)
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging symbols found)
Using host libthread_db library "/lib64/libthread_db.so.1".

Attaching to program: /usr/sbin/nscd, process 10489
(no debugging symbols found)
Loaded symbols for /usr/sbin/nscd
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 47886080697648 (LWP 10489)]
[New Thread 1115932992 (LWP 10497)]

../../gdb/linux-nat.c:1175: internal-error: linux_ptrace_post_attach: Assertion `pid == GET_LWP (ptid)' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) 

Symptom3.b

ldap-nss.c:1319: do_init: Assertion `cfg->ldc_uris[__session.ls_current_uri] != ((void *)0)' failed.

Trigger by running ldagetusers on lxplus258. or

perl -e 'my @foo = (getpwuid(1)); @foo = (getpwuid(1));'

probably related to file permission on /etc/ldap.conf ? why does the first getpwent one run..?

Peter: verified, chmod 0644 /etc/ldap.conf makes the assertion failure go away. Reproduced on vanilla RHEL5.

The same problem affects the startup of the dbus-daemon (reported by Alex I)

Additional info:

  • workaround: "nscd" is alarmed and should get restarted automatically.
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2009-02-10 - PeterKelemen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LinuxSupport All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback