Project

General

Profile

Bug #92

Deadlock in gdplogd

Added by Nitesh Mor almost 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
gdplogd
Start date:
12/20/2016
Due date:
% Done:

0%


Description

Christopher reports issues with non-responsive gdplogd (see #91). It appears to be a deadlock, here's a backtrace from gdb (running from @(#)libgdp 0.7.2 (2016-12-11 14:25); the git hash is missing for some reason).

root@gdp-01:~# ps -ef | grep gdplogd
root     11621  5467  0 09:22 pts/2    00:00:00 grep --color=auto gdplogd
gdp      22522     1  0 Dec19 ?        00:00:00 /bin/sh /usr/sbin/gdplogd-wrapper.sh
gdp      22526 22522  0 Dec19 ?        00:00:12 /usr/sbin/gdplogd -D20
gdp      22527 22522  0 Dec19 ?        00:00:00 llogger -a /var/log/gdp/gdplogd.log
(gdb) attach 22526
Attaching to process 22526
[New LWP 22529]
[New LWP 22530]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f4d0e50798d in pthread_join (threadid=139968895817472, thread_return=thread_return@entry=0x0) at pthread_join.c:90
90      pthread_join.c: No such file or directory.
(gdb) info threads 
  Id   Target Id         Frame 
* 1    Thread 0x7f4d0f900880 (LWP 22526) "gdplogd" 0x00007f4d0e50798d in pthread_join (threadid=139968895817472, thread_return=thread_return@entry=0x0)
    at pthread_join.c:90
  2    Thread 0x7f4d0cd0a700 (LWP 22529) "gdplogd" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  3    Thread 0x7f4d0c509700 (LWP 22530) "gdplogd" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
(gdb) thread apply all bt

Thread 3 (Thread 0x7f4d0c509700 (LWP 22530)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f4d0e508e42 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x6372c0 <GclCacheMutex>) at ../nptl/pthread_mutex_lock.c:115
#2  0x0000000000422db8 in _ep_thr_mutex_lock (mtx=mtx@entry=0x6372c0 <GclCacheMutex>, file=file@entry=0x427e56 "gdp_gcl_cache.c", line=line@entry=383, 
    name=name@entry=0x427e47 "&GclCacheMutex") at ep_thr.c:232
#3  0x00000000004110ce in _gdp_gcl_cache_reclaim (maxage=300) at gdp_gcl_cache.c:383
#4  0x000000000040ac01 in gcl_reclaim_resources () at logd_gcl.c:203
#5  0x000000000040584b in gdpd_reclaim_resources (fd=<optimized out>, what=<optimized out>, ctx=<optimized out>) at logd.c:82
#6  0x00007f4d0f53aa0c in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
#7  0x0000000000414efb in run_event_loop (eli_=0x19c7ae0) at gdp_main.c:565
#8  0x00007f4d0e5066ba in start_thread (arg=0x7f4d0c509700) at pthread_create.c:333
#9  0x00007f4d0e23c82d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 2 (Thread 0x7f4d0cd0a700 (LWP 22529)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f4d0e508dbd in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7f4cfc07d1e0) at ../nptl/pthread_mutex_lock.c:80
#2  0x0000000000422db8 in _ep_thr_mutex_lock (mtx=mtx@entry=0x7f4cfc07d1e0, file=file@entry=0x427e56 "gdp_gcl_cache.c", line=line@entry=457, 
    name=name@entry=0x427450 "&gcl->mutex") at ep_thr.c:232
#3  0x00000000004112f9 in _gdp_gcl_lock (gcl=gcl@entry=0x7f4cfc07d1e0) at gdp_gcl_cache.c:457
#4  0x0000000000411746 in _gdp_gcl_cache_get (
    gcl_name=gcl_name@entry=0x7f4d0cd09bb0 "J\232H\224\323V\210\252\251a\330\370KP\367\066\302G\363te\274\225F\214t\341\354\017M\032\021/var/swarm/gdp/gcls/_4a", mode=mode@entry=GDP_MODE_ANY) at gdp_gcl_cache.c:208
#5  0x000000000040630d in post_one_log (
    gdpname=0x7f4d0cd09bb0 "J\232H\224\323V\210\252\251a\330\370KP\367\066\302G\363te\274\225F\214t\341\354\017M\032\021/var/swarm/gdp/gcls/_4a", 
    ctx=<optimized out>) at logd_admin.c:296
#6  0x0000000000407112 in disk_foreach (func=0x4062d7 <post_one_log>, ctx=0x0) at logd_disklog.c:2380
#7  0x0000000000405d2b in admin_probe_thread (ctx=<optimized out>) at logd_admin.c:336
#8  0x00000000004238fa in worker_thread (a=<optimized out>) at ep_thr_pool.c:145
#9  0x00007f4d0e5066ba in start_thread (arg=0x7f4d0cd0a700) at pthread_create.c:333
#10 0x00007f4d0e23c82d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 1 (Thread 0x7f4d0f900880 (LWP 22526)):
#0  0x00007f4d0e50798d in pthread_join (threadid=139968895817472, thread_return=thread_return@entry=0x0) at pthread_join.c:90
#1  0x0000000000405cd0 in main (argc=<optimized out>, argv=<optimized out>) at logd.c:447
(gdb) 

History

#1 Updated by Nitesh Mor almost 4 years ago

  • Blocks Bug #91: edu.berkeley.eecs.gdp-01.gdplog down or hung? gdp create hanging under Node with Ubuntu, RHEL and Mac OS X added

#2 Updated by Nitesh Mor almost 4 years ago

Trace from another stuck instance

(gdb) thread apply all bt

Thread 8 (Thread 0x7fae9effd700 (LWP 14029)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007faeafe54e42 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x6372c0 <GclCacheMutex>) at ../nptl/pthread_mutex_lock.c:115
#2  0x0000000000422db8 in _ep_thr_mutex_lock (mtx=mtx@entry=0x6372c0 <GclCacheMutex>, file=file@entry=0x427e56 "gdp_gcl_cache.c", line=line@entry=202, name=name@entry=0x427e47 "&GclCacheMutex") at ep_thr.c:232
#3  0x000000000041171e in _gdp_gcl_cache_get (gcl_name=gcl_name@entry=0x7faea800594d "\242\267\210\375{~Y\375ʣ\256\266\061n\251'Os\215\366\363\024\016\362\210\343\315\364ő\205t\r\213\221\362,_\261\235\022\342'g\364 \273\216\335k\347S\266\r\252\"v\005eܳ\001\273", <incomplete sequence \307>, mode=mode@entry=0) at gdp_gcl_cache.c:202
#4  0x0000000000414998 in gdp_pdu_proc_cmd (pdu_=0x7faea8005930) at gdp_main.c:146
#5  0x00000000004238fa in worker_thread (a=<optimized out>) at ep_thr_pool.c:145
#6  0x00007faeafe526ba in start_thread (arg=0x7fae9effd700) at pthread_create.c:333
#7  0x00007faeafb8882d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 7 (Thread 0x7fae9f7fe700 (LWP 14028)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007faeafe54e42 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x6372c0 <GclCacheMutex>) at ../nptl/pthread_mutex_lock.c:115
#2  0x0000000000422db8 in _ep_thr_mutex_lock (mtx=mtx@entry=0x6372c0 <GclCacheMutex>, file=file@entry=0x427e56 "gdp_gcl_cache.c", line=line@entry=202, name=name@entry=0x427e47 "&GclCacheMutex") at ep_thr.c:232
#3  0x000000000041171e in _gdp_gcl_cache_get (gcl_name=gcl_name@entry=0x7faea8001fcd "W\246P2\262\256@\231\365\371\363\274Qy\250\021\262\354\340\354\310JJ\276)\027VG\017\006\336\035\061\027\264/\202\033\n\215ڹ\f\a\321I\276\226&-5\032\034\026\316L\320\366\241\321vf", <incomplete sequence \350>, mode=mode@entry=0) at gdp_gcl_cache.c:202
#4  0x0000000000414998 in gdp_pdu_proc_cmd (pdu_=0x7faea8001fb0) at gdp_main.c:146
#5  0x00000000004238fa in worker_thread (a=<optimized out>) at ep_thr_pool.c:145
#6  0x00007faeafe526ba in start_thread (arg=0x7fae9f7fe700) at pthread_create.c:333
#7  0x00007faeafb8882d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 6 (Thread 0x7fae9ffff700 (LWP 14027)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007faeafe54e42 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x6372c0 <GclCacheMutex>) at ../nptl/pthread_mutex_lock.c:115
#2  0x0000000000422db8 in _ep_thr_mutex_lock (mtx=mtx@entry=0x6372c0 <GclCacheMutex>, file=file@entry=0x427e56 "gdp_gcl_cache.c", line=line@entry=202, name=name@entry=0x427e47 "&GclCacheMutex") at ep_thr.c:232
#3  0x000000000041171e in _gdp_gcl_cache_get (gcl_name=gcl_name@entry=0x7faea8001d9d "\341\023\346\070A\033\242\350\070\352\261\061jU\374\027\321\335\030\240}h~d\003\070S}sV\340\066\a1\253a\324\316NLIXvr\260}\206", mode=mode@entry=0) at gdp_gcl_cache.c:202
#4  0x0000000000414998 in gdp_pdu_proc_cmd (pdu_=0x7faea8001d80) at gdp_main.c:146
#5  0x00000000004238fa in worker_thread (a=<optimized out>) at ep_thr_pool.c:145
#6  0x00007faeafe526ba in start_thread (arg=0x7fae9ffff700) at pthread_create.c:333
#7  0x00007faeafb8882d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 5 (Thread 0x7faeace53700 (LWP 12377)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007faeafe54dbd in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7faea00ba720) at ../nptl/pthread_mutex_lock.c:80
#2  0x0000000000422db8 in _ep_thr_mutex_lock (mtx=mtx@entry=0x7faea00ba720, file=file@entry=0x427e56 "gdp_gcl_cache.c", line=line@entry=457, name=name@entry=0x427450 "&gcl->mutex") at ep_thr.c:232
#3  0x00000000004112f9 in _gdp_gcl_lock (gcl=gcl@entry=0x7faea00ba720) at gdp_gcl_cache.c:457
#4  0x0000000000411746 in _gdp_gcl_cache_get (gcl_name=gcl_name@entry=0x7faeace52bb0 "\372p\263'{Y\250?\341\305 A\362\n\376`\225\251\325\355ZE\336YvN\255\306\314\036\264\277/var/swarm/gdp/gcls/_fa", mode=mode@entry=GDP_MODE_ANY) at gdp_gcl_cache.c:208
#5  0x000000000040630d in post_one_log (gdpname=0x7faeace52bb0 "\372p\263'{Y\250?\341\305 A\362\n\376`\225\251\325\355ZE\336YvN\255\306\314\036\264\277/var/swarm/gdp/gcls/_fa", ctx=<optimized out>) at logd_admin.c:296
#6  0x0000000000407112 in disk_foreach (func=0x4062d7 <post_one_log>, ctx=0x0) at logd_disklog.c:2380
#7  0x0000000000405d2b in admin_probe_thread (ctx=<optimized out>) at logd_admin.c:336
#8  0x00000000004238fa in worker_thread (a=<optimized out>) at ep_thr_pool.c:145
#9  0x00007faeafe526ba in start_thread (arg=0x7faeace53700) at pthread_create.c:333
#10 0x00007faeafb8882d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 4 (Thread 0x7faead654700 (LWP 12376)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007faeafe54e42 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x6372c0 <GclCacheMutex>) at ../nptl/pthread_mutex_lock.c:115
#2  0x0000000000422db8 in _ep_thr_mutex_lock (mtx=mtx@entry=0x6372c0 <GclCacheMutex>, file=file@entry=0x427e56 "gdp_gcl_cache.c", line=line@entry=202, name=name@entry=0x427e47 "&GclCacheMutex") at ep_thr.c:232
#3  0x000000000041171e in _gdp_gcl_cache_get (gcl_name=gcl_name@entry=0x7fae9810c1ad "\307\033\345\307UC\205\r|h-!\033bE/\232\327sc\b\270\345\333'd\334(\315\371w\251P:\330k\021\266\247J\251\306\371Ǜ3)\224\257\347\a\270\345\324\346\031\255\214'Q\246", <incomplete sequence \345\246>, mode=mode@entry=0) at gdp_gcl_cache.c:202
#4  0x0000000000414998 in gdp_pdu_proc_cmd (pdu_=0x7fae9810c190) at gdp_main.c:146
#5  0x00000000004238fa in worker_thread (a=<optimized out>) at ep_thr_pool.c:145
#6  0x00007faeafe526ba in start_thread (arg=0x7faead654700) at pthread_create.c:333
#7  0x00007faeafb8882d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 3 (Thread 0x7faeade55700 (LWP 27640)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007faeafe54e42 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x6372c0 <GclCacheMutex>) at ../nptl/pthread_mutex_lock.c:115
#2  0x0000000000422db8 in _ep_thr_mutex_lock (mtx=mtx@entry=0x6372c0 <GclCacheMutex>, file=file@entry=0x427e56 "gdp_gcl_cache.c", line=line@entry=383, name=name@entry=0x427e47 "&GclCacheMutex") at ep_thr.c:232
#3  0x00000000004110ce in _gdp_gcl_cache_reclaim (maxage=300) at gdp_gcl_cache.c:383
#4  0x000000000040ac01 in gcl_reclaim_resources () at logd_gcl.c:203
#5  0x000000000040584b in gdpd_reclaim_resources (fd=<optimized out>, what=<optimized out>, ctx=<optimized out>) at logd.c:82
#6  0x00007faeb0e86a0c in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
#7  0x0000000000414efb in run_event_loop (eli_=0x1d6eae0) at gdp_main.c:565
#8  0x00007faeafe526ba in start_thread (arg=0x7faeade55700) at pthread_create.c:333
#9  0x00007faeafb8882d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 2 (Thread 0x7faeae656700 (LWP 27639)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007faeafe54e42 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x6372c0 <GclCacheMutex>) at ../nptl/pthread_mutex_lock.c:115
#2  0x0000000000422db8 in _ep_thr_mutex_lock (mtx=mtx@entry=0x6372c0 <GclCacheMutex>, file=file@entry=0x427e56 "gdp_gcl_cache.c", line=line@entry=202, name=name@entry=0x427e47 "&GclCacheMutex") at ep_thr.c:232
#3  0x000000000041171e in _gdp_gcl_cache_get (gcl_name=gcl_name@entry=0x7faea40bbd1d "\316\356\366c\344\222\021D`47\223\267\304 V^c", mode=mode@entry=0) at gdp_gcl_cache.c:202
#4  0x0000000000414998 in gdp_pdu_proc_cmd (pdu_=0x7faea40bbd00) at gdp_main.c:146
#5  0x00000000004238fa in worker_thread (a=<optimized out>) at ep_thr_pool.c:145
#6  0x00007faeafe526ba in start_thread (arg=0x7faeae656700) at pthread_create.c:333
#7  0x00007faeafb8882d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 1 (Thread 0x7faeb124c880 (LWP 27634)):
#0  0x00007faeafe5398d in pthread_join (threadid=140388218525440, thread_return=thread_return@entry=0x0) at pthread_join.c:90
#1  0x0000000000405cd0 in main (argc=<optimized out>, argv=<optimized out>) at logd.c:447

#3 Updated by Eric Allman almost 4 years ago

  • Status changed from New to In Progress

This looks like it may be a problem that I fixed already, probably in the existing master branch. It might be commit:6f1724a8, but I"m not sure. I last updated the running gdplogd code on 12/11 and a fair number of locking changes happened after that. No guarantees.

#4 Updated by Nitesh Mor almost 4 years ago

I will deploy the most recent code to gdp-01,02,03,04 and see how it goes. In any case, there is a core dump of a stuck-instance in /home/mor/core.27634 on gdp-01, which might be useful in case this resurfaces again.

#5 Updated by Eric Allman over 3 years ago

  • Status changed from In Progress to Resolved
  • Assignee changed from Nitesh Mor to Eric Allman

I believe this was fixed in commit:b572158d. New code is now running on all four gdp-xx machines, but we need more time to ensure that we are deadlock free.

#6 Updated by Eric Allman over 3 years ago

  • Blocks deleted (Bug #91: edu.berkeley.eecs.gdp-01.gdplog down or hung? gdp create hanging under Node with Ubuntu, RHEL and Mac OS X)

#7 Updated by Eric Allman over 3 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF