Bug #91
edu.berkeley.eecs.gdp-01.gdplog down or hung? gdp create hanging under Node with Ubuntu, RHEL and Mac OS X
0%
Description
The summary is it seems like logdname = edu.berkeley.eecs.gdp-01.gdplogd is causing problems, but changing it to edu.berkeley.eecs.gdp-02.gdplogd works?
Below are the details.
The accessors build is hanging during gcl create under Ubuntu, RHEL and Mac OS X.
To replicate:
svn co https://repo.eecs.berkeley.edu/svn-anon/projects/terraswarm/accessors/trunk/accessors cd accessors/web ant tests.mocha
The output will end with:
[exec] NodeHost [exec] GDPLogAppend.js: setup() [exec] Instantiated accessor GDPLogCreateAppendReadJS.js with class ./gdp/test/auto/GDPLogCreateAppendReadJS.js [exec] GDPLogCreate.js: initialize() [exec] GDPLogAppend.js: initialize() [exec] JavaScriptGDPLogName: ptolemy.actor.lib.jjs.modules.gdp.test.auto.GDPLogSubscribeJS.0.3682781502617527 [exec] GDPLogCreate.js: create() Start.
If I use place the ant process in the background, use ps to find the node process and gdb -p to attach to the process, I get:
(gdb) where #0 0x000000344b20e334 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x000000344b2095d8 in _L_lock_854 () from /lib64/libpthread.so.0 #2 0x000000344b2094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f3ee080ce01 in _ep_thr_mutex_lock (mtx=0x3558390, file=0x7f3ee0811168 "gdp_req.c", line=347, name=0x7f3ee081115c "&req->mutex") at ep_thr.c:232 #4 0x00007f3ee0803a04 in _gdp_req_lock (req=0x3558390) at gdp_req.c:347 #5 0x00007f3ee0803fee in _gdp_req_freeall (reqlist=0x3558580, shutdownfunc=0) at gdp_req.c:307 #6 0x00007f3ee07fd56a in _gdp_gcl_freehandle (gcl=0x3558540) at gdp_gcl_ops.c:132 #7 0x00007f3ee07fb541 in _gdp_gcl_decref (gclp=0x7fff77307ce0) at gdp_gcl_cache.c:603 #8 0x00007f3ee07fd9c7 in _gdp_gcl_create ( gclname=0x3532608 "\363\034\362\201\216\217\036\226\207Rf\352\001\375k\357?\331\004\as\204\255\216xReBJ\201\223 ۛ\221b\ \001^\271\271\365|9\355Ҋw\353\071\223\334o5\355\224ӆ\372Y\203\370\244\254\023U", logdname=<value optimized out>, gmd=0x353ca40, chan=0x353c5b0, reqflags=<value optimized out>, pgcl=0x3532600) at gdp_gcl_ops.c:285 #9 0x00007f3ee07f7cd8 in gdp_gcl_create ( gclname=0x3532608 "\363\034\362\201\216\217\036\226\207Rf\352\001\375k\357?\331\004\as\204\255\216xReBJ\201\223 ۛ\221b\ \001^\271\271\365|9\355Ҋw\353\071\223\334o5\355\224ӆ\372Y\203\370\244\254\023U", logdname=0x3532628 "\233\221b\001^\271\271\365|9\355Ҋw\353\071\223\334o5\355\224ӆ\372Y\203\370\244\254\023U", gmd=0x353ca40, pgcl=0x3532600) at gdp_api.c:331 #10 0x00007f3ee0c2a84c in ffi_call_unix64 () from /tmp/atest/accessors/web/node_modules/ffi/build/Release/ffi_bindings.node #11 0x00007f3ee0c29ff3 in ffi_call () from /tmp/atest/accessors/web/node_modules/ffi/build/Release/ffi_bindings.node #12 0x00007f3ee0c22ae4 in FFI::FFICall(Nan::FunctionCallbackInfo<v8::Value> const&) () from /tmp/atest/accessors/web/node_modules/ffi/build/Release/ffi_bindings.node #13 0x00007f3ee0c22436 in Nan::imp::FunctionCallbackWrapper(v8::FunctionCallbackInfo<v8::Value> const&) () from /tmp/atest/accessors/web/node_modules/ffi/build/Release/ffi_bindings.node #14 0x0000000000981681 in v8::internal::FunctionCallbackArguments::Call(void (*)(v8::FunctionCallbackInfo<v8::Value> cons\ t&)) () #15 0x00000000009d5b0b in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiC\ allHelper<false>(v8::internal::Isolate*, v8::internal::(anonymous namespace)::BuiltinArguments<(v8::internal::BuiltinExtr\ aArguments)1>) () #16 0x00000000009d60b1 in v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) () #17 0x000033f454e0961b in ?? () #18 0x00001eed3f2412b1 in ?? ()
However, gcl_create works fine.
I'm not sure what happened, I had a successful run on Dec 12, 2016 9:01:07 AM.
I updated the @terraswarm/gdp package after that, but I believe I was able to run the gdp tests.
I checked out a version of the accessors repo from before yesterday's changes and that did not change anything.
So, perhaps there is something up with the gdp server?
I changed the logdname server to edu.berkeley.eecs.gdp-02.gdplog and the problem went away.
History
#1 Updated by Nitesh Mor almost 7 years ago
- Assignee changed from Eric Allman to Nitesh Mor
It appears that both gdp-01 and gdp-02 were stuck. I confirmed it to be an issue with the log-server, but there is nothing in the logfiles. For the moment, I have collected a core dump on gdp-02 and restarted gdplogd on both -01 and -02, which should fix it for the moment.
#2 Updated by Anonymous almost 7 years ago
Thanks, this seems to be fixed for me now.
Feel free to close this.
It took me awhile to figure out.
It would be helpful if there was a script that would periodically attempt to create logs on the different servers so that we are sure that they are working.
I guess gcl-create worked because it was using gdp-04.
#3 Updated by Anonymous almost 7 years ago
The ptII build hung in the same place again, so the problem seems somewhat deterministic. Could someone restart gdp-01 and possibly 02?
#4 Updated by Nitesh Mor almost 7 years ago
I just restarted gdplogd
again on gdp-01
. From a very quick look, it appears to be some kind of deadlock. I have to acknowledge that I do need to brush up my gdb
know-how a little bit.
(gdb) attach 11222 Attaching to process 11222 [New LWP 11225] [New LWP 11226] warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available. warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available. 0x00007f1536d7b9dd in pthread_join (threadid=<optimized out>, thread_return=<optimized out>) at pthread_join.c:117 117 pthread_join.c: No such file or directory. (gdb) thread apply all bt Thread 3 (LWP 11226): #0 0x00007f1536d832bd in __lll_timedlock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:199 #1 0x0000000000000008 in ?? () #2 0x00007f1536d7ce92 in __GI___pthread_mutex_lock (mutex=0x6372c0 <GclCacheMutex>) at ../nptl/pthread_mutex_lock.c:123 #3 0x0000000000000001 in ?? () #4 0x0000000000000001 in ?? () #5 0x000000000042790e in ?? () #6 0x0000000000000049 in ?? () #7 0x0000000000429acf in ?? () #8 0x0000000000000007 in ?? () #9 0x0000000000000000 in ?? () Thread 2 (LWP 11225): #0 0x00007f1536d832bd in __lll_timedlock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:199 #1 0x000000000000002b in ?? () #2 0x00007f1536d7ce0d in __GI___pthread_mutex_lock (mutex=0x7f15283a12b0) at ../nptl/pthread_mutex_lock.c:95 #3 0xd8fd89e6ee48a300 in ?? () #4 0x00007f153557dbb0 in ?? () #5 0x0000000000000035 in ?? () #6 0x0000000000000000 in ?? () Thread 1 (LWP 11222): #0 0x00007f1536d7b9dd in pthread_join (threadid=<optimized out>, thread_return=<optimized out>) at pthread_join.c:117 #1 0x0000000000000000 in ?? ()
#5 Updated by Anonymous almost 7 years ago
At this point, the GDP tests are hanging under Java and JavaScript, so I'm temporarily commenting them out.
This is not a huge problem, I'll reenable them when this is fixed.
#6 Updated by Nitesh Mor almost 7 years ago
Ok, I was poking around on gdp-01
yesterday and installed some necessary packages for gdb
to do it's job a little better. This time I can get some sensible backtrace (as opposed to last time it was hung); I am copying this to create a new bug and marking the new bug as a dependency of this one.
Thanks for the bug-report.
#7 Updated by Nitesh Mor almost 7 years ago
- Blocked by Bug #92: Deadlock in gdplogd added
#8 Updated by Eric Allman over 6 years ago
- Status changed from New to Resolved
- Assignee changed from Nitesh Mor to Eric Allman
I believe that this may have been caused by the same bug as issue #92. I'll set this to "resolved" until we've run longer with the newer code.
#9 Updated by Anonymous over 6 years ago
I'm fine with this being closed. The problem seems to have been fixed, the accessor tests that use the JavaScript interface have been using edu.berkeley.eec\
s.gdp-01.gdplogd for some time now.
#10 Updated by Eric Allman over 6 years ago
- Blocked by deleted (Bug #92: Deadlock in gdplogd)
#11 Updated by Eric Allman over 6 years ago
- Status changed from Resolved to Closed