Project

General

Profile

Bug #91

edu.berkeley.eecs.gdp-01.gdplog down or hung? gdp create hanging under Node with Ubuntu, RHEL and Mac OS X

Added by Anonymous almost 7 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
gdplogd
Start date:
12/17/2016
Due date:
% Done:

0%


Description

The summary is it seems like logdname = edu.berkeley.eecs.gdp-01.gdplogd is causing problems, but changing it to edu.berkeley.eecs.gdp-02.gdplogd works?

Below are the details.

The accessors build is hanging during gcl create under Ubuntu, RHEL and Mac OS X.

To replicate:

svn co https://repo.eecs.berkeley.edu/svn-anon/projects/terraswarm/accessors/trunk/accessors
cd accessors/web
ant tests.mocha

The output will end with:

     [exec]   NodeHost
     [exec] GDPLogAppend.js: setup()
     [exec] Instantiated accessor GDPLogCreateAppendReadJS.js with class ./gdp/test/auto/GDPLogCreateAppendReadJS.js
     [exec] GDPLogCreate.js: initialize()
     [exec] GDPLogAppend.js: initialize()
     [exec] JavaScriptGDPLogName: ptolemy.actor.lib.jjs.modules.gdp.test.auto.GDPLogSubscribeJS.0.3682781502617527
     [exec] GDPLogCreate.js: create() Start.

If I use place the ant process in the background, use ps to find the node process and gdb -p to attach to the process, I get:

(gdb) where
#0  0x000000344b20e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x000000344b2095d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x000000344b2094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f3ee080ce01 in _ep_thr_mutex_lock (mtx=0x3558390, file=0x7f3ee0811168 "gdp_req.c", line=347,
    name=0x7f3ee081115c "&req->mutex") at ep_thr.c:232
#4  0x00007f3ee0803a04 in _gdp_req_lock (req=0x3558390) at gdp_req.c:347
#5  0x00007f3ee0803fee in _gdp_req_freeall (reqlist=0x3558580, shutdownfunc=0) at gdp_req.c:307
#6  0x00007f3ee07fd56a in _gdp_gcl_freehandle (gcl=0x3558540) at gdp_gcl_ops.c:132
#7  0x00007f3ee07fb541 in _gdp_gcl_decref (gclp=0x7fff77307ce0) at gdp_gcl_cache.c:603
#8  0x00007f3ee07fd9c7 in _gdp_gcl_create (
    gclname=0x3532608 "\363\034\362\201\216\217\036\226\207Rf\352\001\375k\357?\331\004\as\204\255\216xReBJ\201\223 ۛ\221b\
\001^\271\271\365|9\355Ҋw\353\071\223\334o5\355\224ӆ\372Y\203\370\244\254\023U", logdname=<value optimized out>,
    gmd=0x353ca40, chan=0x353c5b0, reqflags=<value optimized out>, pgcl=0x3532600) at gdp_gcl_ops.c:285
#9  0x00007f3ee07f7cd8 in gdp_gcl_create (
    gclname=0x3532608 "\363\034\362\201\216\217\036\226\207Rf\352\001\375k\357?\331\004\as\204\255\216xReBJ\201\223 ۛ\221b\
\001^\271\271\365|9\355Ҋw\353\071\223\334o5\355\224ӆ\372Y\203\370\244\254\023U",
    logdname=0x3532628 "\233\221b\001^\271\271\365|9\355Ҋw\353\071\223\334o5\355\224ӆ\372Y\203\370\244\254\023U",
    gmd=0x353ca40, pgcl=0x3532600) at gdp_api.c:331
#10 0x00007f3ee0c2a84c in ffi_call_unix64 ()
   from /tmp/atest/accessors/web/node_modules/ffi/build/Release/ffi_bindings.node
#11 0x00007f3ee0c29ff3 in ffi_call () from /tmp/atest/accessors/web/node_modules/ffi/build/Release/ffi_bindings.node
#12 0x00007f3ee0c22ae4 in FFI::FFICall(Nan::FunctionCallbackInfo<v8::Value> const&) ()
   from /tmp/atest/accessors/web/node_modules/ffi/build/Release/ffi_bindings.node
#13 0x00007f3ee0c22436 in Nan::imp::FunctionCallbackWrapper(v8::FunctionCallbackInfo<v8::Value> const&) ()
   from /tmp/atest/accessors/web/node_modules/ffi/build/Release/ffi_bindings.node
#14 0x0000000000981681 in v8::internal::FunctionCallbackArguments::Call(void (*)(v8::FunctionCallbackInfo<v8::Value> cons\
t&)) ()
#15 0x00000000009d5b0b in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiC\
allHelper<false>(v8::internal::Isolate*, v8::internal::(anonymous namespace)::BuiltinArguments<(v8::internal::BuiltinExtr\
aArguments)1>) ()
#16 0x00000000009d60b1 in v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) ()
#17 0x000033f454e0961b in ?? ()
#18 0x00001eed3f2412b1 in ?? ()

However, gcl_create works fine.

I'm not sure what happened, I had a successful run on Dec 12, 2016 9:01:07 AM.

I updated the @terraswarm/gdp package after that, but I believe I was able to run the gdp tests.

I checked out a version of the accessors repo from before yesterday's changes and that did not change anything.

So, perhaps there is something up with the gdp server?

I changed the logdname server to edu.berkeley.eecs.gdp-02.gdplog and the problem went away.

History

#1 Updated by Nitesh Mor almost 7 years ago

  • Assignee changed from Eric Allman to Nitesh Mor

It appears that both gdp-01 and gdp-02 were stuck. I confirmed it to be an issue with the log-server, but there is nothing in the logfiles. For the moment, I have collected a core dump on gdp-02 and restarted gdplogd on both -01 and -02, which should fix it for the moment.

#2 Updated by Anonymous almost 7 years ago

Thanks, this seems to be fixed for me now.

Feel free to close this.

It took me awhile to figure out.

It would be helpful if there was a script that would periodically attempt to create logs on the different servers so that we are sure that they are working.

I guess gcl-create worked because it was using gdp-04.

#3 Updated by Anonymous almost 7 years ago

The ptII build hung in the same place again, so the problem seems somewhat deterministic. Could someone restart gdp-01 and possibly 02?

#4 Updated by Nitesh Mor almost 7 years ago

I just restarted gdplogd again on gdp-01. From a very quick look, it appears to be some kind of deadlock. I have to acknowledge that I do need to brush up my gdb know-how a little bit.

(gdb) attach 11222
Attaching to process 11222
[New LWP 11225]
[New LWP 11226]
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
0x00007f1536d7b9dd in pthread_join (threadid=<optimized out>, thread_return=<optimized out>) at pthread_join.c:117
117     pthread_join.c: No such file or directory.
(gdb) thread apply all bt

Thread 3 (LWP 11226):
#0  0x00007f1536d832bd in __lll_timedlock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:199
#1  0x0000000000000008 in ?? ()
#2  0x00007f1536d7ce92 in __GI___pthread_mutex_lock (mutex=0x6372c0 <GclCacheMutex>)
    at ../nptl/pthread_mutex_lock.c:123
#3  0x0000000000000001 in ?? ()
#4  0x0000000000000001 in ?? ()
#5  0x000000000042790e in ?? ()
#6  0x0000000000000049 in ?? ()
#7  0x0000000000429acf in ?? ()
#8  0x0000000000000007 in ?? ()
#9  0x0000000000000000 in ?? ()

Thread 2 (LWP 11225):
#0  0x00007f1536d832bd in __lll_timedlock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:199
#1  0x000000000000002b in ?? ()
#2  0x00007f1536d7ce0d in __GI___pthread_mutex_lock (mutex=0x7f15283a12b0) at ../nptl/pthread_mutex_lock.c:95
#3  0xd8fd89e6ee48a300 in ?? ()
#4  0x00007f153557dbb0 in ?? ()
#5  0x0000000000000035 in ?? ()
#6  0x0000000000000000 in ?? ()

Thread 1 (LWP 11222):
#0  0x00007f1536d7b9dd in pthread_join (threadid=<optimized out>, thread_return=<optimized out>) at pthread_join.c:117
#1  0x0000000000000000 in ?? ()

#5 Updated by Anonymous almost 7 years ago

At this point, the GDP tests are hanging under Java and JavaScript, so I'm temporarily commenting them out.

This is not a huge problem, I'll reenable them when this is fixed.

#6 Updated by Nitesh Mor almost 7 years ago

Ok, I was poking around on gdp-01 yesterday and installed some necessary packages for gdb to do it's job a little better. This time I can get some sensible backtrace (as opposed to last time it was hung); I am copying this to create a new bug and marking the new bug as a dependency of this one.

Thanks for the bug-report.

#7 Updated by Nitesh Mor almost 7 years ago

  • Blocked by Bug #92: Deadlock in gdplogd added

#8 Updated by Eric Allman over 6 years ago

  • Status changed from New to Resolved
  • Assignee changed from Nitesh Mor to Eric Allman

I believe that this may have been caused by the same bug as issue #92. I'll set this to "resolved" until we've run longer with the newer code.

#9 Updated by Anonymous over 6 years ago

I'm fine with this being closed. The problem seems to have been fixed, the accessor tests that use the JavaScript interface have been using edu.berkeley.eec\
s.gdp-01.gdplogd for some time now.

#10 Updated by Eric Allman over 6 years ago

  • Blocked by deleted (Bug #92: Deadlock in gdplogd)

#11 Updated by Eric Allman over 6 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF