Project

General

Profile

Bug #55

GDP log subscribe hangs

Added by Nima Mousavi almost 4 years ago. Updated about 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
libgdp
Start date:
09/26/2016
Due date:
% Done:

0%


Description

I've been using the function calls that have been used in gdp/lang/python/reader_multi_subscribe.py, and I'm facing this issue. After getting data for a while, the running just freezes. I traces the runtime and found that is freezes in __func1 call. (Trace attached)

I didn't face a similar issue when running reader_multi_subscribe.py (at least not while I was watching.). Do you know what may be the source of the problem?
I've tried two different logs so far:
edu.berkeley.eecs.swarmlab.device.c098e5300003
edu.berkeley.eecs.bwrc.device.c098e530005d

Thanks.
Nima

gdp.JPG - Trace (163 KB) Nima Mousavi, 09/26/2016 10:42 AM

between_two_missing_datapoints.txt Magnifier (4.98 KB) Nima Mousavi, 10/23/2016 10:27 PM

between_two_datapoints.txt Magnifier (7.18 KB) Nima Mousavi, 10/23/2016 10:27 PM

log.txt Magnifier (443 KB) Nima Mousavi, 10/23/2016 11:05 PM

49

Related issues

Related to GDP - Bug #75: Subscription times out Closed 10/15/2016

History

#1 Updated by Eric Allman almost 4 years ago

  • Category set to Python API
  • Assignee set to Nitesh Mor

Nitesh, I'm passing this one to you because Nima is seeing the problem in _func1, which is part of the Python API. Feel free to pass it back to me if _func1 is just a symptom of libgdp issues.

#2 Updated by Nitesh Mor almost 4 years ago

  • Category changed from Python API to libgdp
  • Assignee changed from Nitesh Mor to Eric Allman

__func1 is just a name for gdp_event_next. As per the trace in the JPEG file (really?), it is called with a NULL value, which implies there is no timeout and it will wait forever for a new event.

There wasn't any data being logged and I had to restart the MQTT gateways (which is a separate problem altogether), which explains the stuck subscription. Ideally, a production system will use some kind of timeout. (Nima: see source:lang/python/apps/reader_test_multiread.py for an example of how to use a timeout).

There is, however, a possibility that a network failure, or a log-server restart, or any other activity not exposed to the user may cause such a blocking operation to never return. Keeping this open to verify that such condition is handled.

#3 Updated by Nima Mousavi almost 4 years ago

I'm currently pulling data from a log that I'm feeding the data to. So I think the data is being logged in there, it just gets stuck after a few data points collected through subscription.
I added the timeout, and now I'm using this line to get data from the log that I'm generating (new data every 4-5 second).
t = {'tv_sec':0, 'tv_nsec':10*(10**9), 'tv_accuracy':0.0}

newRecord = gdp.GDP_GCL.get_next_event(t)

The problem is that the first data is collected OK, but in the next one "None" is returned. (I assume it has timed out, I tried different values of timeout too)

Previously I was calling get_next_event(None), which would work for around 10-15 data points being collected through subscribe, and then it would get stuck on that line.

By the way, I just tried the same code on SwarmBox, and it isn't getting stuck there.

#4 Updated by Eric Allman almost 4 years ago

  • Status changed from New to Feedback

Nima, could you verify that this is still broken on the latest code base? You'll have to recompile libgdp and your application. I recently made some changes that touch this part of the code, so I want to make sure that there's still a problem before I try to debug it.

#5 Updated by Nima Mousavi almost 4 years ago

As I mentioned in my other bug report, I cannot run gdp-setup.bash on Raspberry Pi, because it is not supported. I run make, which compiles some of the c files, but fails on db.h.
So I'm not sure if the latest changes are applied to the RPi version.

Also, the issue does not exist on SwarmBox (which runs Ubuntu). I was able to run the setup bash script on SwarmBox though, which may or may not be the reason behind it working there.

Another point is that the sample python function for subscribe in gdp/lang/python/apps runs fine as far as I can see on both RPi and SwarmBox.

#6 Updated by Eric Allman almost 4 years ago

Nima, commit:4770b39a should have fixed the libdb compile problem (sorry, I don't think I remembered to tell you that I had fixed it). If that doesn't work, you can use sudo apt-get install libdb-dev.

#7 Updated by Nima Mousavi almost 4 years ago

I tried it again and apparently now it was able to install using the bash script. I still had the same freezing issue though when I tried the code right afterwards.
(Although now I'm receiving Error 600, which I think is a separate problem. I submitted a new bug for this)
Thanks.

#8 Updated by Eric Allman almost 4 years ago

Nima, can you clarify if the problem you are seeing is that your program hangs or gets a 600 error? The 600 error is probably a router problem (which has been acting up again). The hang could be because there are simply no events to read, which is not an error. If there is nothing to read and no timeout, get_next_event is supposed to hang. In particular, the only way to exit reader_multi_subscribe.py is to interrupt the process (probably using ^C), so that's not a bug.

On the other hand, if there is more data being fed to the logs to which reader_multi_subscribe.py has subscribed, but with no output, then there is definitely a bug.

#9 Updated by Nima Mousavi almost 4 years ago

The hanging issue was happening prior to the error 600. So they are separate issues.

There is data being added to the log, as I am writing to a log with one machine, and reading from it on a separate machine which hangs.

#10 Updated by Nitesh Mor almost 4 years ago

  • Related to Bug #75: Subscription times out added

#11 Updated by Nima Mousavi almost 4 years ago

Hello,
I just tried again to reproduce the bug that I reported before (#55) and it is still there. I think it may have a different nature from the bug that has been resolved in #75, as I'm constantly writing to the log (every 5 seconds ish), yet still after a while the subscription gets stuck..
Is there any hope to resolve this?

ioClass.py(238): if self.IOtype == 'SER':
ioClass.py(256): newRecord = gdp.GDP_GCL.get_next_event(None)
--- modulename: GDP_GCL, funcname: get_next_event
GDP_GCL.py(628): return cls._helper_get_next_event(None, timeout)
--- modulename: GDP_GCL, funcname: _helper_get_next_event
GDP_GCL.py(542): __func1 = gdp.gdp_event_next
GDP_GCL.py(545): if __gcl_handle == None:
GDP_GCL.py(546): __func1_arg1_type = c_void_p
GDP_GCL.py(551): if timeout == None:
GDP_GCL.py(552): __timeout = None
GDP_GCL.py(553): __func1_arg2_type = c_void_p
GDP_GCL.py(563): __func1.argtypes = [__func1_arg1_type, __func1_arg2_type]
GDP_GCL.py(564): __func1.restype = POINTER(cls.gdp_event_t)
GDP_GCL.py(566): event_ptr = __func1(__gcl_handle, __timeout)

#12 Updated by Nima Mousavi almost 4 years ago

I tried another approach with 1 second timeout, and ignoring the result if the returned value was None. The code still hangs in the same line:


ioClass.py(256): t = {'tv_sec':0, 'tv_nsec':10*(10**9), 'tv_accuracy':0.0}
ioClass.py(257): newRecord = gdp.GDP_GCL.get_next_event(t)
--- modulename: GDP_GCL, funcname: get_next_event
GDP_GCL.py(628): return cls._helper_get_next_event(None, timeout)
--- modulename: GDP_GCL, funcname: _helper_get_next_event
GDP_GCL.py(542): __func1 = gdp.gdp_event_next
GDP_GCL.py(545): if __gcl_handle == None:
GDP_GCL.py(546): __func1_arg1_type = c_void_p
GDP_GCL.py(551): if timeout == None:
GDP_GCL.py(555): __timeout = GDP_DATUM.EP_TIME_SPEC()
GDP_GCL.py(556): __timeout.tv_sec = c_int64(timeout['tv_sec'])
GDP_GCL.py(557): __timeout.tv_nsec = c_uint32(timeout['tv_nsec'])
GDP_GCL.py(558): __timeout.tv_accuracy = c_float(timeout['tv_accuracy'])
GDP_GCL.py(559): __func1_arg2_type = POINTER(GDP_DATUM.EP_TIME_SPEC)
GDP_GCL.py(563): __func1.argtypes = [__func1_arg1_type, __func1_arg2_type]
GDP_GCL.py(564): __func1.restype = POINTER(cls.gdp_event_t)
GDP_GCL.py(566): event_ptr = __func1(__gcl_handle, __timeout)

#13 Updated by Nitesh Mor almost 4 years ago

I just deployed a more recent version of the code on gdp-0[2,3,4].eecs.berkeley.edu. (still need to upgrade gdp-01.eecs.berkeley.edu). The servers had not been updated in a while.

Could you verify that you still run into this issue with the most recent code? I would ask you to create a new log on a server other than gdp-01. You would do this by using the -s edu.berkeley.eecs.gdp-02.gdplogd flag for gcl-create.

If the issue persists, then could you please share the following:

  • The version of GDP code that you are running. You can find it out by using strings /usr/lib/libgdp.so.0.7 | grep libgdp (or something similar, based on your specific environment/configuration). If you are not running somewhat recent version of the code, I'd ask you to upgrade and try to reproduce the bug.
  • More debugging output. You can include gdp.dbg_set("*=50") in your Python code right after you call gdp.gdp_init().

#14 Updated by Nitesh Mor almost 4 years ago

  • Status changed from Feedback to In Progress

#15 Updated by Nima Mousavi almost 4 years ago

I created a new log in gdp-02 and tested again. It no longer gets stuck indefinitely, but still after a few data points it gets stuck for a while and loses a bunch of data points that are being published at that time.

I ran the command you mentioned on gdp(repo)/gdp/libgdp.so.0.7 and received the following result:
@(#)libgdp 0.7.2 (2016-10-23 21:22) 6370e6f90b1e30cfb1e0bd8e865525747dfa6abe

I added the debug line, and found out that even in cases where data is not received by the subscriber, the event is somehow detected, but the data is not extracted out of it. These logs are the result of setting the debug flag in the subscriber code.

  • When subscription is working correctly, between each two data points I get a log similar to 'between_two_datapoints.txt' (attached) --- In these cases I sometimes see lines such as gdp_pdu_proc_resp <<< done and subscr_poker_thread: poking in between data points. But not all the time.

-When data is being published but nothing is received by the subscriber, between each two published data points I get a log similar to 'between_two_missing_datapoints.txt' (attached)

#16 Updated by Nitesh Mor almost 4 years ago

Ok, this is certainly progress.

Could you please share the entire debug output (rather than a filtered version) for cases where you are missing events? I can see that incoming subscription data is discarded, because the event processing routine fails to find a corresponding subscription request. However, the filtered version does not tell me what happened to that missing request.

The output is sent to STDERR, so it should be fairly easy to redirect it to a file selectively.

#17 Updated by Nima Mousavi almost 4 years ago

I attached the complete log with this post. It starts out being ok, then gets stuck, and then continues to get data.

#18 Updated by Nitesh Mor almost 4 years ago

Okay, it appears that you are using a multiread and a subscription at the same time.

Currently, multiple subscriptions and/or multireads to the same log in the same process don't work. We are aware of this bug and it requires some major changes in the way various components are structured. So, however annoying it might be, it is certainly not going to be fixed in the next few days.

#19 Updated by Nitesh Mor almost 4 years ago

Just a quick update, gdp-01 has been upgraded to the same state as the other three servers. So, you should not have to avoid gdp-01.

#20 Updated by Nima Mousavi almost 4 years ago

Well that explains it then.
I'm now just using subscribe and it seems to be working.
You may want to close this bug and I'll let you know if there were other issues.

Thanks.

#21 Updated by Nitesh Mor almost 4 years ago

  • Status changed from In Progress to Closed

Closing this, since it's a different bug (multiple subscriptions and/or multi-reads to the same log in a single client program don't work).

#22 Updated by Anonymous about 3 years ago

I suggest creating a test case using python.

See gdp/lang/python/test/writer_subscriber2_test.py

If you add a file there and the name has _test in it, then it will be run
automatically.

_Christopher

On Sun, Oct 23, 2016 at 1:05 PM, gdp@berkeley.edu wrote:

Issue #55 has been updated by Nima Mousavi.

I tried another approach with 1 second timeout, and ignoring the result if
the returned value was None. The code still hangs in the same line:

ioClass.py(256): t = {'tv_sec':0, 'tv_nsec':10*(10**9), 'tv_accuracy':0.0}
ioClass.py(257): newRecord = gdp.GDP_GCL.get_next_event(t)
--- modulename: GDP_GCL, funcname: get_next_event
GDP_GCL.py(628): return cls.helper_get_next_event(None, timeout)
--- modulename: GDP_GCL, funcname: helper_get_next_event
GDP_GCL.py(542): _func1 = gdp.gdp_event_next
GDP_GCL.py(545): if __gcl_handle == None:
GDP
GCL.py(546): _func1_arg1_type = cvoid
p
GDP_GCL.py(551): if timeout == None:
GDP_GCL.py(555): _timeout = GDP_DATUM.EP_TIME_SPEC()
GDP_GCL.py(556): _
timeout.tv_sec = c_int64(timeout['tv
sec'])
GDP_GCL.py(557): timeout.tv_nsec = c_uint32(timeout['tv_nsec'])
GDP_GCL.py(558): _timeout.tv_accuracy = c_float(timeout['tv_accuracy'])
GDP_GCL.py(559): _func1arg2
type = POINTER(GDP_DATUM.EP_TIME_SPEC)
GDP_GCL.py(563): __func1.argtypes = [
func1_arg1_type, _func1_arg2_type]
GDP_GCL.py(564): __func1.restype = POINTER(cls.gdp
event_t)

GDP_GCL.py(566): event_ptr = func1(gcl_handle, __timeout)

Bug #55: GDP log subscribe hangs
https://gdp.cs.berkeley.edu/redmine/issues/55#change-286

  • Author: Nima Mousavi
  • Status: Feedback
  • Priority: Normal
  • Assignee: Eric Allman
  • Category: libgdp
  • Target version:

I've been using the function calls that have been used in
gdp/lang/python/reader_multi_subscribe.py, and I'm facing this issue.
After getting data for a while, the running just freezes. I traces the
runtime and found that is freezes in __func1 call. (Trace attached)

I didn't face a similar issue when running reader_multi_subscribe.py (at
least not while I was watching.). Do you know what may be the source of the
problem?
I've tried two different logs so far:
edu.berkeley.eecs.swarmlab.device.c098e5300003
edu.berkeley.eecs.bwrc.device.c098e530005d

Thanks.
Nima
Files gdp.JPG
https://gdp.cs.berkeley.edu/redmine/attachments/download/49/gdp.JPG

(163 KB)

You have received this notification because you have either subscribed to
it, or are involved in it.
To change your notification preferences, please click here:
https://gdp.cs.berkeley.edu/redmine/my/account

--
Christopher Brooks, PMP
Academic Program Manager
CHESS/Ptolemy/TerraSwarm
707.332.0670, cxh@eecs.berkeley.edu, https://ptolemy.eecs.berkeley.edu/~cxh

Also available in: Atom PDF