Project

General

Profile

Bug #76

ERROR: request freed while in use

Added by Syed Hasnain over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
-
Start date:
10/15/2016
Due date:
% Done:

0%


Description

The Local GDP seems to be crashing a lot. I see same errors as in bug #69 .

sbuser@swarmnuc015:~$ gdp-reader -s -t tamu.watch.2KMX
Reading GCL 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
2016-10-15 21:07:29.946093 -0500 gdp-reader: _gdp_req_freeall: couldn't acquire req lock: ERROR: request freed while in use [Berkeley:Swarm-GDP:31]

[ERROR] gdp-reader: Cannot open GCL:
    ERROR: Connection timed out [EPLIB:errno:110]
exiting after 0 records with status ERROR: Connection timed out [EPLIB:errno:110]

This can happen randomly. Even when writing data to GDP. The above probelm seems to be solved by restarting the gdplogd.

The router log file contains these:

2016-10-16 02:11:05.022208 selected: write failed: 0: Success
2016-10-16 02:11:05.022212 selected: write failed: 0: Success
2016-10-16 02:11:05.022216 selected: write failed: 0: Success
2016-10-16 02:11:05.022220 selected: write failed: 0: Success
2016-10-16 02:11:05.022224 selected: write failed: 0: Success
2016-10-16 02:11:05.022228 selected: write failed: 0: Success
2016-10-16 02:11:05.022232 selected: write failed: 0: Success
2016-10-16 02:11:05.022236 selected: write failed: 0: Success
2016-10-16 02:11:05.022240 selected: write failed: 0: Success
2016-10-16 02:11:05.022243 selected: write failed: 0: Success
2016-10-16 02:11:05.022247 selected: write failed: 0: Success
2016-10-16 02:11:05.022251 selected: write failed: 0: Success
2016-10-16 02:11:05.022255 selected: write failed: 0: Success
2016-10-16 02:11:05.022258 selected: write failed: 0: Success
2016-10-16 02:11:05.022262 selected: write failed: 0: Success
2016-10-16 02:11:05.022266 selected: write failed: 0: Success
2016-10-16 02:11:05.022269 selected: write failed: 0: Success
2016-10-16 02:11:05.022273 selected: write failed: 0: Success
2016-10-16 02:11:05.022276 selected: write failed: 0: Success

The gdplogd log file for a few hours is

2016-10-15 22:34:14.330872 _gdp_chan_open: talking to router at 127.0.0.1:8007
2016-10-15 22:34:16.240442 get_open_handle: opening 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
2016-10-15 22:34:16.243379 get_open_handle: opening guJr0NNjgsxlSrzCMGWDfcCU-3hoVheRh4i9F82JDWg
2016-10-15 22:42:39.207176 get_open_handle: opening 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
2016-10-15 22:42:39.209659 get_open_handle: opening guJr0NNjgsxlSrzCMGWDfcCU-3hoVheRh4i9F82JDWg
2016-10-16 00:35:54.792280 2016-10-15 17:35:54 -0700 Running /usr/sbin/gdplogd -D*=10
2016-10-16 00:35:54.794070 _gdp_lib_init(NULL)
2016-10-16 00:35:54.794093      @(#)libgdp 0.7.2 (2016-10-14 07:32) 2023d77857ad557a53cf8d9b044ed94e7c0228f2
2016-10-16 00:35:54.794351 My GDP routing name = GFh8bl8BQXeuvUYHu-z7mhS7mRMdKj2bmUvaXkwcjMM
2016-10-16 00:35:54.794438 gdp_lib_init: OK
2016-10-16 00:35:54.794450 disk_init: log dir = /var/swarm/gdp/gcls, mode = 0600
2016-10-16 00:35:54.794463 Signature strictness = 0x1
2016-10-16 00:35:54.794545 Trying Zeroconf:
2016-10-16 00:35:54.797634 _gdp_chan_open(127.0.0.1)
2016-10-16 00:35:54.797659 Trying 127.0.0.1
2016-10-16 00:35:54.797771 _gdp_chan_open: talking to router at 127.0.0.1:8007
2016-10-16 00:36:01.547141 get_open_handle: opening 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
2016-10-16 00:37:51.154603 get_open_handle: opening guJr0NNjgsxlSrzCMGWDfcCU-3hoVheRh4i9F82JDWg
2016-10-16 02:11:35.131718 2016-10-15 19:11:35 -0700 Running /usr/sbin/gdplogd -D*=10
2016-10-16 02:11:35.146140 _gdp_lib_init(NULL)
2016-10-16 02:11:35.146149      @(#)libgdp 0.7.2 (2016-10-14 07:32) 2023d77857ad557a53cf8d9b044ed94e7c0228f2
2016-10-16 02:11:35.151879 My GDP routing name = GFh8bl8BQXeuvUYHu-z7mhS7mRMdKj2bmUvaXkwcjMM
2016-10-16 02:11:35.151935 gdp_lib_init: OK
2016-10-16 02:11:35.151941 disk_init: log dir = /var/swarm/gdp/gcls, mode = 0600
2016-10-16 02:11:35.151952 Signature strictness = 0x1
2016-10-16 02:11:35.152028 Trying Zeroconf:
2016-10-16 02:11:35.155993 _gdp_chan_open(127.0.0.1)
2016-10-16 02:11:35.156002 Trying 127.0.0.1
2016-10-16 02:11:35.156066 _gdp_chan_open: talking to router at 127.0.0.1:8007
2016-10-16 02:11:54.580867 _gdp_event_cb: got EOF, 0 bytes left
2016-10-16 02:11:55.581090 Trying Zeroconf:
2016-10-16 02:11:55.584546 _gdp_chan_open(127.0.0.1)
2016-10-16 02:11:55.584563 Trying 127.0.0.1
2016-10-16 02:11:55.584595 _gdp_chan_open: talking to router at 127.0.0.1:8007
2016-10-16 02:12:02.055778 get_open_handle: opening 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
2016-10-16 02:12:02.066733 get_open_handle: opening guJr0NNjgsxlSrzCMGWDfcCU-3hoVheRh4i9F82JDWg

gdplogd.log Magnifier (32.1 KB) Syed Hasnain, 10/16/2016 03:59 PM

gdp-router-click.log Magnifier (56.1 KB) Syed Hasnain, 10/16/2016 03:59 PM


Related issues

Duplicates GDP - Bug #69: Read, Write Errors Closed 10/07/2016

History

#1 Updated by Nitesh Mor over 3 years ago

  • Subject changed from Local GDP errors to ERROR: request freed while in use
  • Description updated (diff)

#2 Updated by Nitesh Mor over 3 years ago

  • Description updated (diff)

#3 Updated by Nitesh Mor over 3 years ago

  • Assignee set to Nitesh Mor

#4 Updated by Nitesh Mor over 3 years ago

  • Duplicates Bug #69: Read, Write Errors added

#5 Updated by Nitesh Mor over 3 years ago

I can not seem to reproduce this bug. It would be helpful to see some more debugging information. Could you please do the following:

  • Increase the debugging output on the log server. You can do this by creating a file /usr/local/etc/gdp.conf.sh with the following on a single line: GDPLOGD_ARGS=-D50. A server restart will be needed in order for this to take effect.
  • Get some debugging output for the gdp-reader by adding a -D *=50 argument (in addition to -s -t) when you run into this issue again.

#6 Updated by Syed Hasnain over 3 years ago

The following is debug information after the GDP crash happened. The crash has reported in #77 was caused by another using pressing ctrl+c

_gdp_pdu_in: reading 279 data bytes (357 available)
00000068                         7b 22 44 45 56 49 43 45
                                 {  "  D  E  V  I  C  E
00000070 22 3a 20 22 32 4b 4d 58 22 2c 20 22 54 79 70 65
         "  :     "  2  K  M  X  "  ,     "  T  y  p  e
00000080 22 3a 20 22 77 22 2c 20 22 41 43 43 5f 58 22 3a
         "  :     "  w  "  ,     "  A  C  C  _  X  "  :
00000090 20 30 2e 30 31 32 39 30 30 30 30 30 34 33 38 30
            0  .  0  1  2  9  0  0  0  0  0  4  3  8  0
000000a0 39 34 31 33 39 2c 20 22 41 43 43 5f 59 22 3a 20
         9  4  1  3  9  ,     "  A  C  C  _  Y  "  :
000000b0 30 2e 32 31 32 30 39 39 39 39 39 31 38 39 33 37
         0  .  2  1  2  0  9  9  9  9  9  1  8  9  3  7
000000c0 36 38 33 2c 20 22 41 43 43 5f 5a 22 3a 20 30 2e
         6  8  3  ,     "  A  C  C  _  Z  "  :     0  .
000000d0 39 33 37 35 2c 20 22 47 59 52 4f 5f 58 22 3a 20
         9  3  7  5  ,     "  G  Y  R  O  _  X  "  :
000000e0 2d 30 2e 32 32 33 34 36 39 39 38 37 35 31 31 36
         -  0  .  2  2  3  4  6  9  9  8  7  5  1  1  6
000000f0 33 34 38 33 2c 20 22 47 59 52 4f 5f 59 22 3a 20
         3  4  8  3  ,     "  G  Y  R  O  _  Y  "  :
00000100 30 2e 30 37 34 34 39 30 30 30 33 32 38 37 37 39
         0  .  0  7  4  4  9  0  0  0  3  2  8  7  7  9
00000110 32 32 30 36 2c 20 22 47 59 52 4f 5f 5a 22 3a 20
         2  2  0  6  ,     "  G  Y  R  O  _  Z  "  :
00000120 2d 30 2e 30 31 37 31 39 30 30 30 30 30 34 32 33
         -  0  .  0  1  7  1  9  0  0  0  0  0  4  2  3
00000130 31 39 32 39 38 2c 20 22 50 50 47 22 3a 20 31 35
         1  9  2  9  8  ,     "  P  P  G  "  :     1  5
00000140 38 39 33 31 2c 20 22 48 45 41 52 54 5f 52 41 54
         8  9  3  1  ,     "  H  E  A  R  T  _  R  A  T
00000150 45 22 3a 20 37 33 2c 20 22 54 49 4d 45 22 3a 20
         E  "  :     7  3  ,     "  T  I  M  E  "  :
00000160 33 34 38 31 31 31 39 34 37 32 2c 20 22 4d 69 6c
         3  4  8  1  1  1  9  4  7  2  ,     "  M  i  l
00000170 6c 53 65 63 6f 6e 64 73 22 3a 20 32 37 33 7d
         l  S  e  c  o  n  d  s  "  :     2  7  3  }
_gdp_pdu_in: reading 78 signature bytes (78 available)
0000017f                                              30
00000180 4c 02 24 02 f2 ad fd 00 b9 15 c8 d0 e1 c8 57 96
00000190 2e ee 7e ae b0 8e 9f ae 41 04 ca 75 bd 42 e8 03
000001a0 0e 15 00 8b 5e 17 78 02 24 03 e6 5c d1 14 30 38
000001b0 67 26 6f 4e 2f 91 d4 34 76 26 f2 00 4a dc 94 ec
000001c0 e1 6b 4a ec b6 5b 02 d1 0b b7 a9 51 01
_gdp_pdu_in(ACK_DATA_CONTENT) => OK
PDU@0x2071b80:
        v=3, ttl=15, rsvd1=0, cmd=133=ACK_DATA_CONTENT
        dst=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        src=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=0x7f0e54000e30, recno=256705, dbuf=0x7f0e54001ba0, dlen=279
                ts=2016-10-16T21:55:17.821330000Z
        sigmdalg=0x3, siglen=78, sig=0x7f0e540012b0
        total header=128

*** Processing ack/nak 133=ACK_DATA_CONTENT from socket 7
_gdp_gcl_touch(0x2070490)
gdp_gcl_cache_get: 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE =>
        0x2070490 refcnt 569
gdp_pdu_proc_resp(0x2071b80 ACK_DATA_CONTENT) gcl 0x2070490
gdp_pdu_proc_resp: searching gcl 0x2070490 for rid 2
_gdp_req_find(gcl=0x2070490, rid=2)
_gdp_req_find(gcl=0x2070490, rid=2) => 0x2071a80, state IDLE
gdp_pdu_proc_resp: req@0x2071a80:
    nextrec=0, numrecs=0, chan=0x206e8b0
    postproc=(nil), sub_cb=(nil), udata=(nil)
    state=IDLE, stat=OK [67896520 = 0x40c04c8]
    act_ts=2016-10-16T21:55:12.289111000Z
    flags=0x1d7<ASYNCIO,DONE,CLT_SUBSCR,PERSIST,ALLOC_RID,ON_GCL_LIST,ON_CHAN_LIST>
    GCL@0x2070490: 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        iomode = 1, refcnt = 569, reqs = 0x2071a80, nrecs = 256138
        flags = 0xa<INCACHE,INUSE>
    PDU@0x7f0e54000db0:
        v=3, ttl=15, rsvd1=0, cmd=133=ACK_DATA_CONTENT
        dst=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        src=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=(nil)
        total header=104
    rPDU@0x2071b80:
        v=3, ttl=15, rsvd1=0, cmd=133=ACK_DATA_CONTENT
        dst=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        src=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=0x7f0e54000e30, recno=256705, dbuf=0x7f0e54001ba0, dlen=279
                ts=2016-10-16T21:55:17.821330000Z
        sigmdalg=0x3, siglen=78, sig=0x7f0e540012b0
        total header=128
_gdp_req_dispatch >>> ACK_DATA_CONTENT (133) [gcl->refcnt 569], req@0x2071a80:
    nextrec=0, numrecs=0, chan=0x206e8b0
    postproc=(nil), sub_cb=(nil), udata=(nil)
    state=IDLE, stat=OK [67896520 = 0x40c04c8]
    act_ts=2016-10-16T21:55:18.048155000Z
    flags=0x1d7<ASYNCIO,DONE,CLT_SUBSCR,PERSIST,ALLOC_RID,ON_GCL_LIST,ON_CHAN_LIST>
    GCL@0x2070490: 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        iomode = 1, refcnt = 569, reqs = 0x2071a80, nrecs = 256138
        flags = 0xa<INCACHE,INUSE>
    PDU@0x7f0e54000db0:
        v=3, ttl=15, rsvd1=0, cmd=133=ACK_DATA_CONTENT
        dst=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        src=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=(nil)
        total header=104
    rPDU@0x2071b80:
        v=3, ttl=15, rsvd1=0, cmd=133=ACK_DATA_CONTENT
        dst=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        src=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=0x7f0e54000e30, recno=256705, dbuf=0x7f0e54001ba0, dlen=279
                ts=2016-10-16T21:55:17.821330000Z
        sigmdalg=0x3, siglen=78, sig=0x7f0e540012b0
        total header=128
ack_success: received ACK_DATA_CONTENT for ACK_DATA_CONTENT
_gdp_pdu_free(0x7f0e54000db0)
_gdp_req_dispatch <<< ACK_DATA_CONTENT [gcl->refcnt 569]
    OK [67896525 = 0x40c04cd]
acknak_from_estat: OK [67896525 = 0x40c04cd] ->
        ACK_DATA_CONTENT
gdp_pdu_proc_resp(ACK_DATA_CONTENT for ACK_DATA_CONTENT): OK [67896525 = 0x40c04cd]
_gdp_event_new => 0x7f0e54001b60
_gdp_event_trigger: adding event 0x7f0e54001b60 (1) to active list
gdp_pdu_proc_resp <<< done
 >>> HR:73      PPG:158931
ACCX:0.012900   ACCY:0.212100   ACCZ:0.937500
gdp_event_free(0x7f0e54001b60)
gdp_datum_free(0x7f0e54000e30)
  ... draining 279 bytes

subscr_poker_thread: poking

subscr_poker_thread: poking

subscr_poker_thread: poking
subscr_resub: refreshing req@0x2071a80
gdp_datum_new => 0x7f0e54000e30

>>> _gdp_invoke(req=0x2071a80 rid=2): CMD_SUBSCRIBE (72), gcl@0x2070490
        datum @ 0x7f0e54000e30: recno 256139, len 4, no timestamp
00000000 00 00 00 00
         ⌧  ⌧  ⌧  ⌧
_gdp_invoke: sending 72, retries=2
_gdp_req_send: req@0x2071a80:
    nextrec=0, numrecs=0, chan=0x206e8b0
    postproc=(nil), sub_cb=(nil), udata=(nil)
    state=ACTIVE, stat=OK [67896520 = 0x40c04c8]
    act_ts=2016-10-16T21:55:18.048239000Z
    flags=0x1d7<ASYNCIO,DONE,CLT_SUBSCR,PERSIST,ALLOC_RID,ON_GCL_LIST,ON_CHAN_LIST>
    GCL@0x2070490: 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        iomode = 1, refcnt = 569, reqs = 0x2071a80, nrecs = 256138
        flags = 0xa<INCACHE,INUSE>
    PDU@0x2071b80:
        v=3, ttl=15, rsvd1=0, cmd=72=CMD_SUBSCRIBE
        dst=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        src=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=0x7f0e54000e30, recno=256139, dbuf=0x7f0e54001ba0, dlen=4
                ts=(none)
        sigmdalg=0x0, siglen=0, sig=(nil)
        total header=112
_gdp_pdu_out, fd = 7, basemd = (nil):
    PDU@0x2071b80:
        v=3, ttl=15, rsvd1=0, cmd=72=CMD_SUBSCRIBE
        dst=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        src=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=0x7f0e54000e30, recno=256139, dbuf=0x7f0e54001ba0, dlen=4
                ts=(none)
        sigmdalg=0x0, siglen=0, sig=(nil)
        total header=112
_gdp_pdu_out: sending PDU:
00000000 03 0f 00 48 d7 4e ce c5 c2 f7 db 01 8e c6 9c fb
00000010 43 f3 55 a0 9d a4 40 c2 02 f7 7a 4f f8 8b 39 f3
00000020 89 73 38 31 01 31 65 a2 d1 f0 00 52 73 42 51 98
00000030 62 c8 3e bc 0a d4 1b 69 bf 00 c8 2e 06 b2 a3 1a
00000040 55 c5 14 8c 00 00 00 02 00 00 02 02 00 00 00 04
00000050 00 00 00 00 00 03 e8 8b
00000058                         00 00 00 00
                                 ⌧  ⌧  ⌧  ⌧
_gdp_invoke: waiting on 0x2071a80
_gdp_req_unsend: req@0x2071a80:
    nextrec=0, numrecs=0, chan=0x206e8b0
    postproc=(nil), sub_cb=(nil), udata=(nil)
    state=ACTIVE, stat=OK [67896520 = 0x40c04c8]
    act_ts=2016-10-16T21:55:18.048239000Z
    flags=0x1d4<CLT_SUBSCR,PERSIST,ALLOC_RID,ON_GCL_LIST,ON_CHAN_LIST>
    GCL@0x2070490: 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        iomode = 1, refcnt = 569, reqs = 0x2071a80, nrecs = 256138
        flags = 0xa<INCACHE,INUSE>
    PDU@0x2071b80:
        v=3, ttl=15, rsvd1=0, cmd=72=CMD_SUBSCRIBE
        dst=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        src=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=0x7f0e54000e30, recno=256139, dbuf=0x7f0e54001ba0, dlen=4
                ts=(none)
        sigmdalg=0x0, siglen=0, sig=(nil)
        total header=112
_gdp_invoke: sending 72, retries=1
_gdp_req_send: req@0x2071a80:
    nextrec=0, numrecs=0, chan=0x206e8b0
    postproc=(nil), sub_cb=(nil), udata=(nil)
    state=ACTIVE, stat=OK [67896520 = 0x40c04c8]
    act_ts=2016-10-16T21:55:18.048239000Z
    flags=0x154<CLT_SUBSCR,PERSIST,ALLOC_RID,ON_CHAN_LIST>
    GCL@0x2070490: 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        iomode = 1, refcnt = 569, reqs = (nil), nrecs = 256138
        flags = 0xa<INCACHE,INUSE>
    PDU@0x2071b80:
        v=3, ttl=15, rsvd1=0, cmd=72=CMD_SUBSCRIBE
        dst=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        src=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=0x7f0e54000e30, recno=256139, dbuf=0x7f0e54001ba0, dlen=4
                ts=(none)
        sigmdalg=0x0, siglen=0, sig=(nil)
        total header=112
_gdp_gcl_cache_add(0x2070490): already cached
_gdp_pdu_out, fd = 7, basemd = (nil):
    PDU@0x2071b80:
        v=3, ttl=15, rsvd1=0, cmd=72=CMD_SUBSCRIBE
        dst=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        src=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=0x7f0e54000e30, recno=256139, dbuf=0x7f0e54001ba0, dlen=4
                ts=(none)
        sigmdalg=0x0, siglen=0, sig=(nil)
        total header=112
_gdp_pdu_out: sending PDU:
00000000 03 0f 00 48 d7 4e ce c5 c2 f7 db 01 8e c6 9c fb
00000010 43 f3 55 a0 9d a4 40 c2 02 f7 7a 4f f8 8b 39 f3
00000020 89 73 38 31 01 31 65 a2 d1 f0 00 52 73 42 51 98
00000030 62 c8 3e bc 0a d4 1b 69 bf 00 c8 2e 06 b2 a3 1a
00000040 55 c5 14 8c 00 00 00 02 00 00 02 02 00 00 00 04
00000050 00 00 00 00 00 03 e8 8b
00000058                         00 00 00 00
                                 ⌧  ⌧  ⌧  ⌧
_gdp_invoke: waiting on 0x2071a80
_gdp_req_unsend: req@0x2071a80:
    nextrec=0, numrecs=0, chan=0x206e8b0
    postproc=(nil), sub_cb=(nil), udata=(nil)
    state=ACTIVE, stat=OK [67896520 = 0x40c04c8]
    act_ts=2016-10-16T21:55:18.048239000Z
    flags=0x1d4<CLT_SUBSCR,PERSIST,ALLOC_RID,ON_GCL_LIST,ON_CHAN_LIST>
    GCL@0x2070490: 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        iomode = 1, refcnt = 569, reqs = 0x2071a80, nrecs = 256138
        flags = 0xa<INCACHE,INUSE>
    PDU@0x2071b80:
        v=3, ttl=15, rsvd1=0, cmd=72=CMD_SUBSCRIBE
        dst=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        src=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=0x7f0e54000e30, recno=256139, dbuf=0x7f0e54001ba0, dlen=4
                ts=(none)
        sigmdalg=0x0, siglen=0, sig=(nil)
        total header=112
_gdp_invoke: sending 72, retries=0
_gdp_req_send: req@0x2071a80:
    nextrec=0, numrecs=0, chan=0x206e8b0
    postproc=(nil), sub_cb=(nil), udata=(nil)
    state=ACTIVE, stat=OK [67896520 = 0x40c04c8]
    act_ts=2016-10-16T21:55:18.048239000Z
    flags=0x154<CLT_SUBSCR,PERSIST,ALLOC_RID,ON_CHAN_LIST>
    GCL@0x2070490: 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        iomode = 1, refcnt = 569, reqs = (nil), nrecs = 256138
        flags = 0xa<INCACHE,INUSE>
    PDU@0x2071b80:
        v=3, ttl=15, rsvd1=0, cmd=72=CMD_SUBSCRIBE
        dst=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        src=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=0x7f0e54000e30, recno=256139, dbuf=0x7f0e54001ba0, dlen=4
                ts=(none)
        sigmdalg=0x0, siglen=0, sig=(nil)
        total header=112
_gdp_gcl_cache_add(0x2070490): already cached
_gdp_pdu_out, fd = 7, basemd = (nil):
    PDU@0x2071b80:
        v=3, ttl=15, rsvd1=0, cmd=72=CMD_SUBSCRIBE
        dst=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        src=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=0x7f0e54000e30, recno=256139, dbuf=0x7f0e54001ba0, dlen=4
                ts=(none)
        sigmdalg=0x0, siglen=0, sig=(nil)
        total header=112
_gdp_pdu_out: sending PDU:
00000000 03 0f 00 48 d7 4e ce c5 c2 f7 db 01 8e c6 9c fb
00000010 43 f3 55 a0 9d a4 40 c2 02 f7 7a 4f f8 8b 39 f3
00000020 89 73 38 31 01 31 65 a2 d1 f0 00 52 73 42 51 98
00000030 62 c8 3e bc 0a d4 1b 69 bf 00 c8 2e 06 b2 a3 1a
00000040 55 c5 14 8c 00 00 00 02 00 00 02 02 00 00 00 04
00000050 00 00 00 00 00 03 e8 8b
00000058                         00 00 00 00
                                 ⌧  ⌧  ⌧  ⌧
_gdp_invoke: waiting on 0x2071a80
_gdp_event_trigger_pending(0x2071b68): empty
<<< _gdp_invoke(0x2071a80 rid=2) CMD_SUBSCRIBE: ERROR: Connection timed out [EPLIB:errno:110]
  req@0x2071a80:
    nextrec=0, numrecs=0, chan=0x206e8b0
    postproc=(nil), sub_cb=(nil), udata=(nil)
    state=ACTIVE, stat=OK [67896520 = 0x40c04c8]
    act_ts=2016-10-16T21:55:18.048239000Z
    flags=0x1d4<CLT_SUBSCR,PERSIST,ALLOC_RID,ON_GCL_LIST,ON_CHAN_LIST>
    GCL@0x2070490: 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        iomode = 1, refcnt = 569, reqs = 0x2071a80, nrecs = 256138
        flags = 0xa<INCACHE,INUSE>
    PDU@0x2071b80:
        v=3, ttl=15, rsvd1=0, cmd=72=CMD_SUBSCRIBE
        dst=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        src=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=0x7f0e54000e30, recno=256139, dbuf=0x7f0e54001ba0, dlen=4
                ts=(none)
        sigmdalg=0x0, siglen=0, sig=(nil)
        total header=112

subscr_resub(107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE) ->
        ERROR: Connection timed out [EPLIB:errno:110]
gdp_datum_free(0x7f0e54000e30)
  ... draining 4 bytes
sbuser@swarmnuc015:~/gdp/apps$
sbuser@swarmnuc015:~/gdp/apps$
subscr_poker_thread: poking
subscr_resub: refreshing req@0x2071a80
gdp_datum_new => 0x7f0e54000e30

>>> _gdp_invoke(req=0x2071a80 rid=2): CMD_SUBSCRIBE (72), gcl@0x2070490
        datum @ 0x7f0e54000e30: recno 256139, len 4, no timestamp
00000000 00 00 00 00
         ⌧  ⌧  ⌧  ⌧
_gdp_invoke: sending 72, retries=2
_gdp_req_send: req@0x2071a80:
    nextrec=0, numrecs=0, chan=0x206e8b0
    postproc=(nil), sub_cb=(nil), udata=(nil)
    state=ACTIVE, stat=OK [67896520 = 0x40c04c8]
    act_ts=2016-10-16T21:55:18.048239000Z
    flags=0x1d4<CLT_SUBSCR,PERSIST,ALLOC_RID,ON_GCL_LIST,ON_CHAN_LIST>
    GCL@0x2070490: 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        iomode = 1, refcnt = 569, reqs = 0x2071a80, nrecs = 256138
        flags = 0xa<INCACHE,INUSE>
    PDU@0x2071b80:
        v=3, ttl=15, rsvd1=0, cmd=72=CMD_SUBSCRIBE
        dst=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        src=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=0x7f0e54000e30, recno=256139, dbuf=0x7f0e54001ba0, dlen=4
                ts=(none)
        sigmdalg=0x0, siglen=0, sig=(nil)
        total header=112
_gdp_pdu_out, fd = 7, basemd = (nil):
    PDU@0x2071b80:
        v=3, ttl=15, rsvd1=0, cmd=72=CMD_SUBSCRIBE
        dst=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        src=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=0x7f0e54000e30, recno=256139, dbuf=0x7f0e54001ba0, dlen=4
                ts=(none)
        sigmdalg=0x0, siglen=0, sig=(nil)
        total header=112
_gdp_pdu_out: sending PDU:
00000000 03 0f 00 48 d7 4e ce c5 c2 f7 db 01 8e c6 9c fb
00000010 43 f3 55 a0 9d a4 40 c2 02 f7 7a 4f f8 8b 39 f3
00000020 89 73 38 31 01 31 65 a2 d1 f0 00 52 73 42 51 98
00000030 62 c8 3e bc 0a d4 1b 69 bf 00 c8 2e 06 b2 a3 1a
00000040 55 c5 14 8c 00 00 00 02 00 00 02 02 00 00 00 04
00000050 00 00 00 00 00 03 e8 8b
00000058                         00 00 00 00
                                 ⌧  ⌧  ⌧  ⌧
_gdp_invoke: waiting on 0x2071a80
_gdp_req_unsend: req@0x2071a80:
    nextrec=0, numrecs=0, chan=0x206e8b0
    postproc=(nil), sub_cb=(nil), udata=(nil)
    state=ACTIVE, stat=OK [67896520 = 0x40c04c8]
    act_ts=2016-10-16T21:55:18.048239000Z
    flags=0x1d4<CLT_SUBSCR,PERSIST,ALLOC_RID,ON_GCL_LIST,ON_CHAN_LIST>
    GCL@0x2070490: 107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        iomode = 1, refcnt = 569, reqs = 0x2071a80, nrecs = 256138
        flags = 0xa<INCACHE,INUSE>
    PDU@0x2071b80:
        v=3, ttl=15, rsvd1=0, cmd=72=CMD_SUBSCRIBE
        dst=107OxcL32wGOxpz7Q_NVoJ2kQMIC93pP-Is584lzODE
        src=ATFlotHwAFJzQlGYYsg-vArUG2m_AMguBrKjGlXFFIw
        rid=2, olen=24, chan=0x206e8b0, seqno=0
        flags=0xa<HAS_RECNO,HAS_TS>
        datum=0x7f0e54000e30, recno=256139, dbuf=0x7f0e54001ba0, dlen=4
                ts=(none)
        sigmdalg=0x0, siglen=0, sig=(nil)
        total header=112

#7 Updated by Nitesh Mor over 3 years ago

Can you please post the debug log from the log-server (/var/log/gdp/gdplogd.log)? It appears that the client can not renew a subscription and it times out. This is expected behavior when a log-server goes away, or the router misbehaves. At this point, we can assume that the client side is working properly.

Also, I do not see the original gdp-reader: _gdp_req_freeall: couldn't acquire req lock: ERROR: request freed while in use [Berkeley:Swarm-GDP:31] error, which makes me believe that this is a completely separate issue.

#8 Updated by Syed Hasnain over 3 years ago

I have attached the log for gdplog as well as gdp-router-click-log. Thank you for all the help.

#9 Updated by Nitesh Mor over 3 years ago

Your log-server is still running with a -D *=10 parameter, which is not nearly enough information for me to figure out the cause of the problem. In addition, you are running an older version of the log-server.

Can you please update your code and make sure you have updated the gdplogd configuration in the file /usr/local/etc/gdp.conf.sh as I suggested.

#10 Updated by Eric Allman over 3 years ago

  • Status changed from New to Feedback
  • Assignee changed from Nitesh Mor to Eric Allman

Syad, if you are still working on this project, could you try it again with the latest code from the repo? There have been many things fixed in the last few months, several of which might have caused this.

#11 Updated by Rico Maestro over 3 years ago

Hello,

I think I'm also experiencing this problem. The gdplogd works fine at start but after some time, the writes in the client experiences the following error:

gdp.MISC.EP_STAT_SEV_ERROR: 'ERROR: Operation timed out [EPLIB:errno:60]'

Trying then to use gdp-reader or gdp-writer will give me the original error here (and in bug #69). There are times that the gdplogd will crash (as shown in the log I've provided) and there are times when gdplogd won't crash but will not be responsive too.

This is the link for the gdplogd.log: https://drive.google.com/open?id=0BwXJYPKFL4B8bnZodlhranlMRkk. I'm also running relatively updated codes (commit:b3e16f5d) for both the gdplogd and the clients.

#12 Updated by Eric Allman over 3 years ago

OK, that's a new symptom (and may be a totally different bug). In fact, it may be two independent bugs. The "Operation timed out" message looks like a router problem, since it appears to be directly generated by your local app. By any chance did you notice anything unusual there? Assuming your setup is normal, the router debug log will be in /var/log/gdp/gdp-router-click.log.

But there's also something clearly wrong with gdplogd which I haven't seen before. It looks like it might be a use-after-free memory problem, but I have no idea what the object is that is giving the problem. I've added some debug code to try to print out some information before aborting. Please upgrade again (and by the way, thanks for running recent code!) and we'll cross our fingers.

#13 Updated by Rico Maestro over 3 years ago

I've been using the router there in Berkeley (gdp-01) for a while now since I haven't updated our router here.

Here is the link for the gdplogd.log: https://drive.google.com/open?id=0BwXJYPKFL4B8UjZtMElNRWR1MFk. I can't however replicate yesterday's ABORT message even before I cloned the new version of the codes. Regardless, timeout errors on the client side still happen until I restart gdplogd.

#14 Updated by Eric Allman over 3 years ago

I'm not seeing anything that looks unusual in either your gdplogd.log or our gdp-router-click.log. The only thing I can see that might cause such a problem is if there was a networking dropout that severed the connection between your client and our router. The GDP will try to reconnect in this case, but it will still time out eventually.

One thing you could try is to set the retry parameters. In /etc/ep_adm_params/gdp you could set:

swarm.gdp.invoke.retries=10

This will increase the number of retries to 10 (the default is 3).

Another thing (which will be harder to do) is to set the debugging flags in your application (not gdplogd) to 1, which will cause connection failures to be shown. I'm not sure if the Python API makes this easy, but based on a very quick code inspection you could try:

dbg_set("1")

If I'm reading the Python code right, this should be the equivalent of -D1 or -D*=1 on the command line of the C apps. If that doesn't work, we'll need to pull Nitesh in.

#15 Updated by Nitesh Mor over 3 years ago

Eric Allman wrote:

Another thing (which will be harder to do) is to set the debugging flags in your application (not gdplogd) to 1, which will cause connection failures to be shown. I'm not sure if the Python API makes this easy, but based on a very quick code inspection you could try:

dbg_set("1")

If I'm reading the Python code right, this should be the equivalent of -D1 or -D*=1 on the command line of the C apps. If that doesn't work, we'll need to pull Nitesh in.

This is correct; gdp.dbg_set("*=1") or simply gdp.dbg_set("1") will enable printing of debugging information by the C library. Higher numbers (instead of 1) make printing more detailed. The argument to dbg_set() is treated as a string which is passed as it is to ep_dbg_set.

The ideal place to do this will be right after import gdp but before any invocation of a gdp function.

#16 Updated by Rico Maestro over 3 years ago

I've already set the retry parameters and the problem still persists. If it's a network problem, I should still be able to connect to the GDP once I restart my client. However, when I tried restarting the client, the connection still times out, although inspecting the log of gdplogd, I can see that gdplogd received the request from my client (I'm assuming that it is the part with _gdp_pdu_in label).

Just to give some background, I'm running three clients: two of them (Clients A and B) are running in a single machine, while another (Client C) is running in a different machine (which is also in a different subnet). Client A is the reader_test_subscribe.py in the repo. Client B is an app that subscribes to multiple GCLs, filters the data and writes the filtered data to another GCL. Client C is an app that listens to a lot of sensor nodes and logs their data to different GCLs in different threads. The error happens at the same time for all clients especially to those who are writing to a GCL.

Log of Client B (debug is *=50): https://drive.google.com/open?id=0BwXJYPKFL4B8VXpwOEtkTWxsZlU
Log of Client A (debug is *=50): https://drive.google.com/open?id=0BwXJYPKFL4B8YW53WVZCQk10Mjg
Log of Client A when I restarted it (debug is *=50): https://drive.google.com/open?id=0BwXJYPKFL4B8M2ltYXd5N1Baalk
Log of gdplogd: https://drive.google.com/open?id=0BwXJYPKFL4B8dXNNa2ppaDNYYWs

I'm also a bit curious with the output in gdplogd's log file. Are the gdp_gcl_cache_reclaim messages normal? I only see those lines when the clients cannot connect to the gdplogd anymore. Also, based on my observations in the log (please correct me if I'm wrong), after printing a _gdppdu_in message (for example, when opening a GCL in the client side), a _gdp_req_dispatch message usually follows. However, this does not happen when errors start to happen.

#17 Updated by Eric Allman over 3 years ago

The more you tell me, the more confused I get. I'm not seeing any attempts by the client(s) to reconnect to the router, but I am seeing attempts to send commands that never get acknowledged, and hence are retried. The problem is that there should be no way that would give a "Timed out" error, which is specific to opening a network connection.

I am seeing problems with the cache reclaim that you refer to. That's an ordinary task that runs periodically to remove any unused stuff in the cache (cache entries consume file descriptors). However, there are a lot of old entries that it can't clean up because it thinks they are still in use (the ISLOCKED flag), and this looks wrong. It wouldn't surprise me if I have another locking problem, but other than code inspection I don't have any ideas of how to find it.

Re your question, yes, _gdp_pdu_in is normally followed by a _gdp_req_dispatch. I didn't see cases where this was not true, but at the moment I'm down one eye (I had eye surgery yesterday), so it's very hard to read. I should get the bandages off on Saturday, so I'll try to look at this again early next week.

#18 Updated by Rico Maestro over 3 years ago

Hi Eric. I hope you're feeling better now.

I've done some experiments on my own to see when the gdplogd will stop acknowledging commands. When I run a client that appends data on a GCL continuously, no errors occurred. However, when I run another client that subscribes to the same GCL, the writes on the other client fails after around 10-15 minutes (at least for my experiment).

Here is the some of the log of the gdplogd for the second experiment: https://drive.google.com/open?id=0BwXJYPKFL4B8eVRGT0J5OFVqa0U

I've noticed that the message below appeared just before the writes on the other client start failing. Also, _gdp_req_dispatch messages stopped showing after _gdp_pdu_in messages after this point.

2017-03-02 03:36:05.778827 sub_notify_all_subscribers: subscription timeout: req@0x98745f0

I hope this helps.

#19 Updated by Eric Allman over 3 years ago

  • Status changed from Feedback to In Progress
  • Priority changed from Normal to High

Progress is being made, but it's not fixed yet. Updates soon, I hope.

#20 Updated by Eric Allman over 3 years ago

I have fixed some locking issues which might be related. Please try the latest version from the repo. Let me know how it goes.

I'll be on vacation starting tomorrow so I may not respond immediately.

#21 Updated by Eric Allman over 3 years ago

  • Status changed from In Progress to Resolved

#22 Updated by Rico Maestro over 3 years ago

Hi, Eric. Apologies for the late feedback. Based on initial tests, it seems that the gdplogd is not "crashing" anymore. I'll still try some few more tests though. Thanks!

#23 Updated by Eric Allman over 3 years ago

  • Status changed from Resolved to Closed

Great, glad to hear it. I'm still working on a few other problems, but they are deadlocks. I think I can safely close this one.

Also available in: Atom PDF