Chris,


I recall something similar occurring during the lifetime of the late Brian Kantor. Is AMPRGW still a 10 Gbps BSD kernel?



- Lynwood


If it helps - from the late Brian Kantor, SK -

Fri, Jun 2, 2017 at 5:12 PM

I found a possible source of memory corruption - a hash routine
might have returned a negative number, causing the flow stats
gathering to index off the beginning of a large array that
appears in memory adjacent to the routing table.  This might
have resulted in an entry being stomped on.  I don't know.

But the hash routine won't return negative or too large numbers
anymore.  I did fix that.  We'll see if that prevents the problem
from recurring.  I'd hate to have to go through 16 million route
entries in a core dump.
    - Brian

On Fri, Jun 02, 2017 at 11:12:16AM -0700, Brian Kantor wrote:
> On Fri, Jun 02, 2017 at 02:03:56PM -0400, lleachii--- via 44Net wrote:
> > From my perspective, I stop seeing all inbound Internet traffic from AMPRGW
> > to 44.60.44.0/24, except for the intermittent data to another subnet (now
> > 44.62.1.81). I transmit, and never receive replies.
> > Although, I'm still able to send and receive traffic to/from the other 44
> > GWs.
>
> Thanks, that helps.  I checked, and the on-disk copy of the routing
> table is still correct even when this is happening, so I'm beginning
> to suspect memory corruption in the router software itself.
>
> As you know, that's difficult to find, but I'm looking through the
> code to make sure I haven't done any of the usual errors.  The next
> time it happens, I'll take a core dump of the running process and
> see if that tells me anything.  Be sure to let me know.
>     - Brian