Brian Kantor wrote:
On Sun, Jan 04, 2015 at 09:23:08PM +0100, Rob Janssen wrote:
The ampr-ripd has a route lifetime of only 600 seconds. Routes are announced every 300 seconds, so when two subsequent announces are incomplete we lose the route. It happened again this morning at 08:05 local (07:05 UTC). My route again was lost, and recovered at 08:10.
I am at somewhat of a loss to explain why this might have happened; the rip sender logged that it was fetching the proper number of subnet routes (428) from the routing database, and generating the proper number of rip packets. No transmission errors were logged at the time you mention.
It is possible that you did not receive all the packets. They are sent as datagrams so there is nothing to retry or notice if one of them goes missing in transit.
Perhaps it would have been smarter to use a connected mode (TCP) to transmit the routing information. We could convert to doing that, with some significant effort.
I agree that making the timeout much longer than 10 minutes is wise. It might also be wise to control for a large delta in routes received. Logging the number of packets and subnet routes received to syslog might provide some additional data if/when this happens again.
- Brian
What I have observed in the past is that there is a small subset of the routes that appear and disappear in my list quite regularly. I discovered this when I made an auto-adapting filter that allows tunnel traffic only from registered gateways, where new items are always inserted at the top, and when I list that filter there are a few gateways that regularly appear at the top of the list. (it is initially loaded in sorted numeric sequence so this is quite apparent) For example, 44.140.0.1 is always amongst these. I mentioned it on this mailing list but there was no followup on it.
So probably there is something going on that is a bit more systematic than just random packet loss. It could be that your RIP server sends out all packets in one burst without any delay inbetween, there is some queue length limit somewhere (either locally in your system or along the path to here), and the later packets in the burst have a high chance of getting dropped. That could probably be fixed by putting a small usleep between the packet transmissions, so that the queues can drain. Such a change would be much easier than to go to the use of TCP. Of course that is a more stable solution, but a protocol like RIP should survive some random packet loss. Systematic packet loss is a different story.
Maybe something else you can do is drop the extra transmission of RIP packets from the public IP address. I think nobody is really using those (if not because of the funny destination port number), and they only add to this problem by putting even more data in the queue.
Rob