Subject: Re: [44net] Tunnel mesh is (mostly) down From: "John Wiseman" john.wiseman@cantab.net Date: 01/04/2015 10:09 AM
To: "'AMPRNet working group'" 44net@hamradio.ucsd.edu
Wouldn't the simplest solution be to modify the rip44 process so it doesn't delete routes that haven't been announced for a while, or at least for a much longer period?
IPIP tunnels and RIP have the major advantage that they allow those who have a dynamic IP address to participate in net44. I feel it is important that we remember that we are radio hams first, and should use solutions that can be used by the majority of hams, not just those network professionals who want to play a being an ISP.
73, John G8BPQ
That sounds like a good idea, John. The ampr-ripd has a route lifetime of only 600 seconds. Routes are announced every 300 seconds, so when two subsequent announces are incomplete we lose the route. It happened again this morning at 08:05 local (07:05 UTC). My route again was lost, and recovered at 08:10.
I think I'll recompile ampr-ripd with a bit larger timeout. Marius, what do you think? Does this have any negative consequences? (e.g. to change EXPTIME from 600 to 3600 or even 7200)
However, aside from that, the source of the problem should be located, or it will bite us again in the future. There could be big packet loss on the link from UCSD, or maybe some buffer overflow internal to the RIP server (I think it sends big bursts of data rather than slowly throttled streams), or it could be the RIP server somehow sends shortened information because it temporarily does not have the full route list, e.g. because the portal sends an incomplete list.
This has to be investigated or else it will come back. I predict. (I mention this because I recently had a long discussion with someone who thinks that we can live with malfunctioning solutions because we are amateurs. I think we should strive for correctly working systems anyway. And that a system that works well in practice is always better than a system that works well in theory but does not work in real life)
Rob
On Sun, Jan 04, 2015 at 09:23:08PM +0100, Rob Janssen wrote:
The ampr-ripd has a route lifetime of only 600 seconds. Routes are announced every 300 seconds, so when two subsequent announces are incomplete we lose the route. It happened again this morning at 08:05 local (07:05 UTC). My route again was lost, and recovered at 08:10.
I am at somewhat of a loss to explain why this might have happened; the rip sender logged that it was fetching the proper number of subnet routes (428) from the routing database, and generating the proper number of rip packets. No transmission errors were logged at the time you mention.
It is possible that you did not receive all the packets. They are sent as datagrams so there is nothing to retry or notice if one of them goes missing in transit.
Perhaps it would have been smarter to use a connected mode (TCP) to transmit the routing information. We could convert to doing that, with some significant effort.
I agree that making the timeout much longer than 10 minutes is wise. It might also be wise to control for a large delta in routes received. Logging the number of packets and subnet routes received to syslog might provide some additional data if/when this happens again. - Brian
Hello,
I don't think that increasing the route timeout the would have any bad side effects (I think 7200 would be a good value).
But maybe there is another mechanism that could be added to the ampr gateway (And which is already implemented in ampr-ripd): The daemon is capable of force exipring routes if they are received with metric 15. So adding the sending of deleted subnets with metric 15 fore a given time AND increasing normal expire time to higher values (e.g. 10800 - 3 hours, or even more) would make the system more stable.
Marius, YO2LOJ
Brian Kantor wrote:
On Sun, Jan 04, 2015 at 09:23:08PM +0100, Rob Janssen wrote:
The ampr-ripd has a route lifetime of only 600 seconds. Routes are announced every 300 seconds, so when two subsequent announces are incomplete we lose the route. It happened again this morning at 08:05 local (07:05 UTC). My route again was lost, and recovered at 08:10.
I am at somewhat of a loss to explain why this might have happened; the rip sender logged that it was fetching the proper number of subnet routes (428) from the routing database, and generating the proper number of rip packets. No transmission errors were logged at the time you mention.
It is possible that you did not receive all the packets. They are sent as datagrams so there is nothing to retry or notice if one of them goes missing in transit.
Perhaps it would have been smarter to use a connected mode (TCP) to transmit the routing information. We could convert to doing that, with some significant effort.
I agree that making the timeout much longer than 10 minutes is wise. It might also be wise to control for a large delta in routes received. Logging the number of packets and subnet routes received to syslog might provide some additional data if/when this happens again.
- Brian
What I have observed in the past is that there is a small subset of the routes that appear and disappear in my list quite regularly. I discovered this when I made an auto-adapting filter that allows tunnel traffic only from registered gateways, where new items are always inserted at the top, and when I list that filter there are a few gateways that regularly appear at the top of the list. (it is initially loaded in sorted numeric sequence so this is quite apparent) For example, 44.140.0.1 is always amongst these. I mentioned it on this mailing list but there was no followup on it.
So probably there is something going on that is a bit more systematic than just random packet loss. It could be that your RIP server sends out all packets in one burst without any delay inbetween, there is some queue length limit somewhere (either locally in your system or along the path to here), and the later packets in the burst have a high chance of getting dropped. That could probably be fixed by putting a small usleep between the packet transmissions, so that the queues can drain. Such a change would be much easier than to go to the use of TCP. Of course that is a more stable solution, but a protocol like RIP should survive some random packet loss. Systematic packet loss is a different story.
Maybe something else you can do is drop the extra transmission of RIP packets from the public IP address. I think nobody is really using those (if not because of the funny destination port number), and they only add to this problem by putting even more data in the queue.
Rob
Let me get this right... You are talking about the encapsulated RIP transmissions originating from 169.228.66.251 to each public gateway IP?
-----Original Message----- From: 44net-bounces+marius=yo2loj.ro@hamradio.ucsd.edu [mailto:44net-bounces+marius=yo2loj.ro@hamradio.ucsd.edu] On Behalf Of Rob Janssen Sent: Monday, January 05, 2015 11:11 To: Brian Kantor Cc: 44net@hamradio.ucsd.edu Subject: Re: [44net] Tunnel mesh is (mostly) down
Maybe something else you can do is drop the extra transmission of RIP packets from the public IP address. I think nobody is really using those (if not because of the funny destination port number), and they only add to this problem by putting even more data in the queue.
Rob _________________________________________ 44Net mailing list 44Net@hamradio.ucsd.edu http://hamradio.ucsd.edu/mailman/listinfo/44net
On Mon, Jan 05, 2015 at 02:27:18PM +0200, Marius Petrescu wrote:
Let me get this right... You are talking about the encapsulated RIP transmissions originating from 169.228.66.251 to each public gateway IP?
-----Original Message----- On Behalf Of Rob Janssen Sent: Monday, January 05, 2015 11:11 To: Brian Kantor Cc: 44net@hamradio.ucsd.edu Subject: Re: [44net] Tunnel mesh is (mostly) down
Maybe something else you can do is drop the extra transmission of RIP packets from the public IP address. I think nobody is really using those (if not because of the funny destination port number), and they only add to this problem by putting even more data in the queue.
Up until a few minutes ago, the amprgw system was sending the RIP data twice - once UNencapsulated to the public gateway IP, once encapsulated.
Since to the best of my knowledge, no one was using the UNencapsulated RIP for anything, I've discontinued sending it.
If I'm wrong and someone is using it for something, I'll turn it back on. - Brian
Brian Kantor wrote:
I agree that making the timeout much longer than 10 minutes is wise.
I am now running the latest ampr-ripd with 1 hour timeout, to see if there are still problems. (I can see some of the symptoms in the firewall log and the updated filter list)
Rob
Hmmm. Given a 1 hour timeout, then any error would need to be detected and corrected within that hour, or else routes will still be lost. Correct?
It would seem that a timeout of something more like 24 hours would be more practical. I wouldn't think that holding on to old, stale routes (the other end is gone) for 24 hours would not be any significant problem. The problem case is if a longer (more specific) prefix is replaced with a shorter (less specific) prefix to a different location. The longer prefix would still win until it eventually times out. But maybe some ability to poison withdrawn routes (as Marius suggested) could be added to eliminate that problem, too?
Michael N6MEF
-----Original Message----- I am now running the latest ampr-ripd with 1 hour timeout, to see if there are still problems. (I can see some of the symptoms in the firewall log and the updated filter list)
Rob
On Mon, 5 Jan 2015, Rob Janssen wrote:
Brian Kantor wrote:
I agree that making the timeout much longer than 10 minutes is wise.
I am now running the latest ampr-ripd with 1 hour timeout, to see if there are still problems.
My original perl implementation considers expiring old routes only after receiving new routes. In other words, if it stops getting new routes, it'll keep on running with the old ones forever (well, up to the next reboot).
Just a suggestion. :)
- Hessu
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
My original perl implementation considers expiring old routes only after receiving new routes. In other words, if it stops getting new routes, it'll keep on running with the old ones forever (well, up to the next reboot).
Just a suggestion. :)
The initial problem the OMs are trying to solve is the loss <100% of the routes.
You solution only solves a problem where no new RIP announcements are received.
73 de Marc, LX1DUC
The same happens in ampr-ripd. If no new routes are received, the old ones will persist and will be used. @Hessu: I implemented that idea from your perl script, too, since the first version was actually almost a 1 to 1 clone . The initial goal was to get rid of perl for low resource systems...
Marius.
-----Original Message----- From: 44net-bounces+marius=yo2loj.ro@hamradio.ucsd.edu [mailto:44net-bounces+marius=yo2loj.ro@hamradio.ucsd.edu] On Behalf Of Heikki Hannikainen Sent: Tuesday, January 06, 2015 23:53 To: AMPRNet working group Subject: Re: [44net] Tunnel mesh is (mostly) down
(Please trim inclusions from previous messages) _______________________________________________ On Mon, 5 Jan 2015, Rob Janssen wrote:
Brian Kantor wrote:
I agree that making the timeout much longer than 10 minutes is wise.
I am now running the latest ampr-ripd with 1 hour timeout, to see if there
are still problems.
My original perl implementation considers expiring old routes only after receiving new routes. In other words, if it stops getting new routes, it'll keep on running with the old ones forever (well, up to the next reboot).
Just a suggestion. :)
- Hessu
_________________________________________ 44Net mailing list 44Net@hamradio.ucsd.edu http://hamradio.ucsd.edu/mailman/listinfo/44net