Dead-realm marking feature for Radiator RADIUS servers

8th August 2006

18th August 2008; Note: Today I publish almost the same version of dead-realm code. The new version is modified in way which allows simple chaining of several hooks.

In this article I discus new method for fall back to backup server RADIUS server in RADIUS proxy based networks such as eduroam. Problem with eduroam like network is in fact that peers does not have way how to determine whatever their peer is down or not. Current Radiator implementation tries to guess this by setting timeout counter and marking host dead when does not receive answer in time. That assumption is bad when peer have to ask another proxy server and can led into marking all possible peers as dead even if they are not answering just because of waiting for answer from someone else.

Some Radiator admins try to use Round Robin technique with disabled dead host marking, but as I discus in other article, that way is not reliable and in fact is just eduroam abuse.

The eduroam network

On figure 1. is example of eduroam network, example is just fictional, reality is bit more complicated. But for explaining my motivation and way how dead realm marking works this is figure good enought.

The most significant in eduroam are organisations which are willing to allow to their visitors access to their WiFi network. There are 5 organisations orgA.cz, orgB.cz, orgC.de, orgD.de and orgE.bg on figure. Some organisation have two RADIUS servers (just orgA.cz on this figure), some rely just on one.

Roaming on national level is achieved by NREN level servers, here represented as r.cz, r.de and r.bg. They are connected to TOP level servers, on image r1 and r2, which are responsible for international roaming. TOP level servers are duplicated to archive better reliability of infrastructure, in reality some NREN have also duplicated their servers with same purpose in mind.

Figure 1: Example of eduroam network.

Let's do in-mind experiment. We will assume that RADIUS server of organisation orgC.de is down. In that moment there is user@orgC.de visiting orgB.cz. RADIUS server of orgB.cz forwards Access-Request packet to r.cz, it forwards to r1 and r1 forwards to r.de. r.de forwards request to r1.orgC.de which is down. What will happen now?

There will be some retransmits from each host participating on communication, that won't help to r1.orgC.de get again online. After some time will r.cz mark r1 as dead (it has to do so if it want to be able fall back to backup server). But after some time, depending on setting of Retries, RetryTimeout, MaxFailedRequests and MaxFailedGraceTime, r2 will get also marked as dead. r1.orgB.cz should never mark r.cz as dead because there is only one Czech NREN server, also r1 and r2 should not mark r.de, but only if there is just one server.

Now there is user@orgE.bg coming to orgB.cz. r1.orgB.cz forwards request to r.cz which ignores them because does not have working peer. Same if there is user@orgE.bg visiting orgA.cz. Radiator running on r.cz is now sitting silent and wait until FailureBackoffTime expires. I developed patch which address this problem and reset FailureBackoffTime when there is no known working host, but it was not widely accepted it still have some problem when there is bigger amount of requests to dead organisations.

The dead-realm marking

Now same situation but r.cz is running dead-realm marking configuration.

When user@orgC.de comes, r.cz tries first to forward to r1, which will not respond. r.cz will mark realm orgC.de as dead on r1. When it will receive another Access-Request packet with orgC.de realm it will forward it to host which have not this realm marked as dead. In this experiment to r2. But result is same, r.cz makes note saying that realm orgC.de is dead on r2.

If now will come another packet with realm orgC.de, r.cz will discover that both TOP level servers have dead-realm marks saying they are dead for realm orgC.de. It will pick server with older mark and forward that packet to it. Because of this logic if there will be still coming some requests with realm orgC.de, r.cz will continue trying r1, r2, r1, r2...

user@orgE.bg now come, r1.orgB.cz forwards to r1, r1 forwards to r.bg, r1.orgE.bg - user get authenticated and gets access to network. No delay, no problem - everything working as expected.

What will happen when r1 fall down? r.cz quickly discover this and tries communicate through r2, it will not try r1 until, r2 will get down or long timeout (I suggest at least one hour) will expire.

Implementation

Implementation in Radiator is quite easy, thanks to numerous hooks implemented which allow administrator to change how packets are processed.

I will describe implementation on configuration examples from fictions r.cz. First there is defined timeout after which dead-realm marking tries recover from using backup server. Keep this value big to not waste time on dead server to often. dead-realm marking will try to discover alive servers anyway when necessary.

# Load hooks code
StartupHook sub { require "/etc/radiator/DeadRealm.pm"; }

DefineFormattedGlobalVar	dr_timeout 3600

Now follows definition of two AuthBy RADIUS blocks which are defining TOP level servers.

<AuthBy RADIUS>
	Identifier              r1

        RetryTimeout            3
        Retries                 1
        FailureBackoffTime      0

	UseExtendedIds

        <Host r1>
              AuthPort          1812
              AcctPort          1813
              Secret            testing
        </Host>

	NoReplyHook sub { DeadRealm::noReplyHook(@_); };
	ReplyHook sub { DeadRealm::replyHook(@_); };
</AuthBy>

NoReplyHook is responsible for marking this dead realms on this host. ReplyHook removes those dead marks if some response come from host about this realm.

<AuthBy RADIUS>
	Identifier              r2

        RetryTimeout            3
        Retries                 1
        FailureBackoffTime      0

	UseExtendedIds

        <Host r2>
              AuthPort          1812
              AcctPort          1813
              Secret            testing
        </Host>

	NoReplyHook sub { DeadRealm::noReplyHook(@_); };
	ReplyHook sub { DeadRealm::replyHook(@_); };
</AuthBy>

Now become definition of Handler which is responsible for forwarding requests to TOP levels servers. Names of top level servers are in variable dr_TOPLEVEL_server_list. Bold part of that variable name must match Identifier inside of Handler.

DefineFormattedGlobalVar  dr_TOPLEVEL_server_list r1,r2
<Handler Realm=/^.+\..+$/>
      Identifier  TOPLEVEL
      <AuthBy INTERNAL>
            RequestHook sub { DeadRealm::chooseServer(@_); }
      </AuthBy>
</Handler>

Logic responsible for choosing right server to communicate with is implemented in RequestHook. All hooks are part of distribution package which resides on http://wiki.eduroam.cz/dead-realm.

Note: If you need to chain several hooks you can simply add them into the sub { ... }; definition, for example: sub { DeadRealm::replyHook(@_); NREN::stripAttrs(@_); NREN::patchAccounting(@_); };

Tests

Before developing dead-realm marking I already used five virtual servers running in VMware server as my testing lab. I posted my results in tf-mobility list, used it for some other tests to show how bad is Round Robin in EAP infrastructure and I also used this reduced "eduroam" lab for testing dead-realm feature. Testing was done with identity semik@org from other machine simulating AP of orgC. There were 200 tests in line, all were performed by the rad_eap_test doing PEAP-MSCHAPv2 authentication. Before every test I restarted all Radiators to get it into defined state.

Figure 1: Testing lab

During evaluation of dead-realm marking I did numerous tests to be sure all is working as I wish. The most interesting are in table below, description of each test are under table.

test	total time [sec]	access-accepts [count / avg. duration]	access-reject [count]	timeouts [count]
1.	109	200 / 0.55	0	0
2.	138	199 / 0.59	0	1
3.	110	200 / 0.55	0	0
4.	131	199 / 0.56	0	1
5.	204	195 / 0.53	0	5
6.	289	191 / 0.63	1	8

Table 1: Result of testing dead-realm feature

Test no. 1. every site was up and working, there were no communication on r2orgA and r2nren.

Test no. 2. r1orgA and r1nren was down to test how will system fall back to backups. First test timeouted in 20sec, second passed with 10sec delay, rest was pretty fast.

Test no. 3. every site was up and working. There were "visitors" from orgD and orgE trying to authenticate on r1orgA, r2orgA and r1orgC. Realms orgD and orgE were in fact undefined and ignored by r1/2nren servers - that simulate dead sites in infrastructure. On r2orgA and r2nren was no communication except of that related to realms orgD and orgE.

Test no. 4. in this test were servers r1orgA and r1nren down, that was same as in test no. 2., but there were also "visitors" from orgD and orgE as in test no. 3. Results are exactly same as in test no. 2. infrastructure quickly fall back to back backup and after that works quickly as usual.

Test no. 5. this test was designed to examine how well will system switch to backup during dynamic crashes of r1orgA and r2orgA. Test started with r1orgA down and r2orgA up, after 45 seconds was r1orgA started and r2orgA stoped. There was always one working server and other was down. Those restarts resulted in 5 timeouts, but rest of communication was done quickly as usually.

Test no. 6. variation on test no. 5. but r1nren and r2nren were also being restarted but with period of 80secs, period of restarting r1orgA and r2orgA was 45seconds. Test started with r1orgA and r1nren down. In result are 8 timeouts and 1 access reject, in rest of communication were few delayed successful tests due higher frequency of route switching.

That one access reject in test no. 6. was on r1orgA, Log file says:

    DEBUG: Handling request with Handler 'Realm=/^orgA\.etest\.cesnet\.cz$|^r1orgA\.etest\.cesnet\.cz$/i'
    DEBUG:  Deleting session for semik@orgA.etest.cesnet.cz, 127.0.0.1, 
    DEBUG: Handling with Radius::AuthFILE: CheckFILE
    DEBUG: Handling with EAP: code 2, 3, 13
    DEBUG: Response type 25
    DEBUG: EAP TLS SSL_accept result: -1, 1, 8576
    ERR: EAP TLS error: -1, 1, 8576,  14840: 1 - error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number

    DEBUG: EAP result: 1, EAP PEAP TLS error
    DEBUG: AuthBy FILE result: REJECT, EAP PEAP TLS error
    INFO: Access rejected for semik@orgA.etest.cesnet.cz: EAP PEAP TLS error

it was because there was switch from r2orgA to r1orgA. First server get some pieces of unknown TLS session and rejected it. In general there is now way how to wait with server crash until all EAP sessions are terminated. ;)

Configurations, output from testing program and Trace 4 logs of all Radiator RADIUS servers participating on tests are attached in test-records.tar.bz2 file.

Ideas for further development

There is also possibility for load balancing links between servers. That can be done by forwarding requests of some users to one server and of other users to other servers. Radiator in role of proxy server is unable to track EAP sessions, because of that is user identity probably lower possible key usable for distributing load.

Conclusion

Described dead-realm marking is in my actual knowledge only right way how to configure Radiator server in role of Proxy server in network such as eduroam is. I'm running this on Czech NREN proxy server. Tests were done on Radiator 3.15 and patch level 1.690.

Because dead-realm feature is implemented as standard Radiator hooks is there big chance it will work with further versions of Radiator, too. Until CESNET will run run on that server Radiator, I will maintain latest version of this document and hooks implementing dead-realm feature at address http://wiki.eduroam.cz/dead-realm. When needed it is possible to reach me by contacts from my personal page.