Why not to use Round Robin as fall back method to backup RADIUS servers

Jan Tomášek

4. August 2006

This article discus <AuthBy ROUNDROBIN> feature of Radiator. That feature allows proxying of requests to multiple proxy RADIUS servers and distribution of load equally to all configured servers, but because distribution of packet is not session aware it is usable only for simple RADIUS packets not to EAP streams.

Misuse of this technique seams to be quite common on sites running Radiator RADIUS in eduroam. I hope to show why is bad and why it should not to be used.

Example of eduroam infrastructure

Eduroam infrastructure is usually drawn as hierarchical tree, with top showing interconnection of NREN's and end-leafs representing organisations. Image bellow shows real lines existing between RADIUS servers of hypothetical orgA.cz and orgB.de, assuming that NREN have two servers, also TOP level has two servers and orgB.de have two too.

[Figure]

Figure 1: Possible links between eduroam RADIUS proxy servers.

EAP session begins at notebook of user@orgB.de visiting organisation orgA.cz. User's request for network access starts exchange of number Access-Request and Access-Challenges pairs and terminates with Access-Accept of Access-Reject packets.

Let study which servers participate on communication. Communication begins at reset of all systems and there is no other communication running in parallel to this. That assumptions cannot be fulfilled in real world but in lab it is possible, even it's hard to synchronise all events.

In table bellow A-Req stands for Access-Request, and A-Ch for Access-Challenge.

visited organisation CZ NREN TOP level DE NREN home organisation
1. orgA.cz (send A-Req) -> r1.cz -> r1 -> r1.de -> r1.orgB.de
2. orgA.cz <- r1.cz <- r1 <- r1.de <- r1.orgB.de (send A-Ch)
3. orgA.cz (send A-Req) -> r2.cz -> r1 -> r2.de -> r1.orgB.de
4. orgA.cz <- r2.cz <- r1 <- r2.de <- r1.orgB.de (send A-Ch)
5. orgA.cz (send A-Req) -> r1.cz -> r2 -> r1.de -> r2.orgB.de (???)
6a. r1.de -> r1.orgB.de
... <- r1.orgB.de (send A-Ch)
6b. r2 -> r2.de -> r2.orgB.de (???)
6c. r1.cz -> r1 -> r1.de -> r2.orgB.de (???)
6d. orgA.cz (re-send A-Req) -> r2.cz -> r2 -> r1.de -> r1.orgB.de (???)

Table 1: Flow of one EAP session in eduroam infrastructure

Possible source of troubles begins in step no. 5. At that moment Access-Request reaches r2.orgB.de which does not know nothing about work r1.orgB.de is working on . They both ask same user database, but don't share state. r2.orgB.de should ignore this packet, but in some cases it sends Access-Reject. See source Radiator/Radius/EAP_25.pm, function response(), I don't know whatever this is right according RFC 3748 or not, analysis of this problem is out of scope of this document.

If I will assume that r2.orgB.de will ignore packet, than infrastructure will face numerous retransmits from each RADIUS server participating on communication. In points 6a-d will packet get to right server (r1.orgB.de), but also again to bad one.

Test in lab

I'm aware of fact that my "analysis" shown above is not very exact, maybe I missed some serious points which maybe can be used for picking holes and in final distracting from main subject of this article which is to try convince readers that Round Robin feature should not be used for EAP proxy network. Better show how results from experiments in lab.

Description of Lab

For tests I'm using virtual servers powered by VMWare server. I was testing with five servers in configuration shown on image below. All tests were performed by the rad_eap_test doing PEAP-MSCHAPv2 authentication.

[Figure]

Figure 1: Testing lab

All tests were done by submiting queries from another computer simulating AP of r1orgC. There were 100 atempts with identity user@orgA simulating visitor. As client was used rad_eap_test.

test access-accepts
[count / avg. duration]
access-reject
[count]
timeouts
[count]
1. 99 / 0.85 0 1
2. 100 / 11.8 0 0
3. 0 / n.a. 48 52
4. 0 / n.a. 0 100

Table 1: Results of testing Round Robin

Test no. 1. during this test was Round Robin is defined only on r1orgC. All servers were up, doing 100 tests with user@orgA on r1orgC. Everything was working, in logs was huge amount of not necessary retransmits, but for user it works.

Test no. 2. Round Robin is defined only on r1orgC as in test no. 1, but r2nren is down, rest of servers is up. Purpose of this test is to show how slowdown will authenication process if there is dead backup server. Whole process was slowed down more than 10times. There is also very big amount of retransmits. User is still able to get online.

Test no. 3 Round robin was defined on r1orgC, r1nren and r2nren, every server was up. User never get online. Servers r1orgA and r2orgA usually says in log files one of following messages:

   Deleting session for semik@orgA.etest.cesnet.cz, 127.0.0.1,
   Handling with Radius::AuthFILE: CheckFILE
   Handling with EAP: code 2, 104, 6
   Response type 25
   EAP PEAP Nothing to read or write
   AuthBy FILE result: IGNORE, EAP PEAP Nothing to read or write
   Deleting session for semik@orgA.etest.cesnet.cz, 127.0.0.1,
   Handling with Radius::AuthFILE: CheckFILE
   Handling with EAP: code 2, 104, 6
   Response type 25
   TLS not initialised
   Authy FILE result: IGNORE, TLS not initialised
   Deleting session for semik@orgA.etest.cesnet.cz, 127.0.0.1,
   Handling with Radius::AuthFILE: CheckFILE
   Handling with EAP: code 2, 3, 13
   Response type 25
   EAP TLS SSL_accept result: -1, 1, 8576
   EAP TLS error: -1, 1, 8576,  14840: 1 - error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number

   EAP result: 1, EAP PEAP TLS error
   AuthBy FILE result: REJECT, EAP PEAP TLS error
   Access rejected for semik@orgA.etest.cesnet.cz: EAP PEAP TLS error

Test no. 4 Round Robin was defined on r1orgC, r1nren and r2nren as in test no. 3. r2nren was down, rest is up. Doing 100 tests with user@orgA on r1orgC. User never get online.

Conclusion

Round Robin is not intended for use with EAP, it was said by one of developer of the Radiator in their support list.

There is trick which make possible to use Round Robin. It's necessary to put an aggregating server in front of end servers. By end servers I mean r1orgA and r2orgA in my test lab or r1.orgB.de and r2.orgB.de in analysis section. But other parameters seen in tests show it does not worth to invest into another server.

Result of test no. 2 shows that Round Robin infrastructure slows down seriously even if failure of backup server. Result of test no. 4 shows that infrastructure is unstable if there are multiple servers with Round Robin configuration.

It's possible to say that any site which is using Round Robin configuration is abusing eduroam infrastructure by not necessary retransmits which can lead into failure of user authentication.

I show proper configuration of EAP proxy with Radiator in separate article.