keepalived GARPs vs Windows Server 2008

Windows Server 2008 appears to ignore GARPs entirely. There are some sketchy details here:

(Seems like a daft solution to a problem that need not exist if the bizarre “checking for ip already in use” behaviour was simply dropped).

On top of that, if an ARP entry is in continual use it appears to never drop it from its ARP cache — the intention here is sensible, less delay to having to send/wait for ARP responses, however it doesn’t seem to do this via making the odd ARP request and checking for a reply. I’m not sure how it does do it, but I’ve observed that on a router role transition the ARP cache entry just sits there, even though the router no longer answers ARP requests for the old IP, and the behaviour appears to be indefinite (I observed an entry stuck for 10 hours).

My first workaround for this was to dump the ARP cache every so often by a scheduled task on the windows machine, which isn’t very elegant but does at least work. This has quite a long delay on switchover though, or a performance hit if you dump the ARP cache quite frequently, and I wanted to avoid that.

So my second work around involves two bits of software that let the router force the Windows machines to clear their ARP caches. The first part is a Windows service (called ArpFlush) that basically just sits and listens for UDP packets sent to a specific port. When it sees one, it clears the ARP caches for all the interfaces on the machine. The second bit of software is a simple application that sends UDP packets to the ArpFlush service and causes the ARP caches to be cleared. To glue this up to a role transition in keepalived I simply attach a notify_master script that causes an ARP flush request to be sent to a network broadcast address.

As a vague attempt at security, the ARP flush request simply contains a password, and the ArpFlush service only accepts packets that contain a correct password. (These random UDP packets definitely ought to be dropped by the firewall though if from an external source so I’m not too worried about them). In addition, since UDP is unreliable, the client sends (by default) 5 UDP packets at 1 second intervals. To avoid clearing the cache multiple times, the ArpFlush service stops listening for (by default) 10 seconds after it sees a flush request.

If you want to set this up:

  1. Download either the 64-bit or 32-bit version of ArpFlush depending on which version of Windows you are running.
  2. Install the package on the Windows machines. Configure by editing HKLM\Software\NPSL\ArpFlush (particularly probably worth changing the default password of TopSecret!)
  3. Start the service. (The installer doesn’t do this because I couldn’t figure out how to stop it breaking if you didn’t have the CRT installed already)
  4. Download the source for the client. Compile this with something like
    gcc -o /usr/local/bin/arpflush arpflush.c
  5. Test the service by running something like arpflush -a -P TopSecret!. Check the application event log on the Windows machine to see if anything happened
  6. Drop your arpflush command in a script and hook it up to keepalived with notify_master

With any luck it should all be funky after that.

Note that the above software is licensed under GPLv2, and you can get a copy of the source code for the Windows service from the svn repository linked on the right. It relies on a set of libraries I use for sockets/services/etc; this is called OW32 and is licensed under LGPLv2. The code for OW32 is in the same place. To build it I used MS VC2005, and Wix for the installer packages.

This entry was posted in Code and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *