- VNet
10.0.0.0/16
- Subnets:
- default:
10.0.0.0/24
- AzureRouteServer:
10.0.1.0/24
- servers:
10.0.2.0/24
- default:
- Create in
RouteServerSubnet
- Add
10.0.2.10
and10.0.2.20
as Peers on ASN65010
(via Portal or otherwise) - if you don't do this thenexabgp
will be refused a connection. - These IP addresses will be used by the primary and secondary VMs.
- Simple low-spec ubuntu VM
- SSH jumpbox and pretend web client
- On the client/jumpbox VM run the script: http-ping.sh.
- You will see output like:
2021-11-22T14:46:35,983290736+00:00 +
2021-11-22T14:46:37,096042443+00:00 +
That shows: {Date}T{Time} + {Server ID}
. If nothing appears after the +
that's because no reply was received from the VIP (=10.1.0.5
). That's okay, the servers aren't setup yet.
- Simple low-spec ubuntu VMs
- Placed in
servers
subnet, private IPs only - Primary IP =
10.0.2.10
- Secondary =
10.0.2.20
- On the primary and secondary VMs, review the script: server-setup.sh. This installs the dependencies
node
andexabgp
and configures the VIP withifconfig
. - Run the script.
- Launch the web server, e.g.
- Primary server:
ID=PRI node ./web/server.js &
- Secondary server:
ID=SEC node ./web/server.js &
- Primary server:
- That command uses
&
to fork the process into the background. Usefg
to bring it back to the foreground.
- On the primary VM run:
- primary-exabgp-setup.sh. This creates a file in the current directory called
conf.ini
. - Launch exabgp:
exabgp --debug ./conf.ini
- At this point the
http-ping
on the client/jumpbox should start to provide responses.
- primary-exabgp-setup.sh. This creates a file in the current directory called
- On the secondary VM run:
- secondary-exabgp-setup.sh. This creates a file in the current directory called
conf.ini
. - Launch exabgp:
exabgp --debug ./conf.ini
- secondary-exabgp-setup.sh. This creates a file in the current directory called
exabgp
should now be running on both primary and secondary VMs.
Normal state of http-ping.sh will see a response like:
...
2021-11-22T14:51:42,102116071+00:00 + PRI
2021-11-22T14:51:42,214647212+00:00 + PRI
...
Showing responses coming from the primary VM.
Now perform a Ctrl+C
of the exabgp
process on the primary VM, within a couple of seconds the response should failover to the secondary VM:
...
2021-11-22T14:53:15,744081418+00:00 + PRI
2021-11-22T14:53:15,856580859+00:00 + PRI
2021-11-22T14:53:15,969329001+00:00 + SEC // failed over ✅
2021-11-22T14:53:16,082586449+00:00 + SEC
2021-11-22T14:53:16,196167799+00:00 + SEC
...
This example snippet showing the VIP effective route failing over to the secondary VM.
These timings have been gathered crudely using aligned timestamps across VMs, nothing fancier than that! Snippets of the terminal output have been shared for one attempt in each scenario, and other 3 timings given separately.
Failover time: <2 seconds
Start: 09:48:04
End: 09:48:06
Other attempts:
- <5 seconds
- <4 seconds
- <2 seconds
Client
2021-11-23T09:48:06,123153999+00:00 + PRI
2021-11-23T09:48:06,236536190+00:00 + PRI
2021-11-23T09:48:06,354832563+00:00 + SEC
2021-11-23T09:48:06,468471959+00:00 + SEC
2021-11-23T09:48:06,581115537+00:00 + SEC
Primary
^C09:48:04 | 5652 | reactor | ^C received
09:48:04 | 5652 | reactor | performing shutdown
09:48:04 | 5652 | outgoing-1 | connection to 10.0.1.5 closed
09:48:04 | 5652 | outgoing-1 | outgoing-1 10.0.2.10-10.0.1.5, closing connection
09:48:04 | 5652 | outgoing-2 | connection to 10.0.1.4 closed
09:48:04 | 5652 | outgoing-2 | outgoing-2 10.0.2.10-10.0.1.4, closing connection
Secondary
09:48:04 | 4911 | outgoing-1 | received TCP payload ( 19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 001C 02
09:48:04 | 4911 | outgoing-1 | received TCP payload ( 9) 0005 200A 0100 0500 00
09:48:04 | 4911 | outgoing-1 | << message of type UPDATE
09:48:04 | 4911 | parser | parsing UPDATE ( 9) 0005 200A 0100 0500 00
09:48:04 | 4911 | parser | announced NLRI none
09:48:04 | 4911 | parser | NLRI ipv4 unicast without path-information payload 200A 0100 05
09:48:04 | 4911 | parser | withdrawn NLRI 10.1.0.5/32
09:48:04 | 4911 | outgoing-1 | receive-timer 29 second(s) left
09:48:04 | 4911 | peer-2 | << UPDATE #7
09:48:04 | 4911 | peer-2 | UPDATE #7 nlri ( 5) 10.1.0.5/32
09:48:04 | 4911 | outgoing-2 | received TCP payload ( 19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 001C 02
09:48:04 | 4911 | outgoing-2 | received TCP payload ( 9) 0005 200A 0100 0500 00
09:48:04 | 4911 | outgoing-2 | << message of type UPDATE
09:48:04 | 4911 | parser | parsing UPDATE ( 9) 0005 200A 0100 0500 00
09:48:04 | 4911 | parser | announced NLRI none
09:48:04 | 4911 | parser | NLRI ipv4 unicast without path-information payload 200A 0100 05
09:48:04 | 4911 | parser | withdrawn NLRI 10.1.0.5/32
09:48:04 | 4911 | outgoing-2 | receive-timer 29 second(s) left
09:48:04 | 4911 | peer-1 | << UPDATE #7
09:48:04 | 4911 | peer-1 | UPDATE #7 nlri ( 5) 10.1.0.5/32
09:48:05 | 4911 | outgoing-1 | send-timer 8 second(s) left
Failover time: <5 seconds
Start: 09:53:42
End: 09:53:47
Other attempts:
- <1 second
- <3 seconds
- <1 second
Client
2021-11-23T09:53:47,117527920+00:00 + SEC
2021-11-23T09:53:47,233164386+00:00 + SEC
2021-11-23T09:53:47,346205116+00:00 + PRI
2021-11-23T09:53:47,459332348+00:00 + PRI
Primary
$ exabgp --debug ./conf.ini
09:53:42 | 6484 | welcome | Thank you for using ExaBGP
09:53:42 | 6484 | version | 4.0.2-1c737d99
09:53:42 | 6484 | interpreter | 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0]
09:53:42 | 6484 | os | Linux primary 5.4.0-1063-azure #66~18.04.1-Ubuntu SMP Thu Oct 21 09:59:28 UTC 2021 x86_64
Secondary
09:53:43 | 4911 | peer-2 | << UPDATE #8
09:53:43 | 4911 | peer-2 | UPDATE #8 nlri ( 5) 10.1.0.5/32 next-hop 10.0.1.5
09:53:43 | 4911 | outgoing-2 | received TCP payload ( 19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 003B 02
09:53:43 | 4911 | outgoing-2 | received TCP payload ( 40) 0000 001F 4001 0100 4002 0A02 0200 00FF EB00 00FD F240 0304 0A00 0104 C008 04FF EDFF ED20 0A01 0005
09:53:43 | 4911 | outgoing-2 | << message of type UPDATE
09:53:43 | 4911 | parser | parsing UPDATE ( 40) 0000 001F 4001 0100 4002 0A02 0200 00FF EB00 00FD F240 0304 0A00 0104 C008 04FF EDFF ED20 0A01 0005
09:53:43 | 4911 | parser | withdrawn NLRI none
09:53:43 | 4911 | parser | attribute origin flag 0x40 type 0x01 len 0x01 payload 00
09:53:43 | 4911 | parser | attribute as-path flag 0x40 type 0x02 len 0x0a payload 0202 0000 FFEB 0000 FDF2
09:53:43 | 4911 | parser | attribute next-hop flag 0x40 type 0x03 len 0x04 payload 0A00 0104
09:53:43 | 4911 | parser | attribute community flag 0xc0 type 0x08 len 0x04 payload FFED FFED
09:53:43 | 4911 | parser | NLRI ipv4 unicast without path-information payload 200A 0100 05
09:53:43 | 4911 | parser | announced NLRI 10.1.0.5/32 next-hop 10.0.1.4
09:53:43 | 4911 | outgoing-2 | receive-timer 29 second(s) left
09:53:43 | 4911 | peer-1 | << UPDATE #8
09:53:43 | 4911 | peer-1 | UPDATE #8 nlri ( 5) 10.1.0.5/32 next-hop 10.0.1.4
09:53:43 | 4911 | outgoing-1 | send-timer 9 second(s) left
kill
and timing method example:
andrewbryson@primary:~$ ps x | grep exabgp
8655 pts/1 S+ 0:00 /usr/bin/python3 /usr/sbin/exabgp --debug ./conf.ini
8696 pts/0 S+ 0:00 grep --color=auto exabgp
andrewbryson@primary:~$ date; kill -9 8655
Tue Nov 23 10:17:34 UTC 2021
Failover time: <2 seconds
Start: 10:17:34
End: 10:17:36
Other attempts:
- <1 second
- <1 second
- <2 seconds
Good failover times!
Client
2021-11-23T10:17:36,562371918+00:00 + PRI
2021-11-23T10:17:36,674617289+00:00 + SEC
2021-11-23T10:17:36,788237474+00:00 + SEC
Primary
10:17:34 | 8655 | outgoing-1 | receive-timer 25 second(s) left
Killed
Secondary
10:17:34 | 4911 | outgoing-1 | received TCP payload ( 19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 001C 02
10:17:34 | 4911 | outgoing-1 | received TCP payload ( 9) 0005 200A 0100 0500 00
10:17:34 | 4911 | outgoing-1 | << message of type UPDATE
10:17:34 | 4911 | parser | parsing UPDATE ( 9) 0005 200A 0100 0500 00
10:17:34 | 4911 | parser | announced NLRI none
10:17:34 | 4911 | parser | NLRI ipv4 unicast without path-information payload 200A 0100 05
10:17:34 | 4911 | parser | withdrawn NLRI 10.1.0.5/32
10:17:34 | 4911 | outgoing-1 | receive-timer 29 second(s) left
10:17:34 | 4911 | peer-2 | << UPDATE #21
10:17:34 | 4911 | peer-2 | UPDATE #21 nlri ( 5) 10.1.0.5/32
10:17:34 | 4911 | outgoing-2 | received TCP payload ( 19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 001C 02
10:17:34 | 4911 | outgoing-2 | received TCP payload ( 9) 0005 200A 0100 0500 00
10:17:34 | 4911 | outgoing-2 | << message of type UPDATE
10:17:34 | 4911 | parser | parsing UPDATE ( 9) 0005 200A 0100 0500 00
10:17:34 | 4911 | parser | announced NLRI none
10:17:34 | 4911 | parser | NLRI ipv4 unicast without path-information payload 200A 0100 05
10:17:34 | 4911 | parser | withdrawn NLRI 10.1.0.5/32
10:17:34 | 4911 | outgoing-2 | receive-timer 29 second(s) left
10:17:34 | 4911 | peer-1 | << UPDATE #21
10:17:34 | 4911 | peer-1 | UPDATE #21 nlri ( 5) 10.1.0.5/32
- Route changes seem to have a dampening/anti-flapping window of 30 seconds, i.e.
- Terminate the primary exabgp
- The route switches to secondary
- Then very quickly launch exabgp on the secondary again
- It will take 30 seconds from the initial failover before the primary becomes the effective route.
- With equal
as-path
lengths you get a load balanced ECMP behaviour for traffic across primary and secondary servers.
- https://docs.microsoft.com/en-us/azure/route-server/quickstart-configure-route-server-cli
- Demo inspired by Adam's great work: https://github.com/adstuart/azure-routeserver-anycast