Tuesday, 28 October 2014

The Routing Engine In Linux Kernel

In this discussion we will zoom into the Linux Kernel code to understand what really happens when we work on internet routing. 

What is routing
Essential kernel data structures in routing
How route lookup works
Well known kernel APIs for route lookup
Behind the route configuration commands
Policy based routing
Multipath routing

We keep the latest kernel 3.17.1  as the reference for this discussion. So it’s worth mentioning that the well known traditional 'routing cache' is removed from 3.6 kernel onwards and the routing database is only FIB TRIE now.

What we DONT discuses here!
This article doesnt talk about the routing protocols like RIP, OSPF, BGP, EGP etc. Also we will not focus on the commands used for routing table configuration and management.

What is routing
Routing is the brain of Internet protocol, which allows the packets to cross the LAN boundaries. Lets do not spend much time to discuss the same details that you find here, http://en.wikipedia.org/wiki/Routing. Instead lets peek into the implementation details of it.

Configure a route 
We can add/delete routes to the routing tables using one of the below commands
ip route add dev eth0
route add -net gw
# ip route delete dev eth0
route del -net gw

Please refer the section 'route configuration, behind the curtain' to know how it works.

Scanning the Kernel Code

What are the minimum information expected from a route lookup?
  • The nexthop: The directly connected device to which the concerned packet must handover.
  • The output interface: the interface of this device to reach the nexthop
  • type of the route (based on the destination address in the ip header of the packet in question): Few of the important routing flags are below
    • RTCF_LOCAL: The destination address is local and the packet should terminate on this device. Packet will be given to the kernel method ip_local_deliver().
    • RTCF_BROADCAST and RTCF_MULTICAST: uses when the destination address in broadcast or multicast address resp.
    • refer include/uapi/linux/in_route.h for the rest of the flags.
  • Scope of the route (based on the destination address in the ip header of the packet in question):  As per the comment in rtnetlink.h, 'scope is something like the distance to the destination. The enumerator listed below holds the available scope flags. Where, the NOWHERE are reserved for not existing destinations, HOST is our local addresses, LINK are destinations located on directly attached link and UNIVERSE is everywhere in the Universe'.
        enum rt_scope_t {

  • Additionally, a route entry also gives few more information like MTU, priority, protocol id, metrics etc.

What should be the action on the packet based on this info?

Based on the route lookup result, the packet will be given to ip_local_deliver() in the case of local delivery, ip_forward() in the case of forwarding. In forwarding case, the packet is send to the next hop (Found in the route lookup) via the output interface and the packets continue its journey to the destination address.  

Now let’s see where the above information is stored in kernel.

Essential kernel data structures

Forwarding Information Base (FIB): For any given destination address, the routing subsystem is expected to give a route entry with the above discussed information. Such routing information(either statically configured or dynamically populated by the routing protocols) are stored in a kernel database called FIB.

'struct fib_table' is the data structure represents a FIB table in kernel.

struct fib_table {
struct hlist_node tb_hlist;
u32 tb_id;  /* the table id, between 0 to 255 */
int tb_default;
int tb_num_default; /* number of deafult routes in this table */
unsigned long tb_data[0];
            please find the complete structure definition at include/net/ip_fib.h

At boot time, two FIB tables will be created by default, called RT_TABLE_MAIN and RT_TABLE_LOCAL with table id 244 and 255 resp. More FIB tables will be created when policy based routing is enabled.

Kernel API
Creates a TRIE table
Creates a route entry, struct fib_info
Insets a route to the table
Deletes a route entry from the table

A route entry in kernel is 'struct fib_info', which holds the routing information that we discussed above (out dev, next hop, scope, priority, metric etc)

struct fib_info {
            struct hlist_node fib_hash;
            struct hlist_node fib_lhash;
            int                                 fib_treeref;
            unsigned int                   fib_flags;
            unsigned char                fib_dead;
            unsigned char                fib_protocol; /* routing protocol indicator for this route entry */
            unsigned char                fib_scope;
            unsigned char                fib_type;
            u32                               fib_priority;
            u32                               *fib_metrics;
            int                                 fib_nhs;
            struct fib_nh                  fib_nh[0]; /* Yes, it is an array, pls refer Multipath routing section below */
            please find the complete structure definition at ip_fib.h

The Destination cache used by the routing engine: This improves the performance of Linux routing engine. 'struct dst_entry' holds too many critical information required to process a packet going to each destination. This will be created after the route lookup. And the packet (Struct sk_buff) itself will hold a pointer to this entry, but not to fib_info. So after the route lookup, during the journey of the packet in the kernel network stack, destination cache could be referred at any point using skb_dst() api.

struct dst_entry {
            struct rcu_head              rcu_head;
            struct dst_entry  *child;
            struct net_device       *dev;
            struct  dst_ops           *ops;
            unsigned long                _metrics;
            unsigned long           expires;
            struct dst_entry  *path;
            struct dst_entry  *from;
            struct xfrm_state            *xfrm;
            void                              *__pad1;
            int                                 (*input)(struct sk_buff *);
            int                                 (*output)(struct sk_buff *);

            unsigned short               flags;
            please find the complete structure definition at include/net/dst.h

where input() and output() function pointers will be pointing to ip_local_deliver()/ip_forward() and ip_output() respectively. This id filled based on the route lookup result, as we discussed above.

How it works

How route lookup works?
As in the table below there are quite many kernel APIs available to do the route lookup for us. We can choose the appropriate API based on the available  input data and the direction of the packet (ingress or Egress). All of them boils down to the core FIB lookup api, fib_lookup(). This method returns 'struct fib_result' in successful route lookup. This structure is below

struct fib_result {
            unsigned char    prefixlen;
            unsigned char    nh_sel;
            unsigned char    type;
            unsigned char    scope;
            u32                   tclassid;
            struct fib_info *fi;
            struct fib_table *table;
            struct list_head *fa_head;

            please find the complete structure definition at include/net/ip_fib.h

fib_lookup() method searches both the default tables for a matching route, first searches the RT_TABLE_LOCAL and if there is no match, lookup will be done in the main table (RT_TABLE_MAIN). Using the 'fib_result', the destination cache (dst_entry) will be created for this destination, and input() and output() function pointers will be assigned properly as we discussed above.
fib_lookup() expects 'struct flowi4' as the input argument for the table lookup. This object carries the source and destination address, ToS value and many more as below.

struct flowi4 {
            struct flowi_common      __fl_common;
#define flowi4_oif                      __fl_common.flowic_oif
#define flowi4_iif                       __fl_common.flowic_iif
#define flowi4_mark                  __fl_common.flowic_mark
#define flowi4_tos                      __fl_common.flowic_tos
#define flowi4_scope                  __fl_common.flowic_scope
#define flowi4_proto                  __fl_common.flowic_proto
#define flowi4_flags                   __fl_common.flowic_flags
#define flowi4_secid                   __fl_common.flowic_secid
            /* (saddr,daddr) must be grouped, same order as in IP header */
            __be32                          saddr;
            __be32                          daddr;
            union flowi_uli                uli;
#define fl4_sport                        uli.ports.sport
#define fl4_dport                       uli.ports.dport
#define fl4_icmp_type                uli.icmpt.type
#define fl4_icmp_code                uli.icmpt.code
#define fl4_ipsec_spi                  uli.spi
#define fl4_mh_type                  uli.mht.type
#define fl4_gre_key                   uli.gre_key

            please find the complete structure definition at include/net/flow.h

} __attribute__((__aligned__(BITS_PER_LONG/8)));

A set of well known kernel APIs for route lookup

Kernel API

route configuration, behind the curtain!

Kernel routing tables could be managed using standard tools provided in iproute2 (ip command) & net-tools(route, netstat etc) packages. Where the iproute2 package uses NETLINK sockets to talk to the kernel, and the net-tools package uses IOCTLS.

As you know, NETLINK is an extension of generic socket framework (I will discuss about it in a separate article). NETLINK_ROUTE is the netlink message type used to link the admin commands to the routing subsystem. The most important rtnetlink routing commands and corresponding kernel handlers are noted below.

RTM_NEWROUTE   =>  inet_rtm_newroute()
RTM_DELROUTE     =>  inet_rtm_delroute()

RTM_GETROUTE    =>  inet_dump_fib()

Similarly, socket IOCTLS uses the following commands

#define SIOCADDRT      0x890B              /* add routing table entry           */
#define SIOCDELRT       0x890C              /* delete routing table entry        */
#define SIOCRTMSG      0x890D             /* call to routing system */

ip_rt_ioctl() is the kernel handler for all these IOCTL commands.

Now lets discuss two well known extensions of routing, Policy based routing & multipath routing.

Policy based routing

What is it?
As we discussed, there will be only two FIB tables (LOCAL and MAIN) created at network stack boot time. Policy routing is a feature which allows us to create up to 255 routing tables. This extends the control and flexibility in the routing decisions. Admin have to attach every table with a specific 'rule', so that when a packet arrives in routing framework, it searches for a matching rule and the corresponding table will be picked for route lookup (fib_rules_lookup() is the API to do this). See the example below, with that, any packet comes with 'tos' value 0x02 will hit routing table 190 in route lookup.

/* add a rule for a table 190 */
# ip rule add tos 0x02 table 190
/* now add a route entry to this table*/
# ip route add dev eth0 table 190

A fib rule is represented using struct fib_rule,

struct fib4_rule {
            struct fib_rule                common;
            u8                                 dst_len;
            u8                                 src_len;
            u8                                 tos;
            __be32                          src;
            __be32                          srcmask;
            __be32                          dst;
            __be32                          dstmask;

            please find the complete structure definition at net/ipv4/fib_rules.h

How to enable policy routing?
CONFIG_IP_MULTIPLE_TABLES must be SET in make menuconfig to enable policy routing.

Multipath routing

What is it?

Multipath routing allows us to add more than one nexthop to a single route. Admin can assign weight for each of the next hop using 'ip route' command as below

/* the network could be reached via both and, where the weights are 2 and 4 resp. */

#ip route add nexthop via weight 2 nexthop via weight 4.

As you might have noticed, the 'struct fib_info' (a route entry in kernel) is having two important fields for the nexthop, the 'fib_nhs' and 'struct fib_nh[0]'. where the former carries the number of nexthops of this route and latter is the array of those nexthops.

struct fib_nh {
            struct net_device            *nh_dev;
            struct hlist_node nh_hash;
            struct fib_info                 *nh_parent;
            unsigned int                   nh_flags;
            unsigned char                nh_scope;
            int                                 nh_weight;
            int                                 nh_power;
            __u32                           nh_tclassid;
            int                                 nh_oif;
            __be32                          nh_gw;
            __be32                          nh_saddr;
            int                                 nh_saddr_genid;
            struct rtable __rcu * __percpu *nh_pcpu_rth_output;
            struct rtable __rcu          *nh_rth_input;
            struct fnhe_hash_bucket *nh_exceptions;
            please find the complete structure definition at include/net/ip_fib.h

When this feature is enabled (CONFIG_IP_ROUTE_MULTIPATH during 'make menuconfig') in kernel, fib_select_multipath() must be the kernel API used to select a nexthop for the given destination.

Okay! Now what is the roll of the routing protocols?

The routing protocols like OSPF, RIP, BGP, EGP etc. runs as user space daemons to configure the above tables dynamically. They play an inevitable roll in the internet nodes to handle thousands of routes, where static configuration is almost impossible.

With this I wind up the discussion. Hope it helps you to look into the routing subsystem to dig further.