Kernel Network Stack Made Easy: The Routing Engine In Linux Kernel

In this discussion we will zoom into the Linux Kernel code to understand what really happens when we work on internet routing.

What is routing

Essential kernel data structures in routing

How route lookup works

Well known kernel APIs for route lookup

Behind the route configuration commands

Policy based routing

Multipath routing

We keep the latest kernel 3.17.1 as the reference for this discussion. So it’s worth mentioning that the well known traditional 'routing cache' is removed from 3.6 kernel onwards and the routing database is only FIB TRIE now.

What we DONT discuses here!

This article doesnt talk about the routing protocols like RIP, OSPF, BGP, EGP etc. Also we will not focus on the commands used for routing table configuration and management.

What is routing

Routing is the brain of Internet protocol, which allows the packets to cross the LAN boundaries. Lets do not spend much time to discuss the same details that you find here, http://en.wikipedia.org/wiki/Routing. Instead lets peek into the implementation details of it.

Configure a route

We can add/delete routes to the routing tables using one of the below commands

# ip route add 10.10.10.0/24 dev eth0

# route add -net 10.10.10.0/24 gw 10.10.20.1

# ip route delete 10.10.10.0/24 dev eth0

# route del -net 10.10.10.0/24 gw 10.10.20.1

Please refer the section 'route configuration, behind the curtain' to know how it works.

Scanning the Kernel Code

What are the minimum information expected from a route lookup?

The nexthop: The directly connected device to which the concerned packet must handover.
The output interface: the interface of this device to reach the nexthop
type of the route (based on the destination address in the ip header of the packet in question): Few of the important routing flags are below

RTCF_LOCAL: The destination address is local and the packet should terminate on this device. Packet will be given to the kernel method ip_local_deliver().
RTCF_BROADCAST and RTCF_MULTICAST: uses when the destination address in broadcast or multicast address resp.
refer include/uapi/linux/in_route.h for the rest of the flags.

Scope of the route (based on the destination address in the ip header of the packet in question): As per the comment in rtnetlink.h, 'scope is something like the distance to the destination. The enumerator listed below holds the available scope flags. Where, the NOWHERE are reserved for not existing destinations, HOST is our local addresses, LINK are destinations located on directly attached link and UNIVERSE is everywhere in the Universe'.

enum rt_scope_t {

RT_SCOPE_UNIVERSE=0,

RT_SCOPE_SITE=200,

RT_SCOPE_LINK=253,

RT_SCOPE_HOST=254,

RT_SCOPE_NOWHERE=255

};

Additionally, a route entry also gives few more information like MTU, priority, protocol id, metrics etc.

What should be the action on the packet based on this info?

Based on the route lookup result, the packet will be given to ip_local_deliver() in the case of local delivery, ip_forward() in the case of forwarding. In forwarding case, the packet is send to the next hop (Found in the route lookup) via the output interface and the packets continue its journey to the destination address.

Now let’s see where the above information is stored in kernel.

Essential kernel data structures

Forwarding Information Base (FIB): For any given destination address, the routing subsystem is expected to give a route entry with the above discussed information. Such routing information(either statically configured or dynamically populated by the routing protocols) are stored in a kernel database called FIB.

'struct fib_table' is the data structure represents a FIB table in kernel.

struct fib_table {

struct hlist_node tb_hlist;

u32 tb_id; /* the table id, between 0 to 255 */

int tb_default;

int tb_num_default; /* number of deafult routes in this table */

unsigned long tb_data[0];

......................

please find the complete structure definition at include/net/ip_fib.h

......................

};

At boot time, two FIB tables will be created by default, called RT_TABLE_MAIN and RT_TABLE_LOCAL with table id 244 and 255 resp. More FIB tables will be created when policy based routing is enabled.

Kernel API	Purpose
Fib_trie_table()	Creates a TRIE table
Fib_create_info()	Creates a route entry, struct fib_info
Fib_table_insert()	Insets a route to the table
Fib_table_delete()	Deletes a route entry from the table

A route entry in kernel is 'struct fib_info', which holds the routing information that we discussed above (out dev, next hop, scope, priority, metric etc)

struct fib_info {

struct hlist_node fib_hash;

struct hlist_node fib_lhash;

int fib_treeref;

unsigned int fib_flags;

unsigned char fib_dead;

unsigned char fib_protocol; /* routing protocol indicator for this route entry */

unsigned char fib_scope;

unsigned char fib_type;

u32 fib_priority;

u32 *fib_metrics;

int fib_nhs;

struct fib_nh fib_nh[0]; /* Yes, it is an array, pls refer Multipath routing section below */

......................

please find the complete structure definition at ip_fib.h

......................

};

The Destination cache used by the routing engine: This improves the performance of Linux routing engine. 'struct dst_entry' holds too many critical information required to process a packet going to each destination. This will be created after the route lookup. And the packet (Struct sk_buff) itself will hold a pointer to this entry, but not to fib_info. So after the route lookup, during the journey of the packet in the kernel network stack, destination cache could be referred at any point using skb_dst() api.

struct dst_entry {

struct rcu_head rcu_head;

struct dst_entry *child;

struct net_device *dev;

struct dst_ops *ops;

unsigned long _metrics;

unsigned long expires;

struct dst_entry *path;

struct dst_entry *from;

#ifdef CONFIG_XFRM

struct xfrm_state *xfrm;

#else

void *__pad1;

#endif

int (*input)(struct sk_buff *);

int (*output)(struct sk_buff *);

unsigned short flags;

......................

please find the complete structure definition at include/net/dst.h

......................

}

where input() and output() function pointers will be pointing to ip_local_deliver()/ip_forward() and ip_output() respectively. This id filled based on the route lookup result, as we discussed above.

How it works

How route lookup works?

As in the table below there are quite many kernel APIs available to do the route lookup for us. We can choose the appropriate API based on the available input data and the direction of the packet (ingress or Egress). All of them boils down to the core FIB lookup api, fib_lookup(). This method returns 'struct fib_result' in successful route lookup. This structure is below

struct fib_result {

unsigned char prefixlen;

unsigned char nh_sel;

unsigned char type;

unsigned char scope;

u32 tclassid;

struct fib_info *fi;

struct fib_table *table;

struct list_head *fa_head;

......................

please find the complete structure definition at include/net/ip_fib.h

......................

};

fib_lookup() method searches both the default tables for a matching route, first searches the RT_TABLE_LOCAL and if there is no match, lookup will be done in the main table (RT_TABLE_MAIN). Using the 'fib_result', the destination cache (dst_entry) will be created for this destination, and input() and output() function pointers will be assigned properly as we discussed above.

fib_lookup() expects 'struct flowi4' as the input argument for the table lookup. This object carries the source and destination address, ToS value and many more as below.

struct flowi4 {

struct flowi_common __fl_common;

#define flowi4_oif __fl_common.flowic_oif

#define flowi4_iif __fl_common.flowic_iif

#define flowi4_mark __fl_common.flowic_mark

#define flowi4_tos __fl_common.flowic_tos

#define flowi4_scope __fl_common.flowic_scope

#define flowi4_proto __fl_common.flowic_proto

#define flowi4_flags __fl_common.flowic_flags

#define flowi4_secid __fl_common.flowic_secid

/* (saddr,daddr) must be grouped, same order as in IP header */

__be32 saddr;

__be32 daddr;

union flowi_uli uli;

#define fl4_sport uli.ports.sport

#define fl4_dport uli.ports.dport

#define fl4_icmp_type uli.icmpt.type

#define fl4_icmp_code uli.icmpt.code

#define fl4_ipsec_spi uli.spi

#define fl4_mh_type uli.mht.type

#define fl4_gre_key uli.gre_key

......................

please find the complete structure definition at include/net/flow.h

......................

} __attribute__((__aligned__(BITS_PER_LONG/8)));

A set of well known kernel APIs for route lookup

Kernel API	Definition
Ip_route_input()	Include/net/route.h
Ip_route_input_noref()	Net/ipv4/route.c
Ip_route_input_slow()	Net/ipv4/route.c
Ip_route_input_mc()	Net/ipv4/route.c
Ip_route_output()	Include/net/route.h
Ip_route_output_key()	Include/net/route.h
Ip_route_output_flow()	net/ipv4/route.c
ip_route_output_ports()	Include/net/route.h

route configuration, behind the curtain!

Kernel routing tables could be managed using standard tools provided in iproute2 (ip command) & net-tools(route, netstat etc) packages. Where the iproute2 package uses NETLINK sockets to talk to the kernel, and the net-tools package uses IOCTLS.

As you know, NETLINK is an extension of generic socket framework (I will discuss about it in a separate article). NETLINK_ROUTE is the netlink message type used to link the admin commands to the routing subsystem. The most important rtnetlink routing commands and corresponding kernel handlers are noted below.

RTM_NEWROUTE => inet_rtm_newroute()

RTM_DELROUTE => inet_rtm_delroute()

RTM_GETROUTE => inet_dump_fib()

Similarly, socket IOCTLS uses the following commands

#define SIOCADDRT 0x890B /* add routing table entry */

#define SIOCDELRT 0x890C /* delete routing table entry */

#define SIOCRTMSG 0x890D /* call to routing system */

ip_rt_ioctl() is the kernel handler for all these IOCTL commands.

Now lets discuss two well known extensions of routing, Policy based routing & multipath routing.

Policy based routing

What is it?

As we discussed, there will be only two FIB tables (LOCAL and MAIN) created at network stack boot time. Policy routing is a feature which allows us to create up to 255 routing tables. This extends the control and flexibility in the routing decisions. Admin have to attach every table with a specific 'rule', so that when a packet arrives in routing framework, it searches for a matching rule and the corresponding table will be picked for route lookup (fib_rules_lookup() is the API to do this). See the example below, with that, any packet comes with 'tos' value 0x02 will hit routing table 190 in route lookup.

/* add a rule for a table 190 */

# ip rule add tos 0x02 table 190

/* now add a route entry to this table*/

# ip route add 192.168.1.0/24 dev eth0 table 190

A fib rule is represented using struct fib_rule,

struct fib4_rule {

struct fib_rule common;

u8 dst_len;

u8 src_len;

u8 tos;

__be32 src;

__be32 srcmask;

__be32 dst;

__be32 dstmask;

......................

please find the complete structure definition at net/ipv4/fib_rules.h

......................

};

How to enable policy routing?

CONFIG_IP_MULTIPLE_TABLES must be SET in make menuconfig to enable policy routing.

Multipath routing

What is it?

Multipath routing allows us to add more than one nexthop to a single route. Admin can assign weight for each of the next hop using 'ip route' command as below

/* the network 10.1.1.0 could be reached via both 10.10.10.1 and 172.20.20.1, where the weights are 2 and 4 resp. */

#ip route add 10.1.1.0/24 nexthop via 10.10.10.1 weight 2 nexthop via 172.20.20.1 weight 4.

As you might have noticed, the 'struct fib_info' (a route entry in kernel) is having two important fields for the nexthop, the 'fib_nhs' and 'struct fib_nh[0]'. where the former carries the number of nexthops of this route and latter is the array of those nexthops.

struct fib_nh {

struct net_device *nh_dev;

struct hlist_node nh_hash;

struct fib_info *nh_parent;

unsigned int nh_flags;

unsigned char nh_scope;

#ifdef CONFIG_IP_ROUTE_MULTIPATH

int nh_weight;

int nh_power;

#endif

#ifdef CONFIG_IP_ROUTE_CLASSID

__u32 nh_tclassid;

#endif

int nh_oif;

__be32 nh_gw;

__be32 nh_saddr;

int nh_saddr_genid;

struct rtable __rcu * __percpu *nh_pcpu_rth_output;

struct rtable __rcu *nh_rth_input;

struct fnhe_hash_bucket *nh_exceptions;

......................

please find the complete structure definition at include/net/ip_fib.h

......................

};

When this feature is enabled (CONFIG_IP_ROUTE_MULTIPATH during 'make menuconfig') in kernel, fib_select_multipath() must be the kernel API used to select a nexthop for the given destination.

Okay! Now what is the roll of the routing protocols?

The routing protocols like OSPF, RIP, BGP, EGP etc. runs as user space daemons to configure the above tables dynamically. They play an inevitable roll in the internet nodes to handle thousands of routes, where static configuration is almost impossible.

With this I wind up the discussion. Hope it helps you to look into the routing subsystem to dig further.

Kernel Network Stack Made Easy

Tuesday, 28 October 2014

The Routing Engine In Linux Kernel

Essential kernel data structures

A set of well known kernel APIs for route lookup

route configuration, behind the curtain!

Policy based routing

Multipath routing

No comments:

Post a Comment

Upcoming Articles

Blog Archive