In this discussion we will zoom into the Linux Kernel code to understand what really happens when we work on internet routing.
What is routing
Essential
kernel data structures in routing
How
route lookup works
Well
known kernel APIs for route lookup
Behind
the route configuration commands
Policy
based routing
Multipath
routing
We keep the latest kernel 3.17.1 as the reference for this discussion. So it’s worth mentioning that the well known traditional 'routing cache' is removed from 3.6 kernel onwards and the routing database is only FIB TRIE now.
What we DONT discuses here!
This article doesnt talk about the routing protocols like RIP, OSPF, BGP, EGP etc. Also we will not focus on the commands used for routing table configuration and management.
What
is routing
Routing is the brain of Internet protocol, which allows
the packets to cross the LAN boundaries. Lets do not spend much time to
discuss the same details that you find here, http://en.wikipedia.org/wiki/Routing. Instead lets peek into the implementation details of it.
Configure a route
We can add/delete routes to the routing tables using one
of the below commands
# ip route add 10.10.10.0/24 dev eth0
# route add -net 10.10.10.0/24 gw 10.10.20.1
# ip route delete 10.10.10.0/24 dev eth0
# route del -net 10.10.10.0/24 gw 10.10.20.1
Please refer the section 'route configuration, behind the curtain' to
know how it works.
Scanning the Kernel Code
What
are the minimum information expected from a route lookup?
- The nexthop: The directly connected device to which the concerned packet must handover.
- The output interface: the interface of this device to reach the nexthop
- type of the route (based on the destination address in the ip header of the packet in question): Few of the important routing flags are below
- RTCF_LOCAL: The destination address is local and the packet should terminate on this device. Packet will be given to the kernel method ip_local_deliver().
- RTCF_BROADCAST and RTCF_MULTICAST: uses when the destination address in broadcast or multicast address resp.
- refer include/uapi/linux/in_route.h for the rest of the flags.
- Scope of the route (based on the destination address in the ip header of the packet in question): As per the comment in rtnetlink.h, 'scope is something like the distance to the destination. The enumerator listed below holds the available scope flags. Where, the NOWHERE are reserved for not existing destinations, HOST is our local addresses, LINK are destinations located on directly attached link and UNIVERSE is everywhere in the Universe'.
enum rt_scope_t {
RT_SCOPE_UNIVERSE=0,
RT_SCOPE_SITE=200,
RT_SCOPE_LINK=253,
RT_SCOPE_HOST=254,
RT_SCOPE_NOWHERE=255
};
- Additionally, a route entry also gives few more information like MTU, priority, protocol id, metrics etc.
What should be the action on the packet based on this info?
Based on the route lookup result, the packet will be
given to ip_local_deliver() in the case of local delivery, ip_forward() in the
case of forwarding. In forwarding case, the packet is send to the next hop (Found in the route lookup) via the output interface and the packets continue its journey
to the destination address.
Now let’s see where the above information is stored in
kernel.
Essential kernel data structures
Forwarding Information Base (FIB): For any given
destination address, the routing subsystem is expected to give a route entry
with the above discussed information. Such routing information(either
statically configured or dynamically populated by the routing protocols) are stored in a
kernel database called FIB.
'struct fib_table' is the data structure represents a FIB table
in kernel.
struct fib_table {
struct hlist_node tb_hlist;
u32 tb_id; /* the table id, between 0 to 255 */
int tb_default;
int tb_num_default; /* number of deafult routes in this table */
unsigned long tb_data[0];
......................
please find the complete structure definition at include/net/ip_fib.h
......................
};
At boot time, two FIB tables will be created by default, called
RT_TABLE_MAIN and RT_TABLE_LOCAL with table id 244 and 255 resp. More FIB
tables will be created when policy based routing is enabled.
Kernel
API
|
Purpose
|
Fib_trie_table()
|
Creates
a TRIE table
|
Fib_create_info()
|
Creates
a route entry, struct fib_info
|
Fib_table_insert()
|
Insets
a route to the table
|
Fib_table_delete()
|
Deletes
a route entry from the table
|
A
route entry in kernel is 'struct fib_info', which holds the
routing information that we discussed above (out dev, next hop, scope,
priority, metric etc)
struct fib_info {
struct
hlist_node fib_hash;
struct
hlist_node fib_lhash;
int fib_treeref;
unsigned int fib_flags;
unsigned char fib_dead;
unsigned char fib_protocol; /* routing
protocol indicator for this route entry */
unsigned char fib_scope;
unsigned char fib_type;
u32 fib_priority;
u32 *fib_metrics;
int fib_nhs;
struct fib_nh fib_nh[0]; /* Yes, it is an
array, pls refer Multipath routing section below */
......................
please find the
complete structure definition at ip_fib.h
......................
};
The
Destination cache used
by the routing engine: This improves the performance of Linux routing engine. 'struct dst_entry' holds too many critical
information required to process a packet going to each destination. This will
be created after the route lookup. And the packet (Struct sk_buff) itself will
hold a pointer to this entry, but not to fib_info. So after the route lookup, during the journey of the
packet in the kernel network stack, destination cache could be
referred at any point using skb_dst() api.
struct dst_entry {
struct rcu_head rcu_head;
struct
dst_entry *child;
struct
net_device *dev;
struct dst_ops *ops;
unsigned long _metrics;
unsigned
long expires;
struct
dst_entry *path;
struct
dst_entry *from;
#ifdef CONFIG_XFRM
struct
xfrm_state *xfrm;
#else
void *__pad1;
#endif
int (*input)(struct
sk_buff *);
int (*output)(struct
sk_buff *);
unsigned short flags;
......................
please find the
complete structure definition at include/net/dst.h
......................
}
where input() and output() function pointers will be
pointing to ip_local_deliver()/ip_forward() and ip_output() respectively. This
id filled based on the route lookup result, as we discussed above.
How
it works
How
route lookup works?
As in the table below there are quite many kernel APIs available to do the route lookup for us. We can choose the appropriate API based on the available input data and the direction of the packet (ingress or Egress). All of them boils down to the core FIB lookup api,
fib_lookup(). This method returns 'struct fib_result' in successful route
lookup. This structure is below
struct fib_result {
unsigned char prefixlen;
unsigned char nh_sel;
unsigned char type;
unsigned char scope;
u32 tclassid;
struct fib_info
*fi;
struct
fib_table *table;
struct
list_head *fa_head;
......................
please find the complete structure definition at include/net/ip_fib.h
......................
};
fib_lookup() method searches both the default tables for
a matching route, first searches the RT_TABLE_LOCAL and if there is no match,
lookup will be done in the main table (RT_TABLE_MAIN). Using the 'fib_result',
the destination cache (dst_entry) will be created for this destination, and
input() and output() function pointers will be assigned properly as we
discussed above.
fib_lookup() expects 'struct flowi4' as the input
argument for the table lookup. This object carries the source and destination
address, ToS value and many more as below.
struct flowi4 {
struct
flowi_common __fl_common;
#define flowi4_oif __fl_common.flowic_oif
#define flowi4_iif __fl_common.flowic_iif
#define flowi4_mark __fl_common.flowic_mark
#define flowi4_tos __fl_common.flowic_tos
#define flowi4_scope __fl_common.flowic_scope
#define flowi4_proto __fl_common.flowic_proto
#define flowi4_flags __fl_common.flowic_flags
#define flowi4_secid __fl_common.flowic_secid
/*
(saddr,daddr) must be grouped, same order as in IP header */
__be32 saddr;
__be32 daddr;
union flowi_uli uli;
#define fl4_sport uli.ports.sport
#define fl4_dport uli.ports.dport
#define fl4_icmp_type uli.icmpt.type
#define fl4_icmp_code uli.icmpt.code
#define fl4_ipsec_spi uli.spi
#define fl4_mh_type uli.mht.type
#define fl4_gre_key uli.gre_key
......................
please find the complete structure definition at include/net/flow.h
......................
} __attribute__((__aligned__(BITS_PER_LONG/8)));
A set of well known kernel APIs for route lookup
Kernel
API
|
Definition
|
Ip_route_input()
|
Include/net/route.h
|
Ip_route_input_noref()
|
Net/ipv4/route.c
|
Ip_route_input_slow()
|
Net/ipv4/route.c
|
Ip_route_input_mc()
|
Net/ipv4/route.c
|
Ip_route_output()
|
Include/net/route.h
|
Ip_route_output_key()
|
Include/net/route.h
|
Ip_route_output_flow()
|
net/ipv4/route.c
|
ip_route_output_ports()
|
Include/net/route.h
|
route configuration, behind the curtain!
Kernel routing tables could be managed using standard
tools provided in iproute2 (ip command) & net-tools(route, netstat etc)
packages. Where the iproute2 package uses NETLINK sockets to talk to the
kernel, and the net-tools package uses IOCTLS.
As you know, NETLINK is an extension of generic socket
framework (I will discuss about it in a separate article). NETLINK_ROUTE is the
netlink message type used to link the admin commands to the routing subsystem.
The most important rtnetlink routing commands and corresponding kernel handlers
are noted below.
RTM_NEWROUTE => inet_rtm_newroute()
RTM_DELROUTE => inet_rtm_delroute()
RTM_GETROUTE => inet_dump_fib()
Similarly, socket IOCTLS uses the following commands
#define SIOCADDRT 0x890B /* add routing table entry */
#define SIOCDELRT 0x890C /* delete routing table entry */
#define SIOCRTMSG 0x890D /* call to routing system */
ip_rt_ioctl() is the kernel handler for all these IOCTL
commands.
Now lets discuss two well known extensions of routing, Policy
based routing & multipath routing.
Policy based routing
What
is it?
As we discussed, there will be only two FIB tables
(LOCAL and MAIN) created at network stack boot time. Policy routing is a feature which
allows us to create up to 255 routing tables. This extends the control and
flexibility in the routing decisions. Admin have to attach every table with a specific 'rule', so that when a packet arrives in routing framework, it searches for a matching rule and the corresponding table will be picked for route lookup (fib_rules_lookup() is the
API to do this). See the example below, with that, any packet comes with 'tos' value
0x02 will hit routing table 190 in route lookup.
/* add a rule for a table 190 */
# ip rule add tos 0x02 table 190
/* now add a route entry to this table*/
# ip route add 192.168.1.0/24 dev eth0 table 190
A fib rule is represented using struct fib_rule,
struct fib4_rule {
struct
fib_rule common;
u8 dst_len;
u8 src_len;
u8 tos;
__be32 src;
__be32 srcmask;
__be32 dst;
__be32 dstmask;
......................
please find the complete structure definition at net/ipv4/fib_rules.h
......................
};
How to enable policy routing?
CONFIG_IP_MULTIPLE_TABLES must be SET in make menuconfig
to enable policy routing.
Multipath routing
What
is it?
Multipath routing allows us to add more than one
nexthop to a single route. Admin can assign weight for each of the next hop
using 'ip route' command as below
/* the network 10.1.1.0 could be reached
via both 10.10.10.1 and 172.20.20.1, where the weights are 2 and 4 resp. */
#ip route add 10.1.1.0/24 nexthop via
10.10.10.1 weight 2 nexthop via 172.20.20.1 weight 4.
As you might have noticed, the 'struct fib_info' (a
route entry in kernel) is having two important fields for the nexthop, the
'fib_nhs' and 'struct fib_nh[0]'. where the former carries the number of
nexthops of this route and latter is the array of those nexthops.
struct fib_nh {
struct
net_device *nh_dev;
struct
hlist_node nh_hash;
struct fib_info *nh_parent;
unsigned int nh_flags;
unsigned char nh_scope;
#ifdef CONFIG_IP_ROUTE_MULTIPATH
int nh_weight;
int nh_power;
#endif
#ifdef CONFIG_IP_ROUTE_CLASSID
__u32 nh_tclassid;
#endif
int nh_oif;
__be32 nh_gw;
__be32 nh_saddr;
int nh_saddr_genid;
struct rtable
__rcu * __percpu *nh_pcpu_rth_output;
struct rtable
__rcu *nh_rth_input;
struct
fnhe_hash_bucket *nh_exceptions;
......................
please find the
complete structure definition at include/net/ip_fib.h
......................
};
When this feature is enabled (CONFIG_IP_ROUTE_MULTIPATH
during 'make menuconfig') in kernel, fib_select_multipath() must be the kernel
API used to select a nexthop for the given destination.
Okay! Now what is the
roll of the routing protocols?
The routing protocols like OSPF, RIP, BGP, EGP etc. runs as user space daemons to configure the above tables dynamically. They
play an inevitable roll in the internet nodes to handle thousands of routes, where static configuration is almost impossible.
With this I wind up the discussion. Hope it helps you to look into the routing subsystem to dig further.