Tuesday, 28 October 2014

The Routing Engine In Linux Kernel

In this discussion we will zoom into the Linux Kernel code to understand what really happens when we work on internet routing.

What is routing

Essential kernel data structures in routing

How route lookup works

Well known kernel APIs for route lookup

Behind the route configuration commands

Policy based routing

Multipath routing

We keep the latest kernel 3.17.1 as the reference for this discussion. So it’s worth mentioning that the well known traditional 'routing cache' is removed from 3.6 kernel onwards and the routing database is only FIB TRIE now.

What we DONT discuses here!

This article doesnt talk about the routing protocols like RIP, OSPF, BGP, EGP etc. Also we will not focus on the commands used for routing table configuration and management.

What is routing

Routing is the brain of Internet protocol, which allows the packets to cross the LAN boundaries. Lets do not spend much time to discuss the same details that you find here, http://en.wikipedia.org/wiki/Routing. Instead lets peek into the implementation details of it.

Configure a route

We can add/delete routes to the routing tables using one of the below commands

# ip route add 10.10.10.0/24 dev eth0

# route add -net 10.10.10.0/24 gw 10.10.20.1

# ip route delete 10.10.10.0/24 dev eth0

# route del -net 10.10.10.0/24 gw 10.10.20.1

Please refer the section 'route configuration, behind the curtain' to know how it works.

Scanning the Kernel Code

What are the minimum information expected from a route lookup?

The nexthop: The directly connected device to which the concerned packet must handover.
The output interface: the interface of this device to reach the nexthop
type of the route (based on the destination address in the ip header of the packet in question): Few of the important routing flags are below

RTCF_LOCAL: The destination address is local and the packet should terminate on this device. Packet will be given to the kernel method ip_local_deliver().
RTCF_BROADCAST and RTCF_MULTICAST: uses when the destination address in broadcast or multicast address resp.
refer include/uapi/linux/in_route.h for the rest of the flags.

Scope of the route (based on the destination address in the ip header of the packet in question): As per the comment in rtnetlink.h, 'scope is something like the distance to the destination. The enumerator listed below holds the available scope flags. Where, the NOWHERE are reserved for not existing destinations, HOST is our local addresses, LINK are destinations located on directly attached link and UNIVERSE is everywhere in the Universe'.

enum rt_scope_t {

RT_SCOPE_UNIVERSE=0,

RT_SCOPE_SITE=200,

RT_SCOPE_LINK=253,

RT_SCOPE_HOST=254,

RT_SCOPE_NOWHERE=255

};

Additionally, a route entry also gives few more information like MTU, priority, protocol id, metrics etc.

What should be the action on the packet based on this info?

Based on the route lookup result, the packet will be given to ip_local_deliver() in the case of local delivery, ip_forward() in the case of forwarding. In forwarding case, the packet is send to the next hop (Found in the route lookup) via the output interface and the packets continue its journey to the destination address.

Now let’s see where the above information is stored in kernel.

Essential kernel data structures

Forwarding Information Base (FIB): For any given destination address, the routing subsystem is expected to give a route entry with the above discussed information. Such routing information(either statically configured or dynamically populated by the routing protocols) are stored in a kernel database called FIB.

'struct fib_table' is the data structure represents a FIB table in kernel.

struct fib_table {

struct hlist_node tb_hlist;

u32 tb_id; /* the table id, between 0 to 255 */

int tb_default;

int tb_num_default; /* number of deafult routes in this table */

unsigned long tb_data[0];

......................

please find the complete structure definition at include/net/ip_fib.h

......................

};

At boot time, two FIB tables will be created by default, called RT_TABLE_MAIN and RT_TABLE_LOCAL with table id 244 and 255 resp. More FIB tables will be created when policy based routing is enabled.

Kernel API	Purpose
Fib_trie_table()	Creates a TRIE table
Fib_create_info()	Creates a route entry, struct fib_info
Fib_table_insert()	Insets a route to the table
Fib_table_delete()	Deletes a route entry from the table

A route entry in kernel is 'struct fib_info', which holds the routing information that we discussed above (out dev, next hop, scope, priority, metric etc)

struct fib_info {

struct hlist_node fib_hash;

struct hlist_node fib_lhash;

int fib_treeref;

unsigned int fib_flags;

unsigned char fib_dead;

unsigned char fib_protocol; /* routing protocol indicator for this route entry */

unsigned char fib_scope;

unsigned char fib_type;

u32 fib_priority;

u32 *fib_metrics;

int fib_nhs;

struct fib_nh fib_nh[0]; /* Yes, it is an array, pls refer Multipath routing section below */

......................

please find the complete structure definition at ip_fib.h

......................

};

The Destination cache used by the routing engine: This improves the performance of Linux routing engine. 'struct dst_entry' holds too many critical information required to process a packet going to each destination. This will be created after the route lookup. And the packet (Struct sk_buff) itself will hold a pointer to this entry, but not to fib_info. So after the route lookup, during the journey of the packet in the kernel network stack, destination cache could be referred at any point using skb_dst() api.

struct dst_entry {

struct rcu_head rcu_head;

struct dst_entry *child;

struct net_device *dev;

struct dst_ops *ops;

unsigned long _metrics;

unsigned long expires;

struct dst_entry *path;

struct dst_entry *from;

#ifdef CONFIG_XFRM

struct xfrm_state *xfrm;

#else

void *__pad1;

#endif

int (*input)(struct sk_buff *);

int (*output)(struct sk_buff *);

unsigned short flags;

......................

please find the complete structure definition at include/net/dst.h

......................

}

where input() and output() function pointers will be pointing to ip_local_deliver()/ip_forward() and ip_output() respectively. This id filled based on the route lookup result, as we discussed above.

How it works

How route lookup works?

As in the table below there are quite many kernel APIs available to do the route lookup for us. We can choose the appropriate API based on the available input data and the direction of the packet (ingress or Egress). All of them boils down to the core FIB lookup api, fib_lookup(). This method returns 'struct fib_result' in successful route lookup. This structure is below

struct fib_result {

unsigned char prefixlen;

unsigned char nh_sel;

unsigned char type;

unsigned char scope;

u32 tclassid;

struct fib_info *fi;

struct fib_table *table;

struct list_head *fa_head;

......................

please find the complete structure definition at include/net/ip_fib.h

......................

};

fib_lookup() method searches both the default tables for a matching route, first searches the RT_TABLE_LOCAL and if there is no match, lookup will be done in the main table (RT_TABLE_MAIN). Using the 'fib_result', the destination cache (dst_entry) will be created for this destination, and input() and output() function pointers will be assigned properly as we discussed above.

fib_lookup() expects 'struct flowi4' as the input argument for the table lookup. This object carries the source and destination address, ToS value and many more as below.

struct flowi4 {

struct flowi_common __fl_common;

#define flowi4_oif __fl_common.flowic_oif

#define flowi4_iif __fl_common.flowic_iif

#define flowi4_mark __fl_common.flowic_mark

#define flowi4_tos __fl_common.flowic_tos

#define flowi4_scope __fl_common.flowic_scope

#define flowi4_proto __fl_common.flowic_proto

#define flowi4_flags __fl_common.flowic_flags

#define flowi4_secid __fl_common.flowic_secid

/* (saddr,daddr) must be grouped, same order as in IP header */

__be32 saddr;

__be32 daddr;

union flowi_uli uli;

#define fl4_sport uli.ports.sport

#define fl4_dport uli.ports.dport

#define fl4_icmp_type uli.icmpt.type

#define fl4_icmp_code uli.icmpt.code

#define fl4_ipsec_spi uli.spi

#define fl4_mh_type uli.mht.type

#define fl4_gre_key uli.gre_key

......................

please find the complete structure definition at include/net/flow.h

......................

} __attribute__((__aligned__(BITS_PER_LONG/8)));

A set of well known kernel APIs for route lookup

Kernel API	Definition
Ip_route_input()	Include/net/route.h
Ip_route_input_noref()	Net/ipv4/route.c
Ip_route_input_slow()	Net/ipv4/route.c
Ip_route_input_mc()	Net/ipv4/route.c
Ip_route_output()	Include/net/route.h
Ip_route_output_key()	Include/net/route.h
Ip_route_output_flow()	net/ipv4/route.c
ip_route_output_ports()	Include/net/route.h

route configuration, behind the curtain!

Kernel routing tables could be managed using standard tools provided in iproute2 (ip command) & net-tools(route, netstat etc) packages. Where the iproute2 package uses NETLINK sockets to talk to the kernel, and the net-tools package uses IOCTLS.

As you know, NETLINK is an extension of generic socket framework (I will discuss about it in a separate article). NETLINK_ROUTE is the netlink message type used to link the admin commands to the routing subsystem. The most important rtnetlink routing commands and corresponding kernel handlers are noted below.

RTM_NEWROUTE => inet_rtm_newroute()

RTM_DELROUTE => inet_rtm_delroute()

RTM_GETROUTE => inet_dump_fib()

Similarly, socket IOCTLS uses the following commands

#define SIOCADDRT 0x890B /* add routing table entry */

#define SIOCDELRT 0x890C /* delete routing table entry */

#define SIOCRTMSG 0x890D /* call to routing system */

ip_rt_ioctl() is the kernel handler for all these IOCTL commands.

Now lets discuss two well known extensions of routing, Policy based routing & multipath routing.

Policy based routing

What is it?

As we discussed, there will be only two FIB tables (LOCAL and MAIN) created at network stack boot time. Policy routing is a feature which allows us to create up to 255 routing tables. This extends the control and flexibility in the routing decisions. Admin have to attach every table with a specific 'rule', so that when a packet arrives in routing framework, it searches for a matching rule and the corresponding table will be picked for route lookup (fib_rules_lookup() is the API to do this). See the example below, with that, any packet comes with 'tos' value 0x02 will hit routing table 190 in route lookup.

/* add a rule for a table 190 */

# ip rule add tos 0x02 table 190

/* now add a route entry to this table*/

# ip route add 192.168.1.0/24 dev eth0 table 190

A fib rule is represented using struct fib_rule,

struct fib4_rule {

struct fib_rule common;

u8 dst_len;

u8 src_len;

u8 tos;

__be32 src;

__be32 srcmask;

__be32 dst;

__be32 dstmask;

......................

please find the complete structure definition at net/ipv4/fib_rules.h

......................

};

How to enable policy routing?

CONFIG_IP_MULTIPLE_TABLES must be SET in make menuconfig to enable policy routing.

Multipath routing

What is it?

Multipath routing allows us to add more than one nexthop to a single route. Admin can assign weight for each of the next hop using 'ip route' command as below

/* the network 10.1.1.0 could be reached via both 10.10.10.1 and 172.20.20.1, where the weights are 2 and 4 resp. */

#ip route add 10.1.1.0/24 nexthop via 10.10.10.1 weight 2 nexthop via 172.20.20.1 weight 4.

As you might have noticed, the 'struct fib_info' (a route entry in kernel) is having two important fields for the nexthop, the 'fib_nhs' and 'struct fib_nh[0]'. where the former carries the number of nexthops of this route and latter is the array of those nexthops.

struct fib_nh {

struct net_device *nh_dev;

struct hlist_node nh_hash;

struct fib_info *nh_parent;

unsigned int nh_flags;

unsigned char nh_scope;

#ifdef CONFIG_IP_ROUTE_MULTIPATH

int nh_weight;

int nh_power;

#endif

#ifdef CONFIG_IP_ROUTE_CLASSID

__u32 nh_tclassid;

#endif

int nh_oif;

__be32 nh_gw;

__be32 nh_saddr;

int nh_saddr_genid;

struct rtable __rcu * __percpu *nh_pcpu_rth_output;

struct rtable __rcu *nh_rth_input;

struct fnhe_hash_bucket *nh_exceptions;

......................

please find the complete structure definition at include/net/ip_fib.h

......................

};

When this feature is enabled (CONFIG_IP_ROUTE_MULTIPATH during 'make menuconfig') in kernel, fib_select_multipath() must be the kernel API used to select a nexthop for the given destination.

Okay! Now what is the roll of the routing protocols?

The routing protocols like OSPF, RIP, BGP, EGP etc. runs as user space daemons to configure the above tables dynamically. They play an inevitable roll in the internet nodes to handle thousands of routes, where static configuration is almost impossible.

With this I wind up the discussion. Hope it helps you to look into the routing subsystem to dig further.

Sunday, 19 October 2014

IPSec Implementation in Linux Kernel Stack

IPSec is an IETF standardized technology to provide secure communications over the Internet by securing data traffic at the IP layer. IPSec is essential in the world of internet because IP datagrams are not secure by itself, their IP source address can be spoofed, Content of IP datagrams can be sniffed/modified and many more vulnerabilities exists.

There are dozens of RFCs and articles explain the IPSec protocols in depth. Though the major part of IPSec is embedded inside the Operating systems like Linux, its implementation details are rarely documented. This article is an attempt to peek into the details of IPSec framework of Linux kernel TCP/IP stack.

Basics at a Glance

IPsec is an end-to-end security scheme operating in the Internet Layer of the Internet Protocol Suite, while some other Internet security systems in widespread use, such as Transport Layer Security (TLS) and Secure Shell (SSH), operate in the upper layers of the TCP/IP model. Hence, IPsec can protect any application traffic across an IP network. Applications do not need to be specifically designed to use IPsec. And only the sender and receiver have to be IPsec compliant, rest of network can be usual IP.

Before we jump into implementation details lets have a quick review of the ‘essential terminologies’ of IPSec technology like AH, ESP, transport mode, tunnel mode, security association, security policy, IKE protocol, IPSec configuration etc. It may sound complicated at this point. Let’s discuss further to run thru each of them now.

Internet Protocol Security (IPSec) can be achieved using two protocols namely AH(Authentication Header) or ESP (Encapsulating Security Payload). At the same time IPSec supports two modes of operation namely transport mode and tunnel mode. Based on the requirement an admin may choose one of the above protocol and the mode of operation while configuring IPSec.

Authentication Header (AH) protocol
This is one of the TWO implementations of IPSec. AH provides source authentication & data integrity for IP datagrams. But it is not designed to provide confidentiality.

ESP
This is another widely used IPSec protocol. It provides source authentication, data integrity, and confidentiality.

IPSec operation modes:
IPSec operates in two different modes namely transport mode and tunnel mode.
In transport mode, only the payload of the IP packet is usually encrypted and/or authenticated. The routing is intact, since the IP header is neither modified nor encrypted;
where as in tunnel mode, the entire IP packet is encrypted and/or authenticated. It is then encapsulated into a new IP packet with a new IP header. Tunnel mode is used to create virtual private networks for network-to-network communications (e.g. between routers to link sites), host-to-network communications (e.g. remote user access) and host-to-host communications (e.g. private chat).

Security policy
Security policy is a rule which decides whether a given flow needs to go for IPSec processing or not. If no policy matches, the packet takes the default flow in the network stack.

Security association
Another important concept of IP security architecture is the security association. A security association is simply the bundle of algorithms and parameters (such as keys) that is being used to encrypt and authenticate a particular flow in one direction. IPsec uses the Security Parameter Index (SPI: an index to the security association database - SADB), along with the destination address in a packet header, which together uniquely identify a security association for that packet.

As we discussed, the IPSec implementation is spread across the user space and kernel space of Linux operating system. The Kernel space part does the Actual job of IPSec. We will soon look into the kernel part.

The User space part of IPSec Architecture
In order to establish IPSec between two network entities (two routers, two hosts etc), both parties must agree on few parameters like, Key exchange algorithms, key life time, encryption, data integrity, authentication etc. This can be achieved either manual configuration or a popular protocol called IKE. For bigger networks it is a tedious task to manually configure each node.

Setkey is the tool used for configuring IPSec.
setkey -f <conf file>

Example config file for manual configuration:

/*####################################################*/

Security policy
spdadd 172.16.1.0/24 172.16.2.0/24 any -P out ipsec esp/tunnel/ 192.168.1.100 - 172.16.2.23 /require;

Security State

Add 192.168.1.100 172.16.2.23 esp 0x201 -m tunnel -E 3des-cbc 0x7aeaca3f87d060a12f4a4487d5a5c3355920fae69a96c831 -A hmac-md5 0xc0291ff014dccdd03874d9e8e4cdf3e6;

/*###################################################*/

If we go for IKE daemon the SA block above is not required as it will be established via this daemon. The popular Linux IKE implementation tools are racoon, Openswan, strongSwan etc. the following discussion assumes that the IPsec is configured via one of the above method

Uspace – Kspace Communication: NETLINK_XFRM
As we discussed some part of IPSec is in user space and some in kernel space, there should be a reliable mechanism for their communication, that is done using a netlink socket called NETLINK_XFRM.

If you want to debug at this level, you may start with xfrm_netlink_rcv(). Which is the kernel method responds to this netlink socket.
The table-1 shows the important netlink messages that XFRM supports to manage SAD and SPD from user space.

XFRM_MSG_NEWSA	To add a new SA to SAD
XFRM_MSG_DELSA	To delete a new SA to SAD
XFRM_MSG_GETSA	To get a new SA to SAD
XFRM_MSG_FLUSHSA	To flush SAD
XFRM_MSG_NEWPOLICY	To add a new policy to SPD
XFRM_MSG_DELPOLICY	To delete a new policy to SPD
XFRM_MSG_GETPOLICY	To get a new policy to SPD
XFRM_MSG_FLUSHPOLICY	To flush SPD

Now Lets delve into the IPSec framework in kernel

We will look into the packet transmission and reception flow of IPSec enabled kernel. Also we will see the SPD and SAD lookup and off course the important kernel objects involved.

XFRM framework of kernel:
This is the 'IPSec co-ordinator' in kernel. The actual IPSec performs inside this frameowrk. Which internally calls the protocol specific implementations of AH and ESP protocols (net/ipv4/esp4.c, net/ipv6/esp6.c). Though most of the XFRM framework is common for both ipv4 and ipv6 (net/xfrm), the protocol to XFRM linking part is implemented in net/ipv4/xfrm4_policy.c and net/ipv6/xfrm6_policy.c
XFRM initialization is done by two methods, xfrm4_init() and xfrm6_init().

Kernel cryptography:
The 'acrypto'(asynchronous crypto), cryptd, pcrypto(for multicore environment) layers of kernel has already implemented almost all algrithms (DES, 3DES, AES, RC5, IDEA, 3-IDEA, CAST, BLOWFISH etc..). There are two IPSec stacks used in kernel. the native netkey stack(syncronous) and traditional KLIPS stack(asynchronous). So an IPSec developer may not need to know all the compilated mathematics of cryptography, but just call crypto APIs :-).

To start with, the core object of xfrm is the 'xfrm' member of 'struct net'. i.e each network namespace has got a separate xfrm object. This object will be reffered to access the hash tables (remeber hash tables :) ) of SPD and SAD. Also holds the state garbage collector (state_gc_work)

Data structures

Info in SPD indicates “what” to do with arriving datagram; Info in the SAD indicates “how” to do it.

The building block of SPD (Policy Database) is struct xfrm_policy.

/* ################################################# */

struct xfrm_policy {

#ifdef CONFIG_NET_NS
                struct net                            *xp_net;
#endif
                struct hlist_node              bydst;
                struct hlist_node              byidx;
                /* This lock only affects elements except for entry. */
                rwlock_t                              lock;
                atomic_t                              refcnt;
                struct timer_list                timer;
                struct flow_cache_object flo;
                atomic_t                              genid;
                u32                                         priority;
                u32                                         index;
                struct xfrm_mark             mark;
                struct xfrm_selector       selector;
                struct xfrm_lifetime_cfg lft;
                struct xfrm_lifetime_cur curlft;
                struct xfrm_policy_walk_entry walk;
                struct xfrm_policy_queue polq;
                u8                                           type;
                u8                                           action;
                u8                                           flags;
                u8                                           xfrm_nr;
                u16                                         family;
                struct xfrm_sec_ctx        *security;
                struct xfrm_tmpl              xfrm_vec[XFRM_MAX_DEPTH];
};

Important Fields:

- refcnt is to hold the reference to the policy.

- which embedded xfrm_selector object to hold the source and destination IP addresses, source and destination ports, protocol, interface index etc. xfrm_selector_match() API checks if the given packet matches with the XFRM selector.

- lft: is the policy lifetime

- timer: to handle the policy expiry

- polq: is a queue to push the packets when there are no states associated with this policy.

- action: this field decides the fate of the traffic. (XFRM_POLICY_ALLOW and XFRM_POLICY_BLOCK)

- family (v4 or v6, as mentioned this structure is common for all protocols)

The building block of SAD (Association Database) is struct xfrm_state .

/* #################################################### */

/* Full description of state of transformer. */

struct xfrm_state {

#ifdef CONFIG_NET_NS
                struct net                            *xs_net;
#endif
                union {
                                struct hlist_node              gclist;
                                struct hlist_node              bydst;
                };
                struct hlist_node              bysrc;
                struct hlist_node              byspi;
                atomic_t                              refcnt;
                spinlock_t                           lock;
                struct xfrm_id                   id;
                struct xfrm_selector       sel;

/* Key manager bits */

struct xfrm_state_walk km;

/* Parameters of this state. */

                struct {
                                u32                         reqid;
                                u8                           mode;
                                u8                           replay_window;
                                u8                           aalgo, ealgo, calgo;
                                u8                           flags;
                                u16                         family;
                                xfrm_address_t               saddr;
                                int                           header_len;
                                int                           trailer_len;
                                u32                         extra_flags;
                } props;

struct xfrm_lifetime_cfg lft;

                /* Data for transformer */
                struct xfrm_algo_auth   *aalg;
                struct xfrm_algo               *ealg;
                struct xfrm_algo               *calg;
                struct xfrm_algo_aead *aead;

/* Data for encapsulator */

struct xfrm_encap_tmpl *encap;

--------------

-------------------

/* data for replay detection */

struct xfrm_replay_state replay;

struct xfrm_replay_state_esn *replay_esn;

struct xfrm_replay_state preplay;

struct xfrm_replay_state_esn *preplay_esn;

struct xfrm_replay *repl;

u32 replay_maxage;

u32 replay_maxdiff;

struct timer_list rtimer;

/* Statistics */

struct xfrm_stats stats;

struct xfrm_lifetime_cur curlft;

struct tasklet_hrtimer mtimer;

/* Last used time */

unsigned long lastused;

---------------------------

----------------------------

/* Private data of this transformer, format is opaque,

* interpreted by xfrm_type methods. */
void *data;

}

/* ###################################################### */

IPSec kernel APIs:

Xfrm_lookup()	xfrm lookup(SPD and SAD) method
Xfrm_input()	xfrm processing for an ingress packet
Xfrm_output()	xfrm processing for an egress packet
Xfrm4_rcv()	IPv4 specific Rx method
Xfrm6_rcv()	IPv6 specific Rx method
Esp_input()	ESP processing for an ingress packet
Esp_output()	ESP processing for an egress packet
Ah_output()	AH processing for an ingress packet
Ah_input()	ESP processing for an egress packet
xfrm_policy_alloc()	allocates an SPD object
Xfrm_policy_destroy()	frees an SPD object
xfrm_ policy_lookup	SPD lookup
xfrm_policy_byid()	SPD lookup based on id
Xfrm_policy_insert()	Add an entry to SPD
Xfrm_Policy_delete()	remove an entry from SPD
Xfrm_bundle_create()	creates a xfrm bundle
Xfrm_policy_delete()	releases the resources of a policy object
Xfrm_state_add()	add an entry to SAD
Xfrm_state_delete()	free and SAD object
Xfrm_state_alloc()	allocate an SAD object
xfrm_state_lookup_byaddr()	src address based SAD lookup
xfrm_state_find()	SAD look up based on dst
xfrm_state_lookup()	SAD lookup based on spi

table-2 : XFRM APIS

Kernel Code flow:

database lookup

The main API used for IPSec lookup is xfrm_lookup(). Which internally does the SPD lookup and SAD lookup. Once the routing decision is taken, the packet is given to xfrm_lookup(). I,e the dst_entry object is already set in the packet (skb->dst). If the lookup succeeds the ‘skb->dst->output’ will set to xfrm_output().

To make the lookup faster for future packets, the important informations like the route entry (ipv4 or ipv6), the matching policy etc. will be cached by calling xfrm_bundle_create(). The struct xfrm_dst is the object used for xfrm cache.

- The main APIs used for SPD lookup are xfrm_ policy_lookup(), xfrm_policy_byid(). Which look for a match for destination and source IP addresses, source and destination port addresses, protocol, and interface index. More APIs are given in table-2 above.

- State lookup can be done in 3 ways. based on SPI, based on dstination address or by src address. XFRM maintains 3 hash tables per namespace (struct net) for this. The APIs are given in the table-2.

The table-3 and 4 take you through the kernel methods involved in IPSec during packet transmission and reception respectively.

IPSec in packet transmission

For better understanding I have divided the IPSec transmission process in 7 stepes as below

Step-1: Transport_layer_sendmsg()

Does TCP/UDP specific jobs are done here before going for route lookup

Step-2: ip_route_output_slow()

Xfrm_lookup()

Step-3: ip_local_output()

Step-4: ip_local_out()

LOCAL_OUT netfilter applies here.

Calls skb->dst->output(), which is xfrm4_output in case of ipv4 and xfrm6_output in the case of ipv6

Step-5: xfrm4_output/xfrm6_output

Step-6: esp_output()/ah_output()

Step-7: ip_output()

Step-8: dev_queue_xmit()

Egress QoS comes here.

Step-9: dev->ndo_start_xmit()

Table-3: IPSec Tx steps

IPSec in packet reception

For better understanding I have divided the IPSec reception process in 7 stepes, they are below

Step-1: netif_receive_skb()

Step-2: ip_rcv()

Netfilter PRE_ROUTING applies here.

Step-3: ip_receive_finish

Calls ip_route_input_noref(). Which finds the route entry and set dst->output for local delivery, forwarding etc. But IPSec applies on the end systems ONLY. So we bothr if it is set for local delivery

Step-4: ip_local_deliver

LOCAL_IN Netfilter part here.

Step-5: ip_local_deliver_finish()

Based on the protocol field of ip header (IPPROTO_AH, IPPROTO_ESP), packet will be given to xfrm4_rcv() function

Step-6: xfrm4_rcv()

Step-7: xfrm_input()

Calls xfrm_state_lookup()

calls esp_input()/ah_input()

Once again applies the PRE_ROUTING Netfilter, but now for the decapsulated packet

Step-8: xfrm4_rcv_encap_finish()

Will do the route lookup again for the decapsulated packet using ip_route_input_noref(). Again route lookup should decide for local_delivery.

Step-9: ip_local_delivery()

again the LOCAL_IN Netfilter for decapsulated packet

now the protocol field will be TCP/UDP and the packet flows in the native reception methods of TCP/UDP and delivers to the socket

Step-10: transport_layer_rcvmsg()

-to userspace

Table-4: IPSec Rx steps

Here I conclude this document. Now we have briefly covered various building blocks of IPSec including XFRM framework, essential data structures, APIs, code flow etc. I hope this helped you to build a platform to dig more into IPSec feature of kernel stack.

Kernel Network Stack Made Easy

Tuesday, 28 October 2014

The Routing Engine In Linux Kernel

Essential kernel data structures

A set of well known kernel APIs for route lookup

route configuration, behind the curtain!

Policy based routing

Multipath routing

Sunday, 19 October 2014

IPSec Implementation in Linux Kernel Stack

Now Lets delve into the IPSec framework in kernel

Upcoming Articles

Blog Archive