Details of DiffQ Implementation in Linux 2.6 Kernel
Introduction
This document describes the design and implementation of
DiffQ congestion control protocol. DiffQ has been implemented as a kernel module for the 2.6 series of the Linux kernel (2.6.18 onwards). This document aims to provide insight into implementation of DiffQ and its various features, to make it easier to understand, extend or use the actual code.
DiffQ is a packet level networking protocol which implies that it needs to process every single packet being transmitted or received in the Linux networking stack. DiffQ is primarily intended to be used and tested over IP networks. Hence, we concern ourselves with only IP packets in the IP stack in Linux. In addition, DiffQ also determines the MAC MAC priority of the packet being transmitted. Further, DiffQ dictates the rate at which a source application SHOULD generate traffic for the destination. To this mean DiffQ needs to provide a way for the conforming application to find the current rate and restrict non-conforming applications to the correct rates.
Based on all these requirements DiffQ needs (or provides) the following interfaces from (to) the kernel:
- Mechanism to capture, process and reinject all IP packets in the kernel
- Mechanism to control (i.e. assign) the MAC priority for each packet transmitted by the MAC
- Mechanism to change the PHY parameters (CWmin, CWmax, AIFS, TXOPLimit) for various MAC priorities
- Mechanism to provide information to upper layers
- Mechanism to restrict source rate of applications
In this document we will describe how each of this mechanism is implemented in DiffQ.
Background
This document assumes that the reader is familiar with Linux Kernel Programming,
NetFilter? API, Madwifi driver, etc. The References section includes links to several tutorials on these topics.
Implementation Details
1. Packet capture and processing mechanism
In Linux,
struct sk_buff represents a packet in the network stack. The IP packets traverse the Linux networking stack as a struct sk_buff pointers (struct sk_buff*) to prevent un necessary copying while the packets traverse networking layers. On the transmission side, the packet is allocates (as an sk_buff) when the transport layer (TCP or UDP) reads the data from the socket buffer (the socket API, used at user space, writes the data to the socket buffers). On the receiver side the packet is allocated (as an sk_buff) when it is received and validated (CRC) by the MAC.
NetFilter? API provides various hooks throughout the networking stack where packets can be captured from the stack. Any module interested in packet inspection can tap into
NetFilter? hooks by specifying the type of the hook and the handling function. When a packet arrives at a hook the packet is captured from the stack and passed to the hook handling function. There can be several handling functions at the same hook in which case the packet is passed to each of them sequentially starting with the handling function having the highest priority (the priority is specified while registering the hook). The packets can be modified and returned to the stack (NF_ACCEPT), dropped from the stack (NF_DROP) or queued for later re-injection (NF_QUEUE). The various positions in the stack where the hooks are available are shown in the figure below:
- PREROUTING (NF_IP_PRE_ROUTING): Packets arrive at this hook just after they are received by the MAC layer and before any routing decision is made.
- POSTROUTING (NF_IP_POST_ROUTING): Packets arrive at this hook after the routing decision has been made and before the packet is handed over to the MAC layer.
- LOCALOUT (NF_IP_LOCAL_OUT): Packets arrive at this hook when they are passed to IP layer from transport layer. This means that packets arriving in this hook have been generated by this machine.
- LOCALIN (NF_IP_LOCAL_IN): Packets arrive at this hook just before they are being passed to the transport layer. This means packets arriving in this hook are destined for this machine.
Let's assume that the Soekris nodes (which we will call just nodes) have ath0 and ath1 as two wireless interfaces and eth0 as the Ethernet interface. In the current implementation, the ath0 interface is used to connect to other nodes within the mesh network and ath2 interface is used to provide network access to 802.11 clients (laptops, PDAs, etc). Hence, ath0 works in ad hoc mode and ath1 works in master (AP) mode. All inter mesh traffic is generated either on the mesh nodes themselves or on the clients connected to the mesh nodes via the ath1 AP interface. We do not consider the scenario in which traffic can come on eth0 for the mesh network. This is a performance optimization since we use eth0 interface to connect to and control the mesh nodes using ssh. Hence, clients will send traffic via ath1 to a node (which will be forwarded using ath0) and application on the node will send traffic via ath0.
In DiffQ we have the following scenarios of packet handling:
- Packet is generated on this machine. This packet may be sent out on any interface.
- Packet is destined for this machine. This packet can arrive from any interface.
- Packet is to be forwarded via this machine. This packet can arrive on ath2 or ath0 (why not eth0? read the above paragraph)
Each packet in DiffQ belongs to a flow which is uniquely identified by four values - source IP address, source port, destination IP address, destination port. Flows are uni-directonal. Hence, a TCP connection sends data on one flow and receives ACK/data on another flow.
To handle all this traffic we have the following hooks:
- LOCALOUT (NF_IP_LOCALOUT_hook): This hooks traps packets which are generated on the node. At this hook level we know the destination interface of the packet. The following sequence of action are taken on packets trapped at this hook :
- If the packet is going out on eth0 or ath1 then the packet is returned back to the IP stack (by returning NF_ACCEPT). This is due to the reason that we only handle mesh traffic over DiffQ and eth0 is Ethernet and ath2 is AP interface.
- If the packet is going out of ath0 interface then we attach the packet with a DiffQ header. Then the packet is queued (by returning NF_QUEUE). This makes the stack pass the packet to a queue handler (function queue_handler which has been registered earlier as queue handler for IP packets). In the queue_handler function we enqueue the packet in flow level queues according to the DiffQ algorithm.
- LOCALIN (NF_IP_LOCALIN_hook): This hooks traps packets which are destined for this node. We check if the packet has DiffQ header on it and if the header is present then it is removed. The packet is then returned to the stack (by returning NF_ACCEPT). This makes the stack deliver the packet to the transport layer.
- PREROUTING (NF_IP_PRE_ROUTING_hook): This hook traps packets which are received from some network interface. If the packet is received from eth0 then it is returned to the stack (by returning NF_ACCEPT). If the packet arrives on ath0 then we check if the packet has DiffQ header. If the header is present then this packet is part of a DiffQ flow and hence we enqueue the packet by asking the stack to queue it to our queue handler (by returning NF_QUEUE). If the header is not present then the packet is not part of the DiffQ flow and is returned back to the stack (by returning NF_ACCEPT).
- POSTROUTING (NF_IP_POST_ROUTING): This hooks traps packets after routing decisions. Packets which have DiffQ header are not touched here (by returning NF_ACCEPT). This hook exists to trap packets in special cases which we will discuss later. If a qualifying packet is found then it is treated similar to LOCALOUT packet.
1.1 DiffQ header
Each DiffQ packet is appended with a DiffQ header. The header contains various fields which are used by the DiffQ algorithm. The header can be put either in front of the IP header (complete encapsulation of the IP packet) or after IP header (transport layer encapsulation encapsulation) or at the end of data. If we put the header before IP header then Linux would not recognize the packet as IP packet and it would need considerable more effort to capture and process packets. Since we want to use much of IP functionality like routing, addressing, etc. it makes more sense to let Linux handle the IP stuff. If we add the header after IP header and before TCP header then there are two problems. Firstly, we need to move large amount of data to make way for the new header (since sk_buff data area is allocated contiguously) and secondly, the sniffers like Wireshark which we intend to use for testing of DiffQ would not recognize either the DiffQ header or the transport header.
In the current design (in the next version the header will be moved to an IP Option. See TODO) the header is attached at the end of the packet. The transport header does not require to be modified since we remove the DiffQ header before passing it to transport layer on the receiver side. The IP header is modified to reflect the new size of the packet.
The DiffQ header has the following fields:
- Type (8bits): The type of the flow that this packet belongs to. Either RealTime? (RT) or Best Effort (BE). RT has two sub types - Soft RealTime? (SRT) and Hard RealTime? (HRT),
- Flowid (8 bits): The source assigned id for the current flow (to be removed in next version).
- Qlen (16 bits): The length of the packet queue of the flow that this packet belongs to. This field is updated by each node which transmits this packet.
- TTL (8 bits): The total number of hops that this packet should travel. Decremented at each hop.
- Original_expiry (16bits): For RT flows, this field is set by the source node to reflect the lifetime (in msec) of the packet once the source sends it out. For BE flows this field is set to 0.
- Expiry(16 bits): For RealTime? flows, this field signifies the amount of time left for the packet to expire. It is updated at each intermediate node in the packet route (by subtracting the total time spend on the intermediate node from this value). This field is initialized to original_expiry at the source node. (Why both original_expiry and expiry are required in the header? See FAQ)
- Timestamp (64 bits): The timestamp at the source node when this packet was transmitted. Timestamp is used in expiry time calculation. (See FAQ)
- Sequence number (32 bits): A source initiated sequence number which incrementally numbers each packet of a flow. This is used only for testing purpose to trace packets.
1.2 Packet Enqueue
All packets to be enqueued are added to a flow specific queue. Appropriate locking is done (spinlock and rwlock) for synchronization. For re-injecting queued packets, we need to save some state variables with each packet. Hence, a structure (struct Packet) is used to encapsulate the packet and the state. Packets are enqueue to the flow queue as struct Packet * which has the following fields:
- struct sk_buff *skbuff: This is the sk_buff which contains the packet to be enqueued.
- FlowEntry? *flow: This pointer links to the flow to which this packet belongs. This pointer is stored so that we do not need to search for the packet's flow when the packet is dequeued from queue.
- struct nf_info *nfinfo: This is the state to be maintained for packet re-injection.
There are two global linked lists - one of all RT flows and the other of all BE flows. Whenever a new flow is created it is added to the appropriate linked list.
1.3 Schedular
The Schedular is the component which schedules the queued packets for transmission. The scheduler waits on a wait_queue for packets to be added to the flow queues. The Schedular works as follows:
- Finds the number of slots in the MAC which are free. Though the MAC can buffer hundreds of packets at any instant, we do not want the MAC queue to go above a threshold level (varies from test to test, can be chosen as low as 1 or high as 100). It is possible to modify the threshold dynamically.
- If the MAC already have threshold number of packets enqueued then packets are not scheduled till some packets are transmitted. DiffQ reads the MAC queue size from the structure which represents the device (ath0). This size is decremented each time MAC sends a packet (actually when the packet is sent to the hardware, not sent on the air).
- Given that some packets can be transmitted, the scheduler looks into the global linked lists, first in the RT list and then in the BE flows. RT flows will earliest expiry are given preference. When all RT flows have been serviced and if possible BE flows are scheduled. BE flows with maximum queue differential (queue size of a flow at node - queue size of the same flow at next hop node) are given preference.
- To service a flow, the Schedular dequeues a packet from the flow queue and re-injects the packet back to the stack. DiffQ header update (expiry, ttl, qlen) and flow level statistical update (decrease of queue size by 1, recalculation of queue differential for BE flows, etc) are done before re-injecting the packet
- Once reinjected, the packet traverses the remaining part of the IP stack to the MAC layer and finally over the air.
- The MAC layer priority of the packet (currently 0 - 7, with 7 being the highest) is loaded into the TOS field of IP header to inform MAC of which priority to use while transmitting the packet.
1.4 MAC Priority Determination
In DiffQ we assume that several MAC priority levels are available. In the current MAC (modified madwifi-ng driver) 8 priority levels are implemented. This allows 4 priority levels for RT (PRIO4, PRIO5, PRIO6, PRIO7) and 4 for BE traffic (PRIO0, PRIO1, PRIO2, PRIO3). The priority levels for RT are higher than those of BE. The packets are scheduled as per the mechanism described above. Whenever a packet is scheduled, a MAC priority is calculated for it according to the following logic:
- For RT packets, the expiry time is the measure of the importance of packet; lower the expiry the higher the priority. Hence expiry time is used to assign the packet into one of 4 RT priority levels. There can be
- For BE packets the queue differential (size of queue on the node - size of queue on the next hop node) is taken as the priority of the packet; higher the queue differential the higher the priority.
There are several mechanism to allocate packet to one of the priority levels based on the expiry time or the queue differential:
- Linear: In this case the priority of a packet is determined as Priority = (Expiry Time / RT_PRIO_RESOLUTION) % 4. This gives a priority from 0 to 3 which is directly mapped to PRIO4, PRIO5, PRIO6 or PRIO7. the RT_PRIO_RESOLUTION is a configuration parameter which specifies the resolution of priority change. For example, if RT_PRIO_RESOLUTION is 25, then for every 25ms that the packet spends on the route, its priority increases by 1.
- Exponential: In the exponential case the priority is checking if the expiry time lies in one of the levels. The range of the levels is exponentially distributed (example: 0 - 5ms, 6 - 25, 26 - 125, >126).
The same logic is applied for the BE packets using the queue differential instead of expiry time and using BE_PRIO_RESOLUTION constant.
2. MAC control mechanism
DiffQ uses a modified madwifi-ng driver where in 10 MAC priority levels have been implemented. This is an extension of the 802.11e scheme which implemented 4 priority levels. Since the driver is a separate module, the TOS field of the IP header is used to pass the priority information to the MAC driver. Since there are 8 priority levels, TOS can be set from 0 to 7 to instruct the driver to use the appropriate priority. The driver has code which checks the TOS field and puts the packet in the appropriate hardware queue. Most legacy applications do not use IP TOS and hence TOS field is 0 by default. Hence, the same driver can be used unmodified even when DiffQ is not loaded.
3. PHY control mechanism
The parameters for various queues (CWmin, CWmax, AIFS, TXOPLimit) as well as general PHY parameters (channel, data rate, power) can be modified at any time using the interface provided by the MAC driver (See help on commands wlanconfig, iwpriv, iwlist, iwconfig). Scripts can be written to automate the settings while running any tests.
4. Information export mechanism
To provide parameters to the user space application, DiffQ has a proc based interface to the userspace.
- cat /proc/net/diffq/rt/flows: This provides statistical information on all the RT flows that are present on the node.
- cat /proc/net/diffq/be/flows: This provides statistical information on all the BE flows that are present on the node.
- /proc/net/diffq/ctrl: This implements an ioctl interface to modify some of DiffQ parameters and to read the queue length for any flow. See the section on source code for more details.
In the new version of DiffQ, the information export mechanism will be migrated to a more convenient solution which can be directly used without having to write program to use ioctl call. (Maybe a program like diffqconfig similar to wlanconfig)
5. Source rate control mechanism
The sorce rate control mechanism is designed to restrict the rate of a source to the optimum available end to end rate. The hop by hop congestion control feature of DiffQ creates a constant back-pressure over the hops between destination to source. This is how DiffQ controls congestion. The basic logic is that between two time instants that the queue size is read for a given flow:
- if the queue size has increased, then the flow could not send enough packets during the time interval and hence it should decrease its rate.
- if the queue size has decreased, then the flow could send enough packets during the time interval and hence it should increase its rate.
- if the queue size has not changed, then the flow is on an equilibrium utilization and hence it should slightly increase its rate (this is to probe the network).
An application can read the source rate of a flow from the DiffQ ioctl interface. The mechanism of rate decrease (AIMD, MIMD, linear, etc) are not dictated by DiffQ. We are experimenting with various schemes.
Since the queue length is a per node statistic, congestion control is applied on each node of the flow rather than just the end nodes of a flow. This leads to better adaptation of the flow to the changing network conditions. The scheduling helps in prioritizing the flows on the same node and MAC priority helps in prioritizing flows between separate nodes.
We have a sample rate controlled application rctx (transmission) and rcrx (receiver) which behaves like iperf but demonstrates rate control. See the section on source code for more details.
A Look into the code
Modifications to madwifi :
1. ieee80211_wme_initparams_locked in net80211/ieee80211_proto.c
2. ath/if_ath.c
FAQ
(You can email
pwason@ncsu.edu for any additional question/queries)
TODO
- Header overhaul. Remove redundant fields from the header.
- Move the DiffQ header to IP Options. This would solve checksum and MTU size problems.
- Cleanup the code.
- Implementing diffqconfig
- Caching of recently accessed flows to speed up packet enqueues.
References