Reprinted with permission of Linux Journal, from issue 29, September 1996. Some changes have been made to aclearcase/" target="_blank" >ccomodate the web. This article was originally written for the Kernel Korner column. The Kernel Korner series has included many other articles of interest to Linux kernel hackers, as well.
TheHyperNews Linux KHGDiscussion Pages
Reprinted with permission of Linux Journal, from issue 29, September 1996.Some changes have been made to accomodate the web.This article was originally written for the Kernel Korner column.The Kernel Korner series has includedmany other articles of interest to Linux kernel hackers, as well.
The Linux operating system implements the industry-standardBerkeley socket API, which has its origins in the BSD unixdevelopments (4.2/4.3/4.4 BSD). In this article, we will lookat the way the memory management and buffering is implementedfor network layers and network device drivers under theexisting Linux kernel, as well as explain how and why somethings have changed over time.
The networking layer tries to be fairly object-oriented inits design, as indeed is much of the Linux kernel. The corestructure of the networking code goes back to the initialnetworking and socket implementations by Ross Biro and OrestZborowski respectively. The key objects are:
The primary goal of the sk_buff routines is toprovide a consistent and efficient buffer handling method forall of the network layers, and by being consistent to make itpossible to provide higher level sk_buff and sockethandling facilities to all the protocols.
An sk_buff is a control structure with a block ofmemory attached. There are two primary sets of functionsprovided in the sk_buff library. Firstly routines tomanipulate doubly linked lists of sk_buffs, secondlyfunctions for controlling the attached memory. The buffers areheld on linked lists optimised for the common networkoperations of append to end and remove from start. As so muchof the networking functionality occurs during interrupts theseroutines are written to be atomic. The small extra overheadthis causes is well worth the pain it saves in bug hunting.
We use the list operations to manage groups of packets as theyarrive from the network, and as we send them to the physicalinterfaces. We use the memory manipulation routines forhandling the contents of packets in a standardised andefficient manner.
At its most basic level, a list of buffers is managed usingfunctions like this:
void append_frame(char *buf, int len)These two fairly simplistic pieces of code actually demonstratethe receive packet mechanism quite accurately. Theappend_frame() function is similar to the code calledfrom an interrupt by a device driver receiving a packet, andprocess_frame() is similar to the code called to feeddata into the protocols. If you go and look in net/core/dev.cat netif_rx() and net_bh(), you will see thatthey manage buffers similarly. They are far more complex, asthey have to feed packets to the right protocol and manage flowcontrol, but the basic operations are the same. This is just astrue if you look at buffers going from the protocol code to auser application.
{
struct sk_buff *skb=alloc_skb(len, GFP_ATOMIC);
if(skb==NULL)
my_dropped++;
else
{
skb_put(skb,len);
memcpy(skb->data,data,len);
skb_append(&my_list, skb);
}
}
void process_queue(void)
{
struct sk_buff *skb;
while((skb=skb_dequeue(&my_list))!=NULL)
{
process_data(skb);
kfree_skb(skb, FREE_READ);
}
}
The example also shows the use of one of the data controlfunctions, skb_put(). Here it is used to reserve spacein the buffer for the data we wish to pass down.
Let's look at append_frame(). The alloc_skb()fucntion obtains a buffer of len bytes(Figure 1),which consists of:
Immediately after a buffer has been allocated, all theavailable room is at the end. A further function namedskb_reserve() (Figure 2)can be called before data isadded allows you to specify that some of the room should be atthe beginning. Thus, many sending routines start with somethinglike:
skb=alloc_skb(len+headspace, GFP_KERNEL);
skb_reserve(skb, headspace);
skb_put(skb,len);
memcpy_fromfs(skb->data,data,len);
pass_to_m_protocol(skb);
In systems such as BSD unix you don't need to know inadvance how much space you will need as it uses chains of smallbuffers (mbufs) for its network buffers. Linux chooses to uselinear buffers and save space in advance (often wasting a fewbytes to allow for the worst case) because linear buffers makemany other things much faster.
Now to return to the list functions. Linux provides thefollowing operations:
The semantics of allocating and queueing buffers for socketsalso involve flow control rules and for sending a whole list ofinteractions with signals and optional settings such as nonblocking. Two routines are designed to make this easy for mostprotocols.
The sock_queue_rcv_skb() function is used to handleincoming data flow control and is normally used in the form:
sk=my_find_socket(whatever);This function uses the socket read queue counters to preventvast amounts of data being queued to a socket. After a limit ishit, data is discarded. It is up to the application to readfast enough, or as in TCP, for the protocol to do flow controlover the network. TCP actually tells the sending machine toshut up when it can no longer queue data.
if(sock_queue_rcv_skb(sk,skb)==-1)
{
myproto_stats.dropped++;
kfree_skb(skb,FREE_READ);
return;
}
On the sending side, sock_alloc_send_skb() handlessignal handling, the non blocking flag, and all the semanticsof blocking until there is space in the send queue so youcannot tie up all of memory with data queued for a slowinterface. Many protocol send routines have this function doingalmost all the work:
skb=sock_alloc_send_skb(sk,....)
if(skb==NULL)
return -err;
skb->sk=sk;
skb_reserve(skb, headroom);
skb_put(skb,len);
memcpy(skb->data, data, len);
protocol_do_something(skb);
Most of this we have met before. The very important line isskb->sk=sk. The sock_alloc_send_skb() hascharged the memory for the buffer to the socket. By settingskb->sk we tell the kernel that whoever does akfree_skb() on the buffer should cause the socket tobe credited the memory for the buffer. Thus when a device hassent a buffer and frees it the user will be able to send more.
All Linux network devices follow the same interface althoughmany functions available in that interface will not be neededfor all devices. An object oriented mentality is used and eachdevice is an object with a series of methods that are filledinto a structure. Each method is called with the device itselfas the first argument. This is done to get around the lack ofthe C++ concept of this within the C language.
The file drivers/net/skeleton.c contains the skeleton of anetwork device driver. View or print a copy from a recentkernel and follow along throughout the rest of the article.
Each network device deals entirely in the transmission ofnetwork buffers from the protocols to the physical media, andin receiving and decoding the responses the hardware generates.Incoming frames are turned into network buffers, identified byprotocol and delivered to netif_rx(). This functionthen passes the frames off to the protocol layer for furtherprocessing.
Each device provides a set of additional methods for thehandling of stopping, starting, control and physicalencapsulation of packets. These and all the other controlinformation are collected together in the device structuresthat are used to manage each device.
All Linux network devices have a unique name. This is not inany way related to the file system names devices may have, andindeed network devices do not normally have a filesystemrepresentation, although you may create a device which is tiedto device drivers. Traditionally the name indicates only thetype of a device rather than its maker. Multiple devices of thesame type are numbered upwards from 0. Thus ethernet devicesare known as ``eth0'', ``eth1'', ``eth2'' etc. The namingscheme is important as it allows users to write programs orsystem configuration in terms of ``an ethernet card'' ratherthan worrying about the manufacturer of the board and forcingreconfiguration if a board is changed.
The following names are currently used for generic devices:
If possible, a new device should pick a name that reflectsexisting practice. When you are adding a whole new physicallayer type you should look for other people working on such aproject and use a common naming scheme.
Certain physical layers present multiple logical interfacesover one media. Both ATM and Frame Relay have this property, asdoes multi-drop KISS in the amateur radio environment. Undersuch circumstances a driver needs to exist for each activechannel. The Linux networking code is structured in such a wayas to make this managable without excessive additional code,and the name registration scheme allows you to create andremove interfaces almost at will as channels come into and outof existance. The proposed convention for such names is stillunder some discussion, as the simple scheme of ``sl0a'',``sl0b'', "sl0c" works for basic devices like multidrop KISS,but does not cope with multiple frame relay connections where avirtual channel may be moved across physical boards.
Each device is created by filling in a structdevice object and passing it to theregister_netdev(struct device *) call. This links yourdevice structure into the kernel network device tables. As thestructure you pass in is used by the kernel, you must not freethis until you have unloaded the device with voidunregister_netdev(struct device *) calls. These calls arenormally done at boot time, or module load and unload.
The kernel will not object if you create multiple devices withthe same name, it will break. Therefore, if your driver is aloadable module you should use the struct device*dev_get(const char *name) call to ensure the name is notalready in use. If it is in use, you should fail or pickanother name. You may not use unregister_netdev() tounregister the other device with the name if you discover aclash!
A typical code sequence for registration is:
int register_my_device(void)
{
int i=0;
for(i=0;i<100;i++)
{
sprintf(mydevice.name,"mydev%d",i);
if(dev_get(mydevice.name)==NULL)
{
if(register_netdev(&mydevice)!=0)
return -EIO;
return 0;
}
}
printk("100 mydevs loaded. Unable to load more.\n");
return -ENFILE;
}
All the generic information and methods for each networkdevice are kept in the device structure. To create a device youneed to fill most of these in. This section covers how theyshould be set up.
The next block of parameters are used to maintain thelocation of a device within the device address spaces of thearchitecture. The irq field holds the interrupt (IRQ)the device is using. This is normally set at boot, or by theinitialization function. If an interrupt is not used, notcurrently known, or not assigned, the value zero should beused. The interrupt can be set in a variety of fashions. Theauto-irq facilities of the kernel may be used to probe for thedevice interrupt, or the interrupt may be set when loading thenetwork module. Network drivers normally use a global intcalled irq for this so that users can load the modulewith insmod mydevice irq=5 style commands. Finally,the IRQ may be set dynamically from the ifconfig command. Thiscauses a call to your device that will be discussed later on.
The base_addr field is the base I/O space address thedevice resides at. If the device uses no I/O locations or isrunning on a system with no I/O space concept this field shouldbe zero. When this is user settable, it is normally set by aglobal variable called io. The interface I/O addressmay also be set with ifconfig.
Two hardware shared memory ranges are defined for things likeISA bus shared memory ethernet cards. For current purposes, thermem_start and rmem_end fields are obsoleteand should be loaded with 0. The mem_start andmem_end addresses should be loaded with the start andend of the shared memory block used by this device. If noshared memory block is used, then the value 0 should be stored.Those devices that allow the user to specify this parameter usea global variable called mem to set the memory base,and set the mem_end appropriately themselves.
The dma variable holds the DMA channel in use by thedevice. Linux allows DMA (like interrupts) to be automaticallyprobed. If no DMA channel is used, or the DMA channel is notyet set, the value 0 is used. This may have to change, sincethe latest PC boards allow ISA bus DMA channel 0 to be used byhardware boards and do not just tie it to memory refresh. Ifthe user can set the DMA channel the global variabledma is used.
It is important to realise that the physical information isprovided for control and user viewing (as well as the driver'sinternal functions), and does not register these areas toprevent them being reused. Thus the device driver must alsoallocate and register the I/O, DMA and interrupt lines itwishes to use, using the same kernel functions as any otherdevice driver. [See the recent Kernel Korner articles onwriting a character device driver in issues 23, 24, 25, 26, and28 of Linux Journal.]
The if_port field holds the physical media type formulti-media devices such as combo ethernet boards.
In order for the network protocol layers to perform in asensible manner, the device has to provide a set of capabilityflags and variables. These are also maintained in the devicestructure.
The mtu is the largest payload that can be sent overthis interface (that is, the largest packet size not includingany bottom layer headers that the device itself will provide).This is used by the protocol layers such as IP to selectsuitable packet sizes to send. There are minimums imposed byeach protocol. A device is not usable for IPX without a 576byte frame size or higher. IP needs at least 72 bytes, and doesnot perform sensibly below about 200 bytes. It is up to theprotocol layers to decide whether to co-operate with yourdevice.
The family is always set to AF_INET andindicates the protocol family the device is using. Linux allowsa device to be using multiple protocol families at once, andmaintains this information solely to look more like thestandard BSD networking API.
The interface hardware type (type) field is taken from a tableof physical media types. The values used by the ARP protocol(see RFC1700) are used for those media supporting ARP andadditional values are assigned for other physical layers. Newvalues are added when neccessary both to the kernel and tonet-tools which is the package containing programs likeifconfig that need to be able to decode this field. Thefields defined as of Linux pre2.0.5 are:
From RFC1700:
Those interfaces marked unused are defined types but withoutany current support on the existing net-tools. The Linux kernelprovides additional generic support routines for devices usingethernet and token ring.
The pa_addr field is used to hold the IP address whenthe interface is up. Interfaces should start down with thisvariable clear. pa_brdaddr is used to hold theconfigured broadcast address, pa_dstaddr the target ofa point to point link and pa_mask the IP netmask ofthe interface. All of these can be initialised to zero. Thepa_alen field holds the length of an address (in ourcase an IP address), this should be initialised to 4.
The hard_header_len is the number of bytes thedevice desires at the start of a network buffer it is passed.It does not have to be the number of bytes of physical headerthat will be added, although this is normal. A device can usethis to provide itself a scratchpad at the start of eachbuffer.
In the 1.2.x series kernels, the skb->data pointerwill point to the buffer start and you must avoid sending yourscratchpad yourself. This also means for devices with variablelength headers you will need to allocate max_size+1bytes and keep a length byte at the start so you know where theheader really begins (the header should be contiguous with thedata). Linux 1.3.x makes life much simpler and ensures you willhave at least as much room as you asked free at the start ofthe buffer. It is up to you to use skb_push()appropriately as was discussed in the section on networkingbuffers.
The physical media addresses (if any) are maintained indev_addr and broadcast respectively. Theseare byte arrays and addresses smaller than the size of thearray are stored starting from the left. The addr_lenfield is used to hold the length of a hardware address. Withmany media there is no hardware address, and this should be setto zero. For some other interfaces the address must be set by auser program. The ifconfig tool permits the setting of aninterface hardware address. In this case it need not be setinitially, but the open code should take care not to allow adevice to start transmitting without an address being set.
A set of flags are used to maintain the interfaceproperties. Some of these are ``compatibility'' items and assuch not directly useful. The flags are:
Packets are queued for an interface by the kernel protocolcode. Within each device, buffs[] is an array ofpacket queues for each kernel priority level. These aremaintained entirely by the kernel code, but must be initialisedby the device itself on boot up. The intialisation code used is:
int ct=0;All other fields should be initialised to 0.
while(ct<DEV_NUMBUFFS)
{
skb_queue_head_init(&dev->buffs[ct]);
ct++;
}
The device gets to select the queue length it wants bysetting the field dev->tx_queue_len to the maximumnumber of frames the kernel should queue for the device.Typically this is around 100 for ethernet and 10 for seriallines. A device can modify this dynamically, although itseffect will lag the change slightly.
Each network device has to provide a set of actual functions(methods) for the basic low level operations. It should alsoprovide a set of support functions that interface the protocollayer to the protocol requirements of the link layer it isproviding.
The init method is called when the device is initialised andregistered with the system. It should perform any low levelverification and checking needed, and return an error code ifthe device is not present, areas cannot be registered or it isotherwise unable to proceed. If the init method returns anerror the register_netdev() call returns the errorcode and the device is not created.
All devices must provide a transmit function. It is possiblefor a device to exist that cannot transmit. In this case thedevice needs a transmit function that simply frees the bufferit is passed. The dummy device has exactly this functionalityon transmit.
The dev->hard_start_xmit() function is called andprovides the driver with its own device pointer and networkbuffer (an sk_buff) to transmit. If your device isunable to accept the buffer, it should return 1 and setdev->tbusy to a non-zero value. This will queue thebuffer and it may be retried again later, although there is noguarantee that the buffer will be retried. If the protocollayer decides to free the buffer the driver has rejected, thenit will not be offered back to the device. If the device knowsthe buffer cannot be transmitted in the near future, forexample due to bad congestion, it can calldev_kfree_skb() to dump the buffer and return 0indicating the buffer is processed.
If there is room the buffer should be processed. The bufferhanded down already contains all the headers, including linklayer headers, neccessary and need only be actually loaded intothe hardware for transmission. In addition, the buffer islocked. This means that the device driver has absoluteownership of the buffer until it chooses to relinquish it. Thecontents of an sk_buff remain read-only, except thatyou are guaranteed that the next/previous pointers are free soyou can use the sk_buff list primitives to buildinternal chains of buffers.
When the buffer has been loaded into the hardware, or in thecase of some DMA driven devices, when the hardware hasindicated transmission complete, the driver must release thebuffer. This is done by calling dev_kfree_skb(skb,FREE_WRITE). As soon as this call is made, thesk_buff in question may spontaneously disappear andthe device driver thus should not reference it again.
It is neccessary for the high level protocols to append lowlevel headers to each frame before queueing it fortransmission. It is also clearly undesirable that the protocolknow in advance how to append low level headers for allpossible frame types. Thus the protocol layer calls down to thedevice with a buffer that has at leastdev->hard_header_len bytes free at the start of thebuffer. It is then up to the network device to correctly callskb_push() and to put the header on the packet in itsdev->hard_header() method. Devices with no link layerheader, such as SLIP, may have this method specified as NULL.
The method is invoked giving the buffer concerned, the device'sown pointers, its protocol identity, pointers to the source anddestination hardware addresses, and the length of the packet tobe sent. As the routine may be called before the protocollayers are fully assembled, it is vital that the method use thelength parameter, not the buffer length.
The source address may be NULL to mean ``use the defaultaddress of this device'', and the destination may be NULL tomean ``unknown''. If as a result of an unknown destination theheader may not be completed, the space should be allocated andany bytes that can be filled in should be filled in. Thisfacility is currently only used by IP when ARP processing musttake place. The function must then return the negative of thebytes of header added. If the header is completely built itmust return the number of bytes of header added.
When a header cannot be completed the protocol layers willattempt to resolve the address neccessary. When this occurs,the dev->rebuild_header() method is called with theaddress at which the header is located, the device in question,the destination IP address, and the network buffer pointer. Ifthe device is able to resolve the address by whatever meansavailable (normally ARP), then it fills in the physical addressand returns 1. If the header cannot be resolved, it returns 0and the buffer will be retried the next time the protocol layerhas reason to believe resolution will be possible.
There is no receive method in a network device, because itis the device that invokes processing of such events. With atypical device, an interrupt notifies the handler that acompleted packet is ready for reception. The device allocates abuffer of suitable size with dev_alloc_skb() andplaces the bytes from the hardware into the buffer. Next, thedevice driver analyses the frame to decide the packet type. Thedriver sets skb->dev to the device that received theframe. It sets skb->protocol to the protocol the framerepresents so that the frame can be given to the correctprotocol layer. The link layer header pointer is stored inskb->mac.raw and the link layer header removed withskb_pull() so that the protocols need not be aware ofit. Finally, to keep the link and protocol isolated, the devicedriver must set skb->pkt_type to one of the following:
Finally, the device driver invokes netif_rx() topass the buffer up to the protocol layer. The buffer is queuedfor processing by the networking protocols after the interrupthandler returns. Deferring the processing in this fashiondramatically reduces the time interrupts are disabled andimproves overall responsiveness. Once netif_rx() iscalled, the buffer ceases to be property of the device driverand may not be altered or referred to again.
Flow control on received packets is applied at two levels bythe protocols. Firstly a maximum amount of data may beoutstanding for netif_rx() to process. Secondly eachsocket on the system has a queue which limits the amount ofpending data. Thus all flow control is applied by the protocollayers. On the transmit side a per device variabledev->tx_queue_len is used as a queue length limiter.The size of the queue is normally 100 frames, which is enoughthat the queue will be kept well filled when sending a lot ofdata over fast links. On a slow link such as slip link, thequeue is normally set to about 10 frames, as sending even 10frames is several seconds of queued data.
One piece of magic that is done for reception with mostexisting device, and one you should implement if possible, isto reserve the neccessary bytes at the head of the buffer toland the IP header on a long word boundary. The existingethernet drivers thus do:
skb=dev_alloc_skb(length+2);to align IP headers on a 16 byte boundary, which is alsothe start of a cache line and helps give performanceimprovments. On the Sparc or DEC Alpha these improvements arevery noticable.
if(skb==NULL)
return;
skb_reserve(skb,2);
/* then 14 bytes of ethernet hardware header */
Each device has the option of providing additional functionsand facilities to the protocol layers. Not implementing thesefunctions will cause a degradation in service available via theinterface but not prevent operation. These operations splitinto two categories--configuration and activation/shutdown.
When a device is activated (that is, the flagIFF_UP is set) the dev->open() method isinvoked if the device has provided one. This permits the deviceto take any action such as enabling the interface that areneeded when the interface is to be used. An error return fromthis function causes the device to stay down and causes theuser request to activate the device to fail with the errorreturned by dev->open()
The second use of this function is with any device loaded as amodule. Here it is neccessary to prevent a device beingunloaded while it is open. Thus the MOD_INC_USE_COUNTmacro must be used within the open method.
The dev->close() method is invoked when the device isconfigured down and should shut off the hardware in such a wayas to minimise machine load (for example by disabling theinterface or its ability to generate interrupts). It can alsobe used to allow a module device to be unloaded now that it isdown. The rest of the kernel is structured in such a way thatwhen a device is closed, all references to it by pointer areremoved. This ensures that the device may safely be unloadedfrom a running system. The close method is not permitted tofail.
A set of functions provide the ability to query and to setoperating parameters. The first and most basic of these is aget_stats routine which when called returns a structenet_statistics block for the interface. This allowsuser programs such as ifconfig to see the loading on theinterface and any problem frames logged. Not providing thiswill lead to no statistics being available.
The dev->set_mac_address() function is called whenevera superuser process issues an ioctl of typeSIOCSIFHWADDR to change the physical address of adevice. For many devices this is not meaningful and for othersnot supported. If so leave this functiom pointer asNULL. Some devices can only perform a physicaladdress change if the interface is taken down. For these checkIFF_UP and if set then return -EBUSY.
The dev->set_config() function is called by theSIOCSIFMAP function when a user enters a command likeifconfig eth0 irq 11. It passes an ifmapstructure containing the desired I/O and other interfaceparameters. For most interfaces this is not useful and you canreturn NULL.
Finally, the dev->do_ioctl() call is invoked wheneveran ioctl in the range SIOCDEVPRIVATE toSIOCDEVPRIVATE+15 is used on your interface. All theseioctl calls take a struct ifreq. This is copied intokernel space before your handler is called and copied back atthe end. For maximum flexibility any user may make these callsand it is up to your code to check for superuser status whenappropriate. For example the PLIP driver uses these to setparallel port time out speeds to allow a user to tune the plipdevice for their machine.
Certain physical media types such as ethernet supportmulticast frames at the physical layer. A multicast frame isheard by a group, but not all, hosts on the network, ratherthan going from one host to another.
The capabilities of ethernet cards are fairly variable. Mostfall into one of three categories:
The kernel support code maintains lists of physicaladdresses your interface should be allowing for multicast. Thedevice driver may return frames matching more than therequested list of multicasts if it is not able to do perfectfiltering.
Whenever the list of multicast addresses changes the devicedrivers dev->set_multicast_list() function is invoked.The driver can then reload its physical tables. Typically thislooks something like:
if(dev->flags&IFF_PROMISC)There are a small number of cards that can only do unicastor promiscuous mode. In this case the driver, when presentedwith a request for multicasts has to go promiscuous. If this isdone, the driver must itself also set the IFF_PROMISCflag in dev->flags.
SetToHearAllPackets();
else if(dev->flags&IFF_ALLMULTI)
SetToHearAllMulticasts();
else
{
if(dev->mc_count<16)
{
LoadAddressList(dev->mc_list);
SetToHearList();
}
else
SetToHearAllMulticasts();
}
In order to aid driver writer the multicast list is keptvalid at all times. This simplifies many drivers, as a resetfrom error condition in a driver often has to reload themulticast address lists.
Ethernet is probably the most common physical interface typethat is handled. The kernel provides a set of general purposeethernet support routines that such drivers can use.
eth_header() is the standard ethernet handler for thedev->hard_header routine, and can be used in anyethernet driver. Combined with eth_rebuild_header()for the rebuild routine it provides all the ARP lookup requiredto put ethernet headers on IP packets.
The eth_type_trans() routine expects to be fed a rawethernet packet. It analyses the headers and setsskb->pkt_type and skb->mac itself as well asreturning the suggested value for skb->protocol. Thisroutine is normally called from the ethernet driver receiveinterrupt handler to classify packets.
eth_copy_and_sum(), the final ethernet support routine,is quite internally complex but offers significant performanceimprovements for memory mapped cards. It provides the supportto copy and checksum data from the card into ansk_buff in a single pass. This single pass throughmemory almost eliminates the cost of checksum computation whenused and can really help IP throughput.
Alan Cox has been working on Linux since version0.95, when he installed it in order to do further work on theAberMUD game. He now manages the Linux Networking, SMP, andLinux/8086 projects and hasn't done any work on AberMUD sinceNovember 1993.