Network Buffers And Memory Management

发表于:2007-07-04来源:作者:点击数: 标签:
Network Buffers And Memory Management Reprinted with permission of Linux Journal, from issue 29, September 1996. Some changes have been made to accomodate the web. This article was originally written for the Kernel Korner column. The Kerne

Network Buffers And Memory Management

Reprinted with permission of Linux Journal, from issue 29, September 1996. Some changes have been made to aclearcase/" target="_blank" >ccomodate the web. This article was originally written for the Kernel Korner column. The Kernel Korner series has included many other articles of interest to Linux kernel hackers, as well.

by Alan Cox

TheHyperNews Linux KHGDiscussion Pages


Network Buffers And Memory Management

Reprinted with permission of Linux Journal, from issue 29, September 1996.Some changes have been made to accomodate the web.This article was originally written for the Kernel Korner column.The Kernel Korner series has includedmany other articles of interest to Linux kernel hackers, as well.

by Alan Cox

The Linux operating system implements the industry-standardBerkeley socket API, which has its origins in the BSD unixdevelopments (4.2/4.3/4.4 BSD). In this article, we will lookat the way the memory management and buffering is implementedfor network layers and network device drivers under theexisting Linux kernel, as well as explain how and why somethings have changed over time.

Core Concepts

The networking layer tries to be fairly object-oriented inits design, as indeed is much of the Linux kernel. The corestructure of the networking code goes back to the initialnetworking and socket implementations by Ross Biro and OrestZborowski respectively. The key objects are:

Device or Interface:
A network interface represents a thing which sends andreceives packets. This is normally interface code for aphysical device like an ethernet card. However some devices aresoftware only such as the loopback device which is used forsending data to yourself.
Protocol:
Each protocol is effectively a different language ofnetworking. Some protocols exist purely because vendors choseto use proprietary networking schemes, others are designed forspecial purposes. Within the Linux kernel each protocol is aseperate module of code which provides services to the socketlayer.
Socket:
So called from the notion of plugs and sockets. A socket isa connection in the networking that provides unix file I/O andexists to the user program as a file descriptor. In the kerneleach socket is a pair of structures that represent the highlevel socket interface and low level protocol interface.
sk_buff:
All the buffers used by the networking layers aresk_buffs. The control for these is provided by corelow-level library routines available to the whole of thenetworking. sk_buffs provide the general buffering andflow control facilities needed by network protocols.

Implementation of sk_buffs

The primary goal of the sk_buff routines is toprovide a consistent and efficient buffer handling method forall of the network layers, and by being consistent to make itpossible to provide higher level sk_buff and sockethandling facilities to all the protocols.

An sk_buff is a control structure with a block ofmemory attached. There are two primary sets of functionsprovided in the sk_buff library. Firstly routines tomanipulate doubly linked lists of sk_buffs, secondlyfunctions for controlling the attached memory. The buffers areheld on linked lists optimised for the common networkoperations of append to end and remove from start. As so muchof the networking functionality occurs during interrupts theseroutines are written to be atomic. The small extra overheadthis causes is well worth the pain it saves in bug hunting.

We use the list operations to manage groups of packets as theyarrive from the network, and as we send them to the physicalinterfaces. We use the memory manipulation routines forhandling the contents of packets in a standardised andefficient manner.

At its most basic level, a list of buffers is managed usingfunctions like this:

void append_frame(char *buf, int len)
{
struct sk_buff *skb=alloc_skb(len, GFP_ATOMIC);
if(skb==NULL)
my_dropped++;
else
{
skb_put(skb,len);
memcpy(skb->data,data,len);
skb_append(&my_list, skb);
}
}

void process_queue(void)
{
struct sk_buff *skb;
while((skb=skb_dequeue(&my_list))!=NULL)
{
process_data(skb);
kfree_skb(skb, FREE_READ);
}
}
These two fairly simplistic pieces of code actually demonstratethe receive packet mechanism quite accurately. Theappend_frame() function is similar to the code calledfrom an interrupt by a device driver receiving a packet, andprocess_frame() is similar to the code called to feeddata into the protocols. If you go and look in net/core/dev.cat netif_rx() and net_bh(), you will see thatthey manage buffers similarly. They are far more complex, asthey have to feed packets to the right protocol and manage flowcontrol, but the basic operations are the same. This is just astrue if you look at buffers going from the protocol code to auser application.

The example also shows the use of one of the data controlfunctions, skb_put(). Here it is used to reserve spacein the buffer for the data we wish to pass down.

Let's look at append_frame(). The alloc_skb()fucntion obtains a buffer of len bytes(Figure 1),which consists of:

  • 0 bytes of room at the head of the buffer
  • 0 bytes of data, and
  • len bytes of room at the end of the data.
The skb_put() function (Figure 4)grows the dataarea upwards in memory through the free space at the buffer endand thus reserves space for the memcpy(). Many networkoperations used in sending add to the start of the frame eachtime in order to add headers to packets, so theskb_push() function (Figure 5)is provided to allowyou to move the start of the data frame down through memory,providing enough space has been reserved to leave room fordoing this.

Immediately after a buffer has been allocated, all theavailable room is at the end. A further function namedskb_reserve() (Figure 2)can be called before data isadded allows you to specify that some of the room should be atthe beginning. Thus, many sending routines start with somethinglike:

    skb=alloc_skb(len+headspace, GFP_KERNEL);
skb_reserve(skb, headspace);
skb_put(skb,len);
memcpy_fromfs(skb->data,data,len);
pass_to_m_protocol(skb);

In systems such as BSD unix you don't need to know inadvance how much space you will need as it uses chains of smallbuffers (mbufs) for its network buffers. Linux chooses to uselinear buffers and save space in advance (often wasting a fewbytes to allow for the worst case) because linear buffers makemany other things much faster.

Now to return to the list functions. Linux provides thefollowing operations:

  • skb_dequeue() takes the first bufferfrom a list. If the list is empty a NULL pointer isreturned. This is used to pull buffers off queues. The buffersare added with the routines skb_queue_head() andskb_queue_tail().
  • skb_queue_head() places a buffer atthe start of a list. As with all the list operations, it isatomic.
  • skb_queue_tail() places a buffer atthe end of a list, which is the most commonly used function.Almost all the queues are handled with one set of routinesqueueing data with this function and another set removing itemsfrom the same queues with skb_dequeue().
  • skb_unlink() removes a buffer fromwhatever list it was on. The buffer is not freed, merelyremoved from the list. To make some operations easier, you neednot know what list the buffer is on, and you can always callskb_unlink() on a buffer which is not in a list. Thisenables network code to pull a buffer out of use even when thenetwork protocol has no idea who is currently using it. Aseperate locking mechanism is provided so device drivers do notfind someone removing a buffer they are using at that moment.
  • Some more complex protocols like TCP keepframes in order and re-order their input as data is received.Two functions, skb_insert() and skb_append(),exist to allow users to place sk_buffs before or aftera specific buffer in a list.
  • alloc_skb() creates a newsk_buff and initialises it. The returned buffer isready to use but does assume you will fill in a few fields toindicate how the buffer should be freed. Normally this isskb->free=1. A buffer can be told not to be freed whenkfree_skb() (see below) is called.
  • kfree_skb() releases a buffer, and ifskb->sk is set it lowers the memory use counts of thesocket (sk). It is up tothe socket and protocol-levelroutines to have incremented these counts and to avoid freeinga socket with outstanding buffers. The memory counts are veryimportant, as the kernel networking layers need to know howmuch memory is tied up by each connection in order to preventremote machines or local processes from using too much memory.
  • skb_clone() makes a copy of ansk_buff but does not copy the data area, which must beconsidered read only.
  • For some things a copy of the data is neededfor editing, and skb_copy() provides the samefacilities but also copies the data (and thus has a much higheroverhead).

javascript:window.open(this.src);" style="CURSOR: pointer" onload="return imgzoom(this,550)">
Figure 1: After alloc_skb


Figure 2: After skb_reserve


Figure 3: An sk_buff containing data


Figure 4: After skb_put has been called on the buffer


Figure 5: After an skb_push has occured on the previous buffer


Figure 6: Network device data flow

Higher Level Support Routines

The semantics of allocating and queueing buffers for socketsalso involve flow control rules and for sending a whole list ofinteractions with signals and optional settings such as nonblocking. Two routines are designed to make this easy for mostprotocols.

The sock_queue_rcv_skb() function is used to handleincoming data flow control and is normally used in the form:

    sk=my_find_socket(whatever);
if(sock_queue_rcv_skb(sk,skb)==-1)
{
myproto_stats.dropped++;
kfree_skb(skb,FREE_READ);
return;
}
This function uses the socket read queue counters to preventvast amounts of data being queued to a socket. After a limit ishit, data is discarded. It is up to the application to readfast enough, or as in TCP, for the protocol to do flow controlover the network. TCP actually tells the sending machine toshut up when it can no longer queue data.

On the sending side, sock_alloc_send_skb() handlessignal handling, the non blocking flag, and all the semanticsof blocking until there is space in the send queue so youcannot tie up all of memory with data queued for a slowinterface. Many protocol send routines have this function doingalmost all the work:

    skb=sock_alloc_send_skb(sk,....)
if(skb==NULL)
return -err;
skb->sk=sk;
skb_reserve(skb, headroom);
skb_put(skb,len);
memcpy(skb->data, data, len);
protocol_do_something(skb);

Most of this we have met before. The very important line isskb->sk=sk. The sock_alloc_send_skb() hascharged the memory for the buffer to the socket. By settingskb->sk we tell the kernel that whoever does akfree_skb() on the buffer should cause the socket tobe credited the memory for the buffer. Thus when a device hassent a buffer and frees it the user will be able to send more.

Network Devices

All Linux network devices follow the same interface althoughmany functions available in that interface will not be neededfor all devices. An object oriented mentality is used and eachdevice is an object with a series of methods that are filledinto a structure. Each method is called with the device itselfas the first argument. This is done to get around the lack ofthe C++ concept of this within the C language.

The file drivers/net/skeleton.c contains the skeleton of anetwork device driver. View or print a copy from a recentkernel and follow along throughout the rest of the article.

Basic Structure

Each network device deals entirely in the transmission ofnetwork buffers from the protocols to the physical media, andin receiving and decoding the responses the hardware generates.Incoming frames are turned into network buffers, identified byprotocol and delivered to netif_rx(). This functionthen passes the frames off to the protocol layer for furtherprocessing.

Each device provides a set of additional methods for thehandling of stopping, starting, control and physicalencapsulation of packets. These and all the other controlinformation are collected together in the device structuresthat are used to manage each device.

Naming

All Linux network devices have a unique name. This is not inany way related to the file system names devices may have, andindeed network devices do not normally have a filesystemrepresentation, although you may create a device which is tiedto device drivers. Traditionally the name indicates only thetype of a device rather than its maker. Multiple devices of thesame type are numbered upwards from 0. Thus ethernet devicesare known as ``eth0'', ``eth1'', ``eth2'' etc. The namingscheme is important as it allows users to write programs orsystem configuration in terms of ``an ethernet card'' ratherthan worrying about the manufacturer of the board and forcingreconfiguration if a board is changed.

The following names are currently used for generic devices:

ethn
Ethernet controllers, both 10 and 100Mb/second
trn
Token ring devices.
sln
SLIP devices. Also used in AX.25 KISS mode.
pppn
PPP devices both asynchronous and synchronous.
plipn
PLIP units. The number matches the printer port.
tunln
IPIP encapsulated tunnels
nrn
NetROM virtual devices
isdnn
ISDN interfaces handled by isdn4linux. (*)
dummyn
Null devices
lo
The loopback device
(*) At least one ISDN interface is anethernet impersonator, that is the Sonix PC/Volante driver.Therefore, it uses an ``eth'' device name as it behaves in allaspects as if it was ethernet rather than ISDN.

If possible, a new device should pick a name that reflectsexisting practice. When you are adding a whole new physicallayer type you should look for other people working on such aproject and use a common naming scheme.

Certain physical layers present multiple logical interfacesover one media. Both ATM and Frame Relay have this property, asdoes multi-drop KISS in the amateur radio environment. Undersuch circumstances a driver needs to exist for each activechannel. The Linux networking code is structured in such a wayas to make this managable without excessive additional code,and the name registration scheme allows you to create andremove interfaces almost at will as channels come into and outof existance. The proposed convention for such names is stillunder some discussion, as the simple scheme of ``sl0a'',``sl0b'', "sl0c" works for basic devices like multidrop KISS,but does not cope with multiple frame relay connections where avirtual channel may be moved across physical boards.

Registering A Device

Each device is created by filling in a structdevice object and passing it to theregister_netdev(struct device *) call. This links yourdevice structure into the kernel network device tables. As thestructure you pass in is used by the kernel, you must not freethis until you have unloaded the device with voidunregister_netdev(struct device *) calls. These calls arenormally done at boot time, or module load and unload.

The kernel will not object if you create multiple devices withthe same name, it will break. Therefore, if your driver is aloadable module you should use the struct device*dev_get(const char *name) call to ensure the name is notalready in use. If it is in use, you should fail or pickanother name. You may not use unregister_netdev() tounregister the other device with the name if you discover aclash!

A typical code sequence for registration is:

int register_my_device(void)
{
int i=0;
for(i=0;i<100;i++)
{
sprintf(mydevice.name,"mydev%d",i);
if(dev_get(mydevice.name)==NULL)
{
if(register_netdev(&mydevice)!=0)
return -EIO;
return 0;
}
}
printk("100 mydevs loaded. Unable to load more.\n");
return -ENFILE;
}

The Device Structure

All the generic information and methods for each networkdevice are kept in the device structure. To create a device youneed to fill most of these in. This section covers how theyshould be set up.

Naming

First, the name field holds the device name. This is a stringpointer to a name in the formats discussed previously. It mayalso be " " (four spaces), in which case the kernelwill automatically assign an ethn name to it. This is aspecial feature that is best not used. After Linux 2.0, weintend to change to a simple support function of the formdev_make_name("eth").

Bus Interface Parameters

The next block of parameters are used to maintain thelocation of a device within the device address spaces of thearchitecture. The irq field holds the interrupt (IRQ)the device is using. This is normally set at boot, or by theinitialization function. If an interrupt is not used, notcurrently known, or not assigned, the value zero should beused. The interrupt can be set in a variety of fashions. Theauto-irq facilities of the kernel may be used to probe for thedevice interrupt, or the interrupt may be set when loading thenetwork module. Network drivers normally use a global intcalled irq for this so that users can load the modulewith insmod mydevice irq=5 style commands. Finally,the IRQ may be set dynamically from the ifconfig command. Thiscauses a call to your device that will be discussed later on.

The base_addr field is the base I/O space address thedevice resides at. If the device uses no I/O locations or isrunning on a system with no I/O space concept this field shouldbe zero. When this is user settable, it is normally set by aglobal variable called io. The interface I/O addressmay also be set with ifconfig.

Two hardware shared memory ranges are defined for things likeISA bus shared memory ethernet cards. For current purposes, thermem_start and rmem_end fields are obsoleteand should be loaded with 0. The mem_start andmem_end addresses should be loaded with the start andend of the shared memory block used by this device. If noshared memory block is used, then the value 0 should be stored.Those devices that allow the user to specify this parameter usea global variable called mem to set the memory base,and set the mem_end appropriately themselves.

The dma variable holds the DMA channel in use by thedevice. Linux allows DMA (like interrupts) to be automaticallyprobed. If no DMA channel is used, or the DMA channel is notyet set, the value 0 is used. This may have to change, sincethe latest PC boards allow ISA bus DMA channel 0 to be used byhardware boards and do not just tie it to memory refresh. Ifthe user can set the DMA channel the global variabledma is used.

It is important to realise that the physical information isprovided for control and user viewing (as well as the driver'sinternal functions), and does not register these areas toprevent them being reused. Thus the device driver must alsoallocate and register the I/O, DMA and interrupt lines itwishes to use, using the same kernel functions as any otherdevice driver. [See the recent Kernel Korner articles onwriting a character device driver in issues 23, 24, 25, 26, and28 of Linux Journal.]

The if_port field holds the physical media type formulti-media devices such as combo ethernet boards.

Protocol Layer Variables

In order for the network protocol layers to perform in asensible manner, the device has to provide a set of capabilityflags and variables. These are also maintained in the devicestructure.

The mtu is the largest payload that can be sent overthis interface (that is, the largest packet size not includingany bottom layer headers that the device itself will provide).This is used by the protocol layers such as IP to selectsuitable packet sizes to send. There are minimums imposed byeach protocol. A device is not usable for IPX without a 576byte frame size or higher. IP needs at least 72 bytes, and doesnot perform sensibly below about 200 bytes. It is up to theprotocol layers to decide whether to co-operate with yourdevice.

The family is always set to AF_INET andindicates the protocol family the device is using. Linux allowsa device to be using multiple protocol families at once, andmaintains this information solely to look more like thestandard BSD networking API.

The interface hardware type (type) field is taken from a tableof physical media types. The values used by the ARP protocol(see RFC1700) are used for those media supporting ARP andadditional values are assigned for other physical layers. Newvalues are added when neccessary both to the kernel and tonet-tools which is the package containing programs likeifconfig that need to be able to decode this field. Thefields defined as of Linux pre2.0.5 are:
From RFC1700:

ARPHRD_NETROM
NET/ROM(tm) devices.
ARPHRD_ETHER
10 and 100Mbit/second ethernet.
ARPHRD_EETHER
Experimental Ethernet (not used).
ARPHRD_AX25
AX.25 level 2 interfaces.
ARPHRD_PRONET
PROnet token ring (not used).
ARPHRD_CHAOS
ChaosNET (not used).
ARPHRD_IEE802
802.2 networks notably token ring.
ARPHRD_ARCNET
ARCnet interfaces.
ARPHRD_DLCI
Frame Relay DLCI.
Defined by Linux:
ARPHRD_SLIP
Serial Line IP protocol
ARPHRD_CSLIP
SLIP with VJ header compression
ARPHRD_SLIP6
6bit encoded SLIP
ARPHRD_CSLIP6
6bit encoded header compressed SLIP
ARPHRD_ADAPT
SLIP interface in adaptive mode
ARPHRD_PPP
PPP interfaces (async and sync)
ARPHRD_TUNNEL
IPIP tunnels
ARPHRD_TUNNEL6
IPv6 over IP tunnels
ARPHRD_FRAD
Frame Relay Access Device.
ARPHRD_SKIP
SKIP encryption tunnel.
ARPHRD_LOOPBACK
Loopback device.
ARPHRD_LOCALTLK
Localtalk apple networking device.
ARPHRD_METRICOM
Metricom Radio Network.

Those interfaces marked unused are defined types but withoutany current support on the existing net-tools. The Linux kernelprovides additional generic support routines for devices usingethernet and token ring.

The pa_addr field is used to hold the IP address whenthe interface is up. Interfaces should start down with thisvariable clear. pa_brdaddr is used to hold theconfigured broadcast address, pa_dstaddr the target ofa point to point link and pa_mask the IP netmask ofthe interface. All of these can be initialised to zero. Thepa_alen field holds the length of an address (in ourcase an IP address), this should be initialised to 4.

Link Layer Variables

The hard_header_len is the number of bytes thedevice desires at the start of a network buffer it is passed.It does not have to be the number of bytes of physical headerthat will be added, although this is normal. A device can usethis to provide itself a scratchpad at the start of eachbuffer.

In the 1.2.x series kernels, the skb->data pointerwill point to the buffer start and you must avoid sending yourscratchpad yourself. This also means for devices with variablelength headers you will need to allocate max_size+1bytes and keep a length byte at the start so you know where theheader really begins (the header should be contiguous with thedata). Linux 1.3.x makes life much simpler and ensures you willhave at least as much room as you asked free at the start ofthe buffer. It is up to you to use skb_push()appropriately as was discussed in the section on networkingbuffers.

The physical media addresses (if any) are maintained indev_addr and broadcast respectively. Theseare byte arrays and addresses smaller than the size of thearray are stored starting from the left. The addr_lenfield is used to hold the length of a hardware address. Withmany media there is no hardware address, and this should be setto zero. For some other interfaces the address must be set by auser program. The ifconfig tool permits the setting of aninterface hardware address. In this case it need not be setinitially, but the open code should take care not to allow adevice to start transmitting without an address being set.

Flags

A set of flags are used to maintain the interfaceproperties. Some of these are ``compatibility'' items and assuch not directly useful. The flags are:

IFF_UP
The interface is currently active. In Linux, theIFF_RUNNING and IFF_UP flags are basicallyhandled as a pair. They exist as two items for compatibilityreasons. When an interface is not marked as IFF_UP itmay be removed. Unlike BSD, an interface that does not haveIFF_UP set will never receive packets.
IFF_BROADCAST
The interface has broadcast capability. There will be avalid IP address stored in the device addresses.
IFF_DEBUG
Available to indicate debugging is desired. Not currentlyused.
IFF_LOOPBACK
The loopback interface (lo) is the only interface that hasthis flag set. Setting it on other interfaces is neitherdefined nor a very good idea.
IFF_POINTOPOINT
The interface is a point to point link (such as SLIP orPPP). There is no broadcast capability as such. The remotepoint to point address in the device structure is valid. Apoint to point link has no netmask or broadcast normally, butthis can be enabled if needed.
IFF_NOTRAILERS
More of a prehistoric than a historic compatibility flag.Not used.
IFF_RUNNING
See IFF_UP
IFF_NOARP
The interface does not perform ARP queries. Such aninterface must have either a static table of addressconversions or no need to perform mappings. The NetROMinterface is a good example of this. Here all entries are handconfigured as the NetROM protocol cannot do ARP queries.
IFF_PROMISC
The interface if it is possible will hear all packets onthe network. This is typically used for network monitoringalthough it may also be used for bridging. One or twointerfaces like the AX.25 interfaces are always in promiscuousmode.
IFF_ALLMULTI
Receive all multicast packets. An interface that cannotperform this operation but can receive all packets will go intopromiscuous mode when asked to perform this task.
IFF_MULTICAST
Indicate that the interface supports multicast IP traffic.This is not the same as supporting a physical multicast. AX.25for example supports IP multicast using physical broadcast.Point to point protocols such as SLIP generally support IPmulticast.

The Packet Queue

Packets are queued for an interface by the kernel protocolcode. Within each device, buffs[] is an array ofpacket queues for each kernel priority level. These aremaintained entirely by the kernel code, but must be initialisedby the device itself on boot up. The intialisation code used is:

int ct=0;
while(ct<DEV_NUMBUFFS)
{
skb_queue_head_init(&dev->buffs[ct]);
ct++;
}
All other fields should be initialised to 0.

The device gets to select the queue length it wants bysetting the field dev->tx_queue_len to the maximumnumber of frames the kernel should queue for the device.Typically this is around 100 for ethernet and 10 for seriallines. A device can modify this dynamically, although itseffect will lag the change slightly.

Network Device Methods

Each network device has to provide a set of actual functions(methods) for the basic low level operations. It should alsoprovide a set of support functions that interface the protocollayer to the protocol requirements of the link layer it isproviding.

Setup

The init method is called when the device is initialised andregistered with the system. It should perform any low levelverification and checking needed, and return an error code ifthe device is not present, areas cannot be registered or it isotherwise unable to proceed. If the init method returns anerror the register_netdev() call returns the errorcode and the device is not created.

Frame Transmission

All devices must provide a transmit function. It is possiblefor a device to exist that cannot transmit. In this case thedevice needs a transmit function that simply frees the bufferit is passed. The dummy device has exactly this functionalityon transmit.

The dev->hard_start_xmit() function is called andprovides the driver with its own device pointer and networkbuffer (an sk_buff) to transmit. If your device isunable to accept the buffer, it should return 1 and setdev->tbusy to a non-zero value. This will queue thebuffer and it may be retried again later, although there is noguarantee that the buffer will be retried. If the protocollayer decides to free the buffer the driver has rejected, thenit will not be offered back to the device. If the device knowsthe buffer cannot be transmitted in the near future, forexample due to bad congestion, it can calldev_kfree_skb() to dump the buffer and return 0indicating the buffer is processed.

If there is room the buffer should be processed. The bufferhanded down already contains all the headers, including linklayer headers, neccessary and need only be actually loaded intothe hardware for transmission. In addition, the buffer islocked. This means that the device driver has absoluteownership of the buffer until it chooses to relinquish it. Thecontents of an sk_buff remain read-only, except thatyou are guaranteed that the next/previous pointers are free soyou can use the sk_buff list primitives to buildinternal chains of buffers.

When the buffer has been loaded into the hardware, or in thecase of some DMA driven devices, when the hardware hasindicated transmission complete, the driver must release thebuffer. This is done by calling dev_kfree_skb(skb,FREE_WRITE). As soon as this call is made, thesk_buff in question may spontaneously disappear andthe device driver thus should not reference it again.

Frame Headers

It is neccessary for the high level protocols to append lowlevel headers to each frame before queueing it fortransmission. It is also clearly undesirable that the protocolknow in advance how to append low level headers for allpossible frame types. Thus the protocol layer calls down to thedevice with a buffer that has at leastdev->hard_header_len bytes free at the start of thebuffer. It is then up to the network device to correctly callskb_push() and to put the header on the packet in itsdev->hard_header() method. Devices with no link layerheader, such as SLIP, may have this method specified as NULL.

The method is invoked giving the buffer concerned, the device'sown pointers, its protocol identity, pointers to the source anddestination hardware addresses, and the length of the packet tobe sent. As the routine may be called before the protocollayers are fully assembled, it is vital that the method use thelength parameter, not the buffer length.

The source address may be NULL to mean ``use the defaultaddress of this device'', and the destination may be NULL tomean ``unknown''. If as a result of an unknown destination theheader may not be completed, the space should be allocated andany bytes that can be filled in should be filled in. Thisfacility is currently only used by IP when ARP processing musttake place. The function must then return the negative of thebytes of header added. If the header is completely built itmust return the number of bytes of header added.

When a header cannot be completed the protocol layers willattempt to resolve the address neccessary. When this occurs,the dev->rebuild_header() method is called with theaddress at which the header is located, the device in question,the destination IP address, and the network buffer pointer. Ifthe device is able to resolve the address by whatever meansavailable (normally ARP), then it fills in the physical addressand returns 1. If the header cannot be resolved, it returns 0and the buffer will be retried the next time the protocol layerhas reason to believe resolution will be possible.

Reception

There is no receive method in a network device, because itis the device that invokes processing of such events. With atypical device, an interrupt notifies the handler that acompleted packet is ready for reception. The device allocates abuffer of suitable size with dev_alloc_skb() andplaces the bytes from the hardware into the buffer. Next, thedevice driver analyses the frame to decide the packet type. Thedriver sets skb->dev to the device that received theframe. It sets skb->protocol to the protocol the framerepresents so that the frame can be given to the correctprotocol layer. The link layer header pointer is stored inskb->mac.raw and the link layer header removed withskb_pull() so that the protocols need not be aware ofit. Finally, to keep the link and protocol isolated, the devicedriver must set skb->pkt_type to one of the following:

PACKET_BROADCAST
Link layer broadcast.
PACKET_MULTICAST
Link layer multicast.
PACKET_SELF
Frame to us.
PACKET_OTHERHOST
Frame to another single host.
This last type is normally reported as a result of an interfacerunning in promiscuous mode.

Finally, the device driver invokes netif_rx() topass the buffer up to the protocol layer. The buffer is queuedfor processing by the networking protocols after the interrupthandler returns. Deferring the processing in this fashiondramatically reduces the time interrupts are disabled andimproves overall responsiveness. Once netif_rx() iscalled, the buffer ceases to be property of the device driverand may not be altered or referred to again.

Flow control on received packets is applied at two levels bythe protocols. Firstly a maximum amount of data may beoutstanding for netif_rx() to process. Secondly eachsocket on the system has a queue which limits the amount ofpending data. Thus all flow control is applied by the protocollayers. On the transmit side a per device variabledev->tx_queue_len is used as a queue length limiter.The size of the queue is normally 100 frames, which is enoughthat the queue will be kept well filled when sending a lot ofdata over fast links. On a slow link such as slip link, thequeue is normally set to about 10 frames, as sending even 10frames is several seconds of queued data.

One piece of magic that is done for reception with mostexisting device, and one you should implement if possible, isto reserve the neccessary bytes at the head of the buffer toland the IP header on a long word boundary. The existingethernet drivers thus do:

skb=dev_alloc_skb(length+2);
if(skb==NULL)
return;
skb_reserve(skb,2);
/* then 14 bytes of ethernet hardware header */
to align IP headers on a 16 byte boundary, which is alsothe start of a cache line and helps give performanceimprovments. On the Sparc or DEC Alpha these improvements arevery noticable.

Optional Functionality

Each device has the option of providing additional functionsand facilities to the protocol layers. Not implementing thesefunctions will cause a degradation in service available via theinterface but not prevent operation. These operations splitinto two categories--configuration and activation/shutdown.

Activation And Shutdown

When a device is activated (that is, the flagIFF_UP is set) the dev->open() method isinvoked if the device has provided one. This permits the deviceto take any action such as enabling the interface that areneeded when the interface is to be used. An error return fromthis function causes the device to stay down and causes theuser request to activate the device to fail with the errorreturned by dev->open()

The second use of this function is with any device loaded as amodule. Here it is neccessary to prevent a device beingunloaded while it is open. Thus the MOD_INC_USE_COUNTmacro must be used within the open method.

The dev->close() method is invoked when the device isconfigured down and should shut off the hardware in such a wayas to minimise machine load (for example by disabling theinterface or its ability to generate interrupts). It can alsobe used to allow a module device to be unloaded now that it isdown. The rest of the kernel is structured in such a way thatwhen a device is closed, all references to it by pointer areremoved. This ensures that the device may safely be unloadedfrom a running system. The close method is not permitted tofail.

Configuration And Statistics

A set of functions provide the ability to query and to setoperating parameters. The first and most basic of these is aget_stats routine which when called returns a structenet_statistics block for the interface. This allowsuser programs such as ifconfig to see the loading on theinterface and any problem frames logged. Not providing thiswill lead to no statistics being available.

The dev->set_mac_address() function is called whenevera superuser process issues an ioctl of typeSIOCSIFHWADDR to change the physical address of adevice. For many devices this is not meaningful and for othersnot supported. If so leave this functiom pointer asNULL. Some devices can only perform a physicaladdress change if the interface is taken down. For these checkIFF_UP and if set then return -EBUSY.

The dev->set_config() function is called by theSIOCSIFMAP function when a user enters a command likeifconfig eth0 irq 11. It passes an ifmapstructure containing the desired I/O and other interfaceparameters. For most interfaces this is not useful and you canreturn NULL.

Finally, the dev->do_ioctl() call is invoked wheneveran ioctl in the range SIOCDEVPRIVATE toSIOCDEVPRIVATE+15 is used on your interface. All theseioctl calls take a struct ifreq. This is copied intokernel space before your handler is called and copied back atthe end. For maximum flexibility any user may make these callsand it is up to your code to check for superuser status whenappropriate. For example the PLIP driver uses these to setparallel port time out speeds to allow a user to tune the plipdevice for their machine.

Multicasting

Certain physical media types such as ethernet supportmulticast frames at the physical layer. A multicast frame isheard by a group, but not all, hosts on the network, ratherthan going from one host to another.

The capabilities of ethernet cards are fairly variable. Mostfall into one of three categories:

  1. No multicast filters. The card either receives allmulticasts or none of them. Such cards can be a nuisance on anetwork with a lot of multicast traffic such as group videoconferences.
  2. Hash filters. A table is loaded onto the card giving amask of entries that we wish to hear multicast for. Thisfilters out some of the unwanted multicasts but not all.
  3. Perfect filters. Most cards that support perfectfilters combine this option with 1 or 2 above. This is donebecause the perfect filter often has a length limit of 8 or 16entries.
It is especially important that ethernet interfaces areprogrammed to support multicasting. Several ethernet protocols(notably Appletalk and IP multicast) rely on ethernetmulticasting. Fortunately, most of the work is done by thekernel for you (see net/core/dev_mcast.c).

The kernel support code maintains lists of physicaladdresses your interface should be allowing for multicast. Thedevice driver may return frames matching more than therequested list of multicasts if it is not able to do perfectfiltering.

Whenever the list of multicast addresses changes the devicedrivers dev->set_multicast_list() function is invoked.The driver can then reload its physical tables. Typically thislooks something like:

if(dev->flags&IFF_PROMISC)
SetToHearAllPackets();
else if(dev->flags&IFF_ALLMULTI)
SetToHearAllMulticasts();
else
{
if(dev->mc_count<16)
{
LoadAddressList(dev->mc_list);
SetToHearList();
}
else
SetToHearAllMulticasts();
}
There are a small number of cards that can only do unicastor promiscuous mode. In this case the driver, when presentedwith a request for multicasts has to go promiscuous. If this isdone, the driver must itself also set the IFF_PROMISCflag in dev->flags.

In order to aid driver writer the multicast list is keptvalid at all times. This simplifies many drivers, as a resetfrom error condition in a driver often has to reload themulticast address lists.

Ethernet Support Routines

Ethernet is probably the most common physical interface typethat is handled. The kernel provides a set of general purposeethernet support routines that such drivers can use.

eth_header() is the standard ethernet handler for thedev->hard_header routine, and can be used in anyethernet driver. Combined with eth_rebuild_header()for the rebuild routine it provides all the ARP lookup requiredto put ethernet headers on IP packets.

The eth_type_trans() routine expects to be fed a rawethernet packet. It analyses the headers and setsskb->pkt_type and skb->mac itself as well asreturning the suggested value for skb->protocol. Thisroutine is normally called from the ethernet driver receiveinterrupt handler to classify packets.

eth_copy_and_sum(), the final ethernet support routine,is quite internally complex but offers significant performanceimprovements for memory mapped cards. It provides the supportto copy and checksum data from the card into ansk_buff in a single pass. This single pass throughmemory almost eliminates the cost of checksum computation whenused and can really help IP throughput.

Alan Cox has been working on Linux since version0.95, when he installed it in order to do further work on theAberMUD game. He now manages the Linux Networking, SMP, andLinux/8086 projects and hasn't done any work on AberMUD sinceNovember 1993.




原文转自:http://www.ltesting.net