阅读笔记:如何给OpenSolaris增加一个系统调用
原作者: Eric Schrock
原文来自:
http://blogs.sun.com/roller/page/eschrock
译注者: Badcoffee
Email: blog.oliver@gmail.com
Blog: http://blog.csdn.net/yayong
2005年7月
When I first started in the Solaris group, I was faced with two equally
difficult tasks: learning the development model, and understanding the
source
code. For both these tasks, the recommended method is usually picking a
small
bug and working through the process. For the curious, the first bug I
putback
to ON was 4912227
(ptree call returns zero on failure), a simple bug with near zero risk.
It
was the first step down a very long road.
As a another first step, someone suggested adding a very simple
system call to the
kernel. This turned out to be a whole lot harder than one would expect,
and has
so many subtle(细微的) aspects(方面) that experienced Solaris engineers
(myself
included)
still miss some of the necessary changes. With that in mind, I
thought a
reasonable first OpenSolaris blog would be describing exactly how to
add a new
system call to the kernel.
注:
1.
做Solaris开发面临2个难题,一个是需要了解Solaris开发的模式,或者说是process上的东西;而另一个就是理解Solairs源代码
了。有一个最好的办法就是选择Solaris上一个很小的bug来熟悉process上的东西。
2.
而理解Solaris的源代码,最好是从增加一个非常简单的系统调用开始。但是这有一点难,有很多细微之处即便是有经验的Solaris工程师也会遗漏。
而本篇文章的作者将以此为起点,描述如何给Solaris的kernel增加一个系统调用。
3.
为尽量简化,作者把新增调用的代码放到了已经存在的源文件中,来避免对Makefile的改动。这个新的系统调用只是在被调用时输出任意的信息到
console上。
Before writing any real code, we first have to pick a number that will represent our system call. The main source of documentation here is syscall.h, which describes all the available system call numbers, as well as which ones are reserved. The maximum number of syscalls is currently 256 (NSYSCALL), which doesn't leave much space for new ones. This could theoretically be extended - I believe the hard limit is in the size of sysset_t, whose 16 integers must be able to represent a complete bitmask of all system calls. This puts our actual limit at 16*32, or 512, system calls. But for the purposes of our tutorial, we'll pick system call number 56, which is currently unused. For my own amusement(娱乐), we'll name our (my?) system call 'schrock'. So first we add the following line to syscall.h
#define SYS_uadmin 55
#define SYS_schrock 56
#define SYS_utssys 57
Next, we have to actually add the function that will get called when we invoke the system call. What we should really do is add a new file schrock.c to usr/src/uts/common/syscall, but I'm trying to avoid Makefiles. Instead, we'll just stick it in getpid.c:
#include <sys/cmn_err.h>
int
schrock(void *arg)
{
char buf[1024];
size_t len;
if (copyinstr(arg, buf, sizeof (buf), &len) != 0)
return (set_errno(EFAULT));
cmn_err(CE_WARN, "%s", buf);
return (0);
}
Note that declaring a buffer of 1024 bytes on the stack is a very
bad
thing to do in the kernel. We have limited stack space, and a stack
overflow
will result in a panic. We also don't check that the length of the
string was
less than our scratch space. But this will suffice for illustrative
purposes.
The cmn_err()
function is the simplest way to display messages from the kernel.
注:
8.
第2步,实现系统调用函数。为避免修改Makefile,作者选择了在getpid.c文件里来增加新调用schrock,实现比较简单,就是在
console输出一个指定的字符
串。
9.
这个函数声明了一个1024字节的buffer,这个buffer是要在kernel的stack中分配的,由于kernel的stack空间是非常有限
的,分配这么大的一个buffer是很不好的,stack的溢出是会导致系统panic的。通常,为避免耗尽kernel的stack,局部变量和嵌套函
数调用都要考虑占用stack的资源问
题。
10. 查看OpenSolaris的源代码可以看到,copyinstr()这
个函数是从用户空间将以空字符终止的字符串拷贝到内核空间中,函数原型如下:
copyinstr(const char *uaddr, char *kaddr, size_t maxlength,其中,第1,2个参数分别是位于用户空间的源串和内核空间的目的串;第3个参数是目的串的长度;第4个参数写回实际拷贝的长度。
size_t *lencopied);
We need to place an entry in the system call table. This table lives in sysent.c, and makes heavy use of macros to simplify the source. Our system call takes a single argument and returns an integer, so we'll need to use the SYSENT_CI macro. We need to add a prototype for our syscall, and add an entry to the sysent and sysent32 tables:
int rename();
void rexit();
int schrock();
int semsys();
int setgid();
/* ... */
/* 54 */ SYSENT_CI("ioctl", ioctl, 3),
/* 55 */ SYSENT_CI("uadmin", uadmin, 3),
/* 56 */ SYSENT_CI("schrock", schrock, 1),
/* 57 */ IF_LP64(
SYSENT_2CI("utssys", utssys64, 4),
SYSENT_2CI("utssys", utssys32, 4)),
/* ... */
/* 54 */ SYSENT_CI("ioctl", ioctl, 3),
/* 55 */ SYSENT_CI("uadmin", uadmin, 3),
/* 56 */ SYSENT_CI("schrock", schrock, 1),
/* 57 */ SYSENT_2CI("utssys", utssys32, 4),
/*可以看出,事实上这个表里有每个系统调用的名称,该调用处理函数的指针,还有入口参数的个数。
* This table is the switch used to transfer to the appropriate
* routine for processing a system call. Each row contains the
* number of arguments expected, a switch that tells systrap()
* in trap.c whether a setjmp() is not necessary, and a pointer
* to the routine.
*/
/* returns a 64-bit quantity for both ABIs */可以看到,根据系统调用的返回值的类型及个数,可以使用不同的宏定义,对于本例,需要使用SYSENT_CI。
#define SYSENT_C(name, call, narg) \
{ (narg), SE_64RVAL, NULL, NULL, (llfcn_t)(call) }
/* returns one 32-bit value for both ABIs: r_val1 */
#define SYSENT_CI(name, call, narg) \
{ (narg), SE_32RVAL1, NULL, NULL, (llfcn_t)(call) }
/* returns 2 32-bit values: r_val1 & r_val2 */
#define SYSENT_2CI(name, call, narg) \
{ (narg), SE_32RVAL1|SE_32RVAL2, NULL, NULL, (llfcn_t)(call) }
At this point, we could write a program to invoke our system call, but the point here is to illustrate everything that needs to be done to integrate a system call, so we can't ignore the little things. One of these little things is /etc/name_to_sysnum, which provides a mapping between system call names and numbers, and is used by dtrace(1M), truss(1), and friends. Of course, there is one version for x86 and one for SPARC, so you will have to add the following lines to both the intel and SPARC versions:
ioctl 54注:
uadmin 55
schrock 56
utssys 57
fdsync 58
Truss does fancy decoding of system call arguments. In order to do this, we need to maintain a table in truss that describes the type of each argument for every syscall. This table is found in systable.c. Since our syscall takes a single string, we add the following entry:
{"ioctl", 3, DEC, NOV, DEC, IOC, IOA}, /* 54 */
{"uadmin", 3, DEC, NOV, DEC, DEC, DEC}, /* 55 */
{"schrock", 1, DEC, NOV, STG}, /* 56 */
{"utssys", 4, DEC, NOV, HEX, DEC, UTS, HEX}, /* 57 */
{"fdsync", 2, DEC, NOV, DEC, FFG}, /* 58 */
Don't worry too much about the different constants. But be sure to
read up(攻读)
on the truss source code if you're adding a complicated system call.
注:
17. 第5步,为了让truss(1)命令可以解释出新加的系统调用的参数,需要在systable.c文
件中的systable中
增加一条相应的记录。
18. systable实
际上是truss(1)维护的一个表结构,用来描述系统调用的入口参数个数,返回值和入口参数的输出表示形式,其定义如下:
const struct systable systable[] = {
{ NULL, 8, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX},
{"_exit", 1, DEC, NOV, DEC}, /* 1 */
{"forkall", 0, DEC, NOV}, /* 2 */
{"read", 3, DEC, NOV, DEC, IOB, UNS}, /* 3 */
{"write", 3, DEC, NOV, DEC, IOB, UNS}, /* 4 */
{"open", 3, DEC, NOV, STG, OPN, OCT}, /* 5 */
..............
{"cladm", 3, DEC, NOV, CLC, CLF, HEX}, /* 253 */
{ NULL, 8, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX},
{"umount2", 2, DEC, NOV, STG, MTF}, /* 255 */
{ NULL, -1, DEC, NOV},
};
可以看到,其中每一行实际上对应一个系统调用的描述,对应着结构体systable,
其定义如下:
struct systable {
const char *name; /* name of system call */
short nargs; /* number of arguments */
char rval[2]; /* return value types */
char arg[8]; /* argument types */
};
所以systable这
张表每行的第1个值对应调用名,第2个对应参数个数,第3,4对应返回值的描述,剩下8个值对应调用的入口参数描述。通过这样的描述,truss(1)就
知道每个系统调用的入口参数和返回值格式,并正确的输出了,新增的系统调用对应的记录为:
{"schrock", 1, DEC, NOV, STG}, /* 56 */
This is the file that gets missed the most often when adding a new syscall. Libproc uses the table in proc_names.c to translate between system call numbers and names. Why it doesn't make use of /etc/name_to_sysnum is anybody's guess, but for now you have to update the systable array in this file:
"ioctl", /* 54 */
"uadmin", /* 55 */
"schrock", /* 56 */
"utssys", /* 57 */
"fdsync", /* 58 */
Finally, everything is in place. We can test our system call with a simple program:
#include <sys/syscall.h>
int
main(int argc, char **argv)
{
syscall(SYS_schrock, "OpenSolaris Rules!");
return (0);
}
If we run this on our system, we'll see the following output on the console:
June 14 13:42:21 halcyon genunix: WARNING: OpenSolaris Rules!
Because we did all the extra work, we can actually observe the
behavior using
truss(1), mdb(1), or dtrace(1M). As you
can see,
adding a system call is not as easy as it should be. One of the ideas
that has
been floating around for a while is the Grand Unified Syscall(tm)
project, which
would centralize all this information as well as provide type
information for
the DTrace syscall provider. But until that happens, we'll have to deal
with
this process.
注:
21.
最后,写一个小程序测试一下新加的系统调用。其实,这里略去了很重要而且很复杂的一个环节,就是重新build一下OpenSolairs的内核,然后
Install或者update一下OpenSolaris,让新加的调用可用。因为所有应做的改动都做了,因此,除了可以调用新的系统调用之外,还可以
使用OpenSolaris所有debug工具,如truss(1), mdb(1)和dtrace(1M)。
22. 文章的结尾处,作者透露了未来 OpenSolaris所做的改进,就是将集中化所有有关系统调用的定义,同时为dtrace的syscall provider提供系统调用的类型信息。在这些改进完成之前,增加新的系统调用就不得不走一遍本文所述流程。
Technorati Tag: OpenSolaris
Technorati Tag: Solaris