v1.0.9 / chapter 1 of 3 / 01 oct 04 / greg goebel / public domain
* This chapter provides an overview of Awk and a quick tour of its use.
* The Awk text-processing language is useful for such tasks as:
Awk has two faces: it is a utility for performing simple text-processingtasks, and it is a programming language for performing complextext-processing tasks.
The two faces are really the same, however. Awk uses the same mechanismsfor handling any text-processing task, but these mechanisms are flexibleenough to allow useful Awk programs to be entered on the command line, or toimplement complicated programs containing dozens of lines of Awk statements.
Awk statements comprise a programming language. In fact, Awk is useful forsimple, quick-and-dirty computational programming. Anybody who can write aBASIC program can use Awk, although Awk's syntax is different from that ofBASIC. Anybody who can write a C program can use Awk with littledifficulty, and those who would like to learn C may find Awk a usefulstepping stone, with the caution that Awk and C have significant differencesbeyond their many similarities.
There are, however, things that Awk is not. It is not really well suitedfor extremely large, complicated tasks. It is also an "interpreted"language -- that is, an Awk program cannot run on its own, it must beexecuted by the Awk utility itself. That means that it is relatively slow,though it is efficient as interpretive languages go, and that the programcan only be used on systems that have Awk. There are translators availablethat can convert Awk programs into C code for compilation as stand-aloneprograms, but such translators have to be purchased separately.
One last item before proceeding: What does the name "Awk" mean? Awkactually stands for the names of its authors: "Aho, Weinberger, &Kernighan". Kernighan later noted: "Naming a language after its authors... shows a certain poverty of imagination." The name is reminiscent ofthat of an oceanic bird known as an "auk", and so the picture of an aukoften shows up on the cover of books on Awk.
* It is easy to use Awk from the command line to perform simple operationson text files. Suppose I have a file named "coins.txt" that describes acoin collection. Each line in the file contains the following information:
metal weight in ounces date minted country of origin descriptionThe file has the contents:
gold 1 1986 USA American EagleI could then invoke Awk to list all the gold pieces as follows:
gold 1 1908 Austria-Hungary Franz Josef 100 Korona
silver 10 1981 USA ingot
gold 1 1984 Switzerland ingot
gold 1 1979 RSA Krugerrand
gold 0.5 1981 RSA Krugerrand
gold 0.1 1986 PRC Panda
silver 1 1986 USA Liberty dollar
gold 0.25 1986 USA Liberty 5-dollar piece
silver 0.5 1986 USA Liberty 50-cent piece
silver 1 1987 USA Constitution dollar
gold 0.25 1987 USA Constitution 5-dollar piece
gold 1 1988 Canada Maple Leaf
awk '/gold/' coins.txtThis tells Awk to search through the file for lines of text that contain thestring "gold", and print them out. The result is:
gold 1 1986 USA American Eagle* This is all very nice, you say, but any "grep" or "find" utility can do thesame thing. True, but Awk is capable of doing much more. For example,suppose I only want to print the description field, and leave all the othertext out. I could then change my invocation of Awk to:
gold 1 1908 Austria-Hungary Franz Josef 100 Korona
gold 1 1984 Switzerland ingot
gold 1 1979 RSA Krugerrand
gold 0.5 1981 RSA Krugerrand
gold 0.1 1986 PRC Panda
gold 0.25 1986 USA Liberty 5-dollar piece
gold 0.25 1987 USA Constitution 5-dollar piece
gold 1 1988 Canada Maple Leaf
awk '/gold/ {print ,,,}' coins.txtThis yields:
American EagleThis example demonstrates the simplest general form of an Awk program:
Franz Josef 100 Korona
ingot
Krugerrand
Krugerrand
Panda
Liberty 5-dollar piece
Constitution 5-dollar piece
Maple Leaf
awk <search pattern> {<program actions>}Awk searches through the input file for each line that contains the searchpattern. For each of these lines found, Awk then performs the specifiedactions. In this example, the action is specified as:
{print ,,,}The purpose of the "print" statement is obvious. The "", "", "", and"" are "fields", or "field variables", which store the words in each lineof text by their numeric sequence. "", for example, stores the first wordin the line, "" has the second, and so on. By default, a "word" isdefined as any string of printing characters separated by spaces.
Since "coins.txt" has the the structure:
metal weight in ounces date minted country of origin description-- then the field variables are matched to each line of text in the file asfollows:
metal:The program action in this example prints the fields that contain thedescription. The description field in the file may actually include from oneto four fields, but that's not a problem, since "print" simply ignores anyundefined fields. The astute reader will notice that the "coins.txt" file isneatly organized so that the only piece of information that contains multiplefields is at the end of the line. This is a little contrived, but that's theway examples are.
weight:
date:
country:
description: through
* Awk's default program action is to print the entire line, which is what"print" does when invoked without parameters. This means that the firstexample:
awk '/gold/'-- is the same as:
awk '/gold/ 'Note that Awk recognizes the field variable as representing the entireline, so this could also be written as:
awk '/gold/ {print }'This is redundant, but it does have the virtue of making the action moreobvious.
* Now suppose I want to list all the coins that were minted before 1980. Iinvoke Awk as follows:
awk '{if ( < 1980) print , " ",,,,}' coins.txtThis yields:
1908 Franz Josef 100 KoronaThis new example adds a few new concepts:
1979 Krugerrand
There's a subtle issue involved here, however. In most computer languages, strings are strings, and numbers are numbers. There are operations that unique to each, and one must be specifically converted to the other with conversion functions. You don't concatenate numbers, and you don't perform arithmetic operations on strings.
Awk, on the other hand, makes no strong distinction between strings and numbers. In computer-science terms, it isn't a "strongly-typed" language. All the fields are regarded as strings, but if that string also happens to represent a number, numeric operations can be performed on it. So we can perform an arithmetic comparison on the date field.
* The next example prints out how many coins are in the collection:
awk 'END {print NR,"coins"}' coins.txtThis yields:
13 coinsThe first new item in this example is the END statement. To explain this, Ihave to extend the general form of an Awk program to:
awk 'BEGINThe BEGIN clause performs any initializations required before Awk startsscanning the input file. The subsequent body of the Awk program consists ofa series of search patterns, each with its own program action. Awk scanseach line of the input file for each search pattern, and performs theappropriate actions for each string found. Once the file has been scanned,an END clause can be used to perform any final actions required.
<search pattern 1> {<program actions>}
<search pattern 2> {<program actions>}
...
END {<final actions>}'
So, this example doesn't perform any processing on the input linesthemselves. All it does is scan through the file and perform a finalaction: print the number of lines in the file, which is given by the "NR"variable.
NR stands for "number of records". NR is one of Awk's "pre-defined"variables. There are others, for example the variable NF gives the numberof fields in a line, but a detailed explanation will have to wait for later.
* Suppose the current price of gold is 5, and I want to figure out theapproximate total value of the gold pieces in the coin collection. I invokeAwk as follows:
awk '/gold/ {ounces += } END {print "value = $" 425*ounces}' coins.txtThis yields:
value = 92.5In this example, "ounces" is a variable I defined myself, or a "user defined"variable. You can use almost any string of characters as a variable name inAwk, as long as the name doesn't conflict with some string that has aspecific meaning to Awk, such as "print" or "NR" or "END". There is no needto declare the variable, or to initialize it. A variable handled as a stringvariable is initialized to the "null string", meaning that if you try toprint it, nothing will be there. A variable handled as a numeric variablewill be initialized to zero.
So the program action:
{ounces += }-- sums the weight of the piece on each matched line into the variable"ounces". Those who program in C should be familiar with the "+=" operator.Those who don't can be assured that this is just a shorthand way of saying:
{ounces = ounces + }The final action is to compute and print the value of the gold:
END {print "value = $" 425*ounces}The only thing here of interest is that the two print parameters, the literal'"value = $"' and the expression "425*ounces", are separated by a space, nota comma. This concatenates the two parameters together on output, withoutany intervening spaces.
* All this is fun, but each of these examples only seems to nibble away at"coins.txt". Why not have Awk figure out everything interesting at onetime?
The immediate objection to this idea is that it would be impractical to entera lot of Awk statements on the command line, but that's easy to fix. Thecommands can be written into a file, and then Awk can be told to execute thecommands from that file as follows:
awk -f <awk program file name>Given an ability to write an Awk program in this way, then what should a"master" "coins.txt" analysis program do? Here's one possible output:
Summary Data for Coin Collection:The following Awk program generates this information:
Gold pieces: nn
Weight of gold pieces: nn.nn
Value of gold pieces: n,nnn.nn
Silver pieces: nn
Weight of silver pieces: nn.nn
Value of silver pieces: n,nnn.nn
Total number of pieces: nn
Value of collection: n,nnn.nn
# This is an awk program that summarizes a coin collection.This program has a few interesting features:
#
/gold/ { num_gold++; wt_gold += } # Get weight of gold.
/silver/ { num_silver++; wt_silver += } # Get weight of silver.
END { val_gold = 485 * wt_gold; # Compute value of gold.
val_silver = 16 * wt_silver; # Compute value of silver.
total = val_gold + val_silver;
print "Summary data for coin collection:"; # Print results.
printf ("n");
printf (" Gold pieces: %2dn", num_gold);
printf (" Weight of gold pieces: %5.2fn", wt_gold);
printf (" Value of gold pieces: %7.2fn",val_gold);
printf ("n");
printf (" Silver pieces: %2dn", num_silver);
printf (" Weight of silver pieces: %5.2fn", wt_silver);
printf (" Value of silver pieces: %7.2fn",val_silver);
printf ("n");
printf (" Total number of pieces: %2dn", NR);
printf (" Value of collection: %7.2fn", total); }
printf("<format_code>",<parameters>)
There is one format code for each of the parameters in the list. Each format code determines how its corresponding parameter will be printed. For example, the format code "%2d" tells Awk to print a two-digit integer number, and the format code "%7.2f" tells Awk to print a seven-digit floating-point number, with two digits to the right of the decimal point.
Note also that, in this example, each string printed by "printf" ends with a "n", which is a code for a "newline" (ASCII line-feed code). Unlike the "print" statement, which automatically advances the output to the next line when it prints a line, "printf" does not automatically advance the output, and by default the next output statement will append its output to the same line. A newline forces the output to skip to the next line.
* I stored this program in a file named "summary.awk", and invoked it as follows:
awk -f summary.awk coins.txtThe output was:
Summary data for coin collection:* This information should give you enough background to make good use ofAwk. The next chapter provides a much more complete description of thelanguage.
Gold pieces: 9
Weight of gold pieces: 6.10
Value of gold pieces: 2958.50
Silver pieces: 4
Weight of silver pieces: 12.50
Value of silver pieces: 200.00
Total number of pieces: 13
Value of collection: 3158.50