* The Awk programming language was designed to be simple but powerful. Itallows a user to perform relatively sophisticated text-manipulationoperations through Awk programs written on the command line.
For example, suppose I want to turn a document with single-spacing into adocument with double-spacing. I could easily do that with the following Awkprogram:
awk '{print ; print ""}' infile > outfileNotice how single-quotes (' ') are used to allow using double-quotes (" ")within the Awk expression. This "hides" special characters from the shellyou are using. You could also do this as follows:
awk "{print ; print ""}" infile > outfile-- but the single-quote method is simpler.
This program does what it supposed to, but it also doubles every blank linein the input file, which leaves a lot of empty space in the output. That'seasy to fix, just tell Awk to print an extra blank line if the current lineis not blank:
awk '{print ; if (NF != 0) print ""}' infile > outfile* One of the problems with Awk is that it is ingenious enough to make a userwant to tinker with it, and use it for tasks for which it isn't reallyappropriate. For example, you could use Awk to count the number of lines ina file:
awk 'END {print NR}' infile-- but this is dumb, because the "wc (word count)" utility gives the sameanswer with less bother. "Use the right tool for the job."
Awk is the right tool for slightly more complicated tasks. Once I had a filecontaining an email distribution list. The email addresses of variousdifferent groups were placed on consecutive lines in the file, with thedifferent groups separated by blank lines. If I wanted to quickly andreliably determine how many people were on the distribution list, I couldn'tuse "wc", since, it counts blank lines, but Awk handled it easily:
awk 'NF != 0 END {print count}' list* Another problem I ran into was determining the average size of a number offiles. I was creating a set of bitmaps with a scanner and storing them on afloppy disk. The disk started getting full and I was curious to know justhow many more bitmaps I could store on the disk.
I could obtain the file sizes in bytes using "wc -c" or the "list" utility("ls -l" or "ll"). A few tests showed that "ll" was faster. Since "ll"lists the file size in the fifth field, all I had to do was sum up the fifthfield and divide by NR. There was one slight problem, however: the firstline of the output of "ll" listed the total number of sectors used, and hadto be skipped.
No problem. I simply entered:
ll | awk 'NR!=1 END {print "Average: " s/(NR-1)}'This gave me the average as about 40 KB per file.
* Awk is useful for performing simple iterative computations for which a moresophisticated language like C might prove overkill. Consider the Fibonaclearcase/" target="_blank" >ccisequence:
1 1 2 3 5 8 13 21 34 ...Each element in the sequence is constructed by adding the two previouselements together, with the first two elements defined as both "1". It's adiscrete formula for exponential growth. It is very easy to use Awk togenerate this sequence:
awk 'BEGIN {a=1;b=1; while(++x<=10){print a; t=a;a=a+b;b=t}; exit}'This generates the following output data:
1
2
3
5
8
13
21
34
55
89
* Sometimes an Awk program is so useful that you want to use it over and overagain. In that case, it's simple to execute the Awk program from a shellscript.
For example, consider an Awk script to print each word in a file on aseparate line. This could be done with a script named "words" containing:
awk ', s); for(n=1; n<=c; ++n) print s[n] }'"Words" could them be made executable (using "chmod +x words") and theresulting shell "program" invoked just like any other command. For example,"words" could be invoked from the "vi" text editor as follows:
:%!wordsThis would turn all the text into a list of single words.
For another example, consider the double-spacing program mentionedpreviously. This could be slightly changed to accept standard input, thencopied into a file named "double":
awk '{print; if (NF != 0) print ""}' --- and then could be invoked from "vi" to double-space all the text in theeditor.
* The next step would be to also allow "double" to perform the reverseoperation: To take a double-spaced file and return it to single-spaced,using the option:
undoubleThe first part of the task is, of course, to design a way of stripping outthe extra blank lines, without destroying the spacing of the originalsingle-spaced file by taking out all the blank lines. The simplestapproach would be to delete every other blank line in a continuous block ofsuch blank lines. This won't necessarily preserve the original spacing, butit will preserve spacing in some form.
The method for achieving this is also simple, and involves using a variablenamed "skip". This variable is set to "1" every time a blank line isskipped, to tell the Awk program NOT to skip the next one. The scheme is asfollows:
BEGIN {set skip to 0}This translates directly into the following Awk program:
scan the input:
if skip == 0 if line is blank
skip = 1
else
print the line
get next line of input
if skip == 1 print the line
skip = 0
get next line of input
BEGIN {skip = 0}You could place this in a separate file, named, say, "undouble.awk", and thenwrite the shell script "undouble" as:
skip == 0 {if (NF == 0)
{skip = 1}
else
;
next}
skip == 1 {print;
skip = 0;
next}
awk -f undouble.awk-- or you could embed the program directly in the shell script, usingsingle-quotes to enclose the program and backslashes ("") to allow formultiple lines:
awk 'BEGIN {skip = 0}Remember that when you use "" to embed an Awk program in a script file, theprogram appears as one line to Awk. Make sure you always use asemicolon to separate commands.
skip == 0 {if (NF == 0)
{skip = 1}
else
;
next}
skip == 1 {print;
skip = 0;
next}'
* This example sets a simple flag variable named "skip" to allow the Awkprogram to keep track of what it has been doing. Awk, as you should knowby now, operates in a cycle: get a line, process it, get the next line,process it, and so on; if you want Awk to remember things between cycles,you can have the Awk program leave a little message for itself in a variableso it remembers things from cycle to cycle.
For example, say you want to match on a line whose first field has the value1,000 -- but then print the next line, you could do that as follows:
BEGIN {flag = 0}This program sets a variable named "flag" when it finds a line starting with1,000, and then goes and gets the next line of input. The next line of inputis printed, and then "flag" is cleared so the line after that won't beprinted.
== 1000 {flag = 1;
next}
flag == 1 {print;
flag = 0;
next}
If you wanted to print the next five lines, you could do that in much thesame way using a variable named, say, "counter":
BEGIN {counter = 0}This program initializes a variable named "counter" to 5 when it finds a linestarting with 1,000; for each of the following 5 lines of input, it printsthem and decrements "counter" until it is zero.
== 1000 {counter = 5;
next}
counter > 0 {print;
counter--;
next}
This approach can be taken to as great a level of elaboration as you like.Suppose you have a list of, say, five different actions to be taken aftermatching a line of input; you can then create a variable named, say, "state",that stores which item in the list to perform next. The scheme is generallyas follows:
BEGIN {set state to 0}This is called a "state machine". In this case, it's performing a simplelist of actions, but the same approach could also be used to perform a morecomplicated branching sequence of actions, such as you might have in aflowchart instead of a simple list.
scan the input:
if match set state to 1
get next line of input
if state == 1 do the first thing in the list
state = 2
get next line of input
if state == 2 do the second thing in the list
state = 3
get next line of input
if state == 3 do the third thing in the list
state = 4
get next line of input
if state == 4 do the fourth thing in the list
state = 5
get next line of input
if state == 5 do the fifth (and last) thing in the list
state = 0
get next line of input
You could assign state numbers to the blocks in your flowchart and then useif-then tests for the decision-making blocks to set the state variable toindicate which of the alternate actions should be performed next. However,few Awk programs require such complexities, and going into more elaborateexamples here would probably be more confusing than it's worth. Theessential thing to remember is that an awk program can leave messages foritself in a variable on one line-scan cycle to tell it what to do on laterline-scan cycles.
* Awk is an excellent tool for building UN*X shell scripts, but you can runinto a few problems. Say you have a scriptfile named "testscript", and ittakes two filenames as parameters:
testscript myfile1 myfile2If you're executing Awk commands from a file, handling the two filenamesisn't very difficult. You can initialize variables on the command line asfollows:
cat | awk -f testscript.awk f1= f2= > tmpfileThe Awk program will use two variables, "f1" and "f2", that are initializedfrom the script command line variables "" and "".
Where this measure gets obnoxious is when you are specifying Awk commandsdirectly, which is preferable if possible since it reduces the number offiles needed to implement a script. The problem is that "" and "" havedifferent meanings to the scriptfile and to Awk. To the scriptfile, they arecommand-line parameters, but to Awk they indicate text fields in the input.
The handling of these variables depends on how Awk print fields are defined-- either enclosed in double-quotes (" ") or in single-quotes (' '). If youinvoke Awk as follows:
awk "{ print "This is a test: " }"-- you won't get anything printed for the "" variable. If you insteaduse single-quotes to ensure that the scriptfile leaves the Awk positionalvariables alone, you can insert scriptfile variables by initializing them tovariables on the command line:
awk '{ print "This is a test: " " / parm2 = " f }' f= <This provides the first field in "myfile1" as the first parameter and thename of "myfile2" as the second parameter.
Remember that Awk is relatively slow and clumsy and should not be regarded asthe default tool for all scriptfile jobs. You can use "cat" to append tofiles, "head" and "tail" to cut off a given number of lines of text from thefront or back of a file, "grep" or "fgrep" to find lines in a particularfile, and "sed" to do search-replaces on the stream in the file.
* The original version of Awk was developed in 1977. It was optimized forthrowing together "one-liners" or short, quick-and-dirty programs. However,some users liked Awk so much that they used it for much more complicatedtasks. To quote the language's authors: "Our first reaction to a programthat didn't fit on one page was shock and amazement." Some users regardedAwk as their primary programming tool, and many had in fact learnedprogramming using Awk.
After the authors got over their initial consternation, they decided toaccept the fact, and enhance Awk to make it a better general-purposeprogramming tool. The new version of Awk was released in 1985. Since theold Awk implementation is still the standard in UN*X systems, the new versionis often, if not always, known as Nawk ("New Awk") to distinguish it from theold one.
* Nawk incorporates several major improvements. The most importantimprovement is that users can define their own functions. For example, thefollowing Nawk program implements the "signum" function:
{for (field=1; field<=NF; ++field) {print signum($field)}};Function declarations can be placed in a program wherever a match-actionclause can. All parameters are local to the function. Local variables canbe defined inside the function.
function signum(n) {
if (n<0) return -1
else if (n==0) return 0
else return 1}
* A second improvement is a new function, "getline", that allows input fromfiles other than those specified in the command line at invocation (as wellas input from pipes). "Getline" can be used in a number of ways:
getline Loads from current input.* A related function, "close", allows a file to be closed so it can be readfrom the beginning again:
getline myvar Loads "myvar" from current input.
getline <myfile> Loads from "myfile".
getline myvar <myfile> Loads "myvar" from "myfile".
command | getline Loads from output of "command".
command | getline myvar Loads "myvar" from output of "command".
close("myfile")* A new function, "system", allows Awk programs to invoke system commands:
system("rm myfile")* Command-line parameters can be interpreted using two new predefinedvariables, ARGC and ARGV, a mechanism instantly familiar to C programmers.ARGC ("argument count") gives the number of command-line elements, and ARGV("argument vector") is an array whose entries store the elementsindividually.
* There is a new conditional-assignment expression, known as "?:", which isused as follows:
status = (condition == "green")? "go" : "stop"This translates to:
if (condition=="green") {status = "go"} else {status = "stop"}This construct should also be familiar to C programmers.
* There are new math functions, such as trig and random-number functions:
sin(x) Sine, with x in radians.* There are new string functions, such as match and substitution functions:
cos(x) Cosine, with x in radians.
atan2(y,z) Arctangent of y/x, in range -PI to PI.
rand() Random number, with 0 <= number < 1.
srand() Seed for random-number generator.
Search the target string for the search string; return 0 if no match, return starting index of search string if match. Also sets built-in variable RSTART to the starting index, and sets built-in variable RLENGTH to the matched string's length.
Search for first match of regular expression in and substitute replacement string. This function returns the number of substitutions made, as do the other substitution functions.
Search for first match of regular expression in target string and substitute replacement string.
Search for all matches of regular expression in and substitute replacement string.
Search for all matches of regular expression in target string and substitute replacement string.
* There is a mechanism for handling multidimensional arrays. For example,the following program creates and prints a matrix, and then prints thetransposition of the matrix:
BEGIN {count = 1;This yields:
for (row = 1; row <= 5; ++row) {
for (col = 1; col <= 3; ++col) {
printf("%4d",count);
array[row,col] = count++; }
printf("n"); }
printf("n");
for (col = 1; col <= 3; ++col) {
for (row = 1; row <= 5; ++row) {
printf("%4d",array[row,col]); }
printf("n"); }
exit; }
1 2 3Nawk also includes a new "delete" function, which deletes array elements:
4 5 6
7 8 9
10 11 12
13 14 15
1 4 7 10 13
2 5 8 11 14
3 6 9 12 15
delete(array[count])* Characters can be expressed as octal codes. "33", for example, can beused to define an "escape" character.
* A new built-in variable, FNR, keeps track of the record number of thecurrent file, as opposed to NR, which keeps track of the record number of thecurrent line of input, regardless of how many files have contributed to thatinput. Its behavior is otherwise exactly identical to that of NR.
* While Nawk does have useful refinements, they are generally intended tosupport the development of complicated programs. My feeling is that Nawkrepresents overkill for all but the most dedicated Awk users, and in any casewould require a substantial document of its own to do its capabilitiesjustice. Those who would like to know more about Nawk are encouraged to readTHE AWK PROGRAMMING LANGUAGE by Aho / Weinberger / Kernighan. This short,terse, detailed book outlines the capabilities of Nawk and providessophisticated examples of its use.
* This final section provides a convenient lookup reference for Awkprogramming. If you want a more detailed reference and are using a UN*X orLinux system, you might look at the online awk manual pages by invoking:
man awkApparently some systems have an "info" command that is the same as "man" andwhich is used in the same way.
* Invoking Awk:
awk [-F<ch>] | {-f <pgm file>} [<vars>] [-|<data file>]-- where:
ch: Field-separator character.* General form of Awk program:
pgm: Awk command-line program.
pgm file: File containing an Awk program.
vars: Awk variable initializations.
data file: Input data file.
BEGIN* Search patterns:
<search pattern 1> {<program actions>}
<search pattern 2> {<program actions>}
...
END {<final actions>}
/<string>/ Search for string.The search can be constrained to particular fields:
/^<string>/ Search for string at beginning of line.
/<string>$/ Search for string at end of line.
$<field> ~ /<string>/ Search for string in specified field.Strings can be ORed in a search:
$<field> !~ /<string>/ Search for string Inoti in specified field.
/(<string1>)|(<string2>)/The search can be for an entire range of lines, bounded by two strings:
/<string1>/,/<string2>/The search can be for any condition, such as line number, and can use thefollowing comparison operators:
== != < > <= >=Different conditions can be ORed with "||" or ANDed with "&&".
[<charlist or range>] Match on any character in list or range.If a metacharacter is part of the search string, it can be "escaped" bypreceding it with a "".
[^<charlist or range>] Match on any character not in list or range.
. Match any single character.
* Match 0 or more occurrences of preceding string.
? Match 0 or 1 occurrences of preceding string.
+ Match 1 or more occurrences of preceding string.
* Special characters:
n Newline (line feed).Backspace. r Carriage return. f Form feed.A "" can be embedded in a string by entering it twice: "".
* Built-in variables:
; ,,,... Field variables.* Arithmetic operations:
NR Number of records (lines).
NF Number of fields.
FILENAME Current input filename.
FS Field separator character (default: " ").
RS Record separator character (default: "n").
OFS Output field separator (default: " ").
ORS Output record separator (default: "n").
OFMT Output format (default: "%.6g").
+ Addition.Shorthand assignments:
- Subtraction.
* Multiplication.
/ Division.
% Mod.
++ Increment.
-- Decrement.
x += 2 -- is the same as: x = x + 2* The only unique string operation is concatenation, which is performed simplyby listing two strings connected by a blank space.
x -= 2 -- is the same as: x = x - 2
x *= 2 -- is the same as: x = x * 2
x /= 2 -- is the same as: x = x / 2
x %= 2 -- is the same as: x = x % 2
* Arithmetic functions:
sqrt() Square root.* String functions:
log() Base Iei log.
exp() Power of Iei.
int() Integer part of argument.
Length of string.
Get substring.
Split string into array, with initial array index being 1.
Find index of search string in target string.
Perform formatted print into string.
* Control structures:
if (<condition>) <action 1> [else <action 2>]Scanning through an associative array with "for":
while (<condition>) <action>
for (<initial action>;<condition>;<end-of-loop action>) <action>
for (<variable> in <array>) <action>Unconditional control statements:
break Break out of "while" or "for" loop.* Print:
continue Perform next iteration of "while" or "for" loop.
next Get and scan next line of input.
exit Finish reading input and perform END statements.
print <i1>, <i2>, ... Print items separated by OFS; end with newline.* Printf():
print <i1> <i2> ... Print items concatenated; end with newline.
General format:
printf(<string with format codes>,[<parameters>])Newlines must be explicitly specified with a "n".
General form of format code:
%[<number>]<format code>The optional "number" can consist of:
The format codes are:
d Prints a number in decimal format.* Awk can perform output redirection (using ">" and ">>") and piping (using"|") from both "print" and "printf".
o Prints a number in octal format.
x Prints a number in hexadecimal format.
c Prints a character, given its numeric code.
s Prints a string.
e Prints a number in exponential format.
f Prints a number in floating-point format.
g Prints a number in exponential or floating-point format.
* Revision history:
v1.0 / 11 mar 90 / gvg
v1.1 / 29 nov 94 / gvg / Cosmetic rewrite.
v1.2 / 12 oct 95 / gvg / Web rewrite, added stuff on shell scripts.
v1.3 / 15 jan 99 / gvg / Minor cosmetic update.
v1.0.4 / 01 jan 02 / gvg / Minor cosmetic update.
v1.0.5 / 01 jan 04 / gvg / Minor cosmetic update.
v1.0.6 / 01 may 04 / gvg / Added comments on state variables.
v1.0.7 / 01 jun 04 / gvg / Added comments on numeric / string comparisons.
v1.0.8 / 01 jul 04 / gvg / Corrected an obnoxious typo error.
v1.0.9 / 01 oct 04 / gvg / Corrected another tweaky error.