Node:Word Sorting, Next:History Sorting, Previous:Labels Program, Up:Miscellaneous Programs
The following awk
program prints
the number of occurrences of each word in its input. It illustrates the
associative nature of awk
arrays by using strings as subscripts. It
also demonstrates the for index in array
mechanism.
Finally, it shows how awk
is used in conjunction with other
utility programs to do a useful task of some complexity with a minimum of
effort. Some explanations follow the program listing:
# Print list of word frequencies { for (i = 1; i <= NF; i++) freq[$i]++ } END { for (word in freq) printf "%s\t%d\n", word, freq[word] }
This program has two rules. The
first rule, because it has an empty pattern, is executed for every input line.
It uses awk
's field-accessing mechanism
(see Examining Fields) to pick out the individual words from
the line, and the built-in variable NF
(see Built-in Variables)
to know how many fields are available.
For each input word, it increments an element of the array freq
to
reflect that the word has been seen an additional time.
The second rule, because it has the pattern END
, is not executed
until the input has been exhausted. It prints out the contents of the
freq
table that has been built up inside the first action.
This program has several problems that would prevent it from being
useful by itself on real text files:
awk
convention that fields are
separated just by whitespace. Other characters in the input (except
newlines) don't have any special meaning to awk
. This means that
punctuation characters count as part of words.
awk
language considers upper- and lowercase characters to be
distinct. Therefore, "bartender" and "Bartender" are not treated
as the same word. This is undesirable, since in normal text, words
are capitalized if they begin sentences, and a frequency analyzer should not
be sensitive to capitalization.
The way to solve these problems is to use some of awk
's more advanced
features. First, we use tolower
to remove
case distinctions. Next, we use gsub
to remove punctuation
characters. Finally, we use the system sort
utility to process the
output of the awk
script. Here is the new version of
the program:
# wordfreq.awk --- print list of word frequencies { $0 = tolower($0) # remove case distinctions # remove punctuation gsub(/[^[:alnum:]_[:blank:]]/, "", $0) for (i = 1; i <= NF; i++) freq[$i]++ } END { for (word in freq) printf "%s\t%d\n", word, freq[word] }
Assuming we have saved this program in a file named wordfreq.awk
,
and that the data is in file1
, the following pipeline:
awk -f wordfreq.awk file1 | sort -k 2nr
produces a table of the words appearing in file1
in order of
decreasing frequency. The awk
program suitably massages the
data and produces a word frequency table, which is not ordered.
The awk
script's output is then sorted by the sort
utility and printed on the terminal. The options given to sort
specify a sort that uses the second field of each input line (skipping
one field), that the sort keys should be treated as numeric quantities
(otherwise 15
would come before 5
), and that the sorting
should be done in descending (reverse) order.
The sort
could even be done from within the program, by changing
the END
action to:
END { sort = "sort -k 2nr" for (word in freq) printf "%s\t%d\n", word, freq[word] | sort close(sort) }
This way of sorting must be used on systems that do not
have true pipes at the command-line (or batch-file) level.
See the general operating system documentation for more information on how
to use the sort
program.