words [options] files | |
-c,--count | Report the number of occurrences of each word. |
-s,--sum | Report only the total number of words |
-f,--fold | Convert input to lower case before detecting words. |
-p,--pattern=RE | set pattern defining the word separators |
default: [^[:alpha:]_] | |
-V,--version | print version and exit |
-h | print short help and exit |
--help | print full documentation via less and exit |
--pattern option. By default, any character other then underscore and alphabetic characters (including accented characters) acts as a separator.
Without the --count option, the output comes in 1 column of words, sorted in case insensitive order. With the --count option two tab-separated columns appear with the counts in column 1 and the words in column 2; the order will be reverse numerically sorted on column 1 and normally sub-sorted on column 2.
The --fold option converts all input to lowercase.
The Prêt-à-porter robe is priced at € 77.50, the shoes (ladies' only) at € 255.
To show the words in it:
words test #=>
à
at
is
ladies
only
porter
priced
Prêt
robe
shoes
the
The
To count the words, after folding upper to lower case:
words --count --fold test #=>
2at
2the
1à
1is
1ladies
1only
1porter
1prêt
1priced
1robe
1shoes
to include - to be a possible word character, thus finding words like avant-garde:
words -p '[^[:alpha:]-]' test #=>
at
is
ladies
only
priced
Prêt-à-porter
robe
shoes
the
The
Note that the - must be at the end of the expression, in order not to be interpreted as a range-character.
To count the number of backslashes in a TeX file:
words --pattern='[^\\]' -c test #=>
but, of course, this is a lot faster:
tr -dc '\\' <test |wc -c