Surprisingly relevant?

Posted May 20, 2020 5:45 UTC (Wed) by cyphar (subscriber, #110703)
In reply to: Surprisingly relevant? by NYKevin
Parent article: The state of the AWK

Depends on your definition of "simple". While I do make use of all of the tools you've mentioned, awk has carved out its own niche in that pantheon. It allows you to do a handful of things that you would ordinarily need to reach for a "real" programming for such as aggregation or basic programs that make use of maps. Yes, you could implement these things in Python fairly easily, but with two downsides:

You can't just write the script in your shell, you need to open a text editor. Though this is mostly a Python problem, caused by its use of whitespace for code flow. Awk uses brackets and semicolons, so this isn't an issue. I would wager that most awk scripts are written from within a shell.
Most "real" languages require you to do a bunch of boilerplate (such as looping over input, or explicitly do conversions to non-string values). For "real" programming languages, it makes sense to require this kind of boiler plate -- but awk lets you elide all of it because its only purpose is to execute programs over records and fields. For a quick-and-dirty language like awk it's a godsend to not need to have any boilerplate.

Compare the following programs which take the output of sha256sum of a directory tree and find any files which have matching hashes. The one written in awk is verbatim a program I wrote a week ago (note that I actually wrote it in a single line on my command-line, but I put it in a file for an easier comparison).

% cat prog.py
collisions = {}
for line in iter(input, ""):
  hash, *_ = line.split() # really should be re.split but that would be too mean to Python
  if hash not in collisions:
    collisions[hash] = []
  collisions[hash].append(line)

for hash, lines in collisions.items():
  if len(lines) > 1:
    print(hash)
    for line in lines:
      print(line)

% cat prog.awk
{
  files[$1][length(files[$1])] = $0
}

END {
  for (hash in files) {
    if (length(files[hash]) > 1) {
      print hash;
      for (idx in files[hash]) {
        print " " files[hash][idx];
      }
    }
  }
}

What is the first thing you notice? All of the boilerplate in Python about iter(input, "") and splitting the line is already done for you in awk. The actual logic of the program is implemented in a single statement in awk, with the rest of the program just printing out the calculation. And that is one of the reasons why I reach for awk much more often than I reach for Python when I have a relatively-simple source of data to parse -- I just have to type a lot less.

> What does bother me is when people write awk "{ print $1 }" instead of cut -f1. I find the latter more readable.

The problem is that cut splits on the literal " " (U+0020) or whatever other literal you specify, while awk splits fields using a regular expression (which by default is /\s+/). Many standard Unix programs output data such that cut's field handling is simply not usable. You could clean up the data with sed, but now you're working around the fact that cut isn't doing its job correctly. I sometimes feel that cut would be a better program if it were implemented as an awk script.

to post comments