Wednesday, October 10, 2007

Osh one-liner of the day (October 11, 2007)

The problem is to take a list of words as input (one per line) and print the distribution of word lengths, (e.g. 5 words of length 1, 27 words of length 2, ...). This can be done in one osh command as follows (split into multiple lines for readability):


zack$ cat /usr/share/dict/words | osh ^ \
agg -g 'word: len(word)' 0 'count, word: count + 1' ^ \
sort $
(1, 52)
(2, 155)
(3, 1351)
(4, 5110)
(5, 9987)
(6, 17477)
(7, 23734)
(8, 29926)
(9, 32380)
(10, 30867)
(11, 26010)
(12, 20460)
(13, 14937)
(14, 9763)
(15, 5924)
(16, 3377)
(17, 1813)
(18, 842)
(19, 428)
(20, 198)
(21, 82)
(22, 41)
(23, 17)
(24, 5)


How this works:

  • cat /usr/share/dict/words | : Pipes the contents of the input file to the rest of the command.
  • osh ^: Invokes the osh interpreter and pipes stdin to the first osh command. Each line of input turns into a tuple containing a string, because osh pipes tuples from one command to the next. (The piping here is done within the osh process; it's not using a pipe from the host OS.)
  • agg -g 'word: len(word)': agg is the osh command for doing aggregation, somewhat like python's reduce. -g means that multiple groups will be tracked during the aggregation, with a counter maintained for each. Groups are defined by the value of a grouping function, 'word: len(word)', which returns the length of an input word. I.e., we're going to maintain a counter for each word length observed.
  • 0: The initial value of each counter.
  • 'count, word: count + 1': A function for maintaining a group's counter. The arguments are the current count for the group, and the input word. The output from the function is an incremented count (for the word's group). Output from this agg command is a stream of pairs of the form (word length, #words of that length).
  • ^ sort: Pipe output from agg to the sort command which sorts by word length.
  • $: Print to stdout.

1 comment:

Anonymous said...

O(1) -- I get it, it's a binary pun. You so crazy!