Decoded: comm (coreutils)

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [POSIX requirement] [Linux man] [FreeBSD man]

Logical flow of comm command (coreutils)

Summary

comm - compare two sorted files line by line

[Source] [Code Walkthrough]

Lines of code: 500
Principal syscalls: write() via fwrite()
Support syscalls: fadvise()
Options: 11 (4 short, 7 long)

Descended from comm in Version 2 UNIX (1972)
Added to Textutils in November 1992 [First version]
Number of revisions: 134

Helpers:
  • check_order() - Verifies the order of lines from a file
  • compare_files() - Performs the entire comparison and output procedure
  • writeline() - Writes the requested line with fwrite()
External non-standard helpers:
  • die() - Exit with mandatory non-zero error and message to stderr
  • error() - Outputs error message to standard error with possible process termination
  • readlinebuffer_delim() - Read an entire line of text including the delimiter to a linebuffer

Setup

comm declares several global flags that control execution flow and are defined during parsing:

  • both - Flag to print lines found in both files (default behavior)
  • hard_LC_COLLATE - Flag set if LC_COLLATE is in a standard location (xmemcoll() is usable)
  • issued_disorder_warning[] - Flag array for both input files holding warning status
  • only_file_1 - Flag to only print lines in file 1
  • only_file_2 - Flag to only print lines in file 1
  • seen_unpairable - Flag set if we've observed mismatched lines between files
  • total_option - Flag to print a summary (--total option)

Three global variables that affect output display include:

  • *col_sep - The character that separates columns (usually \t)
  • col_sep_len - The length of the separator
  • delim - The character that separates lines

comm includes one local variable in main(), c, used as the first letter of the next option to process.


Parsing

Parsing comm sets up execution flags based these ideas:

  • Should we verify the input file orders?
  • Should there be an alternate column separator?
  • Should we newline or NUL delimit?
  • Which input files should be displayed?

Parsing failures

These failure cases are explicitly checked:

  • Specifying multiple output delimiters
  • Not providing two input files
  • Unknown options used

Failures result in a short error message followed by the usage instructions.


Execution

The comm utility uses linebuffer data structures to read, hold, and compare lines pulled from the file streams. Since we always pull from stream sequentially, we assume that both input files are sorted in the same way in order for output to be relevant.

The execution process looks like this:

  • Initialize linebuffer: lba and associated pointers:, *thisline, and *all_line
  • Open both file streams
  • Load lines from both files in to associated linebuffers
  • Check if there are any lines in either file to process. If so:
    • If either stream is empty, write the next line of the non-empty stream
    • If both streams have lines, compare and write the lesser
    • Pull the next line from the stream written from
    • Rotate through the used streams
    • Repeat check if there are more lines to process
  • Close the files
  • Print the total results, if requested

Failure cases:

  • Unable to open or close file streams
  • Unable to read from a file stream

Failures at this stage output an error message to STDERR unless quiet mode was enabled


[Back to Project Main Page]