Decoded: uniq (coreutils)

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [POSIX requirement] [Linux man] [FreeBSD man]

Logical flow of uniq command (coreutils)

Summary

uniq - uniquify files (remove duplicate lines from a sorted file)

[Source] [Code Walkthrough]

Lines of code: 676
Principal syscall: write()
Support syscalls: open(), close(), fadvise()
Options: 15 (9 short, 12 long, does not include legacy digits for field skip)

Ancestor included with Version 3 UNIX (1973). Original man dated late 1972.
Added to Textutils in November 1992 [First version]
Number of revisions: 189 [Code Evolution]

Helpers:
  • check_file() - The actual uniq procedure
  • different() - Checks if two input strings match and returns false/true
  • find_field() - Returns the offset to a line's field to compare
  • size_opt() - Converts input option to a size type
  • strict_posix2() - Checks if the system is POSIX2 compliant (affects valid syntax)
  • writeline() - Outputs a line to standard output
External non-standard helpers:
  • die() - Exit with mandatory non-zero error and message to stderr
  • error() - Outputs error message to standard error with possible process termination

Setup

uniq keeps several flags and variables as globals, including:

  • check_chars - The number of characters to check on a line (-w)
  • hard_LC_COLLATE - Flag set if LC_COLLATE is in a standard location
  • ignore_case - Flag if we ignore case when comparing letters
  • output_first_repeated - Flag to only output the first of a repeating group
  • output_later_repeated - Flag to output only repeated lines
  • output_unique - Flag to output only unique lines
  • skip_chars - The number of characters to skip in each field
  • skip_fields - The number of fields to skip when comparing lines

main() introduces a few local variables:

  • delimiter - The end of line delimiter, \n or \0 (-z)
  • *file[] - The input and output file names
  • nfiles - The index to file[]
  • optc - The character for the next option to process
  • output_option_used - Flag if the user requested a specific output mode
  • posixly_correct - Flag if the POSIXLY_CORRECT environment variable is set
  • skip_field_option_type - Holds the skip field processing behaviors (unused, legacy, current)

Parsing

Parsing answers the following questions to define the execution parameters

  • What is the range of comparison between lines?
  • Is the comparison case sensitive?
  • Which lines should be output (first match, subsequent matches)?
  • Should we use the NUL delimiter and how should it apply to groups?

Parsing failures

These failure cases are explicitly checked:

  • Providing too many file names
  • Nonsensical number of fields, characters, or bytes to skip/check
  • Combining a grouping method with an output method
  • Grouping and printing repeats
  • Printing duplicates and repeats
  • Unknown option used

User specified parsing failures result in a short error message followed by the usage instructions. Access related parsing errors die with an error message.


Execution

uniq employs a small optimization to minimize processing and enhance responsiveness depending on the behavior selected by the user. To keep it simple, here is the complex path that may happen during file checking:

  • Open the input and output files
  • Initialize the line buffers
  • While there are still lines of input:
    • Check that there is still more input otherwise exit
    • Find the next field
    • Compare the lines and if they match, count the match
    • Add group or prepend delimiter
    • Output the lines if they don't match
    • Add end of line delimiter
  • Close the input files
  • Free the line buffers
  • Return successful

Failure cases:

  • Too many repeating lines
  • Unable to open or close I/O files
  • Unable to read from input source

All failures at this stage output an error message to STDERR and return without displaying usage help


[Back to Project Main Page]