Decoded: join (coreutils)

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [POSIX requirement] [Linux man] [No FreeBSD entry]

Logical flow of join command (coreutils)

Summary

join - join lines on a common field

[Source] [Code Walkthrough]

Lines of code: 1200
Principal syscall: write()
Support syscalls: open(), close(), fadvise()
Options: 17 (10 short, 7 long)

Descended from join introduced in Version 7 UNIX (1979)
Added to Textutils in November 1992 [First version]
Number of revisions: 211

The idea of the join utility is very similar to the JOIN operation used in relational databases. The idea is more complex than the simple text parsing seen in other utilities, and so the execution relies on four custom structures to manage I/O

Helpers:
  • add_field() - Adds a field to the outlist
  • add_field_list() - Adds to a field list
  • add_file_name() - Adds the file name to the input file list
  • advance_seq() - Adds another line to a sequence
  • check_order() - Verifies the ordering of lines in a file
  • decode_field_spec() - Decodes the -o option arguments
  • delseq() - Deallocates a sequence
  • extract_field() - Retrieves field data from a line
  • free_spareline() - Deallocates all spare lines
  • freeline() - Deallocates line resources
  • getseq() - Builds a sequence from a file line
  • get_line() - Parses a line from a file and returns success
  • init_linep() - Initializes a line structure
  • initseq() - Initializes a sequence to zero entries
  • join() - The top level join procedure
  • keycmp() - Compares two lines and returns a ternary result
  • prfield() - Print a single field in a line
  • prfields() - Print all fields in a line
  • prjoin() - Joins two input lines on a key and prints
  • reset_line() - Removes the number of fields in a line
  • set_join_field() - Sets the join field value
  • string_to_join_field() - Converts a decimal string to represent a field value
  • SWAPLINES() - Function-like macro to perform a line swap
  • xfields() - Creates the fields structure from a line
External non-standard helpers:
  • die() - Exit with mandatory non-zero error and message to stderr
  • error() - Outputs error message to standard error with possible process termination

Setup

join defines four custom structures needed to keep track of I/O:

  • struct field - Tracks a field in a line with a start pointer and a length
  • struct line - Holds a line in a buffer and tracks number of fields and pointer to each
  • struct outlist - A list of output lines specifying files and fields
  • struct seq - A sequences of lines with the same join field

There are also a few important globals that manage execution:

  • autocount_1 - The number of fields for file 1 during autoformatting
  • autocount_2 - The number of fields for file 2 during autoformatting
  • autoformat - Flag to infer the output format from the first line of input files
  • *empty_filler - The string to print in place of empty fields
  • eolchar - The line delimiter, default \n
  • g_names[] - The real names of file1 and file2
  • hard_LC_COLLATE - Flag set if LC_COLLATE is in a standard location
  • ignore_case - Flag to ignore letter casing on join fields (-i)
  • issue_disorder_warning[] - Flag for each file that has set a warning
  • join_field_1 - The field number to join on in file 1
  • join_field_2 - The field number to join on in file 2
  • join_header_lines - Flag to use the first line of a file for the header
  • line_no[] - The number of lines read from file1 and file2
  • outlist_end - The end of the outlist
  • *outlist_head - The beginning of the outlist
  • *prevline[] - The previous line read from file1 and file2
  • print_pairables - Flag to print lines that are matched (-v)
  • print_unpairables_1 - Flag to print unpairables lines from file 1
  • print_unpairables_2 - Flag to print unpairable lines from file 2
  • seen_unpairable - Flag set if we've processed a line without a match
  • *spareline[] - An additional buffer for a line from file1 and file2, if needed
  • tab - The character used for the field delimiter
  • uni_blank - A line reference dedicated to separating lines

main() introduces a few local variables:

  • i - Integer iterator for file number
  • joption_count[] - The join field numbers
  • fp1 - The file stream for file1
  • fp2 - The file stream for flie2
  • nfiles - The number of file arguments provided
  • operand_status[] - Tracks the type of operand (file, join arg, etc)
  • optc - The character for the next option to process
  • optc_status - The type of the operand we're processing
  • prev_optc_status - The type of the previous operand

Parsing

Parsing sets the possible execution parameters for join. The user answers the following questions:

  • Should the files be ordered?
  • Which field from which file should be the join key
  • Are the keys case sensitive?
  • Is there a header line?
  • Should the line entries be NUL terminated?

The join field is initialized during parsing (or just after in one case) via the set_join_field().

Parsing failures

These failure cases are explicitly checked:

  • User provides a nonsensical field number
  • User gives an invalid tab
  • Trying to use STDIN for both input files
  • Missing source files
  • User inputs invalid field specifiers
  • Unknown option used

User specified parsing failures result in a short error message followed by the usage instructions. Access related parsing errors die with an error message.


Execution

The join utility is fairly simple to understand despite the number and depth of the support functions. The high-level operation goes like this:

  • Open the two input sources, one of which could be STDIN
  • Output the header if requested
  • Initialize sequence buffers to hold matching lines
  • Read lines from files and compare the keys for match
  • If the key's match:
    • Read all matching lines from file 1 and add to the sequence
    • Read all matching lines from file 2 and add to the sequence
    • Print the resulting output sequence
  • Verify the file ordering if requested
  • If any lines we're unpairable, output those lines.
  • Clean up all data structures

Failure cases:

  • Unable to read from input file
  • Bad source file number provided
  • Invalid join field
  • Files not properly ordered

All failures at this stage output an error message to STDERR and return without displaying usage help


[Back to Project Main Page]