Decoded: join (coreutils) – MaiZure's Projects

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [POSIX requirement] [Linux man] [No FreeBSD entry]

Logical flow of join command (coreutils)

Summary

join - join lines on a common field

[Source] [Code Walkthrough]

Lines of code: 1200
Principal syscall: write()
Support syscalls: open(), close(), fadvise()
Options: 17 (10 short, 7 long)

Descended from join introduced in Version 7 UNIX (1979)
Added to Textutils in November 1992 [First version]
Number of revisions: 211

The idea of the join utility is very similar to the JOIN operation used in relational databases. The idea is more complex than the simple text parsing seen in other utilities, and so the execution relies on four custom structures to manage I/O

Helpers:

add_field() - Adds a field to the outlist
add_field_list() - Adds to a field list
add_file_name() - Adds the file name to the input file list
advance_seq() - Adds another line to a sequence
check_order() - Verifies the ordering of lines in a file
decode_field_spec() - Decodes the -o option arguments
delseq() - Deallocates a sequence
extract_field() - Retrieves field data from a line
free_spareline() - Deallocates all spare lines
freeline() - Deallocates line resources
getseq() - Builds a sequence from a file line
get_line() - Parses a line from a file and returns success
init_linep() - Initializes a line structure
initseq() - Initializes a sequence to zero entries
join() - The top level join procedure
keycmp() - Compares two lines and returns a ternary result
prfield() - Print a single field in a line
prfields() - Print all fields in a line
prjoin() - Joins two input lines on a key and prints
reset_line() - Removes the number of fields in a line
set_join_field() - Sets the join field value
string_to_join_field() - Converts a decimal string to represent a field value
SWAPLINES() - Function-like macro to perform a line swap
xfields() - Creates the fields structure from a line

External non-standard helpers:

die() - Exit with mandatory non-zero error and message to stderr
error() - Outputs error message to standard error with possible process termination

Setup

join defines four custom structures needed to keep track of I/O:

struct field - Tracks a field in a line with a start pointer and a length
struct line - Holds a line in a buffer and tracks number of fields and pointer to each
struct outlist - A list of output lines specifying files and fields
struct seq - A sequences of lines with the same join field

There are also a few important globals that manage execution:

autocount_1 - The number of fields for file 1 during autoformatting
autocount_2 - The number of fields for file 2 during autoformatting
autoformat - Flag to infer the output format from the first line of input files
*empty_filler - The string to print in place of empty fields
eolchar - The line delimiter, default \n
g_names[] - The real names of file1 and file2
hard_LC_COLLATE - Flag set if LC_COLLATE is in a standard location
ignore_case - Flag to ignore letter casing on join fields (-i)
issue_disorder_warning[] - Flag for each file that has set a warning
join_field_1 - The field number to join on in file 1
join_field_2 - The field number to join on in file 2
join_header_lines - Flag to use the first line of a file for the header
line_no[] - The number of lines read from file1 and file2
outlist_end - The end of the outlist
*outlist_head - The beginning of the outlist
*prevline[] - The previous line read from file1 and file2
print_pairables - Flag to print lines that are matched (-v)
print_unpairables_1 - Flag to print unpairables lines from file 1
print_unpairables_2 - Flag to print unpairable lines from file 2
seen_unpairable - Flag set if we've processed a line without a match
*spareline[] - An additional buffer for a line from file1 and file2, if needed
tab - The character used for the field delimiter
uni_blank - A line reference dedicated to separating lines

main() introduces a few local variables:

i - Integer iterator for file number
joption_count[] - The join field numbers
fp1 - The file stream for file1
fp2 - The file stream for flie2
nfiles - The number of file arguments provided
operand_status[] - Tracks the type of operand (file, join arg, etc)
optc - The character for the next option to process
optc_status - The type of the operand we're processing
prev_optc_status - The type of the previous operand

Parsing

Parsing sets the possible execution parameters for join. The user answers the following questions:

Should the files be ordered?
Which field from which file should be the join key
Are the keys case sensitive?
Is there a header line?
Should the line entries be NUL terminated?

The join field is initialized during parsing (or just after in one case) via the set_join_field().

Parsing failures

These failure cases are explicitly checked:

User provides a nonsensical field number
User gives an invalid tab
Trying to use STDIN for both input files
Missing source files
User inputs invalid field specifiers
Unknown option used

User specified parsing failures result in a short error message followed by the usage instructions. Access related parsing errors die with an error message.

Execution

The join utility is fairly simple to understand despite the number and depth of the support functions. The high-level operation goes like this:

Open the two input sources, one of which could be STDIN
Output the header if requested
Initialize sequence buffers to hold matching lines
Read lines from files and compare the keys for match
If the key's match:
- Read all matching lines from file 1 and add to the sequence
- Read all matching lines from file 2 and add to the sequence
- Print the resulting output sequence
Verify the file ordering if requested
If any lines we're unpairable, output those lines.
Clean up all data structures

Failure cases:

Unable to read from input file
Bad source file number provided
Invalid join field
Files not properly ordered

All failures at this stage output an error message to STDERR and return without displaying usage help

[Back to Project Main Page]