Decoded: sort (coreutils)

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [POSIX requirement] [Linux man] [FreeBSD man]

Logical flow of sort command (coreutils)

Summary

sort - sort, merge, or sequence check text files

[Source] [Code Walkthrough]

Lines of code: 4780
Principal syscalls: None
Support syscall: dup2(), fork()
Options: 54 (24 short, 30 long)

Descended from sort in Version 2 UNIX (1972)
Added to Textutils in August 1998 [First version]
Number of revisions: 456

The sort utility is unique among the coreutils in that some tasks are multi-threaded. It implements merge sort through two custom data structures (binary tree and a priority queue [heap]) along with the gnulib hash table. Since sorting is a fundamental problem in computer science, it's no surprise that this utility has an 'academic' feel.


Setup

main() initializes a few variables used througout the utility:

  • *key - A generic key pointer (usually to key_buf)
  • key_buf - Buffer for a key during processing
  • gkey - A 'global' default key
  • gkey_only - Only use the default key
  • *s - A string used as a key suffix (used several ways)
  • c - The first character of the option to process
  • checkonly - Flag for check sorting only (two modes)
  • mergeonly - Flag for only merging input files
  • random_source - Random source file name
  • need_random - Flag for random source usage
  • nthreads - Number of threads to use for sorting
  • nfiles - Number of input files to process
  • posixly_correct - Flag for if the environment enforces POSIX
  • posix_ver - Holds the posix version
  • traditional_usage - Flag the force non-POSIX key styles
  • **files - The pointer to the list of files
  • *files_from - Name of a file containg the list of input files
  • tok - Holds the token struct for an input file
  • *outfile - The output file for sort
  • locale_ok - Flag set if i18n was properly configured

Then sort goes through a few preliminary steps:

  • The usual i18n setup
  • Initialize character lookup tables
  • Customize signal handlers
  • Initialize global key
  • Prepare input file arguments

Character tables
sort initializes four tables indexed by character value and used for handling decisions: A boolean table identifying blank characters, to include the newline. A boolean table of nonprintable characters, a table of non-dictionary characters, and a translation table to upper case

Signal handlers
Sorting produces temporary files that need to be cleaned if we receive a termination signal. We define a custom sighandler that removes the files and apply it to a subset of signals, caught_signals. During critical section execution, child threads block these signals since temp files are in use.


Parsing

Parsing command line options answers the following questions:

  • Which mode are we using, sorting, merging, or neither? (Neither implies sort check)
  • For sorting, which key fields are we using?
  • What is the sort basis? (numbers, letters, descending, etc)
  • Where is input coming from? (stdin, file arguments, a list file)

The most complex task in parsing is setting the key information from user input. It's further complicated by the historical differences between BSD and POSIX.

Parsing failures

These failure cases are explicitly checked:

  • Invalid key position values
  • Unknown character in field specification
  • Request both check-only variants (quiet and diagnose)
  • Multiple output files
  • Multiple random sources
  • Missing or unusable tab characters
  • Combining input file list and command line files
  • Providing extra operands for check-only mode

Extra Comments


Execution

Sorting executes through three distinct modes: checking a sort, merging input files (assumed to be sorted), or performing a full sort and merge.

The full sort is the most complex path through the utility and involves specific tools including a binary merge trees, child threads, priority queues, and temporary files.

Data structures

Sorting uses a binary merge tree and a priority queue which work together to process individual files then tie them together.

Sorting

The full sort() pass goes through the following tasks

  • Find the maximum number of possible independent threads
  • For each input file...
    • Buffer all file lines
    • Open a temporary file for the upcoming sort
    • Construct a merge tree to divide up the work among threads
      • merge_tree_init()
    • Fork child threads for each tree layer. Leaf threads sort and queue work.
      • sortlines()
    • Merge the results up the tree. Root node outputs to temp file
      • merge_loop() and mergelines_node()
    • Close the temporary file

Now we have temporary files that contain sorted data ready for merger

Merging (after sorting)

Putting the temporary files together in to a single output file can go several ways. We'll look at both the simple and the complicated

Easy merge
The easy case happens when there are few (or one) temporary files.

  • Open the temp files as possible in a single pass
    • open_input_files()
  • Open the output target
  • Merge the files directly to the output
    • mergefps()

Complicated merge
We need to do more work if there are more temporary files than we can reasonably manage in a single pass

  • Prepare a temporary output file
  • Open as many temp files as possible in a single pass
    • mergefiles()
  • Merge the files in to a single temporary file for that pass
    • mergefps()
  • Repeat passes and coalesce files for subsequent passes
  • Attempt to merge temp files to the final output
  • Close file input descriptors and mergefps() until all remaining are processed

Merging Only

The difference with merge-only is that there are no temporary files to process. The user has provided a list of actual files to sort. The merge procedure is effectively the same.

--

There are several ways that file processing could fail:

Failure cases:

  • Unable to open any input or output files
  • Unusable file names
  • Failure to localize
  • Unable to close stdin
  • Any read or write failures from I/O
  • Output file truncation or mode error
  • Unable to delete temporary files
  • Attempting to merge too many files at once
  • Nonsensical number of threads requested

Helpers

There are enough helper functions to separate them by purpose:

Threading helpers:
  • cs_enter() - Blocks signals may that interrupt execution
  • cs_leave() - Restores signal handlers to nornal
  • delete_proc() - Removes a temp process from the process has table
  • reap() - Wait for threads to exit then clean
  • reap_all() - Remove and clean all child threads
  • reap_exited() - Reap all exited child threads
  • reap_some() - Remove at least one child thread
  • register_proc() - Adds a temp process to the proccess hash table
  • wait_proc() - Wait for and reap a child thread
Sorting helpers:
  • check_ordering_compatibility() - Verifies key consistency
  • debug_key() - Outputs a line and highlights the key
  • debug_key_compare() - Verifies key sortability
  • debug_line() - Checks all keys against a line
  • default_sort_size() - Computes the default sorting size
  • insertkey() - Add a new sorting key
  • key_init() - Initializes (zeroes) a keyfield
  • key_numeric() - Returns true if the key is a number
  • key_to_opts() - Converts a key to short options
  • key_warnings() - Verifies and complains about given keys
  • keycompare() - Compares two lines using given keys
  • limfield() - Finds the end of a key field (character after end)
  • mark_key() - Underlines a key in a line
  • parse_field_count() - Parses and bounds keys
  • sequential_sort() - Recursive compare and swap of line array
  • set_ordering() - Applies the given options to a key
  • sort() - Starts the sort procedure
  • sort_buffer_size() - Gets the sort buffer size
  • sort_die() - Exits sorting with SORT_FAILURE
  • sortlines() - Threaded-recursive sorting procedure
  • sortlines_thread() - Sort procedure for child threads
Merging helpers:
  • avoid_trashing_input() - Ensures no clashing I/O files names during merge
  • merge() - Starts the merge procedure
  • merge_loop() - Merges lines from the queue
  • mergefiles() - Verifies files and calls mergefps to merge files
  • mergelines() - Compares and merges two lines appropriately
  • mergelines_node() - Meges nodes from the merge tree
  • mergefps() - Merges input files to another output fd
Binary Tree Helpers:
  • compare_nodes() - Compares two nodes based on priority (level)
  • init_node() - Initialize a merge tree node
  • lock_node() - Lock the node to limit access
  • merge_tree_destroy() - Free a merge tree
  • merge_tree_init() - Create a new merge tree of given size
  • unlock_node() - Unlock the node for general access
Priority Queue Helpers:
  • queue_check_insert() - Check node before queuing
  • queue_check_insert_parent() - Check and insert a node's parent
  • queue_destroy() - Free the queue
  • queue_init() - Allocate space for a new queue of max nodes
  • queue_insert() - Add a new node to the queue and re-heap
  • queue_pop() - Pop the top priority node from the pq
Other Helpers:
  • add_temp_dir() - Adds a directory for temporary files
  • async_safe_die() - Outputs thread errors safely
  • badfieldspec() - Exits from a bad field specification
  • begfield() - Returns a pointer to the start of a field
  • buffer_linelim() - Return a ponter just beyond the line
  • check() - Verify file line ordering
  • check_inputs() - Verify input file availability
  • check_output() - Verify output file availability
  • cleanup() - Remove any temporary files that may remain
  • compare() - Compare two lines
  • compare_random() - Sorts based on key hash (~random order)
  • create_temp() - Wrapper to create a temp file or fail
  • create_temp_file() - Creates a temporary file
  • debug_width() - Computes the printed width of a memory region (compress tabs)
  • exit_cleanup() - High-level exit routine
  • fillbuf() - Fills a buffer from a stream
  • find_unit_order() - Returns the order of magnitude of a number
  • general_numcompare() - Generalized number-as-string comparator
  • getmonth() - Returns an integer month from an input string
  • human_numcompare() - Compares numbers with unit labels
  • incompatible_options() - Exits with incompatibilty error
  • initbuf() - Allocates and aligns buffers for lines
  • inittables() - Initializes character tables
  • maybe_create_temp() - Creates a temporary file if possible
  • move_fd() - Duplicates a file descriptor (dup2 wrapper)
  • nan_compare() - Compares NaN values via memcmp
  • numcompare() - Compares numbers-as-strings and returns result
  • open_input_files() - Opens and prepares multiple input files for use
  • open_temp() - Create a file descriptor for temporary files
  • pipe_fork() - Create a new process and redirect I/O for compression
  • proctab_comparator() - Compare function for the process table
  • proctab_hasher() - Hash function for the process table
  • random_md5_state_init() - Initialize a random vector
  • sighandler() - Generic defualt signal handler
  • specify_nmerge() - Sets the number of concurrent input files
  • specify_nthreads() - Sets the number of concurrent threads
  • specify_sort_size() - Sets the memory limit while sorting
  • stream_open() - Creates a FILE stream for a file. Null is STDOUT
  • struct_month_cmp() - Month comparison function
  • traverse_raw_number() - Walks a number and returns the largest digit
  • write_line() - Writes a line to a temporary file
  • write_unique() - Writes unique lines (-u), otherwise normal write
  • xfclose() - Closes a FILE stream and may fail.
  • xfopen() - Creates a FILE stream. Never null, may fail.
  • xstrxfrm() - Transforms a string while checking erros
  • zaptemp() - Removes temporary file from the list

[Back to Project Main Page]