Decoded: GNU coreutils

October 2018
  updated: September 2019

coreutils brought to you by the GNU project

This is a long-term project to decode all of the GNU coreutils in version 8.3.

This resource is for novice programmers exploring the design of command-line utilities. It is best used as an accompaniment providing useful background while reading the source code of the utility you may be interested in. This is not a user guide -- Please see applicable man pages for instructions on using these utilities.

Status: Complete!

  • Phase 1 [complete] - Each utility has a dedicated page discussing the namespace and execution overview.
  • Phase 2 [complete] - Expanded discussion about important design decisions and algorithms. Tracing utility lineage both from UNIX and early Coreutils. Porting content to something more collaborative. Enhancing source walkthrough to something more useful. Creating a source code evolution visualizer
  • Phase indefinite - Line by line code walkthrough for each utility will be accomplished over a long period. GitHub repo available to gather line-by-line notes. This segment was deferred due to consistent feedback that readers were more interested in high-level discussion.

The GNU Core Utilities

I'll link the utility pages here at the top. Click the command name for the detailed page decoding that utility. The discussion, source code, and walkthroughs are available on each page. Bolded utilities have been expanded as part of phase 2. Enjoy!

Helpful background for code reading

The GNU coreutils has its foibles. Many of these utilities are approaching 30 years old and include revisions by many people over the years. Here are some things to keep in mind when reading the code:
  • Tiny programs - These utilities are small, (mostly) single-source file programs designed to do one thing and do it well. They are not designed for long life or to scale beyond their role. Consequently, we see designs often considered 'bad practice' such as:
    • Many globals
    • Liberal use of macros
    • goto statements
    • Long functions with nested switchs/loops
  • Know POSIX - Start with the Utility Syntax Guidelines. In general, POSIX supports interoperability by defining appropriate inputs and outputs, but leaves the 'work' to the implementation. While the GNU coreutils may not strictly conform to POSIX, many ideas are entrenched: permission bits, uids/gids, environment variables, exit status, and about 3718 pages of more trivia.
  • Outside help - Portability is a complex problem and coreutils relies on extra help from a related project: gnulib. Almost every utility includes functions from gnulib which are specially designed for common problems used in many places across various systems - No need to reinvent the wheel.
  • Launched from a shell - The Core utilities expect support from a shell such as bash, zsh, ksh, and others. The shell forks/clones in to the utility, passes the arguments, sets up the environment, redirects I/O via pipes, and retains exit values.
  • Three families - GNU coreutils were originally three distinct packages for shell, text, and file utilities. Utilities within the same type share many of the same design patterns.

Basic design

Most CLI utilities look something close to this:

General CLI procedure

The key ideas:

  • A setup phase for flags, options, localization, etc
  • An argument parsing phase thats reads input to set execution parameters
  • A processing/execution phase that prepares input for one or more syscalls
  • Many opportunities to check constraints and fail out of execution
    • Distinct EXIT status hint about problem location
    • EXIT_FAILURE is general and commonly used
  • Providing feedback after failed execution

This is the framework I'll use to organize the decoding of each utility. We'll see that each has a unique variant of this idea which range from a few lines to thousands of lines. I'd categorize the variants in three groups: trivial, wrappers, and full utilities

Trivial utilities
Trivial utilities have a unique set up phase which defines a macro in a couple lines. Then it 'includes' the source of another utility in which the macro forces a specific flow control. Examples include: arch, dir, and vdir

Wrapper utilities
Wrappers perform setup and parse command line options which are passed directly as arguments to a syscall. The result of the syscall is the result of the utility. These utilities do little processing on their own. Examples include: link, whoami, hostid, logname, and more

Full utilities
The diagram above shows a design for full utilities. A setup phase, an option/argument parsing phase, and execution. Execution means processing input data and may invoke many syscalls along the way to handle more data until complete. Most utilities fall in to this category.


Digging deeper

Let's go through the most common ideas shared across many of the utilities. Knowing these concepts beforehand should speed up code reading.

Utility Initialization

All utilities have a short initialization procedure near the beginning of main():

  initialize_main (&argc, &argv);
  set_program_name (argv[0]);
  setlocale (LC_ALL, "");
  bindtextdomain (PACKAGE, LOCALEDIR);
  textdomain (PACKAGE);

  atexit (close_stdout);

This preamble solves a few administrative issues; the most important of which are internationalization and assigning the exit action. I'll go through each of these lines below. This lines don't impact the specific action of a utility.

Parsing with Getopt

Ever wonder why command line utilities have had the same look and feel for the past 40 years? You can thank the Getopt toolset. The bare minimum you need to know to follow the coreutils is:

  • Command line options can be 'short' and 'long', prefixed with (-) and (--) respectively. Short options are defined as a string while long options use a struct.
  • Short options use 1) only a letter if the option has no argument, 2) A single colon (:) for mandatory arguments, and (::) for optional arguments. For example, the short option string for kill is: Lln:s:t. Which says that L, l, t take no arguments but n and s need an argument.
  • Long options often have a short analogue
  • The getopt_long() function returns the next option and is used in all utilities
  • The optind index is a position within the argv[] array for the next argument.
  • The optarg char pointer points to the value of the option's argument.

Traversing the file system with fts

Unix-like systems often support the fts library to easily manage walking through the file system. The basic hand-waved details are:

  • The tree is represented by an FTS structure built by calling fts_open() or xfts_open() on a path.
  • A node (file/directory) from the tree is a FTSENT structure.
  • Calling fts_read() on the FTS generates FTSENTs. This is walking the tree.
  • The FTSENT->fts_info field describes the entries. It is used often to decide how to handle the entry.

Syscall wrappers, and helpers

coreutils often invokes syscalls through wrappers and helpers beyond those provided by libc. Many are linked through the Gnulib project.

write

libc provides many text writing functions, such as fwrite() for buffered stream access, and the write() syscall wrapper. Coreutils brings in non-standard functions such as full_write(). The full_write() function continuously retries writes unless there is a hard failure. It relies on safe_write() to retry the write() syscall across interrupts. Other write-related helpers are used only in a single utility. Such as iwrite() in dd, cwrite() in split. I'll discuss those within the utilities themselves.

Common functions

All utilities use at least three functions: main(), usage(), and _().

The usage() function displays help for the utility that includes a list of input parameters, their meaning, and appropriate syntax.

The _() function is really a macro defined in system.h that binds simple strings to the Native Language Support capability in GNU gettext.h. If it's a string meant to be shown to the user, it's probably wrapped with this function.

Common code lines

The following code lines occur in most non-trivial utilities:

#include "system.h"
This header defines system-dependent marcos, variables, and useful non-standard functions. It provides 'translations' necessary to allow coreutils to build on as many architectures as possible. Overall, this header is a patchwork of corner cases lacking serious organization -- but it works!. Many C standard and POSIX headers are included within this header, such as: unistd.h, limits.h, ctypes.h, time.h, string.h, errno.h, stdbool.h, stdlib.h, fcntl.h, inttypes.h, and locale.h.

#define PROGRAM_NAME "cat"
Defines the official name for the utility. Used in the 'version' check.

#define AUTHORS proper_name ("Richard M. Stallman")
Defines the authors for the utility. Used in the 'version' check.

emit_try_help ()
Prints help suggestion after failed output. Includes a link to the online documents. This will appear at the beginning of usage()

emit_ancillary_info (PROGRAM_NAME)
Prints common extra help info after the command-specific output. Includes a link to the online documents. This appears close to the end of usage()

exit (status)
Syscall to end execution with the given status. This appears at the end of usage()

initialize_main(&argc, &argv)
Special handler for VMS forcing built-in wildcard expansion. This is defined away for most other operating systems

set_program_name(argv[0]);
Saves the basic program name using the first input argument. Discards the path component of argv[0].

setlocale(LC_ALL, "");
Sets up internationalization options during execution. Provided by libc in <locale.h>

bindtextdomain (PACKAGE, LOCALEDIR);
Sets the directory of intenationalization features using the free software gettext.h

textdomain (PACKAGE);
Sets the text domain to enable i18n.

atexit(close_stdout);
Registers the close_stdout function for call when the program ends. This flushes the buffer steam in addition to closing.

IF_LINT(something);
Suppresses GCC warnings if using a linter by including the code within the parens. Usually this is NOP

C idioms

There are a few idioms buried in the coreutils source that may be unfamiliar to beginners.

!!
The double exclaimation point is exactly what you see, a double unary NOT operation. The purpose is to coerce a value in to a boolean. It's often used to make a flag from a function return value.

do { ... } while (0)
The non-loop often encloses a multi-statement macro to ensure proper tokenization after preprocessor substitution. The core use-case is as a consequent:

if (condition)
  MACRO;
else 
  something else
Note that lack of semi-colon after while -- It's manually added after the macro in the C code.


Utility Maintenance

An active project like coreutils is always evolving. In general, updates proceed across three arcs:

  • Project-wide changes - These are larger scale changes to underlying architecture and dependencies across all utilities. Some highlights include:
    • 1995: Native language support was added thanks to the GNU gettext project. This incorporated the _() macro around most text output lines. Internationalization support expanded in 1996, adding several initializers to main() as discussed in the previous section
    • 1995: Short descriptions of utility purpose were added to usage output
    • 2003: VMS wildcard support. This is visible via the initialize_main() function
    • 2016: The die() macro replaces most exit() and error() functions on failure paths to avoid compiler warnings
    • Various: Incorporating macro constants such as EXIT_SUCCESS, PROGRAM_NAME, AUTHORS, among others.
  • Utility-specific updates - Many changes apply only to a subset or single utility. These cases usually fall in to three categories: bug fixes, new features, and optimizations. Examples of each type include:
    • Bug: The join, sort, and uniq commands were susceptible to an overflow attack until patched in 2016
    • Feature: The --output option was added to df in 2013
    • Optimization: The yes utility performance improved with better buffering
  • Annual maintenance - At a minimum, the copyright years of all utilities are updated. Another administrative change includes updating the FSF address. These changes have no effect on execution

For curious readers, I've included an 'evolution' view within each utility page to visualize utility changes over time.

Contributing

People interested in contributing should read everything on the GNU project page. The contribution guidelines and list of rejected features are especially enlightening. Finally, go through the mailing list archives to get an idea of what contributions are most valuable. A very short list of things to consider before writing any code:

  • Can this functionality be reproduced with existing tools?
  • Does your contribution break backwards compatibility?
  • Does the proposed behavior deviate significantly from POSIX?

Not sure? Send your concerns to the community on the mailing list


Fun stuff

Veteran developers looking for a reason to peek inside these utilities may want to start their journey here.

Trivia

Shortest utility: false (2 lines - tied with arch, dir, and vdir)
Shortest standalone utility: true (80 lines) -- the first version is almost a minimum C program!
Longest utility: ls (5308 lines)

  • Many utilities trace back to Research UNIX in the 1970s. A handful even further back to Multics
  • The oldest spiritual ancestor is the CTSS LISTF command (~1963). Thankfully shorted to ls
  • The distinct syntax of the dd utility is reminiscent of the OS/360 job control language (early 1960s).
  • The sort program is the only utility that takes advantage of multi-threading
  • The fmt utility demonstrates optimization of lines and paragraphs using feature costs
  • The deceptively simple yes utility has high-performance output using page-aligned memory buffers
  • The df utility is faster than du. The former uses device metadata while the latter checks all files
  • cksum includes two entry points, one for normal operation and one to generate the CRC-32 table
  • There is no failure condition for the echo utility
  • The design of the test and expr utilities departs significantly from the typical utility
  • su was originally maintained by coreutils/shellutils
  • My personal least used utilities are tsort and ptx - I tested them once in the late 1990s

Interesting implementations

There are a few standalone code snippets within coreutils worth investigating:


FAQ

Nice project! How can I donate to support this effort?
Thanks for the thoughts; unfortunately I'm not configured to receive personal donations. But feel free to share your time or money with the Free Software Foundation -- That's where all the collaborative efforts happen!