Decoded: cat (coreutils)

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [POSIX requirement] [Linux man] [FreeBSD man]

Logical flow of cat command (coreutils)

Summary

cat - concatenate and print files

[Source] [Code Walkthrough]

Lines of code: 768
Principal syscall: write() -- wrapped by full_write()
Support syscalls: fstat()
Options: 19 (10 short, 9 long)

Descended from cat introduced in Version 1 UNIX (1971)
Added to Textutils in November 1992 [First version]
Number of revisions: 162

Helpers:
  • cat() - Implements all the features for I/O copying
  • next_line_num() - Update the line number buffer
  • simple_cat() - Basic copy from input to output
  • write_pending() - Full-write any pending data
External non-standard helpers:
  • die() - Exit with mandatory non-zero error and message to stderr
  • error() - Outputs error message to standard error with possible process termination
  • full_write() - Wrapper for write() that retries on interrupt
  • getpagesize() - Gets the memory page size for the system
  • io_blksize() - Gets the optimal block size
  • ptr_align() - Ensures that returned pointer is memory aligned
  • safe_read() - Reads with retry on interrupt

Setup

At global scope, cat.c does the following:

  • Defines infile to point to the input file name
  • Defines input_desc to hold the file descriptor
  • Defines line_buf[] and several associated pointers to manage line number counts. The 18 digit limitation won't likely matter under normal operation.
  • Defines newlines to track number of new lines across many inputs

main() initializes the following:

  • argind - argv index for the argument to cat
  • c - Holds the next option character for parsing
  • file_open_mode - Bitmap holding the file mode
  • have_read_stdin - Flag set if STDIN was used
  • inbuf - Pointer to the input buffer
  • insize - The optimial number of bytes to read in
  • number - Flag for numbering the output lines
  • number_nonblank - Flag for numbering the non-blank output lines
  • ok - Flag for execution success
  • out_dev - The output device number
  • out_ino - The output inode number
  • out_isreg - Flag if the output is a plain file
  • outbuf - Pointer to the output buffer
  • outsize - The optimal number of bytes to write out
  • page_size - Stores the size of a memory page for the system (4k is common)
  • show_ends - Flag for showing the end of line character ($)
  • show_nonprinting - Flag for showing nonprintable characters
  • show_tabs - Flag for showing tab characters (^I)
  • squeeze_blank - Flag for skipping repeated blanks
  • stat_buf - Buffer for the result of fstat()

Parsing kicks off with the short options passed as a string literal:
"benstuvAET"


Parsing

During parsing, we're collecting options and arguments to answer the following questions:

  • Do we display line numbers? On all lines or only non-empty?
  • Do we display non-printables or end of lines?
  • Do we collapse consecutive spaces?

Parsing failures

The only parsing failure is when an unknown option is used. In that case, help usage is displayed


Execution

cat goes though these steps during execution

  • Open and verify access to output
  • Open and verify the input
  • Choose an output cat method (simple or normal)
  • Write the data between output and input in buffer increments
  • Close the input and move to the next
  • End with the 'best' possible status

Failure cases:

  • Unable to fstat() input or output file
  • The input and output files are the same
  • Failure to write to the output file
  • Failure to close input standard input (if used)

All failures at this stage output an error message to STDERR and return without displaying usage help

Extra comments

Two points to consider: The choice of cat method and the transfer buffer size

cat() vs simple_cat()

If the input is copied directly to the output without changes, then simple_cat() is the method. But if additional formatting is requested (i.e. line numbers, non-printables), then the full cat() function is calleed. The latter is necessarily more complicated at ~300 lines vs 40.

Two cat methods in coreutils

Buffer sizes

Buffers hold data between read and write calls. Common sense says that the buffer should be at least as large as the largest single I/O move. However, reality is more complicated. Consider the stated output buffer size:

OUTSIZE - 1 + INSIZE * 4 + LINE_COUNTER_BUF_LEN + PAGE_SIZE - 1

Source comments provide some discussion, but I'll derive it differently. There are four factors at work:

  • The buffer writes OUTSIZE sized chunks
    The buffer may not write if there is only OUTSIZE - 1 bytes. In the same pass, the buffer must be able to accept the next read of INSIZE bytes. Therefore the output buffer must have at least OUTSIDE - 1 + INSIZE.
  • Each character might be modified (non-printables)
    Each character read may be unprintable with a leading 'M-^' indicator. Thus INSIZE needs to be multiplied by 4 to hold the adjustment
  • Each line might be modified (added line numbers)
    The maximum supported line number length is 20 digits (as LINE_COUNTER_BUF_LEN). These line numbers are prepended to the line and thus must be part of the output buffer.
  • Buffer access should be page-aligned
    Performance on some architectures depends on alignment. In worst case, the buffer is allocated starting on the 2nd byte of a page and thus to align it, we must move forward PAGE_SIZE - 1 to the beginning of the next page.

[Back to Project Main Page]