Decoded: cut (coreutils)

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [POSIX requirement] [Linux man] [FreeBSD man]

Logical flow of cut command (coreutils)

Summary

cut - cut out selected fields of each line of a file

[Source] [Code Walkthrough]

Lines of code: 610
Principal syscall: write() (indirectly through fwrite() and putchar())
Support syscalls: fadvise()
Options: 17 (7 short, 10 long)

Descended from cut as originated in System III (1982) (appeared internally as early as 1980)
Added to Fileutils in November 1992 [First version]
Number of revisions: 192

Helpers:
  • next_item() - Updates current index to the next byte/field
  • print_kth() - True if the current byte/field is printable
  • is_range_start_index() - True if the current position starts a range
  • cut_bytes() - Processes byte mode for an input stream
  • cut_fields() - Processes field mode for an input stream
  • cut_stream() - Directs stream processing to the desired mode
  • cut_file() - Starts processing for an input stream
External non-standard helpers:
  • die() - Exit with mandatory non-zero error and message to stderr
  • error() - Outputs error message to standard error with possible process termination
  • set_fields() - Initializes field_range_pair array

Setup

At global scope, cut.c does the following:

  • Declares field_range_pair to hold the -f option range. From set-fields.h
  • Declares field_1_buffer as a character buffer for the first field
  • Declares field_1_bufsize holds the size of the buffer
  • Defines operating_mode enum for three operation modes (unknown, byte, and field)
  • Declares operating_mode type for the current operation
  • Declares suppress_non_delimited preventing field mode output for lines without delimiters
  • Declares complement to output the 'inverse' of the selected bytes/fields
  • Declares delim to hold the delimiting character for fields
  • Defines line_delim as the standard newline character
  • Declares output_delimiter_specified flag if a delimiter was defined
  • Declares output_delimiter_length as the length of the delimiter
  • Declares output_delimiter_string as the delimiter used on output
  • Declares have_read_stdin flag set if STDIN is used for processing

main() initializes the following:

  • delim_specified - Flag if the user specifies a delimiter
  • ok - Holds the return status of the utility
  • optc - Holds the option character being parsed

Parsing kicks off with the short options passed as a string literal:
"b:c:d:f:nsz"


Parsing

During parsings, we're collecting options and arguments to answer the following questions:

  • Are we using byte mode or field mode?
  • What are the delimiters?
  • Do we collapse consecutive spaces?
  • If field mode, do we use a custom input delimiter?
  • If field mode, do we suppress lines without delimiters?

Parsing failures

These failure cases are explicitly checked:

  • Specifying more than one mode
  • Using a multicharacter delimiter
  • Using an input delimiter in byte mode
  • Suppression non-delimiter output in byte mode

User specified parsing failures result in a short error message followed by the usage instructions. Access related parsing errors die with an error message.


Execution

cut goes though these steps during execution

  • Initialize the field range array using parsed configuration
  • Open and verify the input (files or STDIN)
  • Perform byte or field mode cut operation
  • All results written to STDOUT

Failure cases:

  • Unable to fstat() input or output file
  • The input and output files are the same
  • Failure to write to the output file
  • Failure to close input standard input (if used)

All failures at this stage output an error message to STDERR and return without displaying usage help

Extra comments

Let's touch on the two operating modes.

Byte Mode

Just as you'd expect, we're reading in a byte at a time. The procedure is simple:

  • If the byte is within the print range: print it
  • If not, do nothing
  • If it's the end of a line, print it
  • If it's the end of the file, print the line delimiter

Field Mode

Field mode is more complicated since we may be suppressing lines and using custom delimiters. It's necessary to buffer the first field in order to retain characters while making a buffering decision. A line without a delimiter has only one field.

  • Test for EOF
  • If the first field is the whole line, output if not supressing to include end of line character.
  • If the first field is the first of many, then print if selected
  • Print any subsequent fields based on selection rule
  • Escape the default line delimited if overridden with -d option
  • Repeat above for all lines until the EOF test fires

[Back to Project Main Page]