Decoded: cut (coreutils) – MaiZure's Projects

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [POSIX requirement] [Linux man] [FreeBSD man]

Logical flow of cut command (coreutils)

Summary

cut - cut out selected fields of each line of a file

[Source] [Code Walkthrough]

Lines of code: 610
Principal syscall: write() (indirectly through fwrite() and putchar())
Support syscalls: fadvise()
Options: 17 (7 short, 10 long)

Descended from cut as originated in System III (1982) (appeared internally as early as 1980)
Added to Fileutils in November 1992 [First version]
Number of revisions: 192

Helpers:

next_item() - Updates current index to the next byte/field
print_kth() - True if the current byte/field is printable
is_range_start_index() - True if the current position starts a range
cut_bytes() - Processes byte mode for an input stream
cut_fields() - Processes field mode for an input stream
cut_stream() - Directs stream processing to the desired mode
cut_file() - Starts processing for an input stream

External non-standard helpers:

die() - Exit with mandatory non-zero error and message to stderr
error() - Outputs error message to standard error with possible process termination
set_fields() - Initializes field_range_pair array

Setup

At global scope, cut.c does the following:

Declares field_range_pair to hold the -f option range. From set-fields.h
Declares field_1_buffer as a character buffer for the first field
Declares field_1_bufsize holds the size of the buffer
Defines operating_mode enum for three operation modes (unknown, byte, and field)
Declares operating_mode type for the current operation
Declares suppress_non_delimited preventing field mode output for lines without delimiters
Declares complement to output the 'inverse' of the selected bytes/fields
Declares delim to hold the delimiting character for fields
Defines line_delim as the standard newline character
Declares output_delimiter_specified flag if a delimiter was defined
Declares output_delimiter_length as the length of the delimiter
Declares output_delimiter_string as the delimiter used on output
Declares have_read_stdin flag set if STDIN is used for processing

main() initializes the following:

delim_specified - Flag if the user specifies a delimiter
ok - Holds the return status of the utility
optc - Holds the option character being parsed

Parsing kicks off with the short options passed as a string literal:
"b:c:d:f:nsz"

Parsing

During parsings, we're collecting options and arguments to answer the following questions:

Are we using byte mode or field mode?
What are the delimiters?
Do we collapse consecutive spaces?
If field mode, do we use a custom input delimiter?
If field mode, do we suppress lines without delimiters?

Parsing failures

These failure cases are explicitly checked:

Specifying more than one mode
Using a multicharacter delimiter
Using an input delimiter in byte mode
Suppression non-delimiter output in byte mode

User specified parsing failures result in a short error message followed by the usage instructions. Access related parsing errors die with an error message.

Execution

cut goes though these steps during execution

Initialize the field range array using parsed configuration
Open and verify the input (files or STDIN)
Perform byte or field mode cut operation
All results written to STDOUT

Failure cases:

Unable to fstat() input or output file
The input and output files are the same
Failure to write to the output file
Failure to close input standard input (if used)

All failures at this stage output an error message to STDERR and return without displaying usage help

Extra comments

Let's touch on the two operating modes.

Byte Mode

Just as you'd expect, we're reading in a byte at a time. The procedure is simple:

If the byte is within the print range: print it
If not, do nothing
If it's the end of a line, print it
If it's the end of the file, print the line delimiter

Field Mode

Field mode is more complicated since we may be suppressing lines and using custom delimiters. It's necessary to buffer the first field in order to retain characters while making a buffering decision. A line without a delimiter has only one field.

Test for EOF
If the first field is the whole line, output if not supressing to include end of line character.
If the first field is the first of many, then print if selected
Print any subsequent fields based on selection rule
Escape the default line delimited if overridden with -d option
Repeat above for all lines until the EOF test fires

[Back to Project Main Page]