Decoded: fmt (coreutils)

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [No POSIX requirement] [Linux man] [FreeBSD man]

Logical flow of fmt command (coreutils)

Summary

fmt - reformat paragraph text

[Source] [Code Walkthrough]

Lines of code: 1030
Principal syscall: fwrite()
Support syscall: fadvise()
Options: 16 (7 short, 9 long)


Added to Textutils in September 1994 [First version]
Number of revisions: 173

Helpers:
  • base_cost() - Set the basic cost of the input word
  • check_punctuation() - Compute word punctuation
  • copy_rest() - Push a non-conformant line to output
  • flush_paragraph() - Output the partial paragraph now due to size contraints
  • fmt() - The main format procedure for get, format, and output
  • fmt_paragraph() - Computes the current paragraph format
  • get_paragraph() - Retrieves an entire paragraph, by line and word
  • get_line() - Gets a line from the input
  • get_prefix() - Gets a prefix from the input
  • get_space() - Gets a blank from the input
  • line_cost() - Computes the line cost
  • put_paragraph() - Outputs all the words in a paragraph by line
  • put_line() - Outputs all words in a line to include spaces
  • put_word() - Outputs a single word and recalculates column position
  • put_space() - Outputs the appropriate space
  • same_para() - Returns true if a line could belong to the current paragraph
  • set_other_indent() - Updates the other indent type
  • set_prefix() - Updates the prefix type
External non-standard helpers:
  • die() - Exit with mandatory non-zero error and message to stderr
  • error() - Outputs error message to standard error with possible process termination
  • full_write() - Wrapper for write() that retries on interrupt
  • getpagesize() - Gets the memory page size for the system
  • io_blksize() - Gets the optimal block size
  • ptr_align() - Ensures that returned pointer is memory aligned
  • safe_read() - Reads with retry on interrupt

Setup

Before looking at main(), fmt.c establishes the basis for computing paragraph formats. These include a goal (target line width) and cost/bonus functions for deviating from the goal. Specifically:

  • Defines SHORT_COST macro to set the cost of missing the goal
  • Defines RAGGED_COST macro to set the cost of not matching neighboring lines
  • Defines LINE_COST macro to set the basic cost of a line
  • Defines WIDOW_COST macro to set a special cost for breaking a line after the first word
  • Defines ORPHAN_COST macro to set a special cost for breaking a line before the last word
  • Defines SENTENCE_BONUS macro to set a bonus for breaking at the end of a sentence
  • Defines NOBREAK_COST macro sets a large penalty for breaking on a period thats not end of line
  • Defines PAREN_BONUS macro sets a cost for breaking on an open paren
  • Defines PUNCT_BONUS macro sets a cost for breaking on any punctuation
  • Defines LINE_CREDIT macro sets a bonus for breaking just after a long paragraph

These costs make sense when applied to a structure that describes a word as part of a paragraph. The Word struct looks like this:

  • *text contains the word text
  • length is the length of the word
  • space counts the number of subsequent spaces
  • paren bit indicates a leading parenthesis
  • period bit indicates ending in a final punctuation
  • punct bit indicates ends with a non-terminal punctuation
  • final bit indicates the end of a sentence
  • line_length the length of the best line starting here
  • best_cost the best case cost of starting a paragraph with this word
  • next_break pointer indicates the Word of the best break

A number of globals support the format optimizer including:

  • The max_width integer for the maximum line width
  • The goal_width integer is the line length target
  • The word[] array holding the Words of the paragraph under analysis
  • The parabuf[] array holding text of the constructed paragraph

Other globals support I/O buffering, including:

  • The in_column integer is the column of the most recently read character
  • The out_column integer is the column of the next character to write
  • The word_limit pointer indicates the first invalid word in the array

The above global lists are not exhaustive.

Finally, main() defines a few variables to kick off formatting:

  • goal_width_option - The user provided line length goal
  • max_width_option - The user provided max width
  • optchar - The option we're currently processing
  • ok - The return value of the utility

Parsing kicks off with the short options passed as a string literal:
"0123456789cstuw:p:g:"


Parsing

Parsing for fmt digs in to the user's ideal paragraph, including:

  • How wide are the lines?
  • Are there special indentation cases?
  • Should spacing be uniform?
  • Should only certain lines be formatted?

Parsing failures

There are a few parsing failures to note:

  • Numeric width must be the first option
  • Width could be too large
  • Unknown options are used

Execution

On the surface, fmt is simple:

  • Open stdin or file stream for input
  • Get a paragraph of data
  • Compute the optimal format at the paragraph, line, and word level
  • Print the paragraph
  • Repeat for all paragraphs in the file

Computing the 'optimal' paragraph deserves more explanation

Optimal Paragraph

Paragraphs are a chain of words with the basic constraint that they must remain in the same order. We must determine the 'best' place to set the line breaks. Parsing provided hints about user expectations for line length, indentations, and spacing. fmt embedded the basic idea of 'cost' associated with line features. Here is the approach for the default format:

  • A valid paragraph of data has been retrieved via get_paragraph()
  • Loop through each word backwards from the end of the paragraph:
    • Assume cost of a break is maximum
    • Scan forward from the word to determine the best cost of a break
    • Update any improved costs and the new breakpoint and length
    • Consider all words ahead of the current until the last word or max_width reached
    • Move back another word and repeat
  • Now go through each breakpoint and output the words between all of them. Including spaces and indentation as necessary

There is only one explicit failure case during execution: If we cannot open the input file


[Back to Project Main Page]