Decoded: dd (coreutils)

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [POSIX requirement] [Linux man] [FreeBSD man]

Logical flow of dd command (coreutils)

Summary

dd - convert and copy a file

[Source] [Code Walkthrough]

Lines of code: 2524
Principal syscalls: read(), write()
Support syscalls: fstat(), fsync(), ftruncate(), freopen(), fadvise()
Options: 0, dd uses OS/360-style operands. There are 13 of them.

Descended from dd introduced in Version 5 UNIX (1974)
Added to Fileutils in October 1992 [First version]
Number of revisions: 332

The complexity naturally leads to a very long list of helper functions:

Helpers:
  • advance_input_after_read_error() - Seeks forward through the input
  • advance_input_offset() - Adds a value to an offset
  • alloc_ibuf() - Allocates the input buffer
  • alloc_obuf() - Allocates the output buffer
  • apply_translations() - Builds the translation table for this invocation
  • cache_round() - Round down the cache length to a buffer size multiple
  • cleanup() - Close I/O as needed
  • copy_simple() - Copy without conversion
  • copy_with_block() - Copy input lines to fixed length output blocks
  • copy_with_unblock() - Copy input lines to variable length output (spaces removed)
  • dd_copy() - Main copy and convert function
  • finish_up() - Prepare to end utility (restore signals, etc)
  • ifd_reopen() - Interruptable file reopen
  • iftruncate() - Interruptable file truncate
  • install_signal_handlers() - Replace default signal handlers
  • interrupt_handler() - Handle a standard signal
  • invalidate_cache() - Clear cache for data already processed
  • iread() - Interruptable read data from input (signal checks)
  • iread_fullblock() - Wrapper for iread for an entire block
  • iwrite() - Interruptable write data from buffer (signal checks)
  • maybe_close_stdout() - Function deciding how to close input at 'exit time'
  • multiple_bits_set() - True if input int has multiple asseted bits
  • operand_is() - Verifies that operand is known
  • operand_matches() - Confirms that inputs match operand format
  • nl_error() - Error with newline
  • parse_integer() - Converts a string to a number multiple
  • parse_symbols() - Verifies that input smybol is known
  • print_stats() - Prints output statistics
  • print_xfer_stats() - Outputs time and data statistics
  • process_signals() - Handles pending signals
  • quit() - Cleanly exits the utility
  • scanargs() - Parses the operands and sets flags
  • set_fd_flags() - Alter the target file flags
  • siginfo_handler() - Custom INFO signal handler (print stats)
  • skip() - Skip data blocks for the target buffer (read and discard)
  • skip_via_lseek() - Wrapper for lseek to support tape drives
  • swab_buffer() - Byte swap the indicated buffer
  • translate_buffer() - Applies the active conversion to the target buffer
  • translate_charset() - Builds the active conversion table
  • write_output() - Pushes the output buffer to target output

Setup

The dd utility begins by defining a few global arrays and enums. Some key players are:

Global Arrays:

  • conversions[] - Conversion types for arguments
  • flags[] - Flag values for I/O buffers
  • statuses[] - Status argument values
  • ascii_to_ebcdic[] - An indexed conversion table (in octal)
  • ascii_to_ibm[] - An indexed conversion table (in octal)
  • ebcdic_to_ascii[] - An indexed conversion table (in octal)

Enums follow these arrays, including conversion flag values, status values, I/O flag values, and human values. These are implemented as values or bitfields depending on usage

After main() is called, the first action is to install custom signal handlers. dd can be a long-running process and benefits from an extra feature: Printing I/O statistics via a repurposed a signal (USR1 or INFO depending on Linux/Unix).


Parsing

Parsing dd is different than most coreutils in that there are technically no options. The utility still uses GetOpt trivially, but the real work is done with scanargs(). We are now parsing the arugments in the format of <symbol>=<value> format. Again, this borrows from the OS/360 syntax

Some questions answered by parsing the operands:

  • What are the input and output files?
  • What is the conversion type?
  • Where do we begin in the I/O files
  • How much do we need to convert?
  • Special considerations? Such as casing, padding, truncation, etc?

Parsing failures

These failure cases are explicitly checked:

  • Unknown operands or values
  • Excessive values
  • Nonsensical flags (i.e. seek input, skip/count output, etc)
  • Mutually exclusive flags (block/unblock, lower and upper case, etc)

Extra Comments

Here is an example the OS/360 DD syntax as described on an N74167 punch card in the JCL user guide from 1971:

OS/360 DD syntax

Note that this just a syntax example: The OS/360 DD statement is completely unrelated to POSIX dd


Execution

The parser grabbed the conversion details, so now we can set up the translation. This means copying the global conversion tables to the active trans_table[] . In the 21st century, this doesn't get much use, (especially IBM or EBCDIC formats). The trans_table was already initialized to ASCII identity and it likely stays that way.

Next we open both input and output files, skipping and seeking as necessary. These targets may be STDIO.

We're now ready for the core work in the dd_copy() function:

  • Verify the the input skip is usable
  • Verify that the output seek is usable
  • Allocate the input and output buffers
  • Start the main read/convert loop
    • Advance the progress timer
    • Clear the input buffer to avoid stale data
    • Read from input to the input buffer
      • Verify read success and handle errors
      • Handle partial block reads
    • Translate the input buffer (run all bytes through table)
    • Swap byte order, if requested (swap every other byte)
    • Push block to the output buffer and write to output target
    • Repeat this until all data read
  • Clean up remaining data
    • Handle dangling odd byte, if any
    • Add padding to fill out any remaining block size
    • End with new line
    • Write the last block
    • Truncate if needed
  • Restore signal handlers
  • Print result status

Execution could fail in several ways:

Failure cases:

  • Unable to open input or output targets
  • Unable to truncate output
  • Unable to fstat output target
  • Unable to drop the cache for input or output
  • Unable to sync file caches
  • Unable to set requested file flags
  • Seeking beyond the end of a file
  • Unable to modify target file flags

[Back to Project Main Page]