Decoded: du (coreutils)

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [POSIX requirement] [Linux man] [FreeBSD man]

Logical flow of du command (coreutils)

Summary

du - estimate file space usage

[Source] [Code Walkthrough]

Lines of code: 1140
Principal syscall: stat()
Support syscalls: None
Options: 44 (19 short, 25 long)

Descended from du introduced in Version 1 UNIX (1971)
Added to Shellutils in October 1992 [First version]
Number of revisions: 327

Two general disk-usage strategies are to ask the filesystem for accounting metadata or check every file and estimate. The du utility uses the latter strategy via FTS, which calls stat() for every entry. Check out the df utility for the other strategy

Helpers:
  • du_files() - Traverses input files via FTS
  • duinfo_add() - Combines size, node, and time data
  • duinfo_init() - Sets a duinfo struct to start conditions
  • duinfo_set() - Sets the duinfo data to the input sizes/times
  • fill_mount_table() - Adds the dev/inode data to the di_mnt table
  • hash_ins() - Adds a new dev/inode pair if possible. Indicates if pair was already there
  • mount_point_in_fts_cycle() - Checks if a mount point is involved in a cycle
  • print_only_size() - Prints a size in the requested format
  • print_size() - Prints a size and time
  • process_file() - Collects size information for all file entries based on level.
  • show_date() - Displays a date with the requested format
External non-standard helpers:
  • argv_iter() - Increments an iterator (next argv)
  • argv_iter_init_stream() - Returns an argv iterator
  • di_set_alloc() - Allocates a di_set
  • di_set_lookup() - Searches a di_set for an entry and returns true if it exists
  • die() - Exit with mandatory non-zero error and message to stderr
  • error() - Outputs error message to standard error with possible process termination

Setup

The du utility defines and includes a few global helpers and variables before code execution begins. One noteworthy include from gnulib is a hash table for tracking these device/inode pairs (di_set).

du declares two global structs:

  • struct duinfo - Holds directory information
  • struct dulevel - One layer in the directory hierarchy

du uses a unique style in that global variables set by parsing options use a prefix (opt_). The globals include:

  • apparent_size -
  • *di_files - The dev/inode values for input files already processed
  • *di_mnt - The list of dev/inode values for mount points
  • *exclude - The list of files to exclude
  • hash_all - Flag to hash all files for hard link usage
  • human_output_opts - Flag if the sizes are human readable sizes (-h)
  • localtz - The local system timezone
  • max_depth - The number of levels from root to search
  • opt_all - Flag if the user wants all files (-a)
  • opt_count_all - Flag if the user wants link counts (-l)
  • opt_inodes - Flag if inodes should be counted, not blocks (--inodes)
  • opt_nul_terminate_output - Flag if output is NUL separated (-0)
  • opt_separate_dirs - Flag to not summarize subdirectories (-S)
  • opt_threshold - Flag to set a size threshold (-t)
  • opt_time - Flag if the most recent modified date is used (--time)
  • output_block_size - Sets the unit size of blocks
  • prev_level - The last level processed in a file hierarchy
  • print_grand_total - Flag set if the user requests a grand total (-c)
  • *time_style - The style provided directly by the user
  • *time_format - The format for time (--time-style)
  • time_type - The user provided argument for time style
  • tot_dui - The duinfo struct holding the grand total

main() adds a few more locals before starting parsing:

  • ai - The argv iterator for walking the command line file list
  • bit_flags - Sets FTS search rules
  • *cwd_only[] - Array for the current working directly symbol
  • *files_from - The user provided reference file
  • max_depth_specified - Flag is the user provided a max depth (-d)
  • opt_summarize_only - Flag to only use summaries (-s)
  • symlink_deref_bits - Selects the FTS link following rule

Parsing kicks off with the short options passed as a string literal:
"0abd:chHklmst:xB:DLPSX:"


Parsing

Parsing du asks many questions of the user:

  • Should we count all files and directories?
  • Are we interested in the grand total?
  • Should files we processed from the command line or an input file
  • Should the output delimiter be '\0' or '\n'?
  • What is the block size scalar? Should it be human-readable?
  • Should we consider entry size?
  • What is the maximum depth from root we should count?
  • What is the timestamp output format?

Parsing failures

Parsing may fail in several ways:

  • Providing a nonsensical depth (too large)
  • User requests a 0 threshold size
  • User provides an invalid exclude file
  • User requests both a summary and all entry details
  • Providing files on both the command line and in a reference file

User specified parsing failures result in a short error message followed by the usage instructions. Access related parsing errors die with an error message.


Execution

Not all of the work is done with du. The most important tasks pushed off to gnulib involve directory traversal with FTS, which collects file accounting via stat(). The important tasks within du include 'rolling up' the totals following the post-order hierarchy walk.

Invoking stat() for each target file is the key reason why du is much slower than df

. The data sources used from stat() include:
  • st_size - The total size
  • st_mtime - The modified time of the file
  • st_atime - The last access time of the file
  • st_ctime - The creation time of the file
  • st_ino - The inode number of the file
  • st_dev - The device ID of the source
  • st_mode - The access permissions of the file
  • st_nlink - The number of hard links to the file

The process to find and account for this information goes like this:

  • Organize the input files by creating an iterator over argv or file list
  • Create a hash map of device/inode data for input files (useful for cycle detection)
  • For each input file:
    • Test the file for a valid name
    • Open the target file with FTS (invokes stat())
    • Set the baseline file information (size, time, level/depth)
    • If this file is a new level of the hierarchy, either start fresh (lower level) or summarize previously accumulated data (higher level)
    • Print sizes as requested
  • Print a size grand total

Failure cases:

  • Unable to open or close file targets (via FTS)
  • Failure to read input
  • Invalid input files (zero length names)
  • Argument iterator fails
  • Bad STDIN file names
  • FTS throws an error while walking the tree
  • Unable to access directory
  • Bad timestamp

All failures at this stage output an error message to STDERR and return without displaying usage help


[Back to Project Main Page]