Decoded: split (coreutils)

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [POSIX requirement] [Linux man] [FreeBSD man]

Logical flow of split command (coreutils)

Summary

split - split a file into pieces

[Source] [Code Walkthrough]

Lines of code: 1668
Principal syscall: write()
Support syscalls: open(), close(), stat(), execl(), fork()
Options: 26 (10 short, 16 long)

Descended from split introduced in Version 3 UNIX (1973)
Added to Textutils in November 1992 [First version]
Number of revisions: 214

The split utility is the older sibling of csplit with fewer ways to split input

Helpers:
  • bytes_chunk_extract() - The split procedure on byte chunks of input (-b n, k/n)
  • bytes_split() - The split procedure on bytes of input (-b l/)
  • closeout() - Closes the current output file
  • create() - Creates a new output file
  • cwrite() - Write to the output file, possibly creating a new one
  • ignorable() - Tests if an error can be safely ignored (EPIPE when using filters)
  • input_file_size() - Tests input read size and buffers data
  • line_bytes_split() - The split procedure on byte-limited lines (-C)
  • lines_chunk_split() - The split procedure on chunks of input (-0..9, -l)
  • lines_rr() - The split procedure on round-robin lines of input (-b r/)
  • lines_split() - The split procedure on lines of input (-b)
  • next_file_name() - Gets the next output file name, permuting suffix
  • ofile_open() - Opens a list of files, rotating descriptors as needed
  • parse_chunk() - Parses the chunk options passed through -n
  • set_suffix_length() - Constructs the suffix after parsing user input
External non-standard helpers:
  • die() - Exit with mandatory non-zero error and message to stderr
  • error() - Outputs error message to standard error with possible process termination

Setup

The split utility has several modes of operation, which determine the specified split procedure executed. The utility defines types for each mode:

  • type_bytes - Limits output files to a number of bytes of input (-b)
  • type_byteslines - Limits output file to a number of bytes of complete lines of input (-C)
  • type_chunk_bytes - Generates output file count limited to byte chunks(-n n, k/n)
  • type_chunk_lines - Generates a number of files disregarding line separation(-n l/)
  • type_digits - Ultimately becomes type_lines
  • type_lines - Limits output to a number of lines of input (-l)
  • type_rr - Generates a number of files using round-robin ordering (-n r/)
  • type_undef - No user options discovered, defaults to type_lines

split uses several global flags and variables, including:

  • additional_suffix - The suffix to add beyond the default suffix (--additional-suffix)
  • elide_empty_files - Flag to avoid creating empty files (-e)
  • eolchar - The end of line character (-t)
  • filter_command - The command to pipe output to (--filter)
  • filter_pid - The PID of the filtering process
  • in_stat_buf - The stat structure of the input file
  • infile - The name of the input file
  • n_open_pipes - The actual number of open pipes (index to open_pipes)
  • newblocked - The new set of blocked signals
  • oldblocked - The original set of blocked signals
  • open_pipes - The list of open pipes (child processes)
  • open_pipes_alloc - The number of possible pipe allocations
  • outbase - The base name of output files (no suffix)
  • outfile - The name of the output file
  • outfile_mid - The end of the base name in the output file name
  • output_desc - The output file descriptor
  • suffix_alphabet - The index of characters for the suffix
  • suffix_auto - Flag to generate a new suffix, if needed
  • suffix_length - The length of the output file's suffix
  • unbuffered - Flag to immediately move input to output (-u)
  • verbose - Flag to enable extra feedback (--verbose)

main() adds a few more locals before starting parsing:

  • prefix_len - The length of the output file name prefix
  • optc - the next option character to process
  • max_digit_string_len - The maximum possible size of a file name (includes suffix and name)

Parsing

Parsing breaks down the user-provided options to answer these questions about handling input data:

  • How should the input be split across output files?
  • How should we mangle names to separate files (suffix, etc)?
  • Should we use a special separator?
  • Should output be buffered?
  • Should we provide extra feedback to the user?

Parsing failures

These failure cases are explicitly checked:

  • User doesn't provide a useful line/byte value
  • Nonsensical block size provided (undocumented feature)
  • User specifies multiple or empty separating characters
  • Invalid suffix length provided
  • User provides unusable chunk sizes with the -n option
  • Unknown option used

User specified parsing failures result in a short error message followed by the usage instructions. Access related parsing errors die with an error message.


Execution

Although split has a myriad of split modes and options, they all fall in to two strategies: 1) Control the input read size and generate as many output files as needed; or 2) Control the number of output files and read chunks divided among the files. Also, when the output is to be sent to another command for filtering, then both signal handlers and I/O streams need to change to consider the command to run prior to output.

Execution involves six distinct procedures, each of which deserves its own discussion. I'm going to save that for the line-by-line walkthrough and discuss only common operations. Here's the general idea:

  • Set the suffix length for the output files using the user-provided options
  • Open the input file as binary read-only
  • stat() the input file for size estimated
  • Test reading from the file for chunk reading cases
  • Reset the signal handlers if we're using a filter to handle incident SIGPIPEs
  • Invoke the desired procedure based on user requirements:
    • lines_split()
    • bytes_split()
    • line_bytes_split()
    • lines_chunk_split()
    • bytes_chunk_extract()
    • lines_rr()
  • Each procedure is unique, but generally do the following:
    • Opens an output file, possibly redirecting to another filter process via fork()/execl()
    • Writes from input to output pipe
    • Checks when a new file is required and opens with the same properties and a new suffix
  • Close the input
  • Close all output files

Failure cases:

  • Any errors reading or writing to file descriptors/streams
  • Unable to open or close I/O files
  • Unable to read from input source
  • Unable to set environment for filter
  • Unable to invoke filter command
  • Failure in filter child process
  • Unable to create/truncate a new file
  • Unable to generate a new suffix

All failures at this stage output an error message to STDERR and return without displaying usage help


[Back to Project Main Page]