Decoded: csplit (coreutils)

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [POSIX requirement] [Linux man] [FreeBSD man]

Logical flow of csplit command (coreutils)

Summary

csplit - split a file into context-determined pieces

[Source] [Code Walkthrough]

Lines of code: 1527
Principal syscall: write()
Support syscalls: open(), close()
Options: 17 (7 short, 10 long)

Descended from csplit as implemented in System III (1982) (appeared internally as early as 1980)
Added to Fileutils in November 1992 [First version]
Number of revisions: 212

The csplit utility is conceptually simple, but includes many boilerplate functions for reading and buffering input

Helpers:
  • check_for_offset() - Checks a offset for the regular expression
  • check_format_conv_type() - Verifies format flags
  • cleanup() - Regular cleanup after execution (closes output, restores signals)
  • cleanup_fatal() - Cleanup after a fatal error and exits failure
  • clear_line_control() - Initializes a line
  • close_output_file() - Closes an output file and prints the number of bytes written
  • create_new_buffer() - Allocates a new buffer
  • create_output_file() - Creates a new output file stream
  • delete_all_files() - Deletes all the files that have been created
  • dump_rest_of_file() - Outputs each input line to the current output stream
  • extract_regexp() - Finds and verifies regular expression
  • find_line() - Finds a specific line in a buffer
  • free_buffer() - Deallocates a memory buffer
  • get_first_line_in_buffer() - Finds the first unread buffered line
  • get_format_flags() - Counts the printf format flags
  • get_new_buffer() - Returns a new buffer, either created or reused
  • handle_line_error() - Handles EOF cases while reading files
  • interrupt_handler() - Custom interrupt handler to delete files
  • keep_new_line() - Buffers a new line
  • max_out() - Calculates the maximum size of a format (in bytes)
  • new_line_control() - Create and initialize a line
  • no_more_lines() - Confirms that no more lines are available (bad comment in source?)
  • load_buffer() - Fills a buffer with input
  • make_filename() - Generates a new file name string
  • new_control_record() - Allocates a new control struct
  • parse_patterns() - Processes user provided patterns
  • parse_repeat_count() - Handles repeating command line input
  • process_line_count() - The split procedure using line number
  • process_regexp() - The split procedure using regular expression matches
  • read_input() - Reads a chosen number of bytes from input
  • record_line_starts() - Scans a buffer for the number of lines
  • regexp_error() - Handles regular expression errors
  • remove_line() - Finds the first unread line in a buffer
  • save_buffer() - Add the given buffer to the list of buffers
  • save_line_to_file() - Outputs a single line to the current output stream
  • save_to_hold_area() - Saves a partial line to a buffer
  • set_input_file() - Opens the given file
  • split_file() - The primary split procedure for the input file
  • write_to_file() - Writes all buffered lines to the current ouput stream
  • xalloc_die() - Error handler for out of memory
External non-standard helpers:
  • die() - Exit with mandatory non-zero error and message to stderr
  • error() - Outputs error message to standard error with possible process termination

Setup

The csplit utility defines a few global structures to manage input data. These include:

  • struct buffer_record - A buffer for holding lines and linking to the next buffer
  • struct control - A compiled regular expression for matching
  • struct cstring - String information (start pointer and length)
  • struct line - Buffered line information

csplit uses several global flags and variables, including:

  • bytes_written - The number of bytes written for the current output file
  • caught_signals - The caught signal set
  • control_used - The number of controls (regex patterns)
  • *controls - The pointer to the first control
  • current_line - The index of the current line
  • digits - The number of digits used in the output file name
  • elide_empty_files - Flag if empty files should be removed
  • *filename_space - The buffer for output file names (fits largest possible)
  • files_created - The number of output files created
  • **global_argv - The list of argv components
  • have_read_eof - Flag if we've seen EOF
  • *head - The beginning of the list of buffer
  • *hold_area - Pointer to the current partially read line
  • hold_count - The number of bytes in the current partially read line
  • last_line_number - The last line number in the buffer
  • *output_stream - The current output file stream
  • *output_filename - The current output file name
  • *prefix - The output file name prefix string
  • remove_files - Flag if files should be removed on error
  • *suffix - The output file name suffix string
  • suppress_count - Flag if we do not output the file size in bytes
  • suppress_matched - Flag if we have a suppression pattern

main() adds a few more locals before starting parsing:

  • prefix_len - The length of the output file name prefix
  • optc - the next option character to process
  • max_digit_string_len - The maximum possible size of a file name (includes suffix and name)

Parsing

Parsing breaks down the user-provided options to answer these questions about handling input data:

  • Should we process all input, or only early lines or digits?
  • Do the output file names need a prefix or suffix?
  • Should files with errors be removed?
  • Should size counts be printed for output files?

Parsing failures

These failure cases are explicitly checked:

  • User provides an invalid count number
  • Not enough operands (need at least an input file and a matching pattern)
  • Unknown option used

User specified parsing failures result in a short error message followed by the usage instructions. Access related parsing errors die with an error message.


Execution

csplit is one of several utilities that defines a custom signal handler. In this case, it is used to delete any files that were created that the utility must delete if interrupted. The change occures between setting up the input stream and just before the split operation beings.

The utility has two primary 'working' functions, process_regexp() and process_line_count(), which extract the file sections based on regular expression matches or line number respectively.

The overall process for csplit looks like this:

  • Measure the output file names (length)
  • Check and set the input file name
  • Gather the section patterns, which may be regex or line numbers
  • Add a custom signal handler to the usual interrupting signals to delete any created files (and call into the default handler)
  • Check the next pattern and pass to the appropriate processor (regex or line)
  • Create the output file for that pattern.
  • Process all matches (write to output file)
  • Discard the remaining input
  • Close the output file
  • Move to the next pattern and repeat and the matching sequence
  • Create a generic output file for unmatched input
  • Write unmatched input to the final file
  • Close final file
  • Restore default signal handlers

Failure cases:

  • Inconsistent patterns provided by the user
  • Unable to open or close I/O files
  • Unable to read from input source

All failures at this stage output an error message to STDERR and return without displaying usage help


[Back to Project Main Page]