Decoded: csplit (coreutils) – MaiZure's Projects

[Back to Project Main Page]

Note: This page explores the design of command-line utilities. It is not a user guide.
[GNU Manual] [POSIX requirement] [Linux man] [FreeBSD man]

Logical flow of csplit command (coreutils)

Summary

csplit - split a file into context-determined pieces

[Source] [Code Walkthrough]

Lines of code: 1527
Principal syscall: write()
Support syscalls: open(), close()
Options: 17 (7 short, 10 long)

Descended from csplit as implemented in System III (1982) (appeared internally as early as 1980)
Added to Fileutils in November 1992 [First version]
Number of revisions: 212

The csplit utility is conceptually simple, but includes many boilerplate functions for reading and buffering input

Helpers:

check_for_offset() - Checks a offset for the regular expression
check_format_conv_type() - Verifies format flags
cleanup() - Regular cleanup after execution (closes output, restores signals)
cleanup_fatal() - Cleanup after a fatal error and exits failure
clear_line_control() - Initializes a line
close_output_file() - Closes an output file and prints the number of bytes written
create_new_buffer() - Allocates a new buffer
create_output_file() - Creates a new output file stream
delete_all_files() - Deletes all the files that have been created
dump_rest_of_file() - Outputs each input line to the current output stream
extract_regexp() - Finds and verifies regular expression
find_line() - Finds a specific line in a buffer
free_buffer() - Deallocates a memory buffer
get_first_line_in_buffer() - Finds the first unread buffered line
get_format_flags() - Counts the printf format flags
get_new_buffer() - Returns a new buffer, either created or reused
handle_line_error() - Handles EOF cases while reading files
interrupt_handler() - Custom interrupt handler to delete files
keep_new_line() - Buffers a new line
max_out() - Calculates the maximum size of a format (in bytes)
new_line_control() - Create and initialize a line
no_more_lines() - Confirms that no more lines are available (bad comment in source?)
load_buffer() - Fills a buffer with input
make_filename() - Generates a new file name string
new_control_record() - Allocates a new control struct
parse_patterns() - Processes user provided patterns
parse_repeat_count() - Handles repeating command line input
process_line_count() - The split procedure using line number
process_regexp() - The split procedure using regular expression matches
read_input() - Reads a chosen number of bytes from input
record_line_starts() - Scans a buffer for the number of lines
regexp_error() - Handles regular expression errors
remove_line() - Finds the first unread line in a buffer
save_buffer() - Add the given buffer to the list of buffers
save_line_to_file() - Outputs a single line to the current output stream
save_to_hold_area() - Saves a partial line to a buffer
set_input_file() - Opens the given file
split_file() - The primary split procedure for the input file
write_to_file() - Writes all buffered lines to the current ouput stream
xalloc_die() - Error handler for out of memory

External non-standard helpers:

die() - Exit with mandatory non-zero error and message to stderr
error() - Outputs error message to standard error with possible process termination

Setup

The csplit utility defines a few global structures to manage input data. These include:

struct buffer_record - A buffer for holding lines and linking to the next buffer
struct control - A compiled regular expression for matching
struct cstring - String information (start pointer and length)
struct line - Buffered line information

csplit uses several global flags and variables, including:

bytes_written - The number of bytes written for the current output file
caught_signals - The caught signal set
control_used - The number of controls (regex patterns)
*controls - The pointer to the first control
current_line - The index of the current line
digits - The number of digits used in the output file name
elide_empty_files - Flag if empty files should be removed
*filename_space - The buffer for output file names (fits largest possible)
files_created - The number of output files created
**global_argv - The list of argv components
have_read_eof - Flag if we've seen EOF
*head - The beginning of the list of buffer
*hold_area - Pointer to the current partially read line
hold_count - The number of bytes in the current partially read line
last_line_number - The last line number in the buffer
*output_stream - The current output file stream
*output_filename - The current output file name
*prefix - The output file name prefix string
remove_files - Flag if files should be removed on error
*suffix - The output file name suffix string
suppress_count - Flag if we do not output the file size in bytes
suppress_matched - Flag if we have a suppression pattern

main() adds a few more locals before starting parsing:

prefix_len - The length of the output file name prefix
optc - the next option character to process
max_digit_string_len - The maximum possible size of a file name (includes suffix and name)

Parsing

Parsing breaks down the user-provided options to answer these questions about handling input data:

Should we process all input, or only early lines or digits?
Do the output file names need a prefix or suffix?
Should files with errors be removed?
Should size counts be printed for output files?

Parsing failures

These failure cases are explicitly checked:

User provides an invalid count number
Not enough operands (need at least an input file and a matching pattern)
Unknown option used

User specified parsing failures result in a short error message followed by the usage instructions. Access related parsing errors die with an error message.

Execution

csplit is one of several utilities that defines a custom signal handler. In this case, it is used to delete any files that were created that the utility must delete if interrupted. The change occures between setting up the input stream and just before the split operation beings.

The utility has two primary 'working' functions, process_regexp() and process_line_count(), which extract the file sections based on regular expression matches or line number respectively.

The overall process for csplit looks like this:

Measure the output file names (length)
Check and set the input file name
Gather the section patterns, which may be regex or line numbers
Add a custom signal handler to the usual interrupting signals to delete any created files (and call into the default handler)
Check the next pattern and pass to the appropriate processor (regex or line)
Create the output file for that pattern.
Process all matches (write to output file)
Discard the remaining input
Close the output file
Move to the next pattern and repeat and the matching sequence
Create a generic output file for unmatched input
Write unmatched input to the final file
Close final file
Restore default signal handlers

Failure cases:

Inconsistent patterns provided by the user
Unable to open or close I/O files
Unable to read from input source

All failures at this stage output an error message to STDERR and return without displaying usage help

[Back to Project Main Page]