Decoded: The top utility (procps-ng)

March 2019

Logical layout of the top utility (procps-ng)

The table of processes (top) utility is a task manager that's been around in some form since the 1980s. My introduction to it in that era came from a system administrator who described it as "The hi-tech 'ps' alternative". The utility has been rewritten over the years as various *nix systems have come and gone. Today, most Linux distros pull top from the procps-ng package. This is the version I'll focus on in this project.

My goal is to make this resource useful for seasoned developers. I'll highlight important OS interactions since these are applicable to many current (and future!) utilities. This resource cannot substitute for reading the code, but it can save you a lot of time on your first attempt.

The main topics include:

Data Sources
Signals
Data Structures
Execution
Optimization
Namespace (Globals, Functions, Callback macros)

Snapshot

[Source] [Code Walkthrough]

Version: procps-ng 3.3.15+ (NEXT - commit: dbe12b54 [Jan 2019])
Lines of code: 6522
Functions: 115
Callback macros: 54

Full Disclosure: I am not a procps contributor -- all credit goes to James C. Warner
Defer to the official Documentation or the development community for any conflicting information

Part 0 - Introductory Presentation

I've put together a ~10 minute presentation that cuts across the utility by example. We trace the load averages from kernel, through /proc, into top, and to the screen.

Part 1 - Data Sources

top pulls data (almost) entirely from the /proc file system. Indirectly, /proc pulls from many data structures and global variables around the kernel. In my opinion, understanding the OS interactions of 'top' is possibly the most important idea that extends to other projects.

I've itemized the rougly 75 data sources used in top. At a minimum, I've included the /proc data source and associated 3.10 kernel declarations. Most summary kernel statistics are declared as globals within respective subsystems and exported (symbol made visible) to other subsystems.

For completeness, this strace of top definitively identifies all interactions with the OS on my system for the initial update frame. I ran top and pressed 'q' after about 1 second. Picking apart an strace is always very enlightening and worthly of an article by itself. I'll save that for another day.

Summary Information

Data	/proc	Kernel
CPU idle time	/proc/stat	Scheduler (dec, def)
CPU hw-int time	/proc/stat	Scheduler (dec, def)
CPU sw-int time	/proc/stat	Scheduler (dec, def)
CPU kernel time	/proc/stat	Scheduler (dec, def)
CPU nice time	/proc/stat	Scheduler (dec, def)
CPU steal time	/proc/stat	Scheduler (dec, def)
CPU user time	/proc/stat	Scheduler (dec, def)
Load averages	/proc/loadavg	Scheduler (dec/def)
Mem buffers	/proc/meminfo	File System (derived)
Mem free	/proc/meminfo	Memory Manager (dec/def)
Mem total	/proc/meminfo	Memory Manager (dec/def)
Mem used	N/A (derived)	N/A
Swap free	/proc/meminfo	Memory Manager (dec)
Swap total	/proc/meminfo	Memory Manager (dec)
Swap used	N/A (derived)	N/A
Task Counts	N/A (derived)	N/A
Uptime	/proc/uptime	System Timer (derived)
User count	N/A (utmp file)	N/A

Task Information

Most per-task data are available through the stat, statm, and status entries under /proc/[pid#]. You can use the retrieval function mappings to trace back most process fields. The kernel display functions for the big three are:

stat - do_task_stat()
statm - proc_pid_statm()
status - proc_pid_status()

For process data within the kernel, most roads lead back to the task_struct.

Data	/proc	Kernel
%CPU	/proc/#/stat (derived)	task_struct (dec)
%MEM	/proc/#/statm	task_struct (dec) -> mm.rss_stat (dec)
CGNAME	/proc/#/cgroup (derived)	Control Group (dec)
CGROUPS	/proc/#/cgroup	Control Group subsys (dec)
CODE	/proc/#/statm	task_struct (dec) -> mm (dec, def)
COMMAND	/proc/#/stat /proc/#/cmdline	task_struct (dec) -> mm (dec)
DATA	/proc/#/statm	task_struct (dec) -> mm (dec, def)
ENVIRON	/proc/#/environ	task_struct (dec) -> mm (dec)
Flags	/proc/#/stat	task_struct (dec)
GID	/proc/#/status	task_struct (dec) -> cred (dec)
GROUP	/proc/#/status (derived)	N/A (GID lookup, see above)
LXC	N/A (derived)	N/A
nDRT	/proc/#/statm	N/A (Forced to zero in 3.10)
NI	/proc/#/stat	task_struct (dec)
nMaj	/proc/#/stat	task_struct (dec) -> signal (dec)
nMin	/proc/#/stat	task_struct (dec) -> signal (dec)
nTh	/proc/#/status	task_struct (dec) -> signal (dec)
nsIPC	/proc/#/ns/ipc	task_struct (dec) -> namespace (dec)
nsMNT	/proc/#/ns/mnt	task_struct (dec) -> namespace (dec)
nsNET	/proc/#/ns/net	task_struct (dec) -> namespace (dec)
nsPID	/proc/#/ns/pid	task_struct (dec) -> namespace (dec)
nsUSER	/proc/#/ns/user	task_struct (dec) -> namespace (dec)
nsUTS	/proc/#/ns/uts	task_struct (dec) -> namespace (dec)
NU	/proc/#/stat (derived)	task_struct (dec), x86 thread_info (dec)
OOMa	/proc/#/oom_adj	Derived from MM
OOMs	/proc/#/oom_score	task_struct (dec) -> signal (dec)
P	/proc/#/stat (derived)	task_struct (dec), x86 thread_info (dec)
PGRP	/proc/#/stat	task_struct of group_leader(dec)
PID	/proc	task_struct (dec)
PPID	/proc/#/stat /proc/#/status	task_struct of parent (dec)
PR	/proc/#/stat	task_struct (dec)
RES	/proc/#/statm	task_struct (dec) -> mm.rss_stat (dec)
RSan	/proc/#/status	Not in 3.10
RSfd	/proc/#/status	Not in 3.10
RSlk	/proc/#/status	task_struct (dec) -> mm (dec)
RSsh	/proc/#/status	Not in 3.10
RUID	/proc/#/status	task_struct (dec) -> cred (dec)
RUSER	/proc/#/status (derived)	N/A (RUID passwd lookup)
S	/proc/#/stat /proc/#/status	task_struct (dec)
SHR	/proc/#/statm	task_struct (dec) -> mm.rss_stat (dec)
SID	/proc/#/stat	task_struct (dec)
SUID	/proc/#/status	task_struct (dec) -> cred (dec)
SUPGIDS	/proc/#/status	task_struct (dec) -> cred (dec)
SUPGRPS	/proc/#/status (derived)	N/A (SUPGID passwd lookup)
SUSER	/proc/#/status (derived)	N/A (SUID passwd lookup)
SWAP	/proc/#/status	task_struct (dec) -> mm.rss_stat (dec)
TGID	/proc/status	task_struct (dec)
TIME	/proc/#/stat	task_struct (dec)
TIME+	/proc/#/stat	task_struct (dec)
TPGID	/proc/#/stat	task_struct (dec) -> signal (dec) -> tty (dec)
TTY	/proc/#/stat	task_struct (dec) -> signal (dec) -> tty (dec)
UID	/proc/#/status	task_struct (dec) -> cred (dec)
USED	/proc/#/status (derived)	N/A
USER	/proc/#/status (derived)	N/A
VIRT	/proc/#/statm	task_struct (dec) -> mm (dec)
vMj	/proc/#/stat	task_struct (dec) -> signal (dec)
vMn	/proc/#/stat	task_struct (dec) -> signal (dec)
WCHAN	/proc/#/wchan	Tracing fp & ip (def)

Part 2 - Signals

Signal handling in top (procps)

top rewires default signals to several custom handlers for two purposes:

The window may have been resized either directly or while paused. We need a custom signal handler to set the global flag to force a full window and task field recalculation on the next update. Signals and handlers affected:
- sig_resize() - SIGWINCH, SIGCONT
Since top overrides terminal settings at start up, the original settings need to be restored if execution is paused or terminated. Otherwise, the terminal may be in an unusable state after returning to the shell. The map of signals to handlers is:
- sig_endpgm() - SIGALRM, SIGHUP, SIGINT, SIGPIPE, SIGQUIT, SIGTERM, SIGUSR1, SIGUSR2
- sig_paused() - SIGSTP, SIGTTIN, SIGTTOU
- sig_abexit() - All others

Here's an annotated look at the code for all four signal handlers:

static void sig_endpgm (int dont_care_sig) {
   sigset_t ss;

   sigfillset(&ss);              /* Blocking all signals */
   sigprocmask(SIG_BLOCK, &ss, NULL);
   Frames_signal = BREAK_sig;    /* Flag to force full terminal recompute */
   bye_bye(NULL);                /* Full graceful exit (See Part 6) */
   (void)dont_care_sig;          /* Silence compiler warnings */
}

static void sig_abexit (int sig) {
   sigset_t ss;

   sigfillset(&ss);          /* Blocking all signals */
   sigprocmask(SIG_BLOCK, &ss, NULL);
   at_eoj();                 /* Restore TTY and show debug log */
   fprintf(stderr, N_fmt(EXIT_signals_fmt)
      , sig, signal_number_to_name(sig), Myname);
   signal(sig, SIG_DFL);     /* Redirect to default */
   raise(sig);               /* Execute default */
   _exit(sig | 0x80);        /* Error exit */
}

static void sig_resize (int dont_care_sig) {
   tcdrain(STDOUT_FILENO);      /* Clear STDOUT buffer*/
   Frames_signal = BREAK_sig;   /* Flag to force full terminal recompute */
   (void)dont_care_sig;         /* Silence compiler warnings */
}

static void sig_paused (int dont_care_sig) {

   /* Restore Original TTY */
   if (-1 == tcsetattr(STDIN_FILENO, TCSAFLUSH, &Tty_original))
      error_exit(fmtmk(N_fmt(FAIL_tty_set_fmt), strerror(errno)));
   if (keypad_local) putp(keypad_local);
   putp(tg2(0, Screen_rows));  /* Move to bottom left */
   putp(Cap_curs_norm);        /* Normalize cursor format */
#ifndef RMAN_IGNORED
   putp(Cap_smam);             /* Enable right margin */
#endif
   fflush(stdout);             /* Push STDOUT buffer */
   raise(SIGSTOP);             /* Block here until SIGCONT is received */
   
   /* Restore top's expected TTY configuration */
   if (-1 == tcsetattr(STDIN_FILENO, TCSAFLUSH, &Tty_raw))
      error_exit(fmtmk(N_fmt(FAIL_tty_set_fmt), strerror(errno)));
#ifndef RMAN_IGNORED
   putp(Cap_rmam);             /* Disable right margin */
#endif
   if (keypad_xmit) putp(keypad_xmit);
   putp(Cursor_state);         /* Restore cursor settings */
   Frames_signal = BREAK_sig;  /* Flag to force full terminal recompute */
   (void)dont_care_sig;        /* Silence compiler warnings */
}

Part 3 - Data Structures

top relies on several structures to organize and cache data needed during computation. I'll separate the data structure discussion in to two groups: Those defined exclusively within top, and those included from external sources (i.e. C/POSIX standard structures). I'll focus more on the top-specific structures since they are unique, but the external standard structures are generally more useful to understand.

top-specific structures

These are the data structures defined and used exclusively within top

Fields [FLD_t]

The FLD_t structure to track task list fields

A field structure represents a single column of information in the task list. All possible fields are defined as a global array and accessed by the task window.

Three of five members deal directly with data display:

width - The character count of the field. -1 is variable, usually text. 0 is computed at runtime.
scale - Represents the magnitude of the number (Kb, Mb, Gb, etc)
align - Visual alignment, left or right

The other two members have specialized usage:

sort - The sort function used to order by this fields. (See part 6)
lflag - The /proc source (usually stat, status, statm, or specialized)

Task Window [WIN_t]

The WIN_t structure to track window data

The task windows are the most complex custom structure in top, performing many roles from holding configurations, referencing accessing many globals, tracking screen dimensions, and caching process data. top defines four Windows as globals.

Important members to know:

**ppt - The proc_t array used by the window
pflgsall[] - Tracks fields being displayed in the window
procflgs[] - Holds behavior flags for the active fields
rc - Window configuration specifics (see RCW_t below)

Window view management variables:

begnext - The offset to apply to task list on next update
begpflg - The first visible point in pflgsall[]
begtask - The first task visible in ppt[]
endpflg - The visible ending point in pflgsall[]
usrselflg - Flag for include or exclude (e)uid matches
usrseltyp - The type of user to match (uid or euid)
usrseluid - The chosen user id to match
varcolsz - The cap on variable width columns
winlines - The number of lines for the window

Others members:

cap* - Terminal capabilities for this window
grpname - A string representing window number:name
*next - Pointer to the next window
*prev - Pointer to the last window

Window Configuration [RCW_t]

The RCW_t structure to track window configuration

Each window has an associated configuration structure (rc) built from both user settings and an external file. This structure is critical for tracking 'live' fields in the current window.

The three most important members to know are:

fieldscur[] - The ordered array of active fields in the window
sortinx - The field enum within Fieldstab[] to sort on
winflags - The window modes including view/sort details

Other members include:

graph_cpus - The alternate cpu display values for 't'
graph_mems - The alternate memory display values for 'm'
headclr - Header color
maxtasks - The user-provided maximum tasks to
msgsclr - Message color
summclr - Summary color
taskclr - Task color
winname[] - The user-provided window name

General Configuration [RCF_t]

The RCF_t structure to track configuration file data

This structure supports the external configuration file and data not used directly by a window.

The main members include:

id - The rc file version
win_index - The index to the global Winstk for the current window
win - Local references to all window config data

The remaining members track settings:

delay_time - Desired delay time between updates, 'd'
fixed_widest - Extra space for fields, 'X'
mode_altscr - Alternate display mode, 'A'
mode_irixps - CPU display mode, 'I'
summ_mscale - Summary memory scaling 'I'
task_mscale - Process memory scaling 'e'
zero_suppress - Suppress zeroes for scaled values '0'

History [HST_t]

The HST_t structure to track task history

top tracks a useful varaibles between frames, notably CPU ticks and fault counts. The members are:

lnk - Integer index to the next HST in the chain
maj - Major fault count
min - Minor fault count
pid - The associated PID
tics - The tick count for this entry

CPU History [CT_t, CPU_t]

The CT_t and CPU_t structures to track CPU usage

We need to keep track of some CPU data, especially the changes between frames. These two structures work together manage updates:

The CPU_t structure, one per CPU

cur - The current frame's tick data (CT_t)
id - The CPU number
node - The numa node of the CPU
sav - The previous frame's tick data (CT_t)

The CT_t (CPU tick tracker), two per CPU (this update and the last update). These data are provided by /proc/stat and correspond to tick counts in the following modes:

i - Idle time
n - Nice-invoked process time
s - Kernel time
u - Userland time
w - I/O blocked time
x - HW interrupt time
y - SW interrupt time
z - Virtual steal time (hypervisor block time)

Other structures included in top

These structures are used in top but defined elsewhere. These are useful to know since they're used throughout the OS/Kernel.

pid_t - Required by POSIX, thus typedef'd in the kernel as int
proc_t - Defined within procps to hold all data about a process
PROCTAB - Defined by procps to track persistent /proc information
sigset_t - Required by POSIX, and defined in the kernel as an architecture-specific type array
termios - The POSIX definition of the terminal, specifically the I/O and control modes.

Part 4 - Execution

High level execution in in top (procps)

Execution appears simple at a high level. top begins with basic initialization, including reading configurations from files and command line flags, then setting up the terminal. The main loop continuously pulls input from /proc, formatting it, and sending results to the terminal. The program will only exit if the user pressed 'q', the user configured a fixed number of updates, or the operating system sends a disruptive signal. Let's dig into initialization, the main loop, and the exit procedure.

Let's break down each of these steps.

Initialization

int main (int dont_care_argc, char **argv) {
   (void)dont_care_argc;
   before(*argv);
                  
   wins_stage_1();
   configs_reads();
   parse_args(&argv[1]);
   whack_terminal();
   wins_stage_2();

...to main loop...

(void)dont_care_argc;
Suppresses two compiler warnings: Unused argument (by using it) and statement with no effect (by void cast).

before(*argv);
Performs important preparatory work such as verifying /proc mount, counting CPUs, and rewiring signal handlers.

wins_stage_1();
Basic window preparation. Sets up the capabilities table, sets up a logical window ring, and sets the current window to the first.

configs_reads();
Processes external configuration files (rc files)

parse_args(&argv[1]);
Handles command line arguments

whack_terminal();
Clobbers the terminal in preparation for displaying the top utility. Saves old settings, adjusts flags, and sets buffering

wins_stage_2();
Finishes window configuration by defining terminal capabilities (global and window) and initially defining fields

The Big Loop

The top utility main loop

top will churn through the main loop until either the user or the operating system forces an exit. I'll focus on the three operations pictured above and highlighted in the lines below.

...in main(), after initialization...

  for (;;) {
     struct timespec ts;

     frame_make();                 /* Builds and displays the next update */

     if (0 < Loops) --Loops;       /* Controls the user-specified */
     if (!Loops) bye_bye(NULL);    /* automatic exit counter (-n) */

     ts.tv_sec = Rc.delay_time;    /* Set up delay time between frames */
     ts.tv_nsec = (Rc.delay_time - (int)Rc.delay_time) * 1000000000;

     if (Batch)                    /* Batch mode operation (-b) */
        pselect(0, NULL, NULL, NULL, &ts, NULL);
     else {
        if (ioa(&ts))              /* Wait - pselect() within ioa() */
           do_key(iokey(1))        /* Handles keyboard input */;
     }
  }

Let's dig in to the highlighted operations: Building a single output frame, waiting for the next update, and processing input.

Building a Frame

Building a single frame in top

An output frame requires preperatory computations. In the interest of optimization, the actual computation sequence depends on if we need a full update (such as recomputing screen dimensions), or only a limited update if most elements are unchanged.

At a minimum, preparing a frame will always:

Read system information via sysinfo_refresh(). This includes the CPU counts from /proc/stat and memory information from /proc/meminfo. These data changed on each frame and are always referenced.
Refresh process information via procs_refresh(). Updates relevant pointers in windows.

Sometimes we need full updates such as when this is the first frame displayed after running top or when the user resizes the terminal window. Full updates include the previous operations, plus:

Recalculating the field tables for each window in zap_fieldstab().
Rechecking the CPU counts via cpus_refresh().

Displaying data on the terminal
We draw the screen one line at a time, from top to bottom. Functionally, there are three distinct sections:

Summary - The summary area occupies the first several lines and the procedure is summary_show()
Message - A single line after the summary, which is simply cleared using a termcap. Actually messages are added interactively.
Task Window(s) - The list of tasks and associated fields. There may be up to four windows, but the default is one. The procedure is window_show()

Let's trace the retrieval, construction, formatting, and output of a single line of data through top. I'll try to distill the bottom line and show a single example that can extend to all screen elements. The anti-climactic punchline is:

At the lowest level, we read() from a /proc source, format data to a buffer, and putp() the buffer to the terminal with help from curses.

If you blink, you make miss this simple process so let's look at how this happens with load average in the summary display.

The load average appears at the top of the screen, so we expect the action to be near the beginning of the procedure. Here are the first 15 lines of summary_show():

static void summary_show (void) {
 #define isROOM(f,n) (CHKw(w, f) && Msg_row + (n) < Screen_rows - 1)
 #define anyFLG 0xffffff
   WIN_t *w = Curwin;
   char tmp[MEDBUFSIZ];
   int i;

   /* Display Uptime and Loadavg */
   if (isROOM(View_LOADAV, 1)) {
      if (!Rc.mode_altscr)
         show_special(0, fmtmk(LOADAV_line, Myname, sprint_uptime(0)));
      else
         show_special(0, fmtmk(CHKw(w, Show_TASKON)? LOADAV_line_alt : LOADAV_line
            , w->grpname, sprint_uptime(0)));
      Msg_row += 1;
   } 
...remaining summary_show() procedure...

The critical line includes the show_special() function that uses two layers of argument evaluations. This line encapsulates the entire process of reading, formatting, and writing. The call stack (including macros) for the I/O process looks like this:

Reading - Load average is pulled from /proc/loadavg by sprint_uptime() ultimately using the read() syscall.
- sprint_uptime() - top.c
- loadavg() - sysinfo.c
- FILE_TO_BUF() - sysinfo.c (macro)
- read() - Standard library syscall wrapper
Formatting - The raw result is formatted within fmtmk() to include special terminal characters and returned in a buffer.
- fmtmk() - top.c
- vnsprintf() - Standard library
Writing - show_special() pushes the formatted result to the terminal using a macro and ultimately putp()
- show_special() - top.c
- PUFF() - top.h (macro)
- putp() - curses

This is the call stack (plus function-like macros) for reading and writing the load average

Waiting (Blocking)

After we've built a single frame, the next task in the main loop is to pause operation until it's time for a new frame update OR the user presses a key. The pselect() syscall blocks until either a time delay passes or user input occurs. Let's look at the prototype, and then the usage:

int pselect(int nfds,                     /* Descriptor count to check */
            fd_set *restrict readfds,     /* Reading check descriptor set */
            fd_set *restrict writefds,    /* Writing check descriptor set */
            fd_set *restrict errorfds,    /* Err condition descriptor set */
            timespec *restrict timeout,   /* Delay timeout  */
            sigset_t *restrict sigmask);  /* Signal mask to apply */

Under normal (non-batch) operation, the pselect() syscall is invoked within the ioa() function as shown here:

static inline int ioa (struct timespec *ts) {
   fd_set fs;
   int rc;

   FD_ZERO(&fs);              /* Clears target file descriptors */
   FD_SET(STDIN_FILENO, &fs); /* Adds STDIN to the target set */

   rc = pselect(STDIN_FILENO + 1, &fs, NULL, NULL, ts, &Sigwinch_set);

   if (rc < 0) rc = 0;
   return rc;
} // end: ioa

The first argument to pselect() is the number of file descriptors to monitor, since STDIN_FILENO is always 0 (by POSIX), this number amounts to 1. We pass STDIN as the descriptor to read, and we also pass the delay time as the computed frame delay from the main loop.

We ignore SIGWINCH since that case is handled automatically by the signal handler and processed during the next scheduled frame update

Input Handling

Any keyboard input unblocks pselect() and do_key() processes the keystroke. This handler is really the first of several possible layers of switched handlers which are organized to group common functions.

static void do_key (int ch) {
   static struct {
      void (*func)(int ch);    /* Handler to execute */
      char keys[SMLBUFSIZ];    /* Keys to associate to the handler */
   } key_tab[] = {
      { keys_global, { '?', 'B', 'd', ... }},   /* Actual key handler map */
      { keys_summary,{ '1', '2', '3', ... }},
      { keys_task, { '#', '<', '>', ... }},
      { keys_window, { '+', '-', '=', ... }}, 
      { keys_xtra, { 'M', 'N', 'P', 'T', '\0' }}
   };
   int i;

   switch (ch) {
      case 'q':             /* Jump to exit routine on 'q' */
         bye_bye(NULL);
      default:              /* Check all keys/handlers */
         for (i = 0; i < MAXTBL(key_tab); ++i) /* Loop through handlers */
            if (strchr(key_tab[i].keys, ch)) { /* Check keys for a match */
               key_tab[i].func(ch);       /* Execute next key handler */
               Frames_signal = BREAK_kbd; /* Force update on next frame */
               goto all_done;     /* Key matched, jump out of loop */
            }
   };
   ...content cut...
}

The next layer of keyboard handlers switch on the handful of matching keys, which eventually execute the proper procedure. Handlers typically call a function, update a global variable, or alter one of the data structures.

Exiting

Exit procedure for top (procps)

All graceful exits call the bye_bye() procedure, which is annotated below. The most important idea is that we restore the original shell terminal before exiting.

static void bye_bye (const char *str) {
   at_eoj();                        /* End of Job -- restore shell TTY */
#ifdef ATEOJ_RPTSTD
  ...debug reports cut..
#endif
   numa_uninit();                   /* (libproc) */
   if (str) {                       /* Error handling */
      fputs(str, stderr);           /* Error reporting */
      exit(EXIT_FAILURE);           /* Error exit */
   }
   if (Batch) fputs("\n", stdout);  /* Batch mode output */
   exit(EXIT_SUCCESS);              /* Standard exit */
}

Part 5 - Optimizations

Reading and writing is easy. Reading and writing only what you need, when you need it, with few instructions and little memory is not. top works hard to keep information flowing under adverse conditions (think a million processes, filtered, and sorted across four windows ever second).

top applies many common sense strategies, for instance:

Caching - Data structures allow us to retain data and computations between updates. For instance, top computes and retains window/field dimenions potentially indefinitely.
Separate/Limit Operations - Identify data that must be read every frame and grab it all at once, such as the process table. Don't read any data that won't be displayed, such as tasks data beyond the window size.
Pointers, not data - Confine updates to pointers and avoid directly manipulating data. Task windows do this with the process tables and field metadata
Small buffers - Large buffers lead to more latency. top reduces the default STDOUT buffer to 2kb at initialization

I'm not going to dig in to those well-known optimizations. Instead, let's focus on top's "Pseudo screen" and companion macros. The idea is related to back-buffering in a graphics pipeline, but this implementation seems unique to top and contributes to smooth operation.

top uses a global buffer called Pseudo_screen to hold the logical contents of the current frame that will be sent to the terminal. The benefits may not be obvious, but consider this:

We can analyze and scribble across a buffer at any time. This means we can make smart decisions about how to handle these data.
Terminals are write-only. Data sent is now on display and making changes means sending more data. It's not cheap (or possible in some cases) to know the state of the terminal.

In short, we can optimize data sent to the terminal via a Pseudo_screen. However, we retain the ability to write directly to the terminal, such as for direct user interactions.

Most interaction with the terminal and Pseudo_screen is done through three macros, depending on the use case:

POOF - Writes directly to the terminal and advances pseudo screen state
PUFF - Writes text data to the pseudo screen which may be optimized before writing to the terminal
PUTT - Writes text or capability data directly to the terminal, ignoring pseudo screen. Used for interactives (ex. message row)

Let's look at each:

#define POOF(str,cap) do { \
   putp(str); putp(cap); \                           /* Direct to tty */
   Pseudo_screen[Pseudo_row * ROWMAXSIZ] = '\0'; \   /* Null-term row */
   if (Pseudo_row + 1 < Screen_rows) ++Pseudo_row; \ /* Advance position */
} while (0)

#define PUFF(fmt,arg...) do { \
   char _str[ROWMAXSIZ]; \
   const int _len = snprintf(_str, sizeof(_str), fmt, ## arg); \
   if (Batch) { \
      char *_eol = _str + (_len < 0 ? 0 : (size_t)_len >= sizeof(_str) ? sizeof(_str)-1 : (size_t)_len); \
      while (_eol > _str && _eol[-1] == ' ') _eol--; *_eol = '\0'; putp(_str); } \
   else if (Pseudo_row >= 0 && Pseudo_row < Screen_rows) { \ /* Visible? */
      char *_ptr = &Pseudo_screen[Pseudo_row++ * ROWMAXSIZ]; \
      if (!strcmp(_ptr, _str)) putp("\n"); \  /* No work to do? */
      else { \
         strcpy(_ptr, _str); \   /* Write to frame */
         putp(_ptr); } } \       /* Write to tty */
} while (0)

#define PUTT(fmt,arg...) do { \
   char _str[ROWMAXSIZ]; \
   snprintf(_str, sizeof(_str), fmt, ## arg); \  /* Format to buffer */
   putp(_str); \                                 /* Write buffer to tty */
} while (0)

Part 6 - Namespace

Getting comfortable with the namespace is an early hurdle when diving in to any new utility. To reduce that chore, I'll break down the various symbols you'll see throughout top.[ch]. This section is really a reference rather than a detailed explanation.

Global variables

Roughly the first 150 lines of code declare and define globals used in many of the functions we'll see below. The most important globals to know are:

Batch - Flag for batch mode when redirecting output to a file. I'm assuming this is off
Cap_* - Terminal capabilities for controlling screen and cursor
Curwin - Pointer to the window currently being processed
Fieldstab[] - The global array describing the possible task list fields
Frames_signal - The global refresh state flag used to alter frame updates (set in signal handlers)
Loops - Limits the number of top updates to this number, assumed -1 for forever
Msg_row - The row number of the message row (row count of summary section + 1)
Pseudo_* - A buffer and descriptors for optimized screen control
Stdout_buf - The actual buffer space for the terminal. Not directly used, but know that it's smaller than default
Tty_original - The terminal configuration prior to top execution
Ttychanged - Flag indicating that top has altered terminal flags
Winstk[] - All of the possible windows to display (default number is 4)

Functions

Here are all the functions defined by top and grouped by purpose with groups ordered by importance. More comprehensive groups are at the top and the boilerplate work is at the bottom:

Screen/Frame

frame_hlp() - Computes the sizes of multiple task windows on screen
frame_make() - Procedure to collate and construct a single frame update
help_view() - Display help procedure
summary_hlp() - Prints various CPU% usages in the summary
summary_show() - Procedure to collate and construct the current summary area
task_show() - Constructs a single line of the task list
window_hlp() - Verifies that the window being processed is visible
window_show() - Displays the entire contents of a single window/task list

Initialization

before() - top initialization procedure (signals, memory, etc)
config_cvt() - Configuration file converter
config_insp() - Validates configuration file inspection entries
config_osel() - Validates configuration file other filters
configs_file() - Top-level configuration file validation procedure
configs_path() - Verifies a configuration file path
configs_reads() - Reads in up to three configuration files
parse_args() - Processes user-provided command-line flags
whack_terminal() - Reconfigurations terminal output for top
wins_stage_1() - Early-stage window configuration settings
wins_stage_2() - Late-stage window configuration settings (after config file processings)

Exit / Interrupt:

at_eoj() - Restores the original terminal state prior to top
bye_bye() - Application exit procedure. Restores TTY and returns exit status
error_exit() - Formats exit error message
sig_abexit() - Signal handler for abnormal exit
sig_endpgm() - Signal handler for standard exit
sig_paused() - Signal handler for pausing execution
sig_resize() - Signal handler force terminal resize
xalloc_our_handler() - Custom error handler for xalloc (part of libproc)

Library Interfaces

cpus_refresh() - Refresh CPU timing information
hstbsrch() - History binary search
hstget() - Gets a history record
hstput() - Adds a history record
procs_hlp() - Calculates proc deltas between frames
procs_refresh() - Walks the kernel proccess table and associates to windows
sysinfo_refresh() - Updates memory and CPU info

Fields Management

adj_geometry() - Recalculates screen dimension and associates
build_headers() - Recalculate column headers
calibrate_fields() - Recalculate column/field headers/flags
display_fields() - Show the viewable fields including active markers
fields_utility() - Perform the field management procedure ('[fF]')
widths_resize() - Revisit columns with auto width calculations
zap_fieldstab() - Main procedure to recompute task list fields

Color and Display

capsmk() - Generates a color capability string
show_msg() - Displays a message on the interactive message row
show_pmt() - Displays a prompt for user input
show_scroll() - Displays scroll coordinates
show_special() - Displays a text buffer that includes special capabilities

Input

do_key() - Primary keyboard input handler for main()
find_ofs() - Locates a sub-string within a buffer
find_string() - Collects user-specfied input for string search
keys_global() - Secondary keyboard input handler for global keys
keys_summary() - Secondary keyboard input handler for summary area keys
keys_window() - Secondary keyboard input handler for task list keys
keys_xtra() - Secondary keyboard input handler for miscellaneous keys
other_filters() - Collects a user-specified filter ('[oO]')
write_rcfile() - Procedure to export the current top configuration ('W')

Multi-byte character support

utf8_cols() - Computes the count of displayed characters
utf8_delta() - Returns difference between actual and visible character count
utf8_embody() - Finds the byte-level separation point for a visible line
utf8_justify() - Multi-byte justification function
utf8_proper_col() - Finds the resulting column number given an input and type

Formatting

justify_pad() - Add padding after a colum
make_chr() - Justifies a single character
make_num() - Justifies an integer
make_str() - Justify a string
make_str_utf8() - Justify a string with multi-byte characters
scale_mem() - Changes memory size basis and re-format
scale_num() - Changes a number basis and re-format
scale_pcnt() - Justifies a percent value
scale_tics() - Justifies a time-tick value

Windows

wins_clrhlp() - Save/restore terminal color settings
wins_colors() - Changes temrinal colors
win_names() - Sets the window names from configuration
wins_reflag() - Changes window flags
wins_reset() - Resets window to basic display (no monitoring)
wins_select() - Changes the active window
wins_usrselect() - Matches a task to
wins_warn() - Warns the user of an invalid command

Forest View

forest_adds() - Adds a node and adjusts the tree accordingly
forest_based() - Sorts the tree by start time
forest_create() - Procedure to construct the tree
forest_display() - Adjusts the task cmdline for tree view

Inspection

insp_cnt_nl() - Counts elements in the inspection buffer
insp_do_demo() - Executes the inspection and fills the buffer
insp_do_file() - Accesses the inspect entry for file types
insp_do_pipe() - Accesses the inspect entry for pipe types
insp_find_ofs() - Searches inspection entry for a string offset
insp_find_str() - Implements string search for an inspection entry
insp_mkrow_raw() - Builds an output row for an inspection result (highlights unprintables)
insp_mkrow_utf8() - Builds an output row for an inspection result
insp_show_pgs() - Displays a single page of inspection resutls
insp_view_choice() - Displays inspection results and handles user controls
inspection_utility() - The main inspection procedure

Filtering

osel_add() - Processes a new 'other' filter (task list filter)
osel_clear() - Removes an 'other' filter
osel_matched() - Checks if the filter matches any task in the window

Miscellaneous:

alloc_c() - Custom calloc() function with safe error handler
alloc_r() - Custom realloc() function with safe error handler
alloc_s() - Custom string allocator
fmtmk() - Generic format helper function invoking vsnprintf()
get_float() - Prompt user to input a float
get_int() - Prompt user to input an integer
hex_make() - Convert an input value to hex
ioa() - Blocks execution loop for either keyboard input or elapsed time
ioch() - Reads user input directly from STDIN
iokey() - Processes user input and escaped values
ioline() - Gets user input from an interactive line
mkfloat() - Converts an input number value to a float representation
readfile() - Reads a file of unknown size
scat() - Cheap and fast strcat()
tg2() - Moves the cursor to specified x,y position
user_certify() - Verifies that the input string is a user name or ID

Sorting Callbacks

top uses ~56 callback function-like macros to sort on the possible task list fields. I won't bother to list them here since they only appear once in the code, and never in expanded form. However, I will unroll the macro that defines the functions and demonstrate with one field. Then I'll show another macro that integrates sort functions to the global fieldstab[]. Let's consider sorting on %CPU.

Defining sort callbacks
Consider this line from early in the top.c source:

SCB_NUM1(CPU, pcpu)

Bounce this statement off these pre-processor macros from top.h:

#define SCB_NUM1(f,n) \
   static int SCB_NAME(f) (const proc_t **P, const proc_t **Q) { \
      if ( (*P)->n < (*Q)->n ) return SORT_lt; \
      if ( (*P)->n > (*Q)->n ) return SORT_gt; \
      return SORT_eq; }
	  
#define SCB_NAME(f) sort_EU_ ## f
#define SORT_lt  ( Frame_srtflg > 0 ?  1 : -1 )
#define SORT_gt  ( Frame_srtflg > 0 ? -1 :  1 )
#define SORT_eq  0

The pre-processor transforms the original single line statement:

static int sort_EU_CPU (const proc_t **P, const proc_t **Q) {
  if ( (*P)->pcpu < (*Q)->pcpu ) 
    return ( Frame_srtflg > 0 ?  1 : -1 );
	
  if ( (*P)->pcpu > (*Q)->pcpu )
    return ( Frame_srtflg > 0 ? -1 :  1 );
	
  return 0; 
}

Since these functions are ultimately passed to qsort(), we expect the typical return values of -1, 0, or 1. Frame_srtflg is our global flag that specifies ascending or descending order.

Connecting callbacks to fields
Now we associate the callback to the fields via the global Fieldstab[]. Consider this definition line from top.c:

  .width  .scale  .align    .sort     .lflg
  ------  ------  --------  --------  --------  */
...
{     5,     -1,  A_right,  SF(CPU),  L_stat    },
...

This is the FLD_t entry for CPU. Note the sort member function macro. Run this through these macros:

#define SF(f) (QFP_t)SCB_NAME(f)
#define SCB_NAME(f) sort_EU_ ## f

The preprocessor revises the entry to:

  .width  .scale  .align    .sort     .lflg
  ------  ------  --------  --------  --------  */
...
{     5,     -1,  A_right,  (QFP_t)sort_EU_CPU,  L_stat    },
...

Finally, the QFP_t type is a function signature typedef'd to int in top.h. Compare the prototype to the generated callback signature:

/* The QFP_t type */
typedef int (*QFP_t)(const void *, const void *);

/* Our new matching callback */
static int sort_EU_CPU (const proc_t **P, const proc_t **Q)

When we're ready to order the task list, we pass this function to qsort():

qsort(q->ppt, Frame_maxtask, sizeof(proc_t *), Fieldstab[q->rc.sortindx].sort);

Terminal Capabilities

On my CentOS system, I've mapped my terminal info (xterm) from infocmp to the POSIX-defined terminal capabilities as used within top. If you're feeling adventurous, compare these xterm sequences to an actual VT100/VT102 terminal (here)

top symbol	POSIX Capname	xterm	Purpose
Cap_clr_eol	el	\E[K	Clear remaining line
Cap_clr_eos	ed	\E[J	Clear remaining display
Cap_nl_clreos	(composite)	\n\E[J	Clear and new line
Cap_clr_scr	clear	\E[H\E[2J	Clear screen and home cursor
Cap_curs_norm	cnorm	\E[?12l\E[?25h	Normal cursor
Cap_curs_huge	cvvis	\E[?12;25h	Prominent cursor
Cap_curs_hide	civis	\E[?25l	Hide cursor
Cap_home	home	\E[H	Home cursor (upper left)
Cap_norm	sgr0	\E(B\E[m	Turf off attributes
Cap_reverse	rev	\E[7m	Reverse video cursor
Caps_off	op	\E[39;49m	Set default color
Caps_endline	(composite)	\E[39;49m\E[K	Default color & new line
Cap_rmam	rmam	\E[?7l	No auto margins
Cap_smam	smam	\E[?7h	Auto margins on

Note that \E in xterm sequences represent the 'escape' character (^[), commonly shown in octal as \33

FAQ

What the most important topics not covered?
There are a few things worth further research and discussion:

The interaction with the curses library
The readproc support subsystem of procps
Some features, such as custom configurations, filters, and inspect entries

What insights are there in the strace demonstration in part 1?
Quite a few things, especially if you're interested in the process at load time. Following an strace is beyond the scope of top, but nevertheless, I'll throw out a few things to follow.

The whole process ran for only the initial update and I pressed 'q' to exit, yet over 4439 syscalls were executed, producing that many lines of output. The first syscall was top launching with execvp, presumably just after the shell clones in to this new process. Beyond that:

The first ~200 lines show the CRT linked libraries loading
main() begins executing around line 342 and we see the signal handlers rewiring
The next 4000 lines show top opening /proc entries to build the initial frame.

The most fun appears in the last ~15 lines:

write(1, "\33[H\33[2J\33(B\33[mtop"..., 2048) = 2048 /* Summary write */
write(1, "K\n\33(B\33[m    4 root"..., 1041) = 1041  /* Task window write */
pselect6(1, [0], ...)                                /* Block */
pselect6(1, [0], ...)                                
read(0, "q", 127)                       = 1          /* I press 'q' */
ioctl(0, TCFLSH, TCIFLUSH)              = 0          /* Flush input */
ioctl(0, SNDCTL_TMR_CONTINUE ...}) = 0
write(1, "\33[?1l\33>"..., 30) = 30 /* Move cursor to end and reset */
dup2(3, 2)                              = 2          
close(3)                                = 0
munmap(0x7fd56592d000, 2144488)         = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?         /* Exit success */
+++ exited with 0 +++

Note that writing the summary section pushes the whole 2k buffer, despite using only 5 lines. The message line and task window is more reasonable.

Beyond that, advanced users may be interested in pouring over the /proc access. For example, is it really necessary to open and read the full contents of /etc/passwd a dozen times in a single update?