Decoded: The 'top' utility (procps)

March 2019

Logical layout of the top utility (procps-ng)

The table of processes (top) utility is a task manager that's been around in some form since the 1980s. My introduction to it in that era came from a system administrator who described it as "The hi-tech 'ps' alternative". The utility has been rewritten over the years as various *nix systems have come and gone. Today, most Linux distros pull top from the procps-ng package. This is the version I'll focus on in this project.

My goal is to make this resource useful for seasoned developers. I'll highlight important OS interactions since these are applicable to many current (and future!) utilities. This resource cannot substitute for reading the code, but it can save you a lot of time on your first attempt.

The main topics include:

  1. Data Sources
  2. Signals
  3. Data Structures
  4. Execution
  5. Optimization
  6. Namespace (Globals, Functions, Callback macros)

Snapshot

[Source] [Code Walkthrough]

Version: procps-ng 3.3.15+ (NEXT - commit: dbe12b54 [Jan 2019])
Lines of code: 6522
Functions: 115
Callback macros: 54

Full Disclosure: I am not a procps contributor -- all credit goes to James C. Warner
Defer to the official Documentation or the development community for any conflicting information


Part 0 - Introductory Presentation

I've put together a ~10 minute presentation that cuts across the utility by example. We trace the load averages from kernel, through /proc, into top, and to the screen.


Part 1 - Data Sources

top pulls data (almost) entirely from the /proc file system. Indirectly, /proc pulls from many data structures and global variables around the kernel. In my opinion, understanding the OS interactions of 'top' is possibly the most important idea that extends to other projects.

I've itemized the rougly 75 data sources used in top. At a minimum, I've included the /proc data source and associated 3.10 kernel declarations. Most summary kernel statistics are declared as globals within respective subsystems and exported (symbol made visible) to other subsystems.

For completeness, this strace of top definitively identifies all interactions with the OS on my system for the initial update frame. I ran top and pressed 'q' after about 1 second. Picking apart an strace is always very enlightening and worthly of an article by itself. I'll save that for another day.

Summary Information

Data /proc Kernel
CPU idle time/proc/statScheduler (dec, def)
CPU hw-int time/proc/statScheduler (dec, def)
CPU sw-int time/proc/statScheduler (dec, def)
CPU kernel time/proc/statScheduler (dec, def)
CPU nice time/proc/statScheduler (dec, def)
CPU steal time/proc/statScheduler (dec, def)
CPU user time/proc/statScheduler (dec, def)
Load averages/proc/loadavgScheduler (dec/def)
Mem buffers/proc/meminfoFile System (derived)
Mem free/proc/meminfoMemory Manager (dec/def)
Mem total/proc/meminfoMemory Manager (dec/def)
Mem usedN/A (derived)N/A
Swap free/proc/meminfoMemory Manager (dec)
Swap total/proc/meminfoMemory Manager (dec)
Swap usedN/A (derived)N/A
Task CountsN/A (derived)N/A
Uptime/proc/uptimeSystem Timer (derived)
User countN/A (utmp file)N/A

Task Information

Most per-task data are available through the stat, statm, and status entries under /proc/[pid#]. You can use the retrieval function mappings to trace back most process fields. The kernel display functions for the big three are:

For process data within the kernel, most roads lead back to the task_struct.

Data /proc Kernel
%CPU/proc/#/stat (derived)task_struct (dec)
%MEM/proc/#/statmtask_struct (dec) -> mm.rss_stat (dec)
CGNAME/proc/#/cgroup (derived)Control Group (dec)
CGROUPS/proc/#/cgroupControl Group subsys (dec)
CODE/proc/#/statmtask_struct (dec) -> mm (dec, def)
COMMAND/proc/#/stat
/proc/#/cmdline
task_struct (dec) -> mm (dec)
DATA/proc/#/statmtask_struct (dec) -> mm (dec, def)
ENVIRON/proc/#/environtask_struct (dec) -> mm (dec)
Flags/proc/#/stattask_struct (dec)
GID/proc/#/statustask_struct (dec) -> cred (dec)
GROUP/proc/#/status (derived)N/A (GID lookup, see above)
LXCN/A (derived)N/A
nDRT/proc/#/statmN/A (Forced to zero in 3.10)
NI/proc/#/stattask_struct (dec)
nMaj/proc/#/stattask_struct (dec) -> signal (dec)
nMin/proc/#/stattask_struct (dec) -> signal (dec)
nTh/proc/#/statustask_struct (dec) -> signal (dec)
nsIPC/proc/#/ns/ipctask_struct (dec) -> namespace (dec)
nsMNT/proc/#/ns/mnttask_struct (dec) -> namespace (dec)
nsNET/proc/#/ns/nettask_struct (dec) -> namespace (dec)
nsPID/proc/#/ns/pidtask_struct (dec) -> namespace (dec)
nsUSER/proc/#/ns/usertask_struct (dec) -> namespace (dec)
nsUTS/proc/#/ns/utstask_struct (dec) -> namespace (dec)
NU/proc/#/stat (derived)task_struct (dec), x86 thread_info (dec)
OOMa/proc/#/oom_adjDerived from MM
OOMs/proc/#/oom_scoretask_struct (dec) -> signal (dec)
P/proc/#/stat (derived)task_struct (dec), x86 thread_info (dec)
PGRP/proc/#/stattask_struct of group_leader(dec)
PID/proctask_struct (dec)
PPID/proc/#/stat
/proc/#/status
task_struct of parent (dec)
PR/proc/#/stattask_struct (dec)
RES/proc/#/statmtask_struct (dec) -> mm.rss_stat (dec)
RSan/proc/#/statusNot in 3.10
RSfd/proc/#/statusNot in 3.10
RSlk/proc/#/statustask_struct (dec) -> mm (dec)
RSsh/proc/#/statusNot in 3.10
RUID/proc/#/statustask_struct (dec) -> cred (dec)
RUSER/proc/#/status (derived)N/A (RUID passwd lookup)
S/proc/#/stat
/proc/#/status
task_struct (dec)
SHR/proc/#/statmtask_struct (dec) -> mm.rss_stat (dec)
SID/proc/#/stattask_struct (dec)
SUID/proc/#/statustask_struct (dec) -> cred (dec)
SUPGIDS/proc/#/statustask_struct (dec) -> cred (dec)
SUPGRPS/proc/#/status (derived)N/A (SUPGID passwd lookup)
SUSER/proc/#/status (derived)N/A (SUID passwd lookup)
SWAP/proc/#/statustask_struct (dec) -> mm.rss_stat (dec)
TGID/proc/statustask_struct (dec)
TIME/proc/#/stattask_struct (dec)
TIME+/proc/#/stattask_struct (dec)
TPGID/proc/#/stattask_struct (dec) -> signal (dec) -> tty (dec)
TTY/proc/#/stattask_struct (dec) -> signal (dec) -> tty (dec)
UID/proc/#/statustask_struct (dec) -> cred (dec)
USED/proc/#/status (derived)N/A
USER/proc/#/status (derived)N/A
VIRT/proc/#/statmtask_struct (dec) -> mm (dec)
vMj/proc/#/stattask_struct (dec) -> signal (dec)
vMn/proc/#/stattask_struct (dec) -> signal (dec)
WCHAN/proc/#/wchanTracing fp & ip (def)

Part 2 - Signals

Signal handling in top (procps)

top rewires default signals to several custom handlers for two purposes:

  1. The window may have been resized either directly or while paused. We need a custom signal handler to set the global flag to force a full window and task field recalculation on the next update. Signals and handlers affected:
    • sig_resize() - SIGWINCH, SIGCONT
  2. Since top overrides terminal settings at start up, the original settings need to be restored if execution is paused or terminated. Otherwise, the terminal may be in an unusable state after returning to the shell. The map of signals to handlers is:
    • sig_endpgm() - SIGALRM, SIGHUP, SIGINT, SIGPIPE, SIGQUIT, SIGTERM, SIGUSR1, SIGUSR2
    • sig_paused() - SIGSTP, SIGTTIN, SIGTTOU
    • sig_abexit() - All others

Here's an annotated look at the code for all four signal handlers:

static void sig_endpgm (int dont_care_sig) {
   sigset_t ss;

   sigfillset(&ss);              /* Blocking all signals */
   sigprocmask(SIG_BLOCK, &ss, NULL);
   Frames_signal = BREAK_sig;    /* Flag to force full terminal recompute */
   bye_bye(NULL);                /* Full graceful exit (See Part 6) */
   (void)dont_care_sig;          /* Silence compiler warnings */
}
static void sig_abexit (int sig) {
   sigset_t ss;

   sigfillset(&ss);          /* Blocking all signals */
   sigprocmask(SIG_BLOCK, &ss, NULL);
   at_eoj();                 /* Restore TTY and show debug log */
   fprintf(stderr, N_fmt(EXIT_signals_fmt)
      , sig, signal_number_to_name(sig), Myname);
   signal(sig, SIG_DFL);     /* Redirect to default */
   raise(sig);               /* Execute default */
   _exit(sig | 0x80);        /* Error exit */
}
static void sig_resize (int dont_care_sig) {
   tcdrain(STDOUT_FILENO);      /* Clear STDOUT buffer*/
   Frames_signal = BREAK_sig;   /* Flag to force full terminal recompute */
   (void)dont_care_sig;         /* Silence compiler warnings */
} 
static void sig_paused (int dont_care_sig) {

   /* Restore Original TTY */
   if (-1 == tcsetattr(STDIN_FILENO, TCSAFLUSH, &Tty_original))
      error_exit(fmtmk(N_fmt(FAIL_tty_set_fmt), strerror(errno)));
   if (keypad_local) putp(keypad_local);
   putp(tg2(0, Screen_rows));  /* Move to bottom left */
   putp(Cap_curs_norm);        /* Normalize cursor format */
#ifndef RMAN_IGNORED
   putp(Cap_smam);             /* Enable right margin */
#endif
   fflush(stdout);             /* Push STDOUT buffer */
   raise(SIGSTOP);             /* Block here until SIGCONT is received */
   
   /* Restore top's expected TTY configuration */
   if (-1 == tcsetattr(STDIN_FILENO, TCSAFLUSH, &Tty_raw))
      error_exit(fmtmk(N_fmt(FAIL_tty_set_fmt), strerror(errno)));
#ifndef RMAN_IGNORED
   putp(Cap_rmam);             /* Disable right margin */
#endif
   if (keypad_xmit) putp(keypad_xmit);
   putp(Cursor_state);         /* Restore cursor settings */
   Frames_signal = BREAK_sig;  /* Flag to force full terminal recompute */
   (void)dont_care_sig;        /* Silence compiler warnings */
}

Part 3 - Data Structures

top relies on several structures to organize and cache data needed during computation. I'll separate the data structure discussion in to two groups: Those defined exclusively within top, and those included from external sources (i.e. C/POSIX standard structures). I'll focus more on the top-specific structures since they are unique, but the external standard structures are generally more useful to understand.

top-specific structures

These are the data structures defined and used exclusively within top

Fields [FLD_t]

The FLD_t structure to track task list fields

A field structure represents a single column of information in the task list. All possible fields are defined as a global array and accessed by the task window.

Three of five members deal directly with data display:

  • width - The character count of the field. -1 is variable, usually text. 0 is computed at runtime.
  • scale - Represents the magnitude of the number (Kb, Mb, Gb, etc)
  • align - Visual alignment, left or right

The other two members have specialized usage:

  • sort - The sort function used to order by this fields. (See part 6)
  • lflag - The /proc source (usually stat, status, statm, or specialized)

Task Window [WIN_t]

The WIN_t structure to track window data

The task windows are the most complex custom structure in top, performing many roles from holding configurations, referencing accessing many globals, tracking screen dimensions, and caching process data. top defines four Windows as globals.

Important members to know:

  • **ppt - The proc_t array used by the window
  • pflgsall[] - Tracks fields being displayed in the window
  • procflgs[] - Holds behavior flags for the active fields
  • rc - Window configuration specifics (see RCW_t below)

Window view management variables:

  • begnext - The offset to apply to task list on next update
  • begpflg - The first visible point in pflgsall[]
  • begtask - The first task visible in ppt[]
  • endpflg - The visible ending point in pflgsall[]
  • usrselflg - Flag for include or exclude (e)uid matches
  • usrseltyp - The type of user to match (uid or euid)
  • usrseluid - The chosen user id to match
  • varcolsz - The cap on variable width columns
  • winlines - The number of lines for the window

Others members:

  • cap* - Terminal capabilities for this window
  • grpname - A string representing window number:name
  • *next - Pointer to the next window
  • *prev - Pointer to the last window

Window Configuration [RCW_t]

The RCW_t structure to track window configuration

Each window has an associated configuration structure (rc) built from both user settings and an external file. This structure is critical for tracking 'live' fields in the current window.

The three most important members to know are:

  • fieldscur[] - The ordered array of active fields in the window
  • sortinx - The field enum within Fieldstab[] to sort on
  • winflags - The window modes including view/sort details

Other members include:

  • graph_cpus - The alternate cpu display values for 't'
  • graph_mems - The alternate memory display values for 'm'
  • headclr - Header color
  • maxtasks - The user-provided maximum tasks to
  • msgsclr - Message color
  • summclr - Summary color
  • taskclr - Task color
  • winname[] - The user-provided window name

General Configuration [RCF_t]

The RCF_t structure to track configuration file data

This structure supports the external configuration file and data not used directly by a window.

The main members include:

  • id - The rc file version
  • win_index - The index to the global Winstk for the current window
  • win - Local references to all window config data

The remaining members track settings:

  • delay_time - Desired delay time between updates, 'd'
  • fixed_widest - Extra space for fields, 'X'
  • mode_altscr - Alternate display mode, 'A'
  • mode_irixps - CPU display mode, 'I'
  • summ_mscale - Summary memory scaling 'I'
  • task_mscale - Process memory scaling 'e'
  • zero_suppress - Suppress zeroes for scaled values '0'

History [HST_t]

The HST_t structure to track task history

top tracks a useful varaibles between frames, notably CPU ticks and fault counts. The members are:

  • lnk - Integer index to the next HST in the chain
  • maj - Major fault count
  • min - Minor fault count
  • pid - The associated PID
  • tics - The tick count for this entry

CPU History [CT_t, CPU_t]

The CT_t and CPU_t structures to track CPU usage

We need to keep track of some CPU data, especially the changes between frames. These two structures work together manage updates:

The CPU_t structure, one per CPU

  • cur - The current frame's tick data (CT_t)
  • id - The CPU number
  • node - The numa node of the CPU
  • sav - The previous frame's tick data (CT_t)

The CT_t (CPU tick tracker), two per CPU (this update and the last update). These data are provided by /proc/stat and correspond to tick counts in the following modes:

  • i - Idle time
  • n - Nice-invoked process time
  • s - Kernel time
  • u - Userland time
  • w - I/O blocked time
  • x - HW interrupt time
  • y - SW interrupt time
  • z - Virtual steal time (hypervisor block time)

Other structures included in top

These structures are used in top but defined elsewhere. These are useful to know since they're used throughout the OS/Kernel.

  • pid_t - Required by POSIX, thus typedef'd in the kernel as int
  • proc_t - Defined within procps to hold all data about a process
  • PROCTAB - Defined by procps to track persistent /proc information
  • sigset_t - Required by POSIX, and defined in the kernel as an architecture-specific type array
  • termios - The POSIX definition of the terminal, specifically the I/O and control modes.

Part 4 - Execution

High level execution in in top (procps)

Execution appears simple at a high level. top begins with basic initialization, including reading configurations from files and command line flags, then setting up the terminal. The main loop continuously pulls input from /proc, formatting it, and sending results to the terminal. The program will only exit if the user pressed 'q', the user configured a fixed number of updates, or the operating system sends a disruptive signal. Let's dig into initialization, the main loop, and the exit procedure.

Let's break down each of these steps.

Initialization

int main (int dont_care_argc, char **argv) {
   (void)dont_care_argc;
   before(*argv);
                  
   wins_stage_1();
   configs_reads();
   parse_args(&argv[1]);
   whack_terminal();
   wins_stage_2();

...to main loop...   

(void)dont_care_argc;
Suppresses two compiler warnings: Unused argument (by using it) and statement with no effect (by void cast).

before(*argv);
Performs important preparatory work such as verifying /proc mount, counting CPUs, and rewiring signal handlers.

wins_stage_1();
Basic window preparation. Sets up the capabilities table, sets up a logical window ring, and sets the current window to the first.

configs_reads();
Processes external configuration files (rc files)

parse_args(&argv[1]);
Handles command line arguments

whack_terminal();
Clobbers the terminal in preparation for displaying the top utility. Saves old settings, adjusts flags, and sets buffering

wins_stage_2();
Finishes window configuration by defining terminal capabilities (global and window) and initially defining fields

The Big Loop

The top utility main loop

top will churn through the main loop until either the user or the operating system forces an exit. I'll focus on the three operations pictured above and highlighted in the lines below.

...in main(), after initialization...

  for (;;) {
     struct timespec ts;

     frame_make();                 /* Builds and displays the next update */

     if (0 < Loops) --Loops;       /* Controls the user-specified */
     if (!Loops) bye_bye(NULL);    /* automatic exit counter (-n) */

     ts.tv_sec = Rc.delay_time;    /* Set up delay time between frames */
     ts.tv_nsec = (Rc.delay_time - (int)Rc.delay_time) * 1000000000;

     if (Batch)                    /* Batch mode operation (-b) */
        pselect(0, NULL, NULL, NULL, &ts, NULL);
     else {
        if (ioa(&ts))              /* Wait - pselect() within ioa() */
           do_key(iokey(1))        /* Handles keyboard input */;
     }
  }

Let's dig in to the highlighted operations: Building a single output frame, waiting for the next update, and processing input.

Building a Frame

Building a single frame in top

An output frame requires preperatory computations. In the interest of optimization, the actual computation sequence depends on if we need a full update (such as recomputing screen dimensions), or only a limited update if most elements are unchanged.

At a minimum, preparing a frame will always:

  • Read system information via sysinfo_refresh(). This includes the CPU counts from /proc/stat and memory information from /proc/meminfo. These data changed on each frame and are always referenced.
  • Refresh process information via procs_refresh(). Updates relevant pointers in windows.

Sometimes we need full updates such as when this is the first frame displayed after running top or when the user resizes the terminal window. Full updates include the previous operations, plus:

  • Recalculating the field tables for each window in zap_fieldstab().
  • Rechecking the CPU counts via cpus_refresh().

Displaying data on the terminal
We draw the screen one line at a time, from top to bottom. Functionally, there are three distinct sections:

  • Summary - The summary area occupies the first several lines and the procedure is summary_show()
  • Message - A single line after the summary, which is simply cleared using a termcap. Actually messages are added interactively.
  • Task Window(s) - The list of tasks and associated fields. There may be up to four windows, but the default is one. The procedure is window_show()

Let's trace the retrieval, construction, formatting, and output of a single line of data through top. I'll try to distill the bottom line and show a single example that can extend to all screen elements. The anti-climactic punchline is:

At the lowest level, we read() from a /proc source, format data to a buffer, and putp() the buffer to the terminal with help from curses.

If you blink, you make miss this simple process so let's look at how this happens with load average in the summary display.

The load average appears at the top of the screen, so we expect the action to be near the beginning of the procedure. Here are the first 15 lines of summary_show():

static void summary_show (void) {
 #define isROOM(f,n) (CHKw(w, f) && Msg_row + (n) < Screen_rows - 1)
 #define anyFLG 0xffffff
   WIN_t *w = Curwin;
   char tmp[MEDBUFSIZ];
   int i;

   /* Display Uptime and Loadavg */
   if (isROOM(View_LOADAV, 1)) {
      if (!Rc.mode_altscr)
         show_special(0, fmtmk(LOADAV_line, Myname, sprint_uptime(0)));
      else
         show_special(0, fmtmk(CHKw(w, Show_TASKON)? LOADAV_line_alt : LOADAV_line
            , w->grpname, sprint_uptime(0)));
      Msg_row += 1;
   } 
...remaining summary_show() procedure...

The critical line includes the show_special() function that uses two layers of argument evaluations. This line encapsulates the entire process of reading, formatting, and writing. The call stack (including macros) for the I/O process looks like this:

  • Reading - Load average is pulled from /proc/loadavg by sprint_uptime() ultimately using the read() syscall.
    • sprint_uptime() - top.c
    • loadavg() - sysinfo.c
    • FILE_TO_BUF() - sysinfo.c (macro)
    • read() - Standard library syscall wrapper
  • Formatting - The raw result is formatted within fmtmk() to include special terminal characters and returned in a buffer.
    • fmtmk() - top.c
    • vnsprintf() - Standard library
  • Writing - show_special() pushes the formatted result to the terminal using a macro and ultimately putp()
    • show_special() - top.c
    • PUFF() - top.h (macro)
    • putp() - curses

This is the call stack (plus function-like macros) for reading and writing the load average

Waiting (Blocking)

After we've built a single frame, the next task in the main loop is to pause operation until it's time for a new frame update OR the user presses a key. The pselect() syscall blocks until either a time delay passes or user input occurs. Let's look at the prototype, and then the usage:

int pselect(int nfds,                     /* Descriptor count to check */
            fd_set *restrict readfds,     /* Reading check descriptor set */
            fd_set *restrict writefds,    /* Writing check descriptor set */
            fd_set *restrict errorfds,    /* Err condition descriptor set */
            timespec *restrict timeout,   /* Delay timeout  */
            sigset_t *restrict sigmask);  /* Signal mask to apply */

Under normal (non-batch) operation, the pselect() syscall is invoked within the ioa() function as shown here:

static inline int ioa (struct timespec *ts) {
   fd_set fs;
   int rc;

   FD_ZERO(&fs);              /* Clears target file descriptors */
   FD_SET(STDIN_FILENO, &fs); /* Adds STDIN to the target set */

   rc = pselect(STDIN_FILENO + 1, &fs, NULL, NULL, ts, &Sigwinch_set);

   if (rc < 0) rc = 0;
   return rc;
} // end: ioa

The first argument to pselect() is the number of file descriptors to monitor, since STDIN_FILENO is always 0 (by POSIX), this number amounts to 1. We pass STDIN as the descriptor to read, and we also pass the delay time as the computed frame delay from the main loop.

We ignore SIGWINCH since that case is handled automatically by the signal handler and processed during the next scheduled frame update

Input Handling

Any keyboard input unblocks pselect() and do_key() processes the keystroke. This handler is really the first of several possible layers of switched handlers which are organized to group common functions.

static void do_key (int ch) {
   static struct {
      void (*func)(int ch);    /* Handler to execute */
      char keys[SMLBUFSIZ];    /* Keys to associate to the handler */
   } key_tab[] = {
      { keys_global, { '?', 'B', 'd', ... }},   /* Actual key handler map */
      { keys_summary,{ '1', '2', '3', ... }},
      { keys_task, { '#', '<', '>', ... }},
      { keys_window, { '+', '-', '=', ... }}, 
      { keys_xtra, { 'M', 'N', 'P', 'T', '\0' }}
   };
   int i;

   switch (ch) {
      case 'q':             /* Jump to exit routine on 'q' */
         bye_bye(NULL);
      default:              /* Check all keys/handlers */
         for (i = 0; i < MAXTBL(key_tab); ++i) /* Loop through handlers */
            if (strchr(key_tab[i].keys, ch)) { /* Check keys for a match */
               key_tab[i].func(ch);       /* Execute next key handler */
               Frames_signal = BREAK_kbd; /* Force update on next frame */
               goto all_done;     /* Key matched, jump out of loop */
            }
   };
   ...content cut...
}

The next layer of keyboard handlers switch on the handful of matching keys, which eventually execute the proper procedure. Handlers typically call a function, update a global variable, or alter one of the data structures.

Exiting

Exit procedure for top (procps)

All graceful exits call the bye_bye() procedure, which is annotated below. The most important idea is that we restore the original shell terminal before exiting.

static void bye_bye (const char *str) {
   at_eoj();                        /* End of Job -- restore shell TTY */
#ifdef ATEOJ_RPTSTD
  ...debug reports cut..
#endif
   numa_uninit();                   /* (libproc) */
   if (str) {                       /* Error handling */
      fputs(str, stderr);           /* Error reporting */
      exit(EXIT_FAILURE);           /* Error exit */
   }
   if (Batch) fputs("\n", stdout);  /* Batch mode output */
   exit(EXIT_SUCCESS);              /* Standard exit */
}

Part 5 - Optimizations

Reading and writing is easy. Reading and writing only what you need, when you need it, with few instructions and little memory is not. top works hard to keep information flowing under adverse conditions (think a million processes, filtered, and sorted across four windows ever second).

top applies many common sense strategies, for instance:

  • Caching - Data structures allow us to retain data and computations between updates. For instance, top computes and retains window/field dimenions potentially indefinitely.
  • Separate/Limit Operations - Identify data that must be read every frame and grab it all at once, such as the process table. Don't read any data that won't be displayed, such as tasks data beyond the window size.
  • Pointers, not data - Confine updates to pointers and avoid directly manipulating data. Task windows do this with the process tables and field metadata
  • Small buffers - Large buffers lead to more latency. top reduces the default STDOUT buffer to 2kb at initialization

I'm not going to dig in to those well-known optimizations. Instead, let's focus on top's "Pseudo screen" and companion macros. The idea is related to back-buffering in a graphics pipeline, but this implementation seems unique to top and contributes to smooth operation.

top uses a global buffer called Pseudo_screen to hold the logical contents of the current frame that will be sent to the terminal. The benefits may not be obvious, but consider this:

  • We can analyze and scribble across a buffer at any time. This means we can make smart decisions about how to handle these data.
  • Terminals are write-only. Data sent is now on display and making changes means sending more data. It's not cheap (or possible in some cases) to know the state of the terminal.

In short, we can optimize data sent to the terminal via a Pseudo_screen. However, we retain the ability to write directly to the terminal, such as for direct user interactions.

Most interaction with the terminal and Pseudo_screen is done through three macros, depending on the use case:

  • POOF - Writes directly to the terminal and advances pseudo screen state
  • PUFF - Writes text data to the pseudo screen which may be optimized before writing to the terminal
  • PUTT - Writes text or capability data directly to the terminal, ignoring pseudo screen. Used for interactives (ex. message row)

Let's look at each:

#define POOF(str,cap) do { \
   putp(str); putp(cap); \                           /* Direct to tty */
   Pseudo_screen[Pseudo_row * ROWMAXSIZ] = '\0'; \   /* Null-term row */
   if (Pseudo_row + 1 < Screen_rows) ++Pseudo_row; \ /* Advance position */
} while (0)
#define PUFF(fmt,arg...) do { \
   char _str[ROWMAXSIZ]; \
   const int _len = snprintf(_str, sizeof(_str), fmt, ## arg); \
   if (Batch) { \
      char *_eol = _str + (_len < 0 ? 0 : (size_t)_len >= sizeof(_str) ? sizeof(_str)-1 : (size_t)_len); \
      while (_eol > _str && _eol[-1] == ' ') _eol--; *_eol = '\0'; putp(_str); } \
   else if (Pseudo_row >= 0 && Pseudo_row < Screen_rows) { \ /* Visible? */
      char *_ptr = &Pseudo_screen[Pseudo_row++ * ROWMAXSIZ]; \
      if (!strcmp(_ptr, _str)) putp("\n"); \  /* No work to do? */
      else { \
         strcpy(_ptr, _str); \   /* Write to frame */
         putp(_ptr); } } \       /* Write to tty */
} while (0)
#define PUTT(fmt,arg...) do { \
   char _str[ROWMAXSIZ]; \
   snprintf(_str, sizeof(_str), fmt, ## arg); \  /* Format to buffer */
   putp(_str); \                                 /* Write buffer to tty */
} while (0)

Part 6 - Namespace

Getting comfortable with the namespace is an early hurdle when diving in to any new utility. To reduce that chore, I'll break down the various symbols you'll see throughout top.[ch]. This section is really a reference rather than a detailed explanation.

Global variables

Roughly the first 150 lines of code declare and define globals used in many of the functions we'll see below. The most important globals to know are:

  • Batch - Flag for batch mode when redirecting output to a file. I'm assuming this is off
  • Cap_* - Terminal capabilities for controlling screen and cursor
  • Curwin - Pointer to the window currently being processed
  • Fieldstab[] - The global array describing the possible task list fields
  • Frames_signal - The global refresh state flag used to alter frame updates (set in signal handlers)
  • Loops - Limits the number of top updates to this number, assumed -1 for forever
  • Msg_row - The row number of the message row (row count of summary section + 1)
  • Pseudo_* - A buffer and descriptors for optimized screen control
  • Stdout_buf - The actual buffer space for the terminal. Not directly used, but know that it's smaller than default
  • Tty_original - The terminal configuration prior to top execution
  • Ttychanged - Flag indicating that top has altered terminal flags
  • Winstk[] - All of the possible windows to display (default number is 4)

Functions

Here are all the functions defined by top and grouped by purpose with groups ordered by importance. More comprehensive groups are at the top and the boilerplate work is at the bottom:

Screen/Frame

  • frame_hlp() - Computes the sizes of multiple task windows on screen
  • frame_make() - Procedure to collate and construct a single frame update
  • help_view() - Display help procedure
  • summary_hlp() - Prints various CPU% usages in the summary
  • summary_show() - Procedure to collate and construct the current summary area
  • task_show() - Constructs a single line of the task list
  • window_hlp() - Verifies that the window being processed is visible
  • window_show() - Displays the entire contents of a single window/task list

Initialization

  • before() - top initialization procedure (signals, memory, etc)
  • config_cvt() - Configuration file converter
  • config_insp() - Validates configuration file inspection entries
  • config_osel() - Validates configuration file other filters
  • configs_file() - Top-level configuration file validation procedure
  • configs_path() - Verifies a configuration file path
  • configs_reads() - Reads in up to three configuration files
  • parse_args() - Processes user-provided command-line flags
  • whack_terminal() - Reconfigurations terminal output for top
  • wins_stage_1() - Early-stage window configuration settings
  • wins_stage_2() - Late-stage window configuration settings (after config file processings)

Exit / Interrupt:

  • at_eoj() - Restores the original terminal state prior to top
  • bye_bye() - Application exit procedure. Restores TTY and returns exit status
  • error_exit() - Formats exit error message
  • sig_abexit() - Signal handler for abnormal exit
  • sig_endpgm() - Signal handler for standard exit
  • sig_paused() - Signal handler for pausing execution
  • sig_resize() - Signal handler force terminal resize
  • xalloc_our_handler() - Custom error handler for xalloc (part of libproc)

Library Interfaces

  • cpus_refresh() - Refresh CPU timing information
  • hstbsrch() - History binary search
  • hstget() - Gets a history record
  • hstput() - Adds a history record
  • procs_hlp() - Calculates proc deltas between frames
  • procs_refresh() - Walks the kernel proccess table and associates to windows
  • sysinfo_refresh() - Updates memory and CPU info

Fields Management

  • adj_geometry() - Recalculates screen dimension and associates
  • build_headers() - Recalculate column headers
  • calibrate_fields() - Recalculate column/field headers/flags
  • display_fields() - Show the viewable fields including active markers
  • fields_utility() - Perform the field management procedure ('[fF]')
  • widths_resize() - Revisit columns with auto width calculations
  • zap_fieldstab() - Main procedure to recompute task list fields

Color and Display

  • capsmk() - Generates a color capability string
  • show_msg() - Displays a message on the interactive message row
  • show_pmt() - Displays a prompt for user input
  • show_scroll() - Displays scroll coordinates
  • show_special() - Displays a text buffer that includes special capabilities

Input

  • do_key() - Primary keyboard input handler for main()
  • find_ofs() - Locates a sub-string within a buffer
  • find_string() - Collects user-specfied input for string search
  • keys_global() - Secondary keyboard input handler for global keys
  • keys_summary() - Secondary keyboard input handler for summary area keys
  • keys_window() - Secondary keyboard input handler for task list keys
  • keys_xtra() - Secondary keyboard input handler for miscellaneous keys
  • other_filters() - Collects a user-specified filter ('[oO]')
  • write_rcfile() - Procedure to export the current top configuration ('W')

Multi-byte character support

  • utf8_cols() - Computes the count of displayed characters
  • utf8_delta() - Returns difference between actual and visible character count
  • utf8_embody() - Finds the byte-level separation point for a visible line
  • utf8_justify() - Multi-byte justification function
  • utf8_proper_col() - Finds the resulting column number given an input and type

Formatting

  • justify_pad() - Add padding after a colum
  • make_chr() - Justifies a single character
  • make_num() - Justifies an integer
  • make_str() - Justify a string
  • make_str_utf8() - Justify a string with multi-byte characters
  • scale_mem() - Changes memory size basis and re-format
  • scale_num() - Changes a number basis and re-format
  • scale_pcnt() - Justifies a percent value
  • scale_tics() - Justifies a time-tick value

Windows

  • wins_clrhlp() - Save/restore terminal color settings
  • wins_colors() - Changes temrinal colors
  • win_names() - Sets the window names from configuration
  • wins_reflag() - Changes window flags
  • wins_reset() - Resets window to basic display (no monitoring)
  • wins_select() - Changes the active window
  • wins_usrselect() - Matches a task to
  • wins_warn() - Warns the user of an invalid command

Forest View

  • forest_adds() - Adds a node and adjusts the tree accordingly
  • forest_based() - Sorts the tree by start time
  • forest_create() - Procedure to construct the tree
  • forest_display() - Adjusts the task cmdline for tree view

Inspection

  • insp_cnt_nl() - Counts elements in the inspection buffer
  • insp_do_demo() - Executes the inspection and fills the buffer
  • insp_do_file() - Accesses the inspect entry for file types
  • insp_do_pipe() - Accesses the inspect entry for pipe types
  • insp_find_ofs() - Searches inspection entry for a string offset
  • insp_find_str() - Implements string search for an inspection entry
  • insp_mkrow_raw() - Builds an output row for an inspection result (highlights unprintables)
  • insp_mkrow_utf8() - Builds an output row for an inspection result
  • insp_show_pgs() - Displays a single page of inspection resutls
  • insp_view_choice() - Displays inspection results and handles user controls
  • inspection_utility() - The main inspection procedure

Filtering

  • osel_add() - Processes a new 'other' filter (task list filter)
  • osel_clear() - Removes an 'other' filter
  • osel_matched() - Checks if the filter matches any task in the window

Miscellaneous:

  • alloc_c() - Custom calloc() function with safe error handler
  • alloc_r() - Custom realloc() function with safe error handler
  • alloc_s() - Custom string allocator
  • fmtmk() - Generic format helper function invoking vsnprintf()
  • get_float() - Prompt user to input a float
  • get_int() - Prompt user to input an integer
  • hex_make() - Convert an input value to hex
  • ioa() - Blocks execution loop for either keyboard input or elapsed time
  • ioch() - Reads user input directly from STDIN
  • iokey() - Processes user input and escaped values
  • ioline() - Gets user input from an interactive line
  • mkfloat() - Converts an input number value to a float representation
  • readfile() - Reads a file of unknown size
  • scat() - Cheap and fast strcat()
  • tg2() - Moves the cursor to specified x,y position
  • user_certify() - Verifies that the input string is a user name or ID

Sorting Callbacks

top uses ~56 callback function-like macros to sort on the possible task list fields. I won't bother to list them here since they only appear once in the code, and never in expanded form. However, I will unroll the macro that defines the functions and demonstrate with one field. Then I'll show another macro that integrates sort functions to the global fieldstab[]. Let's consider sorting on %CPU.

Defining sort callbacks
Consider this line from early in the top.c source:

SCB_NUM1(CPU, pcpu)

Bounce this statement off these pre-processor macros from top.h:

#define SCB_NUM1(f,n) \
   static int SCB_NAME(f) (const proc_t **P, const proc_t **Q) { \
      if ( (*P)->n < (*Q)->n ) return SORT_lt; \
      if ( (*P)->n > (*Q)->n ) return SORT_gt; \
      return SORT_eq; }
	  
#define SCB_NAME(f) sort_EU_ ## f
#define SORT_lt  ( Frame_srtflg > 0 ?  1 : -1 )
#define SORT_gt  ( Frame_srtflg > 0 ? -1 :  1 )
#define SORT_eq  0

The pre-processor transforms the original single line statement:

static int sort_EU_CPU (const proc_t **P, const proc_t **Q) {
  if ( (*P)->pcpu < (*Q)->pcpu ) 
    return ( Frame_srtflg > 0 ?  1 : -1 );
	
  if ( (*P)->pcpu > (*Q)->pcpu )
    return ( Frame_srtflg > 0 ? -1 :  1 );
	
  return 0; 
}

Since these functions are ultimately passed to qsort(), we expect the typical return values of -1, 0, or 1. Frame_srtflg is our global flag that specifies ascending or descending order.

Connecting callbacks to fields
Now we associate the callback to the fields via the global Fieldstab[]. Consider this definition line from top.c:

  .width  .scale  .align    .sort     .lflg
  ------  ------  --------  --------  --------  */
...
{     5,     -1,  A_right,  SF(CPU),  L_stat    },
...

This is the FLD_t entry for CPU. Note the sort member function macro. Run this through these macros:

#define SF(f) (QFP_t)SCB_NAME(f)
#define SCB_NAME(f) sort_EU_ ## f

The preprocessor revises the entry to:

  .width  .scale  .align    .sort     .lflg
  ------  ------  --------  --------  --------  */
...
{     5,     -1,  A_right,  (QFP_t)sort_EU_CPU,  L_stat    },
...

Finally, the QFP_t type is a function signature typedef'd to int in top.h. Compare the prototype to the generated callback signature:

/* The QFP_t type */
typedef int (*QFP_t)(const void *, const void *);

/* Our new matching callback */
static int sort_EU_CPU (const proc_t **P, const proc_t **Q)

When we're ready to order the task list, we pass this function to qsort():

qsort(q->ppt, Frame_maxtask, sizeof(proc_t *), Fieldstab[q->rc.sortindx].sort);

Terminal Capabilities

On my CentOS system, I've mapped my terminal info (xterm) from infocmp to the POSIX-defined terminal capabilities as used within top. If you're feeling adventurous, compare these xterm sequences to an actual VT100/VT102 terminal (here)

top symbol POSIX Capname xterm Purpose
Cap_clr_eolel\E[KClear remaining line
Cap_clr_eosed\E[JClear remaining display
Cap_nl_clreos(composite)\n\E[JClear and new line
Cap_clr_scrclear\E[H\E[2JClear screen and home cursor
Cap_curs_normcnorm\E[?12l\E[?25hNormal cursor
Cap_curs_hugecvvis\E[?12;25hProminent cursor
Cap_curs_hidecivis\E[?25lHide cursor
Cap_homehome\E[HHome cursor (upper left)
Cap_normsgr0\E(B\E[mTurf off attributes
Cap_reverserev\E[7mReverse video cursor
Caps_offop\E[39;49mSet default color
Caps_endline(composite)\E[39;49m\E[KDefault color & new line
Cap_rmamrmam\E[?7lNo auto margins
Cap_smamsmam\E[?7hAuto margins on

Note that \E in xterm sequences represent the 'escape' character (^[), commonly shown in octal as \33


FAQ

What the most important topics not covered?
There are a few things worth further research and discussion:

  • The interaction with the curses library
  • The readproc support subsystem of procps
  • Some features, such as custom configurations, filters, and inspect entries

What insights are there in the strace demonstration in part 1?
Quite a few things, especially if you're interested in the process at load time. Following an strace is beyond the scope of top, but nevertheless, I'll throw out a few things to follow.

The whole process ran for only the initial update and I pressed 'q' to exit, yet over 4439 syscalls were executed, producing that many lines of output. The first syscall was top launching with execvp, presumably just after the shell clones in to this new process. Beyond that:

  • The first ~200 lines show the CRT linked libraries loading
  • main() begins executing around line 342 and we see the signal handlers rewiring
  • The next 4000 lines show top opening /proc entries to build the initial frame.

The most fun appears in the last ~15 lines:

write(1, "\33[H\33[2J\33(B\33[mtop"..., 2048) = 2048 /* Summary write */
write(1, "K\n\33(B\33[m    4 root"..., 1041) = 1041  /* Task window write */
pselect6(1, [0], ...)                                /* Block */
pselect6(1, [0], ...)                                
read(0, "q", 127)                       = 1          /* I press 'q' */
ioctl(0, TCFLSH, TCIFLUSH)              = 0          /* Flush input */
ioctl(0, SNDCTL_TMR_CONTINUE ...}) = 0
write(1, "\33[?1l\33>"..., 30) = 30 /* Move cursor to end and reset */
dup2(3, 2)                              = 2          
close(3)                                = 0
munmap(0x7fd56592d000, 2144488)         = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?         /* Exit success */
+++ exited with 0 +++

Note that writing the summary section pushes the whole 2k buffer, despite using only 5 lines. The message line and task window is more reasonable.

Beyond that, advanced users may be interested in pouring over the /proc access. For example, is it really necessary to open and read the full contents of /etc/passwd a dozen times in a single update?