The Text Editor sam

Rob Pike

rob@plan9.bell-labs.com

ABSTRACT

Sam is an interactive multi-file text editor intended for bitmap displays. A textual command language supplements the mouse-driven, cut-and-paste interface to make complex or repetitive editing tasks easy to specify. The language is characterized by the composition of regular expressions to describe the structure of the text being modified. The treatment of files as a database, with changes logged as atomic transactions, guides the implementation and makes a general ‘undo’ mechanism straightforward.

Sam is implemented as two processes connected by a low-bandwidth stream, one process handling the display and the other the editing algorithms. Therefore it can run with the display process in a bitmap terminal and the editor on a local host, with both processes on a bitmap-equipped host, or with the display process in the terminal and the editor in a remote host. By suppressing the display process, it can even run without a bitmap terminal.

This paper is reprinted from Software—Practice and Experience, Vol 17, number 11, pp. 813-845, November 1987. The paper has not been updated for the Plan 9 manuals. Although Sam has not changed much since the paper was written, the system around it certainly has. Nonetheless, the description here still stands as the best introduction to the editor.

Introduction

Sam is an interactive text editor that combines cut-and-paste interactive editing with an unusual command language based on the composition of regular expressions. It is written as two programs: one, the ‘host part,’ runs on a UNIX system and implements the command language and provides file access; the other, the ‘terminal part,’ runs asynchronously on a machine with a mouse and bitmap display and supports the display and interactive editing. The host part may be even run in isolation on an ordinary terminal to edit text using the command language, much like a traditional line editor, without assistance from a mouse or display. Most often, the terminal part runs on a Blit1 terminal (actually on a Teletype DMD 5620, the production version of the Blit), whose host connection is an ordinary 9600 bps RS232 link; on the SUN computer the host and display processes run on a single machine, connected by a pipe.

Sam edits uninterpreted ASCII text. It has no facilities for multiple fonts, graphics or tables, unlike MacWrite,2 Bravo,3 Tioga4 or Lara.5 Also unlike them, it has a rich command language. (Throughout this paper, the phrase command language refers to textual commands; commands activated from the mouse form the mouse language.) Sam developed as an editor for use by programmers, and tries to join the styles of the UNIX text editor ed6,7 with that of interactive cut-and-paste editors by providing a comfortable mouse-driven interface to a program with a solid command language driven by regular expressions. The command language developed more than the mouse language, and acquired a notation for describing the structure of files more richly than as a sequence of lines, using a dataflow-like syntax for specifying changes.

The interactive style was influenced by jim,1 an early cut-and-paste editor for the Blit, and by mux,8 the Blit window system. Mux merges the original Blit window system, mpx,1 with cut-and-paste editing, forming something like a multiplexed version of jim that edits the output of (and input to) command sessions rather than files.

The first part of this paper describes the command language, then the mouse language, and explains how they interact. That is followed by a description of the implementation, first of the host part, then of the terminal part. A principle that influenced the design of sam is that it should have no explicit limits, such as upper limits on file size or line length. A secondary consideration is that it be efficient. To honor these two goals together requires a method for efficiently manipulating huge strings (files) without breaking them into lines, perhaps while making thousands of changes under control of the command language. Sam’s method is to treat the file as a transaction database, implementing changes as atomic updates. These updates may be unwound easily to ‘undo’ changes. Efficiency is achieved through a collection of caches that minimizes disc traffic and data motion, both within the two parts of the program and between them.

The terminal part of sam is fairly straightforward. More interesting is how the two halves of the editor stay synchronized when either half may initiate a change. This is achieved through a data structure that organizes the communications and is maintained in parallel by both halves.

The last part of the paper chronicles the writing of sam and discusses the lessons that were learned through its development and use.

The paper is long, but is composed largely of two papers of reasonable length: a description of the user interface of sam and a discussion of its implementation. They are combined because the implementation is strongly influenced by the user interface, and vice versa.

The Interface

Sam is a text editor for multiple files. File names may be provided when it is invoked:

sam file1 file2 ...

and there are commands to add new files and discard unneeded ones. Files are not read until necessary to complete some command. Editing operations apply to an internal copy made when the file is read; the UNIX file associated with the copy is changed only by an explicit command. To simplify the discussion, the internal copy is here called a file, while the disc-resident original is called a disc file.

Sam is usually connected to a bitmap display that presents a cut-and-paste editor driven by the mouse. In this mode, the command language is still available: text typed in a special window, called the sam window, is interpreted as commands to be executed in the current file. Cut-and-paste editing may be used in any window — even in the sam window to construct commands. The other mode of operation, invoked by starting sam with the option -d (for ‘no download’), does not use the mouse or bitmap display, but still permits editing using the textual command language, even on an ordinary terminal, interactively or from a script.

The following sections describe first the command language (under sam\fP-d and in the sam window), and then the mouse interface. These two languages are nearly independent, but connect through the current text, described below.

The Command Language

A file consists of its contents, which are an array of characters (that is, a string); the name of the associated disc file; the modified bit that states whether the contents match those of the disc file; and a substring of the contents, called the current text or dot (see Figures 1 and 2). If the current text is a null string, dot falls between characters. The value of dot is the location of the current text; the contents of dot are the characters it contains. Sam imparts to the text no two-dimensional interpretation such as columns or fields; text is always one-dimensional. Even the idea of a ‘line’ of text as understood by most UNIX programs — a sequence of characters terminated by a newline character — is only weakly supported.

The current file is the file to which editing commands refer. The current text is therefore dot in the current file. If a command doesn’t explicitly name a particular file or piece of text, the command is assumed to apply to the current text. For the moment, ignore the presence of multiple files and consider editing a single file.

Figure 1. A typical sam screen, with the editing menu presented. The sam (command language) window is in the middle, with file windows above and below. (The user interface makes it easy to create these abutting windows.) The partially obscured window is a third file window. The uppermost window is that to which typing and mouse operations apply, as indicated by its heavy border. Each window has its current text highlighted in reverse video. The sam window’s current text is the null string on the last visible line, indicated by a vertical bar. See also Figure 2.

Commands have one-letter names. Except for non-editing commands such as writing the file to disc, most commands make some change to the text in dot and leave dot set to the text resulting from the change. For example, the delete command, d, deletes the text in dot, replacing it by the null string and setting dot to the result. The change command, c, replaces dot by text delimited by an arbitrary punctuation character, conventionally a slash. Thus,

c/Peter/

replaces the text in dot by the string Peter. Similarly,

a/Peter/

(append) adds the string after dot, and

i/Peter/

(insert) inserts before dot. All three leave dot set to the new text, Peter.

Newlines are part of the syntax of commands: the newline character lexically terminates a command. Within the inserted text, however, newlines are never implicit. But since it is often convenient to insert multiple lines of text, sam has a special syntax for that case:

a

some lines of text

to be inserted in the file,

terminated by a period

on a line by itself

.

In the one-line syntax, a newline character may be specified by a C-like escape, so

c/\n/

replaces dot by a single newline character.

Sam also has a substitute command, s:

s/expression/replacement/

substitutes the replacement text for the first match, in dot, of the regular expression. Thus, if dot is the string Peter, the command

s/t/st/

changes it to Pester. In general, s is unnecessary, but it was inherited from ed and it has some convenient variations. For instance, the replacement text may include the matched text, specified by &:

s/Peter/Oh, &, &, &, &!/

There are also three commands that apply programs to text:

UNIX program

replaces dot by the output of the UNIX program. Similarly, the > command runs the program with dot as its standard input, and | does both. For example,

| sort

replaces dot by the result of applying the standard sorting utility to it. Again, newlines have no special significance for these sam commands. The text acted upon and resulting from these commands is not necessarily bounded by newlines, although for connection with UNIX programs, newlines may be necessary to obey conventions.

One more command: p prints the contents of dot. Table I summarizes sam’s commands.

The value of dot may be changed by specifying an address for the command. The simplest address is a line number:

3

refers to the third line of the file, so

3d

deletes the third line of the file, and implicitly renumbers the lines so the old line 4 is now numbered 3. (This is one of the few places where sam deals with lines directly.) Line 0 is the null string at the beginning of the file. If a command consists of only an address, a p command is assumed, so typing an unadorned 3 prints line 3 on the terminal. There are a couple of other basic addresses: a period addresses dot itself; and a dollar sign ($) addresses the null string at the end of the file.

An address is always a single substring of the file. Thus, the address 3 addresses the characters after the second newline of the file through the third newline of the file. A compound address is constructed by the comma operator

address1,address2

and addresses the substring of the file from the beginning of address1 to the end of address2. For example, the command 3,5p prints the third through fifth lines of the file and .,$d deletes the text from the beginning of dot to the end of the file.

These addresses are all absolute positions in the file, but sam also has relative addresses, indicated by + or -. For example,

$-3

is the third line before the end of the file and

.+1

is the line after dot. If no address appears to the left of the + or -, dot is assumed; if nothing appears to the right, 1 is assumed. Therefore, .+1 may be abbreviated to just a plus sign.

The + operator acts relative to the end of its first argument, while the - operator acts relative to the beginning. Thus .+1 addresses the first line after dot, .- addresses the first line before dot, and +- refers to the line containing the end of dot. (Dot may span multiple lines, and + selects the line after the end of dot, then - backs up one line.)

The final type of address is a regular expression, which addresses the text matched by the expression. The expression is enclosed in slashes, as in

/expression/

The expressions are the same as those in the UNIX program egrep,6,7 and include closures, alternations, and so on. They find the leftmost longest string that matches the expression, that is, the first match after the point where the search is started, and if more than one match begins at the same spot, the longest such match. (I assume familiarity with the syntax for regular expressions in UNIX programs.9) For example,

/x/

matches the next x character in the file,

/xx*/

matches the next run of one or more x’s, and

/x|Peter/

matches the next x or Peter. For compatibility with other UNIX programs, the ‘any character’ operator, a period, does not match a newline, so

/.*/

matches the text from dot to the end of the line, but excludes the newline and so will not match across the line boundary.

Regular expressions are always relative addresses. The direction is forwards by default, so /Peter/ is really an abbreviation for +/Peter/. The search can be reversed with a minus sign, so

-/Peter/

finds the first Peter before dot. Regular expressions may be used with other address forms, so 0+/Peter/ finds the first Peter in the file and $-/Peter/ finds the last. Table II summarizes sam’s addresses.

The language discussed so far will not seem novel to people who use UNIX text editors such as ed or vi.9 Moreover, the kinds of editing operations these commands allow, with the exception of regular expressions and line numbers, are clearly more conveniently handled by a mouse-based interface. Indeed, sam’s mouse language (discussed at length below) is the means by which simple changes are usually made. For large or repetitive changes, however, a textual language outperforms a manual interface.

Imagine that, instead of deleting just one occurrence of the string Peter, we wanted to eliminate every Peter. What’s needed is an iterator that runs a command for each occurrence of some text. Sam’s iterator is called x, for extract:

x/expressioncommand

finds all matches in dot of the specified expression, and for each such match, sets dot to the text matched and runs the command. So to delete all the Peters:

0,$ x/Peter/ d

(Blanks in these examples are to improve readability; sam neither requires nor interprets them.) This searches the entire file (0,$) for occurrences of the string Peter, and runs the d command with dot set to each such occurrence. (By contrast, the comparable ed command would delete all lines containing Peter; sam deletes only the Peters.) The address 0,$ is commonly used, and may be abbreviated to just a comma. As another example,

, x/Peter/ p

prints a list of Peters, one for each appearance in the file, with no intervening text (not even newlines to separate the instances).

Of course, the text extracted by x may be selected by a regular expression, which complicates deciding what set of matches is chosen — matches may overlap. This is resolved by generating the matches starting from the beginning of dot using the leftmost-longest rule, and searching for each match starting from the end of the previous one. Regular expressions may also match null strings, but a null match adjacent to a non-null match is never selected; at least one character must intervene. For example,

, c/AAA/

x/B*/ c/-/

, p

produces as output

-A-A-A-

because the pattern B* matches the null strings separating the A’s.

The x command has a complement, y, with similar syntax, that executes the command with dot set to the text between the matches of the expression. For example,

, c/AAA/

y/A/ c/-/

, p

produces the same result as the example above.

The x and y commands are looping constructs, and sam has a pair of conditional commands to go with them. They have similar syntax:

g/expressioncommand

(guard) runs the command exactly once if dot contains a match of the expression. This is different from x, which runs the command for each match: x loops; g merely tests, without changing the value of dot. Thus,

, x/Peter/ d

deletes all occurrences of Peter, but

, g/Peter/ d

deletes the whole file (reduces it to a null string) if Peter occurs anywhere in the text. The complementary conditional is v, which runs the command if there is no match of the expression.

These control-structure-like commands may be composed to construct more involved operations. For example, to print those lines of text that contain the string Peter:

, x/.*\n/ g/Peter/ p

The x breaks the file into lines, the g selects those lines containing Peter, and the p prints them. This command gives an address for the x command (the whole file), but because g does not have an explicit address, it applies to the value of dot produced by the x command, that is, to each line. All commands in sam except for the command to write a file to disc use dot for the default address.

Composition may be continued indefinitely.

, x/.*\n/ g/Peter/ v/SaltPeter/ p

prints those lines containing Peter but not those containing SaltPeter.

Structural Regular Expressions

Unlike other UNIX text editors, including the non-interactive ones such as sed and awk,7 sam is good for manipulating files with multi-line ‘records.’ An example is an on-line phone book composed of records, separated by blank lines, of the form

Herbert Tic

44 Turnip Ave., Endive, NJ

201-5555642

Norbert Twinge

16 Potato St., Cabbagetown, NJ

201-5553145

...

The format may be encoded as a regular expression:

(.+\n)+

that is, a sequence of one or more non-blank lines. The command to print Mr. Tic’s entire record is then

, x/(.+\n)+/ g/^Herbert Tic$/ p

and that to extract just the phone number is

, x/(.+\n)+/ g/^Herbert Tic$/ x/^[0-9]*-[0-9]*\n/ p

The latter command breaks the file into records, chooses Mr. Tic’s record, extracts the phone number from the record, and finally prints the number.

A more involved problem is that of renaming a particular variable, say n, to num in a C program. The obvious first attempt,

, x/n/ c/num/

is badly flawed: it changes not only the variable n but any letter n that appears. We need to extract all the variables, and select those that match n and only n:

, x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/

The pattern [A-Za-z_][A-Za-z_0-9]* matches C identifiers. Next g/n/ selects those containing an n. Then v/../ rejects those containing two (or more) characters, and finally c/num/ changes the remainder (identifiers n) to num. This version clearly works much better, but there may still be problems. For example, in C character and string constants, the sequence \n is interpreted as a newline character, and we don’t want to change it to \num. This problem can be forestalled with a y command:

, y/\\n/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/

(the second \ is necessary because of lexical conventions in regular expressions), or we could even reject character constants and strings outright:

,y/’[^’]*’/ y/"[^"]*"/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/

The y commands in this version exclude from consideration all character constants and strings. The only remaining problem is to deal with the possible occurrence of \’ or \" within these sequences, but it’s easy to see how to resolve this difficulty.

The point of these composed commands is successive refinement. A simple version of the command is tried, and if it’s not good enough, it can be honed by adding a clause or two. (Mistakes can be undone; see below. Also, the mouse language makes it unnecessary to retype the command each time.) The resulting chains of commands are somewhat reminiscent of shell pipelines.7 Unlike pipelines, though, which pass along modified data, sam commands pass a view of the data. The text at each step of the command is the same, but which pieces are selected is refined step by step until the correct piece is available to the final step of the command line, which ultimately makes the change.

In other UNIX programs, regular expressions are used only for selection, as in the sam g command, never for extraction as in the x or y command. For example, patterns in awk7 are used to select lines to be operated on, but cannot be used to describe the format of the input text, or to handle newline-free text. The use of regular expressions to describe the structure of a piece of text rather than its contents, as in the x command, has been given a name: structural regular expressions. When they are composed, as in the above example, they are pleasantly expressive. Their use is discussed at greater length elsewhere.10

Multiple files

Sam has a few other commands, mostly relating to input and output.

e discfilename

replaces the contents and name of the current file with those of the named disc file;

w discfilename

writes the contents to the named disc file; and

r discfilename

replaces dot with the contents of the named disc file. All these commands use the current file’s name if none is specified. Finally,

f discfilename

changes the name associated with the file and displays the result:

’-. discfilename

This output is called the file’s menu line, because it is the contents of the file’s line in the button 3 menu (described in the next section). The first three characters are a concise notation for the state of the file. The apostrophe signifies that the file is modified. The minus sign indicates the number of windows open on the file (see the next section): - means none, + means one, and * means more than one. Finally, the period indicates that this is the current file. These characters are useful for controlling the X command, described shortly.

Sam may be started with a set of disc files (such as all the source for a program) by invoking it with a list of file names as arguments, and more may be added or deleted on demand.

B discfile1 discfile2 ...

adds the named files to sam’s list, and

D discfile1 discfile2 ...

removes them from sam’s memory (without effect on associated disc files). Both these commands have a syntax for using the shell7 (the UNIX command interpreter) to generate the lists:

B <echo *.c

will add all C source files, and

B <grep -l variable *.c

will add all C source files referencing a particular variable (the UNIX command grep\fP-l lists all files in its arguments that contain matches of the specified regular expression). Finally, D without arguments deletes the current file.

There are two ways to change which file is current:

b filename

makes the named file current. The B command does the same, but also adds any new files to sam’s list. (In practice, of course, the current file is usually chosen by mouse actions, not by textual commands.) The other way is to use a form of address that refers to files:

"expressionaddress

refers to the address evaluated in the file whose menu line matches the expression (there must be exactly one match). For example,

"peter.c" 3

refers to the third line of the file whose name matches peter.c. This is most useful in the move (m) and copy (t) commands:

0,$ t "peter.c" 0

makes a copy of the current file at the beginning of peter.c.

The X command is a looping construct, like x, that refers to files instead of strings:

X/expressioncommand

runs the command in all files whose menu lines match the expression. The best example is

X/’/ w

which writes to disc all modified files. Y is the complement of X: it runs the command on all files whose menu lines don’t match the expression:

Y/\.c/ D

deletes all files that don’t have .c in their names, that is, it keeps all C source files and deletes the rest.

Braces allow commands to be grouped, so

{

    command1

    command2

}

is syntactically a single command that runs two commands. Thus,

X/\.c/ ,g/variable/ {

    f

    , x/.*\n/ g/variable/ p

}

finds all occurrences of variable in C source files, and prints out the file names and lines of each match. The precise semantics of compound operations is discussed in the implementation sections below.

Finally, the undo command, u, undoes the last command, no matter how many files were affected. Multiple undo operations move further back in time, so

u

u

(which may be abbreviated u2) undoes the last two commands. An undo may not be undone, however, nor may any command that adds or deletes files. Everything else is undoable, though, including for example e commands:

e filename

u

restores the state of the file completely, including its name, dot, and modified bit. Because of the undo, potentially dangerous commands are not guarded by confirmations. Only D, which destroys the information necessary to restore itself, is protected. It will not delete a modified file, but a second D of the same file will succeed regardless. The q command, which exits sam, is similarly guarded.

Mouse Interface

Sam is most commonly run connected to a bitmap display and mouse for interactive editing. The only difference in the command language between regular, mouse-driven sam and sam\fP-d is that if an address is provided without a command, sam\fP-d will print the text referenced by the address, but regular sam will highlight it on the screen — in fact, dot is always highlighted (see Figure 2).

Figure 2. A sam window. The scroll bar down the left represents the file, with the bubble showing the fraction visible in the window. The scroll bar may be manipulated by the mouse for convenient browsing. The current text, which is highlighted, need not fit on a line. Here it consists of one partial line, one complete line, and final partial line.

Each file may have zero or more windows open on the display. At any time, only one window in all of sam is the current window, that is, the window to which typing and mouse actions refer; this may be the sam window (that in which commands may be typed) or one of the file windows. When a file has multiple windows, the image of the file in each window is always kept up to date. The current file is the last file affected by a command, so if the sam window is current, the current window is not a window on the current file. However, each window on a file has its own value of dot, and when switching between windows on a single file, the file’s value of dot is changed to that of the window. Thus, flipping between windows behaves in the obvious, convenient way.

The mouse on the Blit has three buttons, numbered left to right. Button 3 has a list of commands to manipulate windows, followed by a list of ‘menu lines’ exactly as printed by the f command, one per file (not one per window). These menu lines are sorted by file name. If the list is long, the Blit menu software will make it more manageable by generating a scrolling menu instead of an unwieldy long list. Using the menu to select a file from the list makes that file the current file, and the most recently current window in that file the current window. But if that file is already current, selecting it in the menu cycles through the windows on the file; this simple trick avoids a special menu to choose windows on a file. If there is no window open on the file, sam changes the mouse cursor to prompt the user to create one.

The commands on the button 3 menu are straightforward (see Figure 3), and are like the commands to manipulate windows in mux,8 the Blit’s window system. New makes a new file, and gives it one empty window, whose size is determined by a rectangle swept by the mouse. Zerox prompts for a window to be selected, and makes a clone of that window; this is how multiple windows are created on one file. Reshape changes the size of the indicated window, and close deletes it. If that is the last window open on the file, close first does a D command on the file. Write is identical to a w command on the file; it is in the menu purely for convenience. Finally, ~~sam~~ is a menu item that appears between the commands and the file names. Selecting it makes the sam window the current window, causing subsequent typing to be interpreted as commands.

Figure 3. The menu on button 3. The black rectangle on the left is a scroll bar; the menu is limited to the length shown to prevent its becoming unwieldy. Above the ~~sam~~ line is a list of commands; beneath it is a list of files, presented exactly as with the f command.

When sam requests that a window be swept, in response to new, zerox or reshape, it changes the mouse cursor from the usual arrow to a box with a small arrow. In this state, the mouse may be used to indicate an arbitrary rectangle by pressing button 3 at one corner and releasing it at the opposite corner. More conveniently, button 3 may simply be clicked, whereupon sam creates the maximal rectangle that contains the cursor and abuts the sam window. By placing the sam window in the middle of the screen, the user can define two regions (one above, one below) in which stacked fully-overlapping windows can be created with minimal fuss (see Figure 1). This simple user interface trick makes window creation noticeably easier.

The cut-and-paste editor is essentially the same as that in Smalltalk-80.11 The text in dot is always highlighted on the screen. When a character is typed it replaces dot, and sets dot to the null string after the character. Thus, ordinary typing inserts text. Button 1 is used for selection: pressing the button, moving the mouse, and lifting the button selects (sets dot to) the text between the points where the button was pressed and released. Pressing and releasing at the same point selects a null string; this is called clicking. Clicking twice quickly, or double clicking, selects larger objects; for example, double clicking in a word selects the word, double clicking just inside an opening bracket selects the text contained in the brackets (handling nested brackets correctly), and similarly for parentheses, quotes, and so on. The double-clicking rules reflect a bias toward programmers. If sam were intended more for word processing, double-clicks would probably select linguistic structures such as sentences.

If button 1 is pressed outside the current window, it makes the indicated window current. This is the easiest way to switch between windows and files.

Pressing button 2 brings up a menu of editing functions (see Figure 4). These mostly apply to the selected text: cut deletes the selected text, and remembers it in a hidden buffer called the snarf buffer, paste replaces the selected text by the contents of the snarf buffer, snarf just copies the selected text to the snarf buffer, look searches forward for the next literal occurrence of the selected text, and <mux> exchanges snarf buffers with the window system in which sam is running. Finally, the last regular expression used appears as a menu entry to search forward for the next occurrence of a match for the expression.

Figure 4. The menu on button 2. The bottom entry tracks the most recently used regular expression, which may be literal text.

The relationship between the command language and the mouse language is entirely due to the equality of dot and the selected text chosen with button 1 on the mouse. For example, to make a set of changes in a C subroutine, dot can be set by double clicking on the left brace that begins the subroutine, which sets dot for the command language. An address-free command then typed in the sam window will apply only to the text between the opening and closing braces of the function. The idea is to select what you want, and then say what you want to do with it, whether invoked by a menu selection or by a typed command. And of course, the value of dot is highlighted on the display after the command completes. This relationship between mouse interface and command language is clumsy to explain, but comfortable, even natural, in practice.

The Implementation

The next few sections describe how sam is put together, first the host part, then the inter-component communication, then the terminal part. After explaining how the command language is implemented, the discussion follows (roughly) the path of a character from the temporary file on disc to the screen. The presentation centers on the data structures, because that is how the program was designed and because the algorithms are easy to provide, given the right data structures.

Parsing and execution

The command language is interpreted by parsing each command with a table-driven recursive descent parser, and when a complete command is assembled, invoking a top-down executor. Most editors instead employ a simple character-at-a-time lexical scanner. Use of a parser makes it easy and unambiguous to detect when a command is complete, which has two advantages. First, escape conventions such as backslashes to quote multiple-line commands are unnecessary; if the command isn’t finished, the parser keeps reading. For example, a multiple-line append driven by an x command is straightforward:

x/.*\n/ g/Peter/ a

one line about Peter

another line about Peter

.

Other UNIX editors would require a backslash after all but the last line.

The other advantage is specific to the two-process structure of sam. The host process must decide when a command is completed so the command interpreter can be called. This problem is easily resolved by having the lexical analyzer read the single stream of events from the terminal, directly executing all typing and mouse commands, but passing to the parser characters typed to the sam command window. This scheme is slightly complicated by the availability of cut-and-paste editing in the sam window, but that difficulty is resolved by applying the rules used in mux: when a newline is typed to the sam window, all text between the newline and the previously typed newline is made available to the parser. This permits arbitrary editing to be done to a command before typing newline and thereby requesting execution.

The parser is driven by a table because the syntax of addresses and commands is regular enough to be encoded compactly. There are few special cases, such as the replacement text in a substitution, so the syntax of almost all commands can be encoded with a few flags. These include whether the command allows an address (for example, e does not), whether it takes a regular expression (as in x and s), whether it takes replacement text (as in c or i), which may be multi-line, and so on. The internal syntax of regular expressions is handled by a separate parser; a regular expression is a leaf of the command parse tree. Regular expressions are discussed fully in the next section.

The parser table also has information about defaults, so the interpreter is always called with a complete tree. For example, the parser fills in the implicit 0 and $ in the abbreviated address , (comma), inserts a + to the left of an unadorned regular expression in an address, and provides the usual default address . (dot) for commands that expect an address but are not given one.

Once a complete command is parsed, the evaluation is easy. The address is evaluated left-to-right starting from the value of dot, with a mostly ordinary expression evaluator. Addresses, like many of the data structures in sam, are held in a C structure and passed around by value:

typedef long Posn;    /* Position in a file */

typedef struct Range{

        Posn    p1, p2;

}Range;

typedef struct Address{

        Range   r;

        File    *f;

}Address;

An address is encoded as a substring (character positions p1 to p2) in a file f. (The data type File is described in detail below.)

The address interpreter is an Address-valued function that traverses the parse tree describing an address (the parse tree for the address has type Addrtree):

Address