Dis Virtual Machine Specification

Lucent Technologies Inc

30 September 1999

Extensively revised by Vita Nuova Limited

5 June 2000, 9 January 2003

1. Introduction

The Dis virtual machine provides the execution environment for programs running under the Inferno operating system. The virtual machine models a CISC-like, three operand, memory-to-memory architecture. Code can either be interpreted by a C library or compiled on-the-fly into machine code for the target architecture.

This paper defines the virtual machine informally. A separate paper by Winterbottom and Pike[2] discusses its design. The Dis object file format is also defined here. Literals and keywords are in typewriter typeface.

2. Addressing Modes

Operand Size

Operand sizes are defined as follows: a byte is 8 bits, a word or pointer is 32 bits, a float is 64 bits, a big integer is 64 bits. The operand size of each instruction is encoded explicitly by the operand code. The operand size and type are specified by the last character of the instruction mnemonic:

Two more operand types are defined to provide ‘short’ types for use by languages other than Limbo: signed 16-bit integers, called ‘short word’ here, and 32-bit IEEE format floating-point numbers, called ‘short float’ or ‘short real’ here. Support for them is limited to conversion to and from words or floats respectively; the instructions are marked below with a dagger (†).

Memory Organization

Memory for a thread is divided into several separate regions. The code segment stores either a decoded virtual machine instruction stream suitable for execution by the interpreter or flash compiled native machine code for the host CPU. Neither type of code segment is addressable from the instruction set. At the object code level, PC values are offsets, counted in instructions, from the beginning of the code space.

Data memory is a linear array of bytes, addressed using 32-bit pointers. Words are stored in the native representation of the host CPU. Data types larger than a byte must be stored at addresses aligned to a multiple of the data size. A thread executing a module has access to two regions of addressable data memory. A module pointer (mp register) defines a region of global storage for a particular module, a frame pointer (fp register) defines the current activation record or frame for the thread. Frames are allocated dynamically from a stack by function call and return instructions. The stack is extended automatically from the heap.

The mp and fp registers cannot be addressed directly, and therefore, can be modified only by call and return instructions.

Effective Addresses

Each instruction can potentially address three operands. The source and destination operands are general, but the middle operand can use any address mode except double indirect. If the middle operand of a three address instruction is omitted, it is assumed to be the same as the destination operand.

The general operands generate an effective address from three basic modes: immediate, indirect and double indirect. The assembler syntax for each mode is:

Garbage Collection

The Dis machine performs both reference counted and real time mark and sweep garbage collection. This hyrbrid approach allows code to be generated in several styles: pure reference counted, mark and sweep, or a hybrid of the two approaches. Compiler writers have the freedom to choose how specific types are handled by the machine to optimize code for performance or language implementation. Instruction selection determines which algorithm will be applied to specific types.

When using reference counting, pointers are a special operand type and should only be manipulated using the pointer instructions in order to ensure the correct functioning of the garbage collector. Every memory location that stores a pointer must be known to the interpreter so that it can be initialized and deallocated correctly. The information is transmitted in the form of type descriptors in the object module. Each type descriptor contains a bit vector for a particular type where each bit corresponds to a word in memory. Type descriptors are generated automatically by the Limbo compiler. The assembler syntax for a type descriptor is:

desc    $10, 132, "001F"

The first parameter is the descriptor number, the second is the size in bytes, and the third a pointer map. The map contains a list of hex bytes where each byte maps eight 32 bit words. The most significant bit represents the lowest memory address. A one bit indicates a pointer in memory. The map need not have an entry for every byte and unspecified bytes are assumed zero.

Throughout this description, the symbolic constant H refers to a nil pointer.

3. Instruction Set

addx - Add

Syntax:     addb    src1, src2, dst

        addf    src1, src2, dst

        addw    src1, src2, dst

        addl    src1, src2, dst

Function:   dst = src1 + src2

The add instructions compute the sum of the operands addressed by src1 and src2 and stores the result in the dst operand. For addb the result is truncated to eight bits.

addc - Add strings

Syntax:     addc    src1, src2, dst

Function:   dst = src1 + src2

The addc instruction concatenates the two UTF strings pointed to by src1 and src2; the result is placed in the pointer addressed by dst. If both pointers are H the result will be a zero length string rather than H.

alt - Alternate between communications

Syntax:     alt src, dst

The alt instruction selects between a set of channels ready to communicate. The src argument is the address of a structure of the following form:

struct Alt {

    int nsend;      /* Number of senders */

    int nrecv;      /* Number of receivers */

    struct {

        Channel* c;     /* Channel */

        void*   val;    /* Address of lval/rval */

    } entry[];

};

The vector is divided into two sections; the first lists the channels ready to send values, the second lists channels either ready to receive or an array of channels each of which may be ready to receive. The counts of the sender and receiver channels are stored as the first and second words addressed by src. An alt instruction proceeds by testing each channel for readiness to communicate. A ready channel is added to a list. If the list is empty after each channel has been considered, the thread blocks at the alt instruction waiting for a channel to become ready; otherwise, a channel is picked at random from the ready set.

The alt instruction then uses the selected channel to perform the communication using the val address as either a source for send or a destination for receive. The numeric index of the selected vector element is placed in dst.

andx - Logical AND

Syntax:     andb    src1, src2, dst

        andw    src1, src2, dst

        andl    src1, src2, dst

Function:   dst = src1 & src2

The instructions compute the bitwise AND of the two operands addressed by src1 and src2 and stores the result in the dst operand.

beqx - Branch equal

Syntax:     beqb    src1, src2, dst

        beqc    src1, src2, dst

        beqf    src1, src2, dst

        beqw    src1, src2, dst

        beql    src1, src2, dst

Function:   if src1 == src2 then pc = dst

If the src1 operand is equal to the src2 operand, then control is transferred to the program counter specified by the dst operand.

bgex - Branch greater or equal

Syntax:     bgeb    src1, src2, dst

        bgec    src1, src2, dst

        bgef    src1, src2, dst

        bgew    src1, src2, dst

        bgel    src1, src2, dst

Function:   if src1 >= src2 then pc = dst

If the src1 operand is greater than or equal to the src2 operand, then control is transferred to program counter specified by the dst operand. This instruction performs a signed comparison.

bgtx - Branch greater

Syntax:     bgtb    src1, src2, dst

        bgtc    src1, src2, dst

        bgtf    src1, src2, dst

        bgtw    src1, src2, dst

        bgtl    src1, src2, dst

Function:   if src1 > src2 then pc = dst

If the src1 operand is greater than the src2 operand, then control is transferred to the program counter specified by the dst operand. This instruction performs a signed comparison.

blex - Branch less than or equal

Syntax:     bleb    src1, src2, dst

        blec    src1, src2, dst

        blef    src1, src2, dst

        blew    src1, src2, dst

        blel    src1, src2, dst

Function:   if src1 <= src2 then pc = dst

If the src1 operand is less than or equal to the src2 operand, then control is transferred to the program counter specified by the dst operand. This instruction performs a signed comparison.

bltx - Branch less than

Syntax:     bltb    src1, src2, dst

        bltc    src1, src2, dst

        bltf    src1, src2, dst

        bltw    src1, src2, dst

        bltl    src1, src2, dst

Function:   if src1 < src2 then pc = dst

If the src1 operand is less than the src2 operand, then control is transferred to the program counter specified by the dst operand.

bnex - Branch not equal

Syntax:     bneb    src1, src2, dst

        bnec    src1, src2, dst

        bnef    src1, src2, dst

        bnew    src1, src2, dst

        bnel    src1, src2, dst

Function:   if src1 != src2 then pc = dst

If the src1 operand is not equal to the src2 operand, then control is transferred to the program counter specified by the dst operand.

call - Call local function

Syntax:     call    src, dst

Function:   link(src) = pc

        frame(src) = fp

        mod(src) = 0

        fp = src

        pc = dst

The call instruction performs a function call to a routine in the same module. The src argument specifies a frame created by new. The current value of pc is stored in link(src), the current value of fp is stored in frame(src) and the module link register is set to 0. The value of fp is then set to src and control is transferred to the program counter specified by dst.

case - Case compare integer and branch

Syntax:     case    src, dst

Function:   pc = 0..i: dst[i].pc where

          dst[i].lo >= src && dst[i].hi < src

The case instruction jumps to a new location specified by a range of values. The dst operand points to a table in memory containing a table of i values. Each value is three words long: the first word specifies a low value, the second word specifies a high value, and the third word specifies a program counter. The first word of the table gives the number of entries. The case instruction searches the table for the first matching value where the src operand is greater than or equal to the low word and less than the high word. Control is transferred to the program counter stored in the first word of the matching entry.

casec - Case compare string and branch

Syntax:     casec   src, dst

Function:   pc = 0..i: dst[i].pc where

           dst[i].lo >= src && dst[i].hi < src

The casec instruction jumps to a new location specified by a range of string constants. The table is the same as described for the case instruction.

consx - Allocate new list element

Syntax:     consb   src, dst

        consc   src, dst

        consf   src, dst

        consl   src, dst

        consm   src, dst

        consmp  src, dst

        consp   src, dst

        consw   src, dst

Function:   p = new(src, dst)

        dst = p

The cons instructions add a new element to the head of a list. A new list element is composed from the src operand and a pointer to the head of an extant list specified by dst. The resulting element is stored back into dst.

cvtac - Convert byte array to string

Syntax:     cvtac   src, dst

Function:   dst = string(src)

The src operand must be an array of bytes, which is converted into a character string and stored in dst. The new string is a copy of the bytes in src.

cvtbw - Convert byte to word

Syntax:     cvtbw   src, dst

Function:   dst = src & 0xff

A byte is fetched from the src operand extended to the size of a word and then stored into dst.

cvtca - Convert string to byte array

Syntax:     cvtca   src, dst

Function:   dst = array(src)

The src operand must be a string which is converted into an array of bytes and stored in dst. The new array is a copy of the characters in src.

cvtcf - Convert string to real

Syntax:     cvtcf   src, dst

Function:   dst = (float)src

The string addressed by the src operand is converted to a floating point value and stored in the dst operand. Initial white space is ignored; conversion ceases at the first character in the string that is not part of the representation of the floating point value.

cvtcl - Convert string to big

Syntax:     cvtcl   src, dst

Function:   dst = (big)src

The string addressed by the src operand is converted to a big integer and stored in the dst operand. Initial white space is ignored; conversion ceases at the first non-digit in the string.

cvtcw - Convert string to word

Syntax:     cvtcw   src, dst

Function:   dst = (int)src

The string addressed by the src operand is converted to a word and stored in the dst operand. Initial white space is ignored; after a possible sign, conversion ceases at the first non-digit in the string.

cvtfc - Convert real to string

Syntax:     cvtfc   src, dst

Function:   dst = string(src)

The floating point value addressed by the src operand is converted to a string and stored in the dst operand. The string is a floating point representation of the value.

cvtfw - Convert real to word

Syntax:     cvtfw   src, dst

Function:   dst = (int)src

The floating point value addressed by src is converted into a word and stored into dst. The floating point value is rounded to the nearest integer.

cvtfl - Convert real to big

Syntax:     cvtfl   src, dst

Function:   dst = (big)src

The floating point value addressed by src is converted into a big integer and stored into dst. The floating point value is rounded to the nearest integer.

cvtfr - Convert real to short real†

Syntax:     cvtfr   src, dst

Function:   dst = (short float)src

The floating point value addressed by src is converted to a short (32-bit) floating point value and stored into dst. The floating point value is rounded to the nearest integer.

cvtlc - Convert big to string

Syntax:     cvtlc   src, dst

Function:   dst = string(src)

The big integer addressed by the src operand is converted to a string and stored in the dst operand. The string is the decimal representation of the big integer.

cvtlw - Convert big to word

Syntax:     cvtlw   src, dst

Function:   dst = (int)src

The big integer addressed by the src operand is converted to a word and stored in the dst operand.

cvtsw - Convert short word to word†

Syntax:     cvtsw   src, dst

Function:   dst = (int)src

The short word addressed by the src operand is converted to a word and stored in the dst operand.

cvtwb - Convert word to byte

Syntax:     cvtwb   src, dst

Function:   dst = (byte)src;

The src operand is converted to a byte and stored in the dst operand.

cvtwc - Convert word to string

Syntax:     cvtwc   src, dst

Function:   dst = string(src)

The word addressed by the src operand is converted to a string and stored in the dst operand. The string is the decimal representation of the word.

cvtwl - Convert word to big

Syntax:     cvtwl   src, dst

Function:   dst = (big)src;

The word addressed by the src operand is converted to a big integer and stored in the dst operand.

cvtwf - Convert word to real

Syntax:     cvtwf   src, dst

Function:   dst = (float)src;

The word addressed by the src operand is converted to a floating point value and stored in the dst operand.

cvtws - Convert word to short word†

Syntax:     cvtws   src, dst

Function:   dst = (short)src;

The word addressed by the src operand is converted to a short word and stored in the dst operand.

cvtlf - Convert big to real

Syntax:     cvtlf   src, dst

Function:   dst = (float)src;

The big integer addressed by the src operand is converted to a floating point value and stored in the dst operand.

cvtrf - Convert short real to real†

Syntax:     cvtrf   src, dst

Function:   dst = (float)src;

The short (32 bit) floating point value addressed by the src operand is converted to a 64-bit floating point value and stored in the dst operand.

divx - Divide

Syntax:     divb    src1, src2, dst

        divf    src1, src2, dst

        divw    src1, src2, dst

        divl    src1, src2, dst

Function:   dst = src2/src1

The src2 operand is divided by the src1 operand and the quotient is stored in the dst operand. Division by zero causes the thread to terminate.

exit - Terminate thread

Syntax:     exit

Function:   exit()

The executing thread terminates. All resources held in the stack are deallocated.

frame - Allocate frame for local call

Syntax:     frame   src1, src2

Function:   src2 = fp + src1->size

        initmem(src2, src1);

The frame instruction creates a new stack frame for a call to a function in the same module. The frame is initialized according to the type descriptor supplied as the src1 operand. A pointer to the newly created frame is stored in the src2 operand.

goto - Computed goto

Syntax:     goto    src, dst

Function:   pc = dst[src]

The goto instruction performs a computed goto. The src operand must be an integer index into a table of PC values specified by the dst operand.

headx - Head of list

Syntax:     headb   src, dst

        headf   src, dst

        headm   src, dst

        headmp  src, dst

        headp   src, dst

        headw   src, dst

        headl   src, dst

Function:   dst = hd src

The head instructions make a copy of the first data item stored in a list. The src operand must be a list of the correct type. The first item is copied into the dst operand. The list is not modified.

indc - Index by character

Syntax:     indc    src1, src2, dst 

Function:   dst = src1[src2]

The indc instruction indexes Unicode strings. The src1 instruction must be a string. The src2 operand must be an integer specifying the origin-0 index in src1 of the (Unicode) character to store in the dst operand.

indx - Array index

Syntax:     indx    src1, dst, src2

Function:   dst = &src1[src2]

The indx instruction computes the effective address of an array element. The src1 operand must be an array created by the newa instruction. The src2 operand must be an integer. The effective address of the src2 element of the array is stored in the dst operand.

indx - Index by type

Syntax:     indb    src1, dst, src2

        indw    src1, dst, src2

        indf    src1, dst, src2

        indl    src1, dst, src2

Function:   dst = src1[src2]

The indb, indw, indf and indl instructions index arrays of the basic types. The src1 operand must be an array created by the newa instruction. The src2 operand must be a non-negative integer index less than the array size. The value of the element at the index is loaded into the dst operand.

insc - Insert character into string

Syntax:     insc    src1, src2, dst

Function:   src1[src2] = dst

The insc instruction inserts a character into an existing string. The index in src2 must be a non-negative integer less than the length of the string plus one. (The character will be appended to the string if the index is equal to the string’s length.) The src1 operand must be a string (or nil). The character to insert must be a valid 16-bit unicode value represented as a word.

jmp - Branch always

Syntax:     jmp dst

Function:   pc = dst

Control is transferred to the location specified by the dst operand.

lea - Load effective address

Syntax:     lea src, dst

Function:   dst = &src

The lea instruction computes the effective address of the src operand and stores it in the dst operand.

lena - Length of array

Syntax:     lena    src, dst

Function:   dst = nelem(src)

The lena instruction computes the length of the array specified by the src operand and stores it in the dst operand.

lenc - Length of string

Syntax:     lenc    src, dst

Function:   dst = utflen(src)

The lenc instruction computes the number of characters in the UTF string addressed by the src operand and stores it in the dst operand.

lenl - Length of list

Syntax:     lenl    src, dst

Function:   dst = 0;

        for(l = src; l; l = tl l)

            dst++;

The lenl instruction computes the number of elements in the list addressed by the src operand and stores the result in the dst operand.

load - Load module

Syntax:     load    src1, src2, dst

Function:   dst = load src2 src1

The load instruction loads a new module into the heap. The module might optionally be compiled into machine code depending on the module header. The src1 operand is a pathname to the file containing the object code for the module. The src2 operand specifies the address of a linkage descriptor for the module (see below). A reference to the newly loaded module is stored in the dst operand. If the module could not be loaded for any reason, then dst will be set to H.

The linkage descriptor referenced by the src2 operand is a table in data space that lists the functions imported by the current module from the module to be loaded. It has the following layout:

int nentries;

struct {    /* word aligned */

    int sig;

    byte    name[]; /* UTF encoded name, 0-terminated */

} entry[];

The nentries value gives the number of entries in the table and can be zero. It is followed by that many linkage entries. Each entry is aligned on a word boundary; there can therefore be padding before each structure. The entry names the imported function in the UTF-encoded string in name, which is terminated by a byte containing zero. The MD5 hash of the function’s type signature is given in the value sig. For each entry, load instruction checks that a function with the same name in the newly loaded exists, with the same signature. Otherwise the load will fail and dst will be set to H.

The entries in the linkage descriptor form an array of linkage records (internal to the virtual machine) associated with the module pointer returned in dst, that is indexed by operators mframe, mcall and mspawn to refer to functions in that module. The linkage scheme provides a level of indirection that allows a module to be loaded using any module declaration that is a valid subset of the implementation module’s declaration, and allows entry points to be added to modules without invalidating calling modules.

lsrx - Logical shift right

Syntax:     lsrw    src1, src2, dst

        lsrl    src1, src2, dst

Function:   dst = (unsigned)src2 >> src1

The lsr instructions shift the src2 operand right by the number of bits specified by the src1 operand, replacing the vacated bits by 0, and store the result in the dst operand. Shift counts less than 0 or greater than the number of bits in the object have undefined results. This instruction is included for support of languages other than Limbo, and is not used by the Limbo compiler.

mcall - Inter-module call

Syntax:     mcall   src1, src2, src3

Function:   link(src1) = pc

        frame(src1) = fp

        mod(src1) = current_moduleptr

        current_moduleptr = src3->moduleptr

        fp = src1

        pc = src3->links[src2]->pc

The mcall instruction calls a function in another module. The first argument specifies a new frame for the called procedure and must have been built using the mframe instruction. The src3 operand is a module reference generated by a successful load instruction. The src2 operand specifies the index for the called function in the array of linkage records associated with that module reference (see the load instruction).

mframe - Allocate inter-module frame

Syntax:     mframe  src1, src2, dst

Function:   dst = fp + src1->links[src2]->t->size

        initmem(dst, src1->links[src2])

The mframe instruction allocates a new frame for a procedure call into another module. The src1 operand specifies the location of a module pointer created as the result of a successful load instruction. The src2 operand specifies the index for the called function in the array of linkage records associated with that module pointer (see the load instruction). A pointer to the initialized frame is stored in dst. The src2 operand specifies the linkage number of the function to be called in the module specified by src1.

mnewz - Allocate object given type from another module

Syntax:     mnewz   src1, src2, dst

Function:   dst = malloc(src1->types[src2]->size)

        initmem(dst, src1->types[src2]->map)

The mnewz instruction allocates and initializes storage to a new area of memory. The src1 operand specifies the location of a module pointer created as the result of a successful load instruction. The size of the new memory area and the location of pointers within it are specified by the src2 operand, which gives a type descriptor number within that module. Space not occupied by pointers is initialized to zero. A pointer to the initialized object is stored in dst. This instruction is not used by Limbo; it was added to implement other languages.

modx - Modulus

Syntax:     modb    src1, src2, dst

        modw    src1, src2, dst

        modl    src1, src2, dst

Function:   dst = src2 % src1

The modulus instructions compute the remainder of the src2 operand divided by the src1 operand and store the result in dst. The operator preserves the condition that the absolute value of a%b is less than the absolute value of b; (a/b)*b + a%b is always equal to a.

movx - Move scalar

Syntax:     movb    src, dst