Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

An Overview of The Synthesis Operating System, Study notes of Operating Systems

This technical report provides an overview of the Synthesis Operating System, which is an example of the new generation of OS's designed for distributed and multiprocessor systems. The report explains the Synthesis model of computation based on macro dataflow, which makes parallel programming easy. It also describes the innovative ideas used in the Synthesis implementation, such as Kernel Code Synthesis, Reduced Synchronization, Fine-Grain Scheduling, and Valued Redundancy, which provide high performance. The report includes performance figures, applications, and demos.

Typology: Study notes

2021/2022

Uploaded on 05/11/2023

faylin
faylin 🇺🇸

4.9

(8)

225 documents

1 / 25

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
An
Overview
of
The
Synthesis
Operating
System
Calton
Pu
and
Henry Massalin
Department
of
Computer
Science
Columbia University
New York, NY 10027
Technical Report No. CUCS-470-89
Abstract
An
operating
system
(OS)
maps
a model
of
computation,
defined by its kernel
interface.
onto
the underlying
hardware.
A.
simple
and
intuitive model
of
computation
makes
it
easy
for programmers
to
write applications. An efficient
implementation
of the
mapping
makes
the
applications
run
fast.
Typical
as interfaces for von
Neumann
hardware
include pro-
cesses
(CPU).
address spaces (memory),
and
I/O
devices
(I/O).
An as for
distributed
and
multiprocessor
systems
must
support
parallel processing
and
inter-process communications.
Designed for
the
distributed
and
multiprocessor systems
of
today
and
tomorrow,
Synthesis
is
an example of
the
new
generation
of OS's.
The
Synthesis model of
computation
based
on macro dataflow makes parallel programming easy.
The
Synthesis
implementation
uses
innovative ideas from
areas
as different as compilers
and
control systems.
Kernel
Code
Synthesis. Reduced
Synchronization.
Fine-Grain Scheduling,
and
Valued
Redundancy
provide
high performance.
Emulation
of
guest OS's in Synthesis allow existing software
written
for
other
OS's to run on Synthesis with little or
no
modification. Using
hardware
and
software
emulating
a
SUN-3/160
running
SUN OS, the
current
version of Synthesis kernel achieves
several times to several dozen times speedu p
for
{j
NIX kernel calls.
°This
work is
partially
funded
by
the
New York
State
Center
for Ad vanced Technology on
Computer
and
Information
Systems
under
the
gran
~
N
YSSTF
CtJ-O 112580. by
the
AT
&.:T
Fou
ndation
under
a Special
Pu
rpose
Gran
t.
a.nd
by the
i'-I
ational
Science
Foundation
under
the
grant
COA-8B-
20i54.
We
gladly acknowledge
the
hardware
parts
contributed
by
AT&T,
Hitachi.
IB~f.
and
~!otorola.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19

Partial preview of the text

Download An Overview of The Synthesis Operating System and more Study notes Operating Systems in PDF only on Docsity!

An Overview of

The Synthesis Operating System

Calton Pu and Henry Massalin

Department of Computer Science Columbia University New York, NY 10027 Technical Report No. CUCS-470-

Abstract

An operating system (OS) maps a model of computation, defined by its kernel interface. onto the underlying hardware. A. simple and intuitive model of computation makes it easy for programmers to write applications. An efficient implementation of the mapping makes

the applications run fast. Typical as interfaces for von Neumann hardware include pro-

cesses (CPU). address spaces (memory), and I/O devices (I/O). An as for distributed and

multiprocessor systems must support parallel processing and inter-process communications. Designed for the distributed and multiprocessor systems of today and tomorrow, Synthesis is an example of the new generation of OS's. The Synthesis model of computation based on macro dataflow makes parallel programming easy. The Synthesis implementation uses innovative ideas from areas as different as compilers and control systems. Kernel Code Synthesis. Reduced Synchronization. Fine-Grain Scheduling, and Valued Redundancy provide high performance. Emulation of guest OS's in Synthesis allow existing software written for other OS's to run on Synthesis with little or no modification. Using hardware and software emulating a SUN-3/160 running SUN OS, the current version of Synthesis kernel achieves several times to several dozen times speedu p for {j NIX kernel calls.

°This work is partially funded by the New York State Center for Ad vanced Technology on Computer and Information Systems under the gran ~ N YSSTF CtJ-O 112580. by the AT &.:T Fou ndation under a Special Pu rpose Gran t. a.nd by the i'-I ational Science Foundation under the grant COA-8B- 20i54. We gladly acknowledge the hardware parts contributed by AT&T, Hitachi. IB~f. and ~!otorola.

Contents

 - 1 Models of Computation and Operating Systems - 1.1 Models of Computation - 1.2 The 1-1 1I.fodel - 1.3 The N -1 1fodel - 1.4 The N-M Model 
  • 2 Multiprocessor and Distributed OS - 2.1 Next-Generation Architectures - 2.2 Next-Generation Computer Systems - 2.3 The Synthesis Model of Computation
  • 3 The Ideas in Synthesis - 3.1 Kernel Code Synthesis - 3.2 Reduced Synchronization - 3.3 Fine-Grain Scheduling - 3.4 Valued Redundancy - 3.5 Old (but Important) Ideas
  • 4 The Implementation of Synthesis - 4.1 Kernel Structure - 4.2 Threads - 4.3 Scheduling - 4.4 Synchronization. - 4.5 Input/Output - 4.6 Performance Figures - 4.7 Applications and Demos - 4.8 Fitting Together
    • 5 Conclusion

program (^) terminal (^) files memoryin^ and^ on lKeyboard disk

I I I l

Figure 2: 1-1 Mapping (DOS)

their kernel interfaces, most prevailing OS's have virtual machines that are instances of the von Neumann model. There are many reasons for this. The most important external reason is that OS users, especially the programmers, are used to the von Neumann model embodied in procedural programming languages. The most important internal reason is the ease of implementing an efficient OS. It is usually easier to find a simple mapping between two similar models (when both are von Neumann) than very different models. In summary, the OS maps its own model of computation (the kernel interface) into the hardware model (the instruction set). We classify the OS implementation into three impor- tant classes: from one virtual machine to one CPU (1-1 mapping), from N virtual machines to one CPU (N-l mapping), and from N virtual machines to M central processing units (N-M mapping). We will consider each in turn.

1.2 The 1-1 Model MS-DOS [3] is the OS with the most copies sold in history. Many thousands of application programs have been written to run on DOS, including word processors. electronic spread- sheets, graphics display packages, and database managers. It is curious that in many aspects the DOS resembles the primitive systems of the 50's. before the IBM /360 Operating System made the term OS popular. Historically. the pioneering computers of the 50's either ran bare machine without any systems software or had very simple support systems called I/O executives. A typical ex- ecutive would be a deck of punched cards that can assemble the program following it and load the resulting assembled code into the memory and execute the program. Sometimes the executive also provided limited I/O support, such as easy access to either punched or mag- netic tapes. In comparison. DOS has sophisticated I/O support, including terminal I/O and the disk file system (hence its name: Disk Operating System). However. in terms of running programs, DOS can run only one program at a time, just like the early I/O executives. \Ve call this kind of OS, exemplified by DOS. the 1-1 mapping OS. The first 1 means that the model of computation presented by the OS contains only one single process. The second 1 means that the OS is intended to run on a single-processor machine.

2

process process^ terminaland^ file (^1 2) ikeyboarc system

I I J I I^ I^ I^ I

Figure 3: N-1 Mapping (UNIX)

An 1-1 mapping is simple. Figure 2 shows the similarity between the DOS model and the von Neumann model. Since there is only one process that occupies all the memory and runs on the physical CPU all the time, once the program is loaded the OS is only concerned with servicing I/O requests ...:.Il the I/O external operations and internal handling are done by subroutines that encapsulate the hardware.

1.3 The N-l Model

The next class of OS in complexity, the N-1 mapping OS, is characterized by multiprogram. ming support. as in U~IX [9]. The N says that we are mapping N processes to a single processor. Figure 3 shows the UNIX N-l model of computation. Even though the difference between the two figures (2 and 3) appears slight, i.e., the addition of one process, the impact on the system and the users is great. Having N processes sharing the same physical machine introduces new opportunities for better hardware utilization. For example, the 1-1 mapping makes the CPU stop whenever there is an I/O activity. An N-1 mapping could run another process while the original process is waiting for the I/O operation to complete. On the other hand. the sharing also introduces several implementation difficulties for the N-1 mapping. First, we have to isolate the memory allocated to each process from the other processes. Otherwise. their execution may interfere with each other. Second, the OS has to switch the CPU between the processes when appropriate. Deciding when the CPU should run which process is called scheduling, which is a problem absent in the 1-1 mapping. Like memory and CPU. access to other shared resources such as disk space also need protection and synchronization. Solving these problems and providing other services make a big difference between typical 1-1 mappings and N-1 mappings. For example, the size of a typical UNIX kernel executable is measured in hundreds of kilobytes, compared to a typical DOS kernel measured in tens of kilobytes. The U:-;IX virtual machine is called a process, which has exactly one thread of control and one complete address space. From the model of computation point of view, the UNIX process is a uniprocessor. UNIX I/O is based on the idea of streams. Data are moved

3

and carry significant system overhead. Finally, another variety - Mach [1] from CMU - in- troduced the light- weight threads that share the same address space within a process and have relatively low overhead. Some of the commercially available multiprocessors (such as the Sequent Symmetry and Encore Muitima..'() run UNIX variants that follow these solutions. Each attempted extension of UNIX represents an incremental improvement. For example, light-weight threads make parallel processing within a process affordable, since the commu- nications between threads through shared memory is fast. However, light-weight threads do not solve the IPC problem at the process level. Also, the introduction of a two-level process hierarchy makes the composition of parallel programs more difficult. Ultimately, it is the original uniprocessor character of the UNIX process that makes parallel processing with UN IX extensions difficult. Before we discuss the details of OS's designed for multiprocessing and distributed pro- cessing, we first introduce their hardware base. Then we describe the desirable properties of such OS's. After that we introduce the Synthesis model of computation (an N-M mapping).

2 Multiprocessor and Distributed as

2.1 Next-Generation Architectures

Although the research on multiprocessor and distributed OS's has been going on for many years, the proliferation of actual parallel and distributed systems started only recently. During the first half of the 1980's, we saw some specialized systems such as Tandem systems for transaction processing and Teradata database machines. During the second half, general- purpose multiprocessor computers with around a dozen of powerful CPUs such as Encore ~lultima..'( and Sequent Symmetry became available. The evolution of wide-area networking happened at about the same time. ARPANET was developed and deployed towards the end of 1970's. During the 1980's, we saw the development of NSFnet and others. Network bandwidth grew slowly at the beginning. It took a decade for one-and-half order of magnitude increase (from the.56 Kilobit/second leased lines in ARPANET to 1.5 Megabit/second T-1 channels in NSFnet). Now we expect a three order of magnitude upgrade (Gigabit/second range) within a few years. The next-generation architecture we expect to see in a few years will consist of many machines connected through a very fast network. The machines range from PC's to super- workstations to supercomputers. A typical workstation would have a fast CPU at 25 to .' MIPS (million instruction per second), plenty of memory at 1 to 4 MB (Megabyte) per each ~fIPS of processing power, and both diversity and abundance in I/O devices, including disk space (20 to 100 times the memory size), color graphics displays, sound and video. A mul- tiprocessor may be a collection of these machines with many sets of CPU and memory on a fast bus.

.j

2.2 Next-Generation Computer Systems

vVe would want to run a next-generation operating system on top of the next-generation architecture described above. Many desirable properties are missing from our current systems: high performance and availability at low cost, graceful degradation when components break, interoperability with a large set of heterogeneous resources, etc. \Ve will consider each in turn. The very reason for building parallel computers is performance. \Ve always need more net processing power and parallel processing is an effective way of harnessing computing power at low cost. With additional machines on the network, the next-generation system should provide plenty of computing power in an easy-to-use form. In other words, the system should help the programmers use the many CPUs, solving the communications and synchronization problems with low overhead. With lots of hardware redundancy, a distributed system has the potential for high avail- ability. Even though there may be no need for replicating everything in such a network, we really want graceful degradation when components break down. Ideally, we want to substitute the broken machines with idle machines, maintaining the system performance and availabil- ity. After we have reached full utilization, a node crash would decrease only its contribution to the whole system performance and availability. An important attribute of a large system is the ability to encapsulate resources at different levels of abstraction. At the high levels, application programs can use the resources easily, regardless of their location. degree of redundancy, and other technical details. We call the high-level abstraction transparency. At the low levels, system programs can use the resource with the maximum economy. In Lampson's system design hints [4], this is called "do not hide power". Abstractions at several levels usually are specified as a layered interface. Finally, connectivity in a large distributed system inevitably results in heterogeneity. Even though systems with different components may not work together as efficiently as a homogeneous system, we still want to communicate with the other systems. This ability is called interoperabiiity. The main reason for interoperability is that some resources are not available elsewhere. so we have to go to a particular place to get them.

2.3 The Synthesis ivlodel of Computation

Synthesis is a multiprocessor and distributed OS. designed to achieve the goals enumerated

in the previous section for an N-M mapping. For system programmers and compiler writers,

we want high performance and predictability (for real-time applications). For application programmers. we want an intuitive model of com;>utation that is easy to use for writing parallel and distributed programs. Finally, for the general user we want the support to run

existing softv,,·are. The Synthesis model of computation contains threads of execution, memory protection

Synthesis I/O devices are byte streams. All the devices support the basic read and lJri te operations on byte streams. The byte stream model of I/O is implemented as a pipeline of device servers. Each physical I/O device has a raw device server, which may be connected to higher level device servers upstream the pipeline. For example, the TTY device has a raw device server connected to the cooked server, which filters the byte stream according to the editing control characters such as backspace.

3 The Ideas in Synthesis

In the previous section, we have described the concept of model of computation in OS's. Now we proceed to present the innovative ideas in the design and implementation of Synthesis. \Ve refer the interested reader to technical discussions of these ideas in other papers': kernel code synthesis [8], reduced synchronization [6], fine-grain scheduling [5], valued redundancy [j'J. All these ideas improve performance, although some have other good properties as well. Before the introduction of the ideas, we make the fundamental observation in optimization: all the components of a system must be optimized for the whole system to be optimized. Otherwise. the system performance (or availability) is limited by its weakest link. It is in light of this fundamen tal observation that our relentless pursuit of performance becomes reasonable.

3.1 Kernel Code Synthesis

Traditional OS kernels maintain the system state in data structures such as linked lists. A typical kernel call starts by traversing the appropriate data structures to reach the starting system state, then executes the few machine instructions to perform the actual function of the kernel call. For example. Stonebraker [10] pointed out that UNIX kernel cost to fetch one byte from buffer pool is about 1800 instructions on the PDP-llj70. The idea of kernel code synthesis is to capture frequently visited system states in small chunks of code. Instead of traversing the data structures, we branch to the synthesized code directly. The term "'code syn thesis" refers to the process of creating new code at run- time to capture a system state. In this section, we describe three methods to synthesize code: Factoring Invariants, Collapsing Layers, and Executable Data Structures. Following the PDP-llj70, successive generations of UNIX systems have been optimized in several ways. However, as shown by our measurements (Section 4), Synthesis kernel using code synthesis outperforms SUNOS by a factor from several times to several dozens of times. Now we describe the three code synthesis methods in tum. Borrowing from mathematics. the Factoring Invariants method is based on the observation that a functional restriction is usually easier to calculate than the original function. When a function will be called repeatedly with a constant parameter, we can apply a process called

8

currying, which simplifies the calculation by substituting the constant and carrying out the calculation then. If the simplified function is called many times, we compensate for the cost of currying and win. The Factoring Invariants method is analogous to constant folding optimization in compiler code generation. But the difference is also significant. Constant folding eliminates static code at compile time. In contrast, Factoring Invariants skips dynamic data structure traversals in addition to bypassing code at execution time. Factoring Invariants can be applied whenever we synthesize code to shorten the resulting execution path. The Collapsing Layers method is based on the observation that in a layered design, separation between layers are a part of specification, not implementation. In other words, procedure calls and context switches between functional layers can be bypassed at execution time. For example. in a system implementing the layered OSI interface, a naive implementation of the presentation layer would call the session layer, which calls the data link layer, and so on. For execution, we can run a fiat function by eliminating these procedure calls. We call this a vertical layer collapsing, which is analogous to in-line code substitution for procedure calls in compiler code generation. After a successful layer collapsing, typically we can apply Factoring Invariants to reduce the code further. These opportunities are due to the normal redundancy across layers of code. Examples include data copying and parameter marshaling. An example of the horizon tal (con text switch) layer collapsing is the pipelined organization of I/O processing. Each stage is a filter in the pipeline, conceptually a thread communicating with its neighbors using queues. If two consecutive filters are finite-state machines, we can collapse the two finite-state machines and the queue into one finite-state machine, eliminating the communication and synchronization overhead. We optimize dedicated I/O devices such as TTY dri vers this way. The Executable Data Structures method is based on the observation that many data structures are traversed in a preferred order. Therefore, we insert the traversal code locally into the data structures. making them self-traversing. The hardwired, localized code reduces the traversal overhead to minimum. Let us consider the simplified example of the active job queue managed by a round-robin scheduler. Each element in the queue contains two short sequences of code: context-switch- out and context-switch-in. The context-switch-out saves the registers and jumps into the next job's context-switch-in routine (in the next element in queue). The context-switch-in restores the registers. installs the address of its own context-switch-out in the timer interrupt vector table. and resumes processing. An interrupt causing a context switch will trigger the current program's context-switch- out, which saves the current state and branches directly into the next jobs's context-switch-in. ~ote that the scheduler has been taken ou t of the loop. It is the queue itself that does the context switch. with a critical path on the order of ten machine instructions. The scheduler

9

3.3 Fine-Grain Scheduling

We call scheduling policies fine-grain if they take into account local information in addition to global properties. An example of interesting local information for scheduling is the size of the job's input queue: if it is empty, dispatching the job will merely block for lack of input. Fine-grain scheduling policies are more sensitive to system state changes by definition.

In particular, fine-grain scheduling may smooth the data flow in a pipeline of processes,

eliminating the bottlenecks. 'We call scheduling mechanisms fine-grain if their scheduling/dispatching costs approach zero. Traditional scheduling mechanisms have high scheduling and dispatching overhead that discourages frequent scheduler decision making. Consequently, most scheduling algorithms tend to minimize their actions. \Ve observe that high scheduling and dispatching overhead is a result of implementation, not an inherent property of all scheduling mechanisms. Fine-grain scheduling mechanisms turn out to differ significantly from traditional mechanisms. Fine-grain scheduling policies and mechanisms together are called "fine-grain scheduling", implemented in the Synthesis operating system. Our approach to fine-grain scheduling poli- cies is similar to feedback mechanisms in control systems. We take a job to be scheduled and measure its progress, making scheduling decisions based on the measurements. For example, if the job is "too slow", say its input queue is getting full, we schedule it more often and let it run longer. The key idea in our fine-grain scheduling policy is based on feedback control, in particular phase locked loop (PLL). A hardware PLL outputs a frequency synchronized with a reference input frequency. Our software analogs of the PLL track a stream of interrupts to generate a new stable source of interrupts locked in step. The reference stream comes from a variety of sources. sayan I/O device, e.g., disk index interrupts that occur once every disk revolution. or the interval timer, e.g., at the end of a CPU quantum. When we use software to implement the PLL idea, we find more flexibility in measurement and control. Unlike hardware PLLs where we always measure phase differences, in software we can measure either the frequency of the input (events/second), or the time interval be- tween inputs (seconds/event). Analogously, we can adjust either the frequency of generated interrupts or the intervals between them. Combining the two kinds of measurements with the two kinds of adjustments, we get four kinds of software locked loops (SLL). An example of SLL that measures and adjusts frequency is a digital oversampling filter program for a CD player, which adjusts the filter I/O rate to match the CD player output. An example of 5LL that measures and adjusts time intervals is the disk sector interrupt generator, which decreases the rotational delay in disk access.

.-\s a feedback system. the 5LL generates interrupts as a function of its input. As the input interrupt stream changes its frequency or interval, the SLL adjust its output. For example, an 5LL that measures intervals will have its natural behavior to maintain the interval between

11

consecutive output interrupts equal to the interval between inputs. Fine-grain scheduling would be impractical without fast interrupt processing, fast con- text switch. and low dispatching overhead. Synthesis fine-grain scheduling policy means adjustments every few hundreds of microseconds on local information, such as the number of characters waiting in an input queue. Very low overhead scheduling (a few tens of microsec- onds) and context switch for dispatching (less then ten microseconds) form the foundation of the Synthesis implementation of fine-grain scheduling mechanism. In addition. we have very low overhead interrupt processing to allow frequent checks on the job progress and quick adjustments to the scheduling policy. Like reduced synchronization, fine-grain scheduling applies only to the N-! and N-IvI classes of OS's. With more processors and resources to schedule, fine-grain scheduling is more important for N-M mappings.

3.4 Valued Redundancy

The idea of vaLued redundancy increases system performance and availability at low cost by replicating the most valuable objects of the system. Since the redundancy management system explicitly maintains a replicated object's cost/performance ratio as its value, it is clear that valued redundancy ma.x.imizes system performance and availability objectives while minimizing maintenance cost. In the calculation of an object's value. we take into account the object's costs, performance contributions, and performance goals. Important cost parameters include the object creation costs (in terms of resources consumed) and maintenance costs, such as storage, consistent update across copies and garbage collection. The performance contributions include the access patterns such as read and write frequency. Finally, the main performance goals are ob ject access time (averaged over the copies) and the apparent ob ject availability (calculated over the attempted accesses). These value calculation parameters are closely related. Between the performance goal of an object and its performance contributions, we have a positive correlation. For example. to maximize system throughput we would set high performance goals for frequently accessed objects. Between the object cost and performance goals. we also have a positive correlation. For instance. we would carry more copies of an object that is read by many nodes. Between the object cost and the performance contributions. we have a trade-off. For example, replica update cost will be high for objects that are modified often. while queries do not carry additional object maintenance cost. With valued redundancy, in principle we can handle specific performance goals for each ob ject. In this paper, we will focus on system-wide performance goals, set in terms of the average behavior. :\Iore concretely, we want a system with high throughput, fast response-

time, and high availability. Valued redundancy will let us concentrate on the objects that

have high performance contributions. replicating them to the extent we can afford. As we will

12

First, although we have chosen the macro dataflow model of computation for Synthesis, we understand that future research on parallel and distributed processing may very well discover new and better models of computation. Since our choice of the macro dataflow model was made based on what we know now, we want to be able to include new and improved models when they are discovered. We will rely on emulation to maintain application program upward com pati bili ty. Second and more importantly, we use emulation as a scientific way to compare Synthesis performance with other OS's. Since we will run the same code for the same kernel interface on the same hardware, the comparison will be fair. The Synthesis emulation of SUNOS for performance comparison is described below. The main reason for the efficiency of Synthesis emulation is kernel code synthesis (in particular Collapsing Layers and Executable Data Structures).

4 The Implementation of Synthesis

Synthesis is an example of OS research "in practice". We have developed the ideas, ap- plied them in an actual implementation, and refined them with the lessons learned from the implementation. All the performance improvement techniques we used can be summarized in two software engineering principles. We call one of them the principle of independence. which separates the implementation from the specification. Among other examples, kernel code synthesis is made possible by a sufficiently abstract interface. We call the other the principle of frugality, which says that we should use the least powerful solution to a given problem. Optimistic synchronization is an example of frugality. The principle of frugality leads us towards a different direction from the usual optimiza- tion. Normally people measure systems to find out what is being used the most, say procedure calls and IPC. Their optimization consists of making procedure calls and IPC fast. Instead of this line of work, the principle of frugality says that we should observe what is useful work in a system. e.g. calculations and algorithms. and what is overhead. e.g. procedure calls and IPC. Then we proceed to eliminate the overhead code, instead of trying to make it fast. The idea is. 1 millisecond is faster than 2 milliseconds. but you cannot beat 0 milliseconds.

4.1 Kernel Structure

The Synthesis kernel can be divided into a number of collections of procedures and data. We call these collections of procedures quajects that encapsulate hardware resources, like Hydra objects (11]. Important quajects in this paper are the I/O device servers and threads. I/O device servers are abstractions of the respective I/O devices and a thread is an abstraction of the CPU. Since these quajects consist only of procedures and data, they are passive. Events

1-

such as interrupts start the threads that animate the quajects and do work. The quajects do not support inheritance or any other language features. Most quajects are implemented by combining a small number of building blocks. Some of the building blocks are well known, such as morutors, queues, and schedulers. The others are simple but somewhat unusual: switches. pumps and gauges. The unusual building blocks require some explanation. A switch is equivalent to the C switch statement. For example, switches direct interrupts to the appropriate service routines. A pump contains a thread that actively copies its input into its output. Pumps connect passive producers with passive consumers. A gauge counts events (e.g., procedure calis, data arrival, interrupts). Schedulers use gauges to collect data for scheduling decisions. All of Synthesis I/O is implemented with these building blocks. Applying the principle of independence. each building block may have several implementations. Applying the principle of frugality, we use the most economical implementation depending on the usage. An example is the several kinds of queues in the SY:lthesis kernel. In general, code synthesis techniques are used to create the most efficient version of a quaject for each particular situation.

4.2 Threads

Synthesis threads are light-weight processes. Each Synthesis thread (called simply "thread" from now on) executes in a context defined by the computer architecture, including: the register save area, the vector table, the address map tables, and the context-switch-in and context-switch-out procedures synthesized by the kernel. Frequently occurring operations on a thread are context switching, blocking and unblocking. We now show how we speed up context switching. Context switches are expensive in traditional systems like U!'i[X because they always do the work of a complete switch: save the registers in a system area, setup the C run-time stack. find the current proc-table and copy the registers into proc-table. start the next process,

among other complications (summarized from SUNOS source code [2]). A Synthesis context-

switch is shorter for two reasons. First, we use Executable Data Structures to minimize the critical path. Second, we switch only the part of the context being used, not all of it. The key data structure in context switching is the ready queue, where we find the threads ready to run (waiting for CPU) chained in an executable circular queue. Instead of using data pointers that link the elements of the queue, we have a jump instruction at the end of each context-switch-out procedure of the preceding thread that points to the context-switch- in procedure of the following thread. When the context switch happens, we simply jump to the thread's contex-switch-out procedure, which consists of a few instructions to save the registers and ends up jumping into the next thread's context-switch-in procedure. This in turn restores the registers and starts the new thread. Since the context switch code is custom created for each thread, we can further optimize it by moving data only when needed. Two optimization opportunities are in the handling of

15

The basic idea of an optimistic queue is to minimize synchronization overhead between the producer and the consumer; when the queue buffer is neither full nor empty, the consumer and the producer operate on different parts of the buffer. Therefore, synchronization is necessary only when the buffer becomes empty or full. The synchronization primitives are the usual primitives, say busy wait or blocking wait. To avoid losing updates in the queue for a single producer and a single consumer, we use a variant of Code Isolation. The queue manipulation routines update two variables, the queue-head and the queue_tail. If only the producer writes on queue-head and only the consumer writes on queue_tail, they need not to synchronize if the queue-head and queue_tail are pointint to different parts of the queue. With some additional care the queue insert has a critical path of about a dozen MC68020 instructions protected only by a compare-and-swap instruction.

4.5 Input/Output

In Synthesis, I/O means more than device drivers. I/O includes all data flow among hardware devices and quaspaces. Data move along logical channels we call streams, which connect the source to the destination of data flow. Physical I/O devices are encapsulated in quajects called device servers. Typically, the device server interface supports the usual I/O operations such as read and lo7ri teo In general, lo7ri te denotes data flow in the same direction of control flow (from caller to callee), and read denotes data flow in the opposite direction of control flow (from callee back to caller). High-level servers may be composed from more basic servers. At boot time, the kernel creates the servers for the raw physical devices. A simple example of composition is to pipeline the output of a raw server into a filter. Concretely, the Synthesis equivalent of UNIX cooked tty driver is a filter that processes the output from the raw tty server and interprets the erase and kill control characters. This filter reads characters from the raw keyboard server. To send characters to the screen, however, the filter writes to an optimistic queue, SInce output can come from both a user program or the echoing of input characters. The default file system server is composed of several filter stages. Connected to the disk hardware we have a raw disk device server. The next stage in the pipeline is the disk scheduler, which contains the disk request queue. followed by the default file system cache manager, which contains the queue of data transfer buffers. Directly connected to the cache manager we have the synthesized code to read the curren tly open files. The other file systems that share the same physical disk unit would connect to the disk scheduler through a monitor and switch. The disk scheduler then will redirect the data flow to the appropriate stream. Another implementation of queues, called the buffered queues, uses kernel code synthesis to generate several specialized queue insert operations (a couple of instructions); each moves a chunk of data into a different area of the same queue element. This way, the overhead of a queue insert is amortized by the blocking factor. For example, the A/D device server handles

17

-14,100 (single word) interrupts per second by packing eight 32-bit words per queue element_ This is orders of magnitude faster than current general-purpose OS·s.

4.6 Performance Figures The current implementation of Synthesis runs on an experimental machine (called the Qua- machine), which is similar to a SUN-3: a Motorola 68020 CPU, 2.5 MB no-wait state main memory, 390 MB hard disk, 3! inch floppy drive. In addition, it has some unusual I/O devices: two-channel 16-bit analog output, two-channel 16-bit analog input, a compact disc player interface, and a 2Kx2Kx8-bit framebuffer with graphics co-processor. The Quamachine is designed and instrumented to aid systems research. Measurement fa- cilities include an instruction counter, a memory reference counter, hardware program tracing, and a microsecond-resolution interval timer. The CPU can operate at any clock speed from 1 ~IHz up to 50 '\1Hz. Normally we run the Quamachine at 50 MHz. By setting the CPU speed to 16 .MHz and introducing 1 wait-state into the memory access, the Quamachine can closely emulate the performance of a SUN-3/160. The current implementation of Synthesis kernel is fully operational, supporting the threads, memory, and I/O devices. A number of application and demonstration programs use the Syn- thesis kernel. The Synthesis kernel is written in 680xO assembly language. We believe that the Synthesis kernel is quite portable since it contains only about 3000 lines of assembly code. This is less than many C compiler run-time library written in assembly. For detailed measurement numbers we refer the reader to our SOSP paper [6]. Here, we only highlight the performance of Syn thesis kerneL Measurements are made on the UNIX emulator running on top of the Synthesis kernel, which is capable of servicing SUNOS kernel calls_ In the simplest case. the emulator translates the UNIX kernel call into an equivalent Synthesis kernel calL Otherwise, multiple Synthesis primitives are combined to emulate a U:-lIX calL All benchmark programs were compiled on the SUN 3/160, using cc -0 under SUNOS release 3.5. The executable a.out was timed on the SUN, then brought over to the Quama- chine and executed under the UNIX emulator. To validate our emulation, the first bench- mark program is a compute-bound test of similarity between the two machines. This test program implements a function producing a chaotic sequence. It touches a large array at non-contiguous points, which ensures that we are not just measuring the "in-the-cache" per- formance. -Vith both hardware and software emulation. we run the same object code on equivalent hardware to achieve a fair comparison between Synthesis and SUNOS. In table 1 we summarize and compare the results of the measurements. The columns under "Raw SuN data" were obtained with the time command and also with a stopwatch. The SUN was unloaded during these measurements. as time reported more than 99% CPU available for them. The Synthesis emulator data were obtained by using the microsecond- resolution real-time clock on the Quamachine, rounded to hundredths of a second. These

18