EPFL-SCR No 11 Nov.99

Benoit A. Gennart, Emin Gabrielyan, Roger D. Hersch, EPFL-DI, Peripheral Systems Lab.

Entrées/sorties parallèles sur les disques locaux de l'architecture Swiss-Tx

La tendance actuelle dans le domaine des superordinateurs est de prendre avantage de l'augmentation constante et rapide de la puissance de calcul des stations de travail, et de créer des superordinateurs en empilant des stations de travail et des PC en les connectant par un réseau rapide. Cette approche minimise le coût du matériel en prenant avantage de l'effort de développement des composants grand-public (microprocesseurs, cartes réseau, disques).

Elle pose aussi de nouveaux problèmes : le modèle de programmation, les entrées/sorties, la gestion des utilisateurs et des tâches. Cette contribution discute les problèmes de conception d'un système d'accès aux données sur les ordinateurs à haute performance. Elle présente brièvement l'architecture matérielle, et les raisons de choisir un modèle de programmation distribué. Elle explique en détail la conception et l'implantation du module NAFS de fichiers zèbrés (striped files), dont l'implémentation est fondée sur le langage d'extension CAP.

Parallel File Striping on the Swiss-Tx Architecture

The current trend in the area of high performance computing is to take advantage of the constant and rapid increase in the processing power of desktop workstations, and create supercomputers by piling up workstations and connecting them up with a high-speed network. Such an arrangement ensures minimal hardware cost by taking advantage of the development effort of commodity components (microprocessors, network cards, disks). It also raises new issues: the programming interface, input/output, user and job scheduling.

This paper presents the design issues surrounding input/output to disk on high-performance machines. It presents briefly the basic hardware architecture, and the reasons behind the choice of a distributed memory programming model. It also explains the reasons for offering an interface to distributed scalable data storage. It then describes in detail the design and implementation of the NAFS striped file package, based on the CAP parallel programming extension to C++.

Introduction

The current approach to high-performance computing is to use the available computing power in existing workstations and PCs and provide to users tools to extract the best performance out of multiple connected desktop computers. Low-end solutions simply consists of dividing a large task into multiple independent jobs, and submitting them to remote computers, as in the SETI@home project (http://www.setileague.org/general/setihome.htm) . Higher-end solutions, for problems with more data dependencies, consists of improving the connectivity of an existing computer network, by using for example an Ethernet crossbar switch. Top-end solutions would include piling up dedicated computers and connecting them through a high-speed custom network, and adding a high-speed archive to handle large data sets.

While it is fairly easy to harness the computing power of multiple desktop computers connected through crossbar switches, achieving a high I/O throughput remains elusive. Solutions involving multiple RAID servers are possible, but involve additional costs. It is therefore tempting to use internal desktop-workstation disk-drives to mimic a single high-performance archive. The principle is to divide user files into multiple subfiles allocated on different disks in each desktop workstation. A parallel-striped-file software package ensures that the multiple files appear to the user as a single conventional file.

The main design goals of a parallel-striped-file package are portability and scalability. Portability is required to ensure that as many available computers as possible can be used regardless of their operating system and network, and leads to the selection of the most common network protocol, namely sockets. Scalability guarantees that the performance of the parallel striped-file package increases as computers are added to the architecture. Achieving high-performance in an I/O system is notoriously difficult: the latency of individual drives is high, and the program I/O operations tend to have small granularity. To improve the performance of the package, it is necessary to design an interface which allows multiple computers to coordinate their I/O activities in what is called collective operations.

Other design goals of a parallel-striped-file package are reliability, transparent job-scheduling, and client-server design. In terms of reliability, it is important for striped-files to survive the crash of a single computer/disk in the parallel architecture. Parallel I/O performance is also very dependent on the location of the files accessed by a program. A smart job-scheduler should be capable of scheduling jobs and moving striped files automatically so as to ensure maximum overall performance. In distributed systems, multiple user programs may access the same file. While this approach is required in the case of e.g. databases, it is often an overdesign in the case of high performance computing, where a single user completely controls the machine (or part of it) and the required files for the duration of the computation. At this point in the project, the reliability, transparency and client-server goals are secondary, and the main focus is on portability and scalability.

This paper describes NAFS, a parallel-striped-file package running under UNIX and WindowsNT.

The Swiss-Tx hardware architecture and I/O requirements describes the Swiss-TX architecture and its effect on the I/O design;
The CAP Computer-Aided Parallelization tool describes the precursor to the NAFS striped-file package project, the Visible Human Slice Server;
The Swiss-Tx I/O software design addresses the issues surrounding the design and implementation of the NAFS striped-file package.

The Swiss-Tx hardware architecture and I/O requirements

Hardware architecture

Fig. 8 in article: Parallel computer architectures

for commodity computing and the Swiss-T1 machine (see on page 3) describes the parts of the Swiss-T1 architecture relevant to the I/O subsystem design. This machine consists of 8 processing nodes, each with 4 dual processor boxes, altogether 64 production processors, and of a four processor front-end subsystem. The front-end subsystem takes care of resource management and of all the external interactions. Two RAID servers connected to the front-end subsystem store the user files. They provide performance through striping and reliability through data redundancy. Each dual-processor box incorporates two 9GB disks, for system and scratch files. The boxes are connected through Ethernet for startup and system messages and through

T-Net for high-speed transfers between user processes.

The issues to be addressed in the design of the Swiss-Tx architecture are: performance, scalability, reliability, portability. Performance of a N-processor architecture must be very close to N times the performance of a single processor, otherwise it is very difficult to justify the increase in size of the architecture. The architecture must be able to survive the failure of one or more nodes. And as a bonus, it would be nice if the software designed to improve the performance would work under various UNIX flavors (Solaris, Digital Unix, Linux) as well as under WindowsNT.

Programming model

The Swiss-Tx machine is a distributed memory architecture. Current implementation of shared memory models over distributed hardware are expensive and do not deliver higher performance for middle grain and small grain parallel applications [8, 18]. In fact, to achieve scalable performance, the programmer must handle data decomposition, allocate data subsets to the various nodes in the architecture and specify explicitly data transfers between the nodes of the architecture. Hence the selection of a distributed memory programming model and the use of a message passing interface for parallel programming. The standard message passing API (Application Programmer Interface) is MPI [4,7].

I/O design and library interface

I/O is a major bottleneck in many parallel applications. The main reason for poor application-level I/O performance is that I/O systems are optimized for large accesses, whereas parallel applications typically make many small

I/O requests [1,2,9,12,16]. Leaving each parallel program thread to fend off for itself results in poor performance, as each thread performs comparatively small I/O requests, each incurring a high latency (see table 1). Much research has demonstrated the efficiencies of data organization and collective I/O [4,10,14], In the MPI-IO interface, data types take care of data organization, and the API supports non-collective, collective, blocking and non-blocking operations. The Swiss-Tx project has selected the MPI-IO API as an interface to the striped file package.

The CAP Computer-Aided Parallelization tool

To extract maximal performance from commodity component based architectures, and overcome the high latency of their communication network, it is necessary to overlap processing, communication, and data accesses. Writing such asynchronous parallel programs is tedious and error-prone. To facilitate the development of such programs, we have developed a computer-aided parallelization tool, CAP, and parallel file system components. CAP lets us generate parallel server applications automatically from a high-level description of threads, operations and the macro-dataflow between operations. Because of CAP's macro-dataflow nature, the generated parallel applications are completely asynchronous, without the need for callback functions. Each thread incorporates an input token queue, ensuring that communication occurs in parallel with computation. In addition, the CAP runtime environment executes disk-access operations asynchronously, also without the need for explicit callback functions [6].

We used the CAP tool in several applications, such as the Visible Human Slice Server which offers access to slices and surfaces within the 3D Visible Human dataset (13GBytes), and the RadioControl radio-rating project which correlates the content of radio programs with the content of wrist-held audio-data recorders (http://www-imt.unine.ch/Radiocontrol). Thanks to CAP the generated applications are flexible; it is easy to maintain and modify the parallel programs. Evaluation of the access times for the applications shows that their performance is close to the best that the underlying hardware can sustain [6]. The CAP tool is also used in a commercialization effort by A2I

(http://www.axsnow.com/), aiming at providing components for manipulating large raster images.

While the development focus of the laboratory is on WindowsNT, the CAP tool is available on both WindowsNT and UNIX (Solaris, Digital). Among future developments is a wizard for specifying graphically the macro-dataflow between operations.

The Swiss-Tx I/O software design

As a part of the Swiss-Tx effort, the authors implement a portable striped file package called NAFS (Not A File System). To the user, a NAFS striped file is a linearly organized set of bytes. The operations available to manipulate the files are the traditional file operations: create, open, close, delete, read from, and write to a NAFS file at specific offsets. The NAFS files are accessible both through an MPI API and an NAFS API. The aim of NAFS is to make the use of striped files as transparent as possible. To achieve this aim, the NAFS project will provide utilities to move, copy, and display (UNIX cat command) files. Both the NAFS and the MPI API hide the striping to the programmer. The striping information can be made available to the programmer who wishes to take advantage of the information to improve performance.

In the following paragraphs we address the following issues:

performance considerations,
striping and programming interface,
miscellaneous design issues (need for dedicated I/O threads, network requirement, pipelining, file protection, redundancy),
the implementation of NAFS using CAP,
the current status of NAFS.

The CAP language extension is described in a separate box.

Expected performance

port to port bandwidth (Ethernet) port to port bandwidth (T-Net/MPI) local-disk throughput
latancy 500ms 12µs 10ms
throughput (nominal) 12.5MB/s 100MB/s 8MB/s
throughput (aggregate) 100MB/s 1GB/s 512MB/s
throughput (2KB block) 5MB/s 62.5MB/s 0.195MB/s
throughput (50KB block) 7.5MB/s 97MB/s 2.5 to 5MB/s

	port to port bandwidth (Ethernet)	port to port bandwidth (T-Net/MPI)	local-disk throughput
latancy	500ms	12µs	10ms
throughput (nominal)	12.5MB/s	100MB/s	8MB/s
throughput (aggregate)	100MB/s	1GB/s	512MB/s
throughput (2KB block)	5MB/s	62.5MB/s	0.195MB/s
throughput (50KB block)	7.5MB/s	97MB/s	2.5 to 5MB/s

Table 1 - Performance figures for the Swiss-T1 architecture

Table 1 presents the relevant performance figures for the Swiss-T1 architecture. We consider that each box in the architecture contains two processors and two disks. The measured Ethernet bandwidth per box is 5 to 8 MB/s per box. The nominal T-Net throughput is 200MB/s

(100MB/s each way), shared between 4 boxes, or 50MB/s per box. We assume that both the Ethernet crossbar and the T-Net crossbar offer sufficient bandwidth to sustain the nominal throughputs at the box level. The next two paragraphs evaluate the distributed I/O design, and the centralized-server design.

In the distributed I/O architecture that we have chosen, each dual-processor box is both a producer and a consumer of data in an I/O operation. The processors produce data that is consumed by the disks. In a typical balanced

I/O transfer, each box spends half the time sending data and half the time receiving data. Hence half the network bandwidth is available for distributed I/O operations. Assuming enough disk bandwidth, the I/O operations can be performed at the rate of 2.5 to 4MB/s per box through Ethernet, and 25MB/s per box through T-Net. In the case if the 32-box T1 architecture, the peak network throughput for I/O is 80 to 128MB/s through Ethernet and

1GB/s through T-Net. The nominal disk throughput is between 2.5 and 5MB/s for 40KB blocks depending on the locality of the data on disk. In the case of a 32 box architecture with 2 disks per box, the peak disk throughput is 160 to 320MB/s. This back-of-the-envelope analysis suggests that the Ethernet bandwidth is below the disk bandwidth, and that it is therefore necessary to use the

T-Net for distributed I/O operations in order to achieve the maximum throughput.

These considerations suggest that the distributed approach offers high performance at low cost, if the T-Net is used. The Ethernet-based distributed design offers only acceptable performance. The back-of-the-envelope calculations in this section must of course be validated through experiments, and the overhead of various protocols (NFS, TCP sockets) taken into accounts.

NAFS design

In this discussion we assume that a parallel program consists of threads. Whether there are multiple user threads per processes as in CAP or a single user thread per process as in many MPI implementations is not important at this point in the discussion of the design. A striped file consists of one or more subfiles. Each NAFS striped file consists of a metafile and one or more subfiles. The striped file is divided in extents

(i.e. contiguous data sets of sufficient size to make a disk access worthwhile, typically 50KB) which are stored in subfiles in round robin fashion. A set of extents located at the same position in each subfile is called a stripe. Consider (Fig. 2) a program consisting of a single thread which dumps the content of its memory (a single 1MB block) to striped file userfile.nafs with two subfiles /scratch/p0/userfile.sub0 (short name: SF0) and /scratch/p1/userfile.sub1 (short name SF1) and an extent size of 50KB. Striped file bytes [0-50K[ are stored in SF0[0-50K[, striped file bytes[50K-100K[ are stored in SF1[0-50K[, striped file bytes [100-150K[ are stored in SF0[50-100K[, striped file bytes [150-200K[ are stored in SF1[50-100K[, etc..

Fig. 2 - File striping

The metafile contains the number of subfiles, the extent size, the total file size and the list of subfile names (absolute OS paths, accessed through NFS on Unix machines or UNCs under WindowsNT). The metafile and subfiles are native OS files, for portability reasons. Each byte in the striped file is described by its offset (64-bit) and value.

When multiple threads are involved in writing to a striped file, each of them writes its own blocks to the file. The number of subfiles in the striped file need not be equal to the number of threads in the parallel program. Multiple requests to the same subfile are serialized. At the NAFS level, the operations read from and write to the striped file a block-list, each block being characterized by a size, an offset in the file, and a pointer in memory to the data to be transferred from/to disk. Arbitrary number of threads can take part to collective read/write operations. MPI I/O uses communicators to indicate the threads involved in collective operations, and datatypes to indicate the layout of data both in file and in memory. A software layer transforms MPI communicators and data types into block-lists and sets of threads.

NAFS issues

Miscellaneous issues must be addressed in the design of the NAFS striped file package: the number of threads per dual-processor box, the high-speed network support, the pipelining of data transfers and disk accesses, the protection of files against multiple simultaneous accesses, and redundancy.

One dedicated I/O thread per I/O node

In the distributed I/O architecture, each user thread can request I/O operations from any dual-processor box in the architecture. In effect, it is acting as a client requesting I/O services. It is therefore necessary that each dual-processor box runs a thread dedicated for serving I/O requests from user threads. In this paper, we refer to user threads as compute-threads and I/O-request-serving threads as disk-threads.

Ethernet/socket communication vs. T-Net/MPI communication

The analysis of paragraph Expected performance suggests that it is necessary to use the T-Net to sustain the disk throughput of the distributed I/O design. And the comments of the previous paragraph indicate it is necessary to have a dedicated I/O thread in addition to the usual processing thread(s) in each box. However, many high-performance versions of MPI are not multi-threaded, and therefore the I/O thread must be allocated in a separate process. This is often made difficult by the fact that the requirements of the high-performance network preclude the use of multiple processes. As a result, and also for portability reasons, NAFS and user threads communicate through sockets. In the second phase of the Swiss-Tx project, it is planned to adapt the NAFS parallel-striped-file package to use the high-speed T-Net network.

Fig. 3 - The need for dedicated I/O threads

Pipelining data transfers and disk accesses

To achieve peak I/O throughput, it is necessary to ensure that data transfers between different dual-processor boxes and I/O transfers to disks are overlapped. The usual approach is to use the two-phase approach [14], where the next chunk of data is transferred over the network while the current chunk is transferred to disk. The CAP language semantics implies the pipelining of operations, as explained in paragraph NAFS implementation using CAP. This simplifies the programming of parallel I/O operations to striped files.

NAFS implementation using CAP

This section discusses the implementation of the non-collective/collective read and write operations in terms of a graphical representation of the behavior of the program. The parallel program consists of 5 compute-threads (PC0 to PC4) and 5 disk-threads (IO0 to IO4), and the parallel programs reads or writes data covering exactly 4 extents (E0 to E3). As explained in section NAFS design, the data handled by a NAFS operation is represented as a single block-list linking all data blocks being written to or read from disk. The striped file written to consists of 3 subfiles SF0, SF1, SF2, handled by threads IO0, IO1, IO2 respectively. Extent E0 (resp. E1, E2, E3) is written to subfile SF0 (resp. SF1, SF2, SF0), according to the round-robin rule.

Non-collective read

In this example, one compute-thread (namely PC2) reads a single block covering 4 extents in a 4-subfile striped file (Fig. 4). The compute-thread divides its memory into four blocks covering the extents and sends the extent requests for each block (ER0 to ER3) to the appropriate disk-thread. Each disk-thread returns the data to the compute-thread which writes it to memory. All extents are read in parallel. However, to limit the memory requirements in the case of an operation involving many extents, a flow control modifier limits the number of extents processed by a single disk-thread to a small number (4). The non-collective write operation is similar to the non-collective read operation.

Fig. 4 - Non collective read operation

Collective write

In this example, the 5 compute-threads PC0 to PC4 collectively write the data covering the 4 extents E0 to E3 (Fig. 5)¹. Before starting the collective operations, all compute-threads synchronize, and compute-thread PC0 initiates the collective operation by creating extent-writing requests for each of the extents covered by the operation (ER0 to ER3)² .

For each extent request, the disk thread controlling the subfile where the extent is stored sends requests to all compute-threads for their part of memory covering the extent (GetBlocks operation). After receiving all blocks from the compute-threads, the disk-thread merges the blocks and writes the extent to disk. All extents are processed in parallel. However, to limit the memory requirements for the operation, a flow control modifier limits the number of extents simultaneously processed by a single disk-thread to a small number.

Fig. 5 - Collective-write operation

Collective read

In the collective-read operation, the 5 compute-threads collectively read 4 blocks from 3 subfiles (Fig. 6). The

5 compute-threads first synchronize and PC0 initiates the collective read operation by asking all compute-threads to divide their respective block-list list into separate block-sublists for each extent. Each compute-thread then sends the block-request sublists to the appropriate disk-thread. Each disk-thread waits for block-sublists. When it has received from all compute-threads the block-sublists corresponding to a given extent, it reads the extent from disk and fills the blocks in the corresponding block-sublist. The disk-threads then send the blocks to the compute-threads, where they are copied in memory.

Current status

The CAP environment has undergone extensive testing. The experimental performance results published in [6] show that the pipelining strategies of CAP are effective, that the performance achieved is close to the maximum of either computation-time, communication-time or disk access-time. The CAP environment runs on WindowsNT, Solaris and Digital Unix. An installation wizard is available under WindowsNT.

The NAFS parallel-striped-file package is implemented and partially tested. It supports blocks of arbitrary size written in arbitrary positions in the striped file (support for MPI data types), arbitrary number of stripes in the striped-file, arbitrary number of processes taking part to collective operations (support for MPI communicators), and non-blocking function interfaces. Current performance on the Swiss-T0 machine is limited by the absence of a FastEthernet crossbar switch. Further testing will be conducted under WindowsNT and UNIX, depending on the available configurations.

The MPI interface to the NAFS parallel-striped-file package is currently under development, reusing as much as possible the ROMIO implementation [17].

Fig. 6 - The collective-read operation

CAP extension language overview

This section presents a simple example describing the basic features of the CAP language extension to C++. CAP's main design goal is to implicitly implement asynchronous parallel behavior. To achieve these goals, it uses language constructs (1) to specify and group threads, (2) to specify the data structures exchanged between threads, (3) to specify the operations the threads and thread groups can perform and (4) to specify the macro-dataflow (pipeline/parallel scheduling) between thread operations.

We first illustrate the asynchronous semantics of the CAP. Consider a CAP program which converts data on one processor and writes it to one disk, chunk by chunk. For the sake of simplicity, we assume that the conversion operation does no change the size of the data chunk, nor does it require data in other chunks. Fig. A1 shows the syntax, graphical representation and timing diagram of such a program. In Fig. A1, program lines 1 to 6 specify the data structures specifying the data chunks moved between the program's two threads. Lines 8 to 17 specify the two threads (DiskThreadT, ComputeThreadT) in the program and the operations they can perform (WriteChunk and ConvertChunk respectively). Lines 19 to 26 logically group the two threads under the name GlobalProcessT, and specify which operation(s) they can perform as a group (ConvertAndWriteData). Line 28 instantiate GlobalProcessT, which automatically creates the two threads under the GlobalProcessT umbrella. In CAP, threads are instantiated upon program initialization. No threads are created during program execution, thereby reducing execution overhead.

Lines 29 to 44 specify the behavior of the program. The input of the program is a DataT token initialized with an open file descriptor FileP, an OffsetInFile, and a Data pointer to the memory data. The asynchronous parallel semantics of the CAP language is entirely handled by the parallel while expression (lines 39 to 44).The SplitChunk C++ function (passed as parameter to the parallel while expression) incrementally divides the DataT input token inP into several DataT tokens referencing consecutive 50KB blocks in memory. The ComputeThread converts the data chunks one after the other, and forwards each converted token to the DiskThread as soon as it is converted. The DiskThread writes the P[i] tokens to disk one after the other, and generates the M[i] void tokens. As soon as each of the M[i] tokens is available, it is merged by the MergeChunk function (not shown) into the outP void token. When all tokens have been merged, the ConvertAndWriteData operation is complete.

The graphical representation of the ConvertAnd-WriteData operation (bottom of Fig. A1) matches the textual specification. It indicates that the input token is divided using the split function into several data chunks that are fed to the ComputeThread's ConvertChunk operation. The output of the conversion is fed to the DiskThread.WriteChunk operation. The output of the WriteChunk operations are used for synchronization purposes: when all outputs have been received, the ConvertAndWriteData operation is complete.

The SplitChunk, MergeChunk functions and the WriteChunk, ConvertChunk operations are all executed asynchronously, i.e. provided enough processors and disk they could all be executed simultaneously (albeit on different data chunks). It is possible to allocate the DiskThread and the ComputeThread not only on different processors in the same box, but also on different boxes. In that case, the token transfer over the network is automatic and asynchronous, that is, computation, communication and disk accesses are overlapped.

In the case where the split function generates many chunks of data, there is a risk of memory overflow, since the split function is typically much faster than the conversion operation. To work around this, CAP uses flow control modifiers to its parallel construct, limiting the number of tokens simultaneously active inside the parallel construct. Fig. A2 shows a modified ConvertAndWriteData where the number of simultaneously active data chunks is limited to4.

To achieve asynchronous parallel behavior in CAP, the programmer replaces the disk- and compute-thread (lines 21 and 22) by arrays of disk- and compute-threads (e.g. DiskThreadT DiskThread[MAX] and ComputeThreadT ComputeThread[MAX]). The pipelining behavior explained in the previous paragraph is still available, but up to MAX data chunks can be compressed or written simultaneously, depending on the available hardware resources. The work of the CAP programmer is then to define the tokens and the operations required to achieve a given algorithm. The CAP language extension supports 8 predefined expressions, 3 for asynchronous parallel behavior (parallel, indexed parallel, parallel while), 3 expressions for iterative behavior (sequence, for, while), and 2 expressions for branching (if, ifelse). CAP programs based on predefined CAP expressions are deadlock-free by construction. CAP programs are reconfigurable without recompilation: the same executable can run on a 1-processor 1-disk low-end PC, on a 4-processor 4-disk shared-memory machine, or on an 10-processor distributed-memory architecture with 60 disks.


     1     token DataT {
2          FILE* FileP ;
3          int OffsetInFile ;
4          int Size ;
5          char* Data ;
6     } ;
7     
8     process DiskThreadT {
9     operations:
10          WriteChunk in DataT* inP out void* outP;
11     } ;
12     
13     process ComputeThreadT {
14     operations:
15          ConvertChunk
16               in DataT* inP out DataT* outP;
17     } ;
18
19     process GlobalProcessT {
20     subprocesses:
21          DiskThreadT DiskThread ;
22          ComputeThreadT ComputeThread ;
23     operations:
24          ConvertAndWriteData
25               in DataT* inP out void* outP;
26     } ;
27
28     GlobalProcessT GlobalProcess ;
29     leaf operation DiskThreadT::WriteChunk
30          in DataT* inP out void* outP
31     { // C++ code to write data chunk to file }
32     
33     leaf operation ComputeThreadT::ConvertChunk
34          in DataT* inP out DataT* outP
35     { // C++ code to uncompress data chunk }
36     
37     operation GlobalProcessT::ConvertAndWriteData
38          in DataT* inP out DataT* outP
39     {     parallel while (SplitChunk, MergeChunk,
40                              ComputeThread, DataT result)
41          ( ComputeThread.CompressChunk >->
42               DiskThread.WriteChunk
43          ) ;
44     }

Fig. A1 - CAP specification and pipelining semantics

1     operation GlobalProcessT::ConvertAndWriteData
2          in DataT* inP out DataT* outP
3     {     flow_control (4)
4          parallel while (SplitChunk, MergeChunk,
5                              ComputeThread, DataT result)
6          ( ComputeThread.CompressChunk >->
7            DiskThread.WriteChunk
8          ) ;
9     }

Fig. A2 Flow-control modifier

Conclusion

This document presented the issues involved in the design and implementation of a parallel-striped file package. It shows that the parallel-striped-file approach is viable, but requires careful implementation: collective operation interface, and pipelining between data transfers and data accesses. Future work will address the implementation of the MPI-IO interface, performance measurements, and possible optimizations of the existing package.

References

[1] S. Baylor, and C. Wu. Parallel I/O workload characteristics using Vesta. In R. Jain, J. Werth, and J. Browne, editors. Input/Output in Parallel and Distributed Computer Systems, chapter 7, p. 167-185. Kluwer, 1996.

[2] P. Crandal, R. Aydt, A. Chien, and D. Reed. Input-output characteristics of scalable parallel applications. In Proc. Supercomputing'96. ACM Press, December 1995.

[3] S. Garg. Architecture and design of a highly efficient parallel file system. In Proc. SC'98 High Speed Networking and Computing. Orlando, Florida, USA, November 7 - 13, 1998 (http://www.supercomp.org/sc98/papers/index.html).

[4] W. Gropp, E. Lusk, A. Skjellum, Using MPI, The MIT Press, 1994.

[5] D. Kotz. Disk-directed I/O for MIMD multiprocessors. ACM Transactions on Computer Systems, 15(1):41-74, February 1997.

[6] V. Messerli, O. Figueiredo, B. A. Gennart, and R. D. Hersch. Parallelizing I/O-intensive image access and processing applications. IEEE Concurrency, 7(2):28-37. April-June 1999.

[7] Message Passing Interface Forum, MPI-2: Extensions to the Message-Passing Interface, Technical Report, July 1997, http://www.mpi-forum.org.

[8] C. M. Mobarry, T. Sterling, J. Crawford, and D. Ridge. A comparative analyzis of hardware and software support for parallel programming within a global name space. In Proc. SC'96 High Speed Networking and Computing (http://www.supercomp.org/sc96/papers/index.html ) 1996.

[9] N. Nieuwejaar, D. Kotz, A Purakayastha, C.Ellis and M. Best. File-access characteristics of parallel scientific workloads. IEEE transactions on Parallel and Distributed Systems, 7(10):1075-1089, October 1996.

[10] J. del Rosario, R. Bordawekar, A. Choudhary. Improved parallel I/O via a two-phase run-time access strategy. In Proc. Workshop on I/O in Parallel Computer Systems at IPPS'93, pages 56-70, April 1993. Also published in Computer Architecture News, 21(5):31-38, December 1993.

[11] K. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-directed collective I/O in Panda. In Proc. Supercomputing '95. ACM Press, December 1995.

[12] E. Smirni, R. Aydt, A. Chien, and D. Reed. I/O requirements of scientific applications. In Proc. 5th IEEE Int. Symp. on High Performance Distributed Computing, p. 49-59. IEEE Computer Society Press, 1996.

[13] J. Sturtevant, M. Christon, P. D. Heerman. PDS/PIO: lightweight libraries for collective parallel I/O. In Proc. SC'98 High Speed Networking and Computing. Orlando, Florida, USA, November 7 - 13, 1998 (http://www.supercomp.org/sc98/papers/index.html ).

[14] R. Thakur, and A. Choudhary. An extented two-phase method for accessing sections of out-of-core arrays. Scientific Programming, 5(4):301-317, Winter 1996.

[15] R. Thakur, W. Gropp, and E. Lusk. A case for using MPI's derived datatypes to improve I/O performance. In Proc. SC'98 High Speed Networking and Computing. Orlando, Florida, USA, November 7 - 13, 1998 ( http://www.supercomp.org/sc98/papers/index.html) .

[16] R. Thakur, W. Gropp, and E. Lusk. An experimental evaluation of the parallel I/O systems of the IBM SP and Intel Paragon using a prodution application. In Proc. 3rd Int. Conf. of the Austrian Center for Parallel Computation (ACPC) with special emphasis on paralle databases and parallel I/O. LNCS 1127. Springer-Verlag, September 1996.

[17] R. Thakur, W. Gropp, and E. Lusk. Users guide for ROMIO, A high-performance, portable MPI-IO implementation. TR ANL/MCS-TM-234. Mathematics and Computer Science Division, Argonne National Laboratory, revised July 1998.

[18] G. Bell, C. van Ingen. DSM Perspective: another point of view. In Proc. of the IEEE, 87(3):412-417, March 1999.

1 Each compute-thread reads data from each extent.

2 The requests are forwarded to the disk-threads and specify for each disk-thread the extent for which it must gather data from the compute-threads.

*refer to contents*	©*EPFL-SCR # 11 - 1999*
	*your comments*

EPFL-SCR No 11 Nov.99