![]() |
EPFL-SCR No 12
| ![]() |
---|
Emin Gabrielyan, EPFL, Computer Science Dept. Peripheral Systems Lab., Emin.Gabrielyan@epfl.ch
Cet article présente l'architecture d'un système de fichiers distribué (SFIO) pour la gestion des entrées/sorties parallèles dans un environment MPI. Différentes techniques d'optimization des communications et d'accès aux disques sont présentées. A l'aide de types dérivés MPI, on peut transmettre sur le réseau des données fragmentées à écrire sur disque à l'aide d'une seule commande MPI. Nous présentons les performances d'entrée/sorties du système de fichiers distribué sur le superordinateur Swiss-Tx formé de noeuds de calcul et E/S de type Compaq Alpha.
This paper presents the design and evaluation of a Striped File I/O (SFIO) library for parallel I/O in an MPI environment. We present techniques for optimizing communications and disk accesses for small striping factors. Using MPI derived datatype capabilities, we transmit fragmented data over the network by single MPI transfers. We present first results regarding the I/O performance of the SFIO library on Compaq Alpha clusters, both for the Fast Ethernet and for the TNet Communication networks.
For I/O bound parallel applications, parallel file striping is an alternative
to Storage Area Networks (SAN). In particular, parallel file striping offers
high throughput I/O capabilities at a much cheaper price, since it does not
require a special network for accessing the mass storage sub-system [6].
Important aspects of parallel I/O systems are highly concurrent access
capabilities to the common datafiles by all parallel application processes and
linear increase in performance when increasing the number of I/O nodes and
processors. Parallelism for input/output operations can be achieved by striping
the data accross multiple disks so that read and write operations occur in
parallel (see Fig. 1).
Fig. 1 - File striping
A number of parallel file systems where designed ([1], [2], [3], [5]), which make use of the parallel file striping paradigm.
MPI is currently the most used standard framework for creating parallel applications running on various types of parallel computers. A well known implementation of MPI [9], called MPICH, has been developed by Argone National Laboratory. MPICH is used on different platforms and incorporates MPI-1.2 operations [10] as well as the MPI-I/O subset of MPI-2 ([11], [12], [13]). MPICH is most popular for cluster architecture supercomputers, based on Fast or Gigabit Ethernet networks. MPICH's MPI-I/O underlying I/O implementation is completely sequential and is based on NFS ([4], [14]).
Due to the locking mechanisms needed to avoid simultaneous multiple accesses to the shared NFS file, MPICH MPI-I/O write operations can be carried out only at a very slow throughput1.
Other factor reducing peak performance is the read-modify-write operations useful for writing fragmented data to the target file. Read-modify-write requires sending the full data covering the written data fragment over the network, modifying it and transmitting it back. In the case of high data fragmentation, i.e. small chunks of data spread over a large dataspace in the file, network access overhead may become dominant.
To be able to provide the highest level of parallelization of access requests
as well as a good load balance, small striping units are required. However low
stripe unit size increases the communication and disk access cost. Our SFIO
parallel file striping implementation integrates the relevant optimizations by
merging sets of network messages and disk accesses into single messages and
single disk access requests. The merging operation makes use of MPI derived
datatypes.
The SFIO library interface does not provide nonblocking
operations, but internally, accesses to the network and disks are made
asynchronously.
Section 2 presents the overall architecture of the SFIO implementation as well as the software layers in order to provide an MPI-I/O interface on top of SFIO. The SFIO interface description, small examples as well as the details of the system design, caching techniques and other optimizations are presented in Section 3. First performance results are given for various configurations of the Swiss-Tx supercomputer [7]. Section 5 presents the conclusions and future work.
Fig. 2 - Integration of SFIO
The SFIO library is implemented using MPI-1.2 message passing calls. It is therefore as portable as MPI-1.2. The local disk access calls, which depend on the underlying operating system are non-portable. However, they are separately integrated into the source for Unix and Windows-NT versions.
The SFIO parallel file striping library offers a simple Unix like interface. We also intend to provide an MPI-I/O interface on top of SFIO. The intermediate level of MPICH's MPI-I/O implementation is ADIO [14]. We successfully modified the ADIO layer of MPICH to route calls to the SFIO interface.
On the Swiss-T1 machine, SFIO can run on top of MPICH as well as on top of FCI-MPI using the low latency and high throughput network TNet [8].
The following source gives a simple SFIO example. The striped file with a stripe unit size of 5 bytes consists of two sub-files. A single compute node accesses the striped file. It is assumed that the program is launched with one compute node MPI process.
#include <mpi.h> #include "mio.h" int _main(int argc, char *argv[]) { MFILE *f; f=mopen ( "t0-p1,/tmp/a1.dat;" "t0-p2,/tmp/a2.dat;" ,5 ); mwritec(f,0,"Hello World",11); mclose(f); }
Below is an example of multiple compute nodes accessing a striped file. Again the striped file with a stripe unit size of 5 bytes consists of two subfiles. It is accessed by three compute nodes. Each of them writes at different positions simultaneously.
#include <mpi.h> #include "../mpi/sfio/mio.h" int _main(int argc, char *argv[]) { MFILE *f; f=mopen ( "t0-p1,/tmp/a1.dat;" "t0-p2,/tmp/a2.dat;" ,5 ); if(rank()==0) { mwritec(f,0,"Hello*World,*",13); } else if(rank()==1) { mwritec(f,13,"I*am*a*program*",15); } else if(rank()==2) { mwritec(f,28,"written*with*SFIO.",18); } mclose(f); }We assume that the program is launched with three compute and two I/O MPI processes. At the end the global file contains the text combined from the fragments written by the first, second and third compute nodes, i. e. " Hello*Worl d,*I*am*a* program*w ritten*with *SFIO." The text is distributed accross the two sub-files. The first sub-file contains "Hellod,*I*progritten*SFIO" and the second "*Worlam*a*am*wr*with." (Fig.3).
Fig.3 - Distribution of striped file accross sub-files
File management operations are mopen, mclose, mchsize, mdelete and mcreate.
MFILE* mopen(char *name, int chunk); void mclose(MFILE *f); void mchsize(MFILE *f, long size); void mdelete(char *name); void mcreate(char *name);
All the presented file management operations are collective. Operation
mopen returns to the compute node a pointer to the logical striped file
descriptor. The striped file name, required for the mopen,
mdelete, mcreate commands is a string containing the full
specification of the number, sequence, locations and paths of sub-files
representing the global striped file. The format of the name is a sequence of
sub-files, spearated by ";": "<host>,<path>;<host>,
<path>;<host>,<path>"
. For example
"t0-p1,/tmp/a1.dat;t0-p2,/tmp/a2.dat;"
.
There are single block and multi-block data access requests.
void mread(MFILE *f, long offset, char *buffer, unsigned size); void mwrite(MFILE *f, long offset, char *buffer, unsigned size); void mreadc(MFILE *f, long offset, char *buffer, unsigned size); void mwritec(MFILE *f, long offset, char *buffer, unsigned size); void mreadb(MFILE *f, unsigned blknum, long offsets[], char *buffers[], unsigned sizes[]); void mwriteb(MFILE *f, unsigned blknum, long offsets[], char *buffers[], unsigned sizes[]);
The data access requests are blocking and non-collective. mreadc and mwritec functions are the optimized versions of the mread and mwrite functions.
Error management functions are given by merror and its collective counterpart merrora.
void merrora(unsigned long *ioerr); void merror(unsigned long *ioerr); void prioerrora();
merror and merrora return an array of error statistic accumulated on all the I/O nodes. At the same time, they reset the error counters on all the I/O nodes. Statistics are accumulated for operating system I/O calls and listed according to open, close, creat, unlink, ftruncate, lseek, write and read functions. prioerrora is a collective operation which prints the error statistic to the standart output of the application.
When a compute node invokes an I/O operation, the SFIO library takes control of that compute node. The library routes the requests to the corresponding I/O listener proxy on the compute node, caches the routed requests and does an optimization of requests queued for each I/O node in order to minimize the cost of disk accesses and network communications. After actual transmission of the messages, the I/O listener(s) prepares a reply which is sent back to the compute node.
Fig. 4 - Disk Optimization
Queued I/O node access requests cached on the compute node are launched either at the end of the function call or when the buffer size reserved on the remote I/O listener for data reception may become full. Memory is not a problem on the compute node, since data always stays in user memory and is not buffered. When launching I/O requests, the SFIO library performs a single data transmission to each of the I/O nodes. It creates dynamically a derived datatype which points to the set of pieces in user space memory related to the given I/O node and transmits the data in a single stream without additional copy. The I/O listener at the same time receives the data as a contiguous block.
Let us explore the scalability of our parallel I/O implementation (SFIO) as a function of the number of contributing I/O nodes. Performance results have been measured on the Swiss-T1 machine [7]. Swiss-T1 consists of 64 Alpha processors grouped in 32 nodes. Two types of networks are used, TNet and Fast Ethernet. To have an idea about the network capabilities, throughput as a function of number of nodes is measured by a simple MPI program for both networks. The nodes are equally divided into transmitting and receiving nodes and maximal all-to-all traffic is created.
Figure 5 demonstrates cluster throughput scalability with a Fast Ethernet Network and Fig. 6 with TNet.
![]() | ![]() |
---|---|
Fig. 5 - Ethernet scalability (blue = peak; yellow = average) | Fig. 6 - TNet scalability |
With Fast Ethernet, each node is connected to a Fast Ethernet crossbar switch. The underlying topology of TNet consists of eight 12-port full crossbar switches. The blue graphs show the peak performances and the yellow graphs the average performances.
Let us now analyze the performances of the SFIO library on the Swiss-T1 machine for MPICH on Fast Ethernet and FCI-MPI on TNet. Let us assign the first processor of each compute node to a compute process and the second processor to an I/O listener (Fig. 7).
Fig. 7 - SFIO Architecture on Swiss-T1
SFIO performance is measured for concurent write access from all compute nodes to all I/O nodes, the striped file being distributed over all I/O nodes. The number of
I/O nodes is equal to the number of compute nodes.
The size of the striped file is 2Gbyte and the striped unit size is 200 bytes only. The application's I/O performance as a function of the number of compute and I/O nodes is measured on both Fast Ethernet and TNet and presented in Fig. 8 and Fig. 9. The blue graphs show the peak performances and the yellow graphs the average performances. We are very surprised with the performance results of SFIO on top of MPICH. This result needs further investigation.
![]() | ![]() |
---|---|
Fig. 8 - SFIO all-to-all I/O performance on Fast Ethernet | Fig. 9 - SFIO all-to-all I/O performance on TNet |
With MPI-FCI the situation is much better. It is highly scalable. When more than 23 nodes participate in the I/O operations, performances may decrease due to TNet's particular communication topology. The effect of topology on the I/O performance will be further studied.
SFIO is a cheap alternative to Storage Area Networks. It is a light-weight portable parallel I/O system available for MPI programmers. Integrated into standard MPI-I/O, SFIO may become a high performance portable MPI-I/O solution for the MPI community.
We plan to realize SFIO benchmarking and check scalability for larger numbers of processors on large supercomputers, e.g. at Sandia National Laboratory.
We intend to implement nonblocking parallel I/O function calls. Disk access optimizations may also be further improved.
Finally we are planning to implement the collective operations as follows: collective operations assume that all compute nodes issue an I/O request at the same logical step in the program. The compute nodes, under control of SFIO library, consult each other to arrive at a common I/O strategy. The I/O nodes are informed about the strategy by the compute nodes and SFIO creates the optimized data flow.
1 When 7 T1 compute nodes access one shared NFS file in an interleaved maner, write throughtput performance on MPICH MPI-IO is 35 KB/s per node
![]() refer to contents |
©EPFL-SCR # 12 - 2000 |
your comments |