Presentation
(PDF, FrameMaker) |
Article (PDF, FrameMaker) |
Published in “Large Scale
Storage in the Web” 18-th IEEE Symposium on Mass Storage Systems and
Technologies, April 17-20, 2001, pp. 135-144 (ISBN# 0-7695-0849-9)
SFIO
a striped file I/O library for MPI
École
Polytechnique Fédérale de Lausanne, Switzerland
{Emin.Gabrielyan,RD.Hersch}@epfl.ch
Abstract
This paper presents the design and evaluation
of a Striped File I/O (SFIO) library for parallel I/O within an MPI
environment. We present techniques for optimizing communications and disk
accesses for small striping factors. Using MPI derived datatype capabilities,
we transmit fragmented data over the network by single MPI transfers. We
present the I/O performance of the SFIO library on DEC Alpha clusters, both for
the Fast Ethernet and for the TNET communication networks.
For I/O bound parallel applications, parallel
file striping may represent an alternative to Storage Area Networks (SAN). In
particular, parallel file striping offers high throughput I/O capabilities at a
much cheaper price, since it does not require a special network for accessing
the mass storage sub-system [1].
Figure 1. File Striping |
Parallel I/O systems should offer highly
concurrent access capabilities to the common data files by all parallel
application processes. They should exhibit linear increase in performance when
increasing both the number of I/O nodes and the number of application’s
processing nodes. Parallelism for input/output operations can be achieved by
striping the data across multiple disks so that read and write operations occur
in parallel (see Figure 1). A number of parallel file systems were
designed ([2], [3], [4], [5], [6], [7], [8]), which rely on the parallel file
striping paradigm.
MPI is a widely used standard framework for
creating parallel applications running on various types of parallel computers
[9]. A well known implementation of MPI, called MPICH, has been developed by
Argone National Laboratory [10]. MPICH is used on different platforms and
incorporates MPI-1.2 operations [11] as well as the MPI-I/O subset of MPI-II
([12], [13], [14]). MPICH is most popular for cluster architecture supercomputers,
based on Fast or Gigabit Ethernet networks. MPICH’s MPI-I/O underlying I/O
implementation is sequential and is based on NFS [10], [15].
Due to the locking mechanisms needed to avoid
simultaneous multiple accesses to the shared NFS file, MPICH MPI-I/O write
operations can be carried out only at a very slow throughput.
Another factor reducing peak performance is
the read-modify-write operation useful for writing fragmented data to the
target file. Read-modify-write requires reading the full contiguous extension
of data covering the data fragment to be written, sending it over the network,
modifying it and transmitting it back. In the case of high data fragmentation,
i.e. small chunks of data spread over a large dataspace in the file, network
access overhead may become dominant.
Our project aims at offering scalable I/O
throughput. To be able to provide the highest level of parallelization of
access requests as well as a good load balance, small striping units are
required. However low stripe unit size increases the communication and disk
access cost. Our SFIO parallel file striping implementation integrates the
relevant optimizations by merging sets of network messages and disk accesses
into single messages and single disk access requests. The merging operation
makes use of MPI derived datatypes. At the present time, the SFIO library
interface does not provide non-blocking operations, but internally, accesses
to the network and disks are made asynchronously.
Section 2 presents the overall architecture of the
SFIO implementation as well as the software layers in order to provide an
MPI-I/O interface on top of SFIO. The SFIO interface description, small
examples as well as the details of the system design, caching techniques and
other optimizations are presented in Section 3. First performance results are given for
various configurations of the Swiss-Tx supercomputer [16]. The performance test
of SFIO on top of MPICH is given in section 4. In the same section we present the topology
of the Swiss-T1 machine and the SFIO benchmarks on top of the native FCI communication
system. Section 5 presents the conclusions and future work.
The SFIO library is implemented using MPI-1.2
message passing calls. It is therefore as portable as MPI-1.2. The local disk
access calls, which depend on the underlying operating system are non-portable.
However, they are separately integrated into the source for the Unix and the
Windows NT versions.
The SFIO parallel file striping library
offers a simple Unix like interface. We intend to provide in the future an
MPI-I/O interface on top of SFIO. The intermediate level of MPICH’s MPI-I/O
implementation is ADIO [15]. We successfully modified the ADIO layer of MPICH
to route calls to the SFIO interface (Figure
2).
Figure 2. SFIO integration into MPI-I/O |
On the Swiss-T1 machine, SFIO can run on top
of MPICH as well as on top of MPI/FCI, an MPI implementation making use of the
low latency and high throughput TNET network [17].
Unlike the majority of file access
sub-systems SFIO is not a block-oriented library [18],[5],[19],[20],[21].
Independance from block orientation provides a number of advantages. There is
no need to send entire blocks over the network or to access them on the disk.
The stripe units do not form blocks; neither network transfers nor disk
accesses are rounded to the size of the stripe unit size. The amount of data
accessed on the disk and transferred over the network is the size specified at
the application level.
The functional architecture of the SFIO
library is shown in Figure 3. Only the structure of the access functions
is described. On top of the graph we have the application’s interface to data
access operations and at the bottom, we have the I/O node operations.
Figure 3. SFIO functional architecture |
The mread
and mwrite operations are the
non-optimized single block access functions and the mreadc and mwritec operations
are their optimized counterparts. The mreadb
and mwriteb operations are
multi-block access functions. All the mread,
mwrite, mreadc, mwritec, mreadb, mwriteb interface functions are operating
at the level of the logical file. For example the SFIO write access operation mwritec(f,0,buffer,size) writes data to
the beginning of the logical file f. Access interface functions are unaware
of the fact that the logical file is striped across subfiles. In the SFIO
library, all the interface access functions are routed to the mrw cyclic distribution module.
This module is responsible for data striping. Contiguous requests (or a set of
contiguous requests for mwriteb and mreadb) are split into small fragments
according to the striping factor. The small requests generated by the mrw module specify the selected subfile,
and the node on which the subfile is located. Global pointers are translated to
subfile pointers. Subfile access requests contain enough information to
execute and complete the I/O operation. For the non-optimized mread and mwrite operations, the library routes the requests to the sfp_read and sfp_write modules that are responsible to send appropriate single
sub- requests to the I/O nodes using MPI as the transport layer.
The network communication and disk access
optimization is demonstrated by the remaining part of the graph. For the
optimized interface functions mreadc,
mwritec, mreadb, mwriteb, the mrw
module routes the requests to the sfp_readc
and sfp_writec functions. These
functions access the sfp_rdwrc module
which stores the sub-requests into a two-dimensional cache. The 2D cache
structure comprises as one axis the I/O nodes and as a second axis the set of
subfiles each I/O node is dealing with. In the general case, on each I/O node
there may be one subfile per global file.
Each entry of the cache can be flushed.
As soon as a large list of the sub-requests
needs to be processed, the library can carry out an effective optimizations in
order to save network communications and disk accesses. Note that the data
itself is never cached, and always stays in user space. Three optimization
procedures are carried out, before an actual transmission takes place. The
requests are sorted by their offsets in the remote subfiles. This operation is
carried out by the sortcache module.
Overlapping and consecutive requests are merged whenever possible into single
requests by the bkmerge module. This
merging operation reduces the number of disk access calls on the remote I/O
nodes.
The
mkbset module creates a derived MPI datatype pointing to the fragmented
pieces of user data in the user’s memory. This allows to efficiently transmit
the data associated to many requests over the network as one contiguous
stream. The data can be transmitted or received without any memory copy at the
application or library level.
The actual data transmission to the I/O nodes
is carried out by the sfp_readb and sfp_writeb functions.
Two functions, mopen and mclose are
provided to open and close a striped file. Note that a file should be opened by
all compute nodes irrespectively of whether that node uses the file or not.
This restriction is placed in order to ensure the correct behaviour of future
collective parallel I/O functions. Additionally, the operation of opening as
well as of closing a file implies a global synchronization point in the
program. The generic functions to read and write to a file are respectively mreadc and mwritec.
The multiple I/O request specification
interface allows an application program to specify multiple I/O requests within
one call. This permits optimizations which otherwise would not be possible. The
multiple I/O request operations are mreadb and
mwriteb.
The following source C code shows a simple
SFIO example. The striped file with a stripe unit size of 5 bytes consists of
two subfiles. A single compute node accesses the striped file. It is assumed
that the program is launched with one compute node MPI process.
#include
<mpi.h>
#include
"/usr/local/sfio/mio.h"
int
_main(int argc, char *argv[])
{
MFILE *f;
f=mopen
//Collective Operation
(
"t0-p1,/tmp/a1.dat;"
"t0-p2,/tmp/a2.dat;"
,5
);
if(rank()==0)
{
//writes
at location 0 in the
//global
file 11 characters
mwritec(f,0,"Hello
World",11);
}
mclose(f); //Collective
Operation
}
Below is an example of multiple compute nodes
accessing a striped file. Again the striped file with a stripe unit size of 5
bytes consists of two subfiles. It is accessed by three compute nodes. Each of
them writes at different positions simultaneously.
#include
<mpi.h>
#include
"/usr/local/sfio/mio.h"
int
_main(int argc, char *argv[])
{
MFILE *f;
char bu[]=”Hello*World!*”;
int r=rank();
f=mopen
(
"t0-p1,/tmp/a.dat;"
"t0-p2,/tmp/a.dat;"
,5
);
//each
process writes at its own
//position
13 characters
mwritec(f,13*r,bu,13);
mclose(f);
}
We assume that the program is launched with
three compute and two I/O MPI processes. After the parallel writing operation,
the global file contains the text combined from the fragments written by the
first, second and third compute nodes, i. e. “Hello*World!*Hello*World!*Hello*World!*” The text is distributed across the two subfiles. The first subfile
contains “Hellod!*Heorld!o*Wor” and the second “*Worlllo*W*Hellld!*” (Figure
4)
Figure 4. Distribution of striped file across subfiles |
In this sub-section we present the SFIO
library application programmer interface.
File management operations are mopen, mclose, mchsize, mdelete and mcreate.
MFILE*
mopen(char *name, int stripeUnitSz);
void
mclose(MFILE *f);
void
mchsize(MFILE *f, long size);
void
mdelete(char *name);
void
mcreate(char *name);
All the presented file management operations
are collective. Operation mopen
returns to the compute node a pointer to the logical striped file descriptor.
The striped file name, required for the mopen,
mdelete, mcreate commands is a string containing the specification of the
I/O nodes together with the paths of subfiles representing the global striped
file. The format of the name is a sequence of subfiles, separated by semicolon:
“<host>,<path>;<host>,<path>...”. For
example “tonep0,/tmp/a.dat;tonep1,/tmp/a.dat;”.
The mchsize
operation change the size of the logical file. If the specified size is smaller
than the current, the operation truncate logical file to the new size.
There are single block and multi-block data
access requests.
void
mread(MFILE *f, long offset,
char *buffer,
unsigned size);
void
mwrite(MFILE *f, long offset,
char *buffer,
unsigned size);
void
mreadc(MFILE *f, long offset,
char *buffer,
unsigned size);
void
mwritec(MFILE *f, long offset,
char *buffer,
unsigned size);
void
mreadb(MFILE *f,
unsigned numberOfBlocks,
long offsets[],
char *buffers[],
unsigned sizes[]);
void
mwriteb(MFILE *f,
unsigned numberOfBlocks,
long offsets[],
char *buffers[],
unsigned sizes[]);
The data access requests are blocking and
non-collective. mreadc and mwritec functions are the optimized
versions of the mread and mwrite functions. The multiple block
data access operations mreadb and mwriteb are optimized. The numberOfBlocks argument in mreadb and mwriteb operations specifies the number of blocks to be accessed by
the single operation in the logical file. The information about each block has
to be provided by three arrays offsets,
buffers and sizes each
having a number of elements given by the variable numberOfBlocks. The offsets array
contains the positions of each block in the logical file. The buffers array contains the addresses of
each block in the user memory and the sizes
aray stores the size of each block in bytes.
Error management functions are given by merror and its collective counterpart merrora.
void
merrora(unsigned long *ioerr);
void
merror(unsigned long *ioerr);
void
prioerrora();
merror and merrora
return an array of error statistics accumulated on all the I/O nodes. At
the same time, they reset the error counters on all the I/O nodes. Statistics are accumulated for
operating system I/O calls and listed according to open, close, creat, unlink, ftruncate, lseek, write and read functions.
prioerrora is a collective operation
which prints the error statistics to the standard output of the application.
In our programming model, we assume a set of
compute nodes and an I/O subsystem. The I/O subsystem comprises a set of I/O
nodes running I/O listener processes. Both compute processes and I/O listeners
are MPI processes within a single MPI program. This allows the I/O subsystem
to optimize the data transfers between compute nodes and I/O nodes using MPI
derived datatypes. The user is allowed to directly use MPI operations only across
the compute nodes for computation purposes. The I/O nodes are available to the
user only through the SFIO interface.
When a compute node invokes an I/O operation,
the SFIO library takes control of that compute node. The library routes the
requests to the corresponding I/O listener proxy on the compute node, caches
the routed requests and does an optimization of requests queued for each I/O
node in order to minimize the cost of disk accesses and network communications.
After actual transmission of the messages, each I/O listener prepares a reply,
which is sent back to the compute node.
In order to optimize the disk accesses on the
remote I/O node, the algorithm implemented on the compute node tries to combine
all overlapping or consecutive I/O requests collected in the cache (Figure
5). Requests queued for each I/O node are
sorted according to their offsets on the remote disk subfile.
Figure 5. Disk access optimisation |
Queued I/O node access requests cached on the
compute node are launched either at the end of the function call or when the
buffer size reserved on the remote I/O listener for data reception may become
full. Memory is not a problem on the compute node, since data always stays in
user memory and is not buffered. When launching I/O requests, the SFIO library
performs a single data transmission to each of the I/O nodes. It creates
dynamically a derived datatype which points to the set of pieces in user space
memory related to the given I/O node and transmits the data in a single stream
without additional copy. The I/O listener at the same time receives the data as
a contiguous chunk. Upon reception of a data write or read request, the I/O
node immediately launches the corresponding disk access request.
Let us explore the scalability of our
parallel I/O implementation (SFIO) as a function of the number of contributing
I/O nodes. Performance results have been measured on the Swiss-T1 machine [16].
The Swiss-T1 supercomputer is based on Compaq AlphaServer DS20 machines and
consists of 64 Alpha processors grouped in 32 nodes. Two types of networks are
interconnecting the processors, the TNET and Fast Ethernet.
To have an idea about the network
capabilities, throughput as a function of number of nodes is measured by a
simple MPI program for both networks. The nodes are equally divided into
transmitting and receiving nodes and all-to-all traffic is generated. Figure 6 demonstrates the cluster’s communication
throughput scalability over Fast Ethernet. The Fast Ethernet network of T1
consists of a full crossbar switch.
Figure 6. Agregate throughput of Fast Ethernet as a function of the number of the contributing nodes |
Let us now analyze the performances of the
SFIO library on the Swiss-T1 machine on top of MPICH using Fast Ethernet. We
assign the first processor of each compute node to a compute process and the
second processor to an I/O listener (Figure
7).
Figure 7. SFIO architecture on Swiss-T1 |
SFIO performance is measured for concurrent
write access from all compute nodes to all I/O nodes, the striped file being
distributed over all I/O nodes. The number of I/O nodes is equal to the number
of compute nodes.
The size of the striped file is 2Gbyte and
the striped unit size is 200 bytes only. The MPICH application’s I/O
performance as a function of the number of compute and I/O nodes is measured
for the Fast Ethernet network. It is presented in Figure 8. The white graph represents the average
throughput and the gray graph the peak performance. These results are
surprising and need further investigation. We suppose that the fall of the
performance may be possibly due to a non-efficient implementation of data
intensive collective operations in the current version of MPICH.
Figure 8. SFIO/MPICH all-to-all I/O performance |
Let us analyze the capacities of the TNET
network of the Swiss-T1 machine. TNET is a high throughput and low latency
network (less than 20ms MPI latency and more than 50MB/s bandwidth) [17]. A
high performance MPI implementation called MPI/FCI is available for communication
through TNET [17].
The TNET throughput as a function of the
number of nodes is measured by a simple MPI program. The contributing nodes
are equally divided into transmitting and receiving nodes (Figure
9). Due to TNET’s specific network topology (Figure
10), communication throughput does not increase
smoothly. A significant increase in throughput occurs when the number of nodes
increases from 8 to 10, 16 to 18 and 24 to 26 nodes.
Figure 9. Aggregate throughput of TNET as a function of the number of the contributing nodes |
The Swiss-T1’s TNET network [22] consists of
eight 12-port full crossbar switches (Figure
10). Routing between switches that do not have
direct connectivity is static [16]. The topology together with the routing
information defines the network’s peak throughput over the subset of processors
assigned to a given application.
Figure 10. The Swiss-T1 network interconnection topology |
Let us now analyze the performances of the
SFIO library on the Swiss-T1 machine on top of MPI/FCI using the proprietary
TNET network. As before, the first processor of each compute node is assigned
to a compute process and the second processor to an I/O listener process. Therefore,
each node acts both as a compute node and an I/O node.
As in SFIO/MPICH, the performance of SFIO
over MPI/FCI is measured for concurrent write accesses from all compute nodes
to all I/O nodes, the striped file being distributed over all I/O nodes.
In order to limit operating system caching
effects, the total size of the striped file linearly increases with the number
of I/O nodes. With a global file size proportional to the number of
contributing I/O nodes, we keep the size of subfiles per I/O node fixed at
1GB/subfile.
The stripe unit size is 200 bytes. The
MPI/FCI application’s I/O performance is measured as a function of the number
of compute and I/O nodes (Figure 11). For each configuration, 53 measurements
are carried out. At job launch time, pairs of I/O and compute processes are
assigned randomly to processing nodes.
Figure 11. SFIO all-to-all I/O performance on TNET |
The I/O throughput on MPI/FCI scales well
when increasing the number of nodes. The speed-up may slightly vary due to
TNET’s particular communication topology (Figure
10). The effect of topology on the I/O performance
needs to be further studied.
SFIO is a cheap alternative to Storage Area
Networks. It is a light-weight portable parallel I/O system for MPI
programmers. Integrated into standard MPI-I/O, SFIO may become a high performance
portable MPI-I/O solution for the MPI community.
We plan to check the scalability of SFIO for
a larger number of processors on a large supercomputer at Sandia National
Laboratory.
We intend to implement non-blocking parallel
I/O function calls. Disk access optimizations may also be further improved.
The library has to be further developed in order to support global files larger
than 4GB.
Finally, we are planning to implement
collective operations as follows: collective operations assume that all compute
nodes issue an I/O request at the same logical step in the program. The compute
nodes, under control of the SFIO library, consult each other to arrive at a
common I/O strategy. The I/O nodes are informed about the strategy of the
compute nodes and an optimized data flow schedule is created.
[1] Martha Bancroft, Nick Bear, Jim Finlayson, Robert Hill,
Richard Isicoff and Hoot Thompson, Functionality and Performance Evaluation of
File Systems for Storage Area Networks (SAN), 17-th IEEE Symp. on Mass storage
systems,
[2] Sachin More, Alok Choudhary, Ian Foster, Ming Q. Xu. MTIO a
multi-threaded parallel I/O system, Proceedings of the 11th International
Parallel Processing Symposium (IPPS '97), pages 368-373
[3] Ron Oldfield and David Kotz. The Armada Parallel File System,
Dartmouth College Dpt. of Compute Science, November 22, 1998, pages 1-14, http://
www.cs.dartmouth.edu/~dfk/armada/design.html
[4] V. Messerli, O. Figueiredo, B. Gennart, R.D. Hersch,
Parallelizing I/O intensive Image Access and Processing Applications, IEEE
Concurrency, Vol. 7, No. 2, April-June 1999, pp. 28-37
[5] Chandramohan A. Thekkath, Timothy Mann, Edward K. Lee, Frangipani:
A Scalable Distributed File System. In Proceedings of the 16th ACM Symposium
on Operating Systems Principles, pages 224- 237. ACM Press, October 1997, ftp://ftp.digital.com/
pub/DEC/SRC/publications/thekkath/frangipani- sosp.ps
[6] Peter F. Gorbett and Dror G. Feitelson. The Vesta parallel
file system. ACM Transactions on Computer Systems, 14(3):225-264, August 1996.
[7] Jay Huber, Christopher L. Elford, Daniel A. Reed, Andrew A.
Chien, and David S. Blumenthal. PPFS: A High Performance Portable Parallel File
System. In Proceeding of the 9th ACM International Conference on
Supercomputing, pages 385-394. ACM Press, July 1995.
[8] David Kotz. Disk-directed I/O for MIMD Multiprocessors. ACM
Transactions on Computer Systems, 15(1):41-74, February 1997.
[9] Peter S. Pacheco, Parallel Programming with MPI, by Morgan
Kaufmann Publishers, pages 137-178, 1997
[10] Rajeev Thakur, William Gropp, Ewing Lusk, On Implementing
MPI-I/O Portable and with High Performance, Sixth Workshop on I/O in Parallel
and Distributed Systems, ACM, May 1999, pp. 23-32.
[11] Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, Jack
Dongarra, MPI - The Complete Reference, Volume 1, The MPI Core, MIT Press,
pages 123-189, 1996
[12] William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing
Lusk, Bill Nitzberg, William Saphir, Marc Snir, MPI - The Complete Reference,
Volume 2, The MPI Extensions, MIT Press, pages 185-274, 1998
[13] William Gropp, Ewing Lusk, Rajeev Thakur, Using MPI-2 Advanced
Features of the Message-Passing Interface, MIT Press, pages 51-118, 1999
[14] Message Passing Interface Forum, MPI-2 Extensions to the
Message-Passing Interface,
[15] Rajeev Thakur, William Gropp, Ewing Lusk “A Case for Using MPI’s
Derived Datatypes to Improve I/O Performance”, http://www.supercomp.org/sc98/ TechPapers/sc98_FullAbstracts/Thakur447/,
pages 1-9, 1998
[16] Pierre Kuonen, Ralf Gruber, Parallel computer architectures for
commodity computing and the Swiss-T1 machine. EPFL Supercomputing Review, Nov
99, pp. 3-11,
http://sawww.epfl.ch/SIC/SA/publications/ SCR99/scr11-page3.html
[17] Stephan Brauss, Communication Libraries for the Swiss-Tx
Machines. EPFL Supercomputing Review, Nov 99, pp. 12-15. http://sawww.epfl.ch/SIC/SA/ publications/SCR99/scr11-page12.html
[18] Benoit A. Gennart,
[19] Edward K. Lee, Highly-Available, Scalable Network Storage, In
Digest of Papers COMPCON 1995, pages 397-402. IEEE Computer Society Press,
March 1995.
[20] Edward K. Lee and Chandramohan A. Thekkath, Petal: Distributed
Virtual Disks, In Proceedings of the Seventh International Conference on
Architectural Support for Programming Languages and Operating Systems,
ASPLOS-VII, pages 84-92. ACM, October 1996. ftp://ftp.digital.com/pub/DEC/SRC/
publications/eklee/petal-paper.pdf
[21] Edward K. Lee, Chandramohan A. Thekkath, Chris Whitaker, Jim
Hogg, A Comparison of Two Distributed Disk Systems, SRC Research Report 155,
April 30, 1998, http://gatekeeper.dec.com/pub/DEC/SRC/
research-reports/abstracts/src-rr-155.html
[22] Kuonen P. The K-Ring: a versatile model for the design of MIMD
computer topology, Proceedings of the High-Performance Computing Conference
(HPC'99), San Diego, USA, pp. 381-385; April (1999).