I/O Tuning

The TAMUCC parallel I/O systems is a collection of IO servers and a large number of disks that act as if they are one very large disk. One of the IO servers, the Meta-Data Server (MDS), tracks where each file is located on the disks of the different IO servers.

Because Lustre has a large collection of disks it is possible to read and write large files across many disks quickly. However, all opening, closing and location of files must go through a single Meta-Data Server and this often becomes a bottle-neck in IO performance. With hundreds of jobs running at the same time and sharing the Lustre file system, there can be much contention for the accessing the Meta-Data Server.

When considering IO performance, use the "avoid too often and too many" rules. Avoid writing small files, opening and closing a file frequently, writing to a separate file for each task in a large parallel job stresses the MDS. It is best to aggregate I/O operations whenever possible. For best I/O performance one should consider using libraries like parallel HDF5 to write single files in parallel efficiently.

Some of the more common sense approach entails using what's provided by the vendor i.e. taking advantage of the hardware. On Linux systems for example, this would mean using the Parallel Virtual Filesystem (PVFS) for Linux-based clusters. On IBM systems, for example, that would imply using the fast Global Parallel Filesystem (GPFS) provided by IBM.

Other common sensible approaches to optimizing I/O is to be aware of the existence and the locations of the file systems i.e. whether the file systems are locally mounted or through a remote file system. The former is much faster than the latter, due to limitations of network bandwidth, disk speed and overhead due to accessing the file system over the network and should always be the goal at the design level.

The other approaches include considering the best software options available. Some of them are enumerated below:

  • Read or write as much data as possible with a single READ/WRITE/PRINT. Avoid performing multiple writes of small records.
  • Use binary instead of ASCII format because of the overhead incurred converting from the internal representation of real numbers to a character string. In addition, ASCII files are larger than the corresponding binary file.
  • In Fortran, prefer direct access to sequential access. Direct or random access files do not have record length indicators at the beginning and end of each record.
  • If available, use asynchronous I/O to overlap reads/writes with computation.

General I/O Tips

  1. Don't open and close files with every I/O operation

    Open the files that are needed at the beginning of execution and close them at the end. Each open/close operation has an overhead cost that adds up, especially if multiple tasks are opening and closing files at the same time.

  2. Use the /work filesystem

    /work has more I/O servers than /home. If you need to keep your input/output data on /home, you may add commands to your batch script to copy data to /work at the beginning of a run and copy out at the end.

  3. Limit the number of files in one directory

    Directories that contain hundreds or thousands of files will be slow to respond to I/O operations due to overhead of indexing the files. Break up directories with large number of files into subdirectories with fewer files.

  4. Aggregate I/O operations as much as possible

    Fewer large read/write operations are much more efficient than many small operations. This may be accomplished by reducing the number of writers from every task to one per node or fewer to balance the bandwidth of a node with the bandwidth of the I/O servers.