-
Notifications
You must be signed in to change notification settings - Fork 12
icl-utk-edu/hpcc
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
% -*- LaTeX -*- % % This is the master README. Translation to text and HTML files is done % with HeVeA. % % Postprocessing - replace </style> with: % tt {color:navy} % h2, h3, h4 {color: #527bbd;} % h2 {border-bottom: 2px solid silver;} %.verbatim {background: #ffffee; border: 1px solid silver; padding: 0.5em;} % </style> \documentclass[twocolumn]{article} \usepackage{mathptmx} \usepackage{url} \usepackage[margin=2cm]{geometry} \begin{document} \title{DARPA/DOE HPC~Challenge Benchmark version 1.5.0beta} \author{Piotr Luszczek\footnote{University of Tennessee Knoxville, Innovative Computing Laboratory}} \date{October 12, 2012} \maketitle \section{Introduction} This is a suite of benchmarks that measure performance of processor, memory subsytem, and the interconnect. For details refer to the HPC~Challenge web site (\url{http://icl.cs.utk.edu/hpcc/}.) In essence, HPC~Challenge consists of a number of tests each of which measures performance of a different aspect of the system. If you are familiar with the High Performance Linpack~(HPL) benchmark code (see the HPL web site: \texttt{http://www.netlib.org/benchmark/hpl/}) then you can reuse the build script file~(input for \texttt{make(1)} command) and the input file that you already have for HPL. The HPC~Challenge benchmark includes HPL and uses its build script and input files with only slight modifications. The most important change must be done to the line that sets the \texttt{TOPdir} variable. For HPC~Challenge, the variable's value should always be \texttt{../../..} regardless of what it was in the HPL build script file. \section{Compiling} The first step is to create a build script file that reflects characteristics of your machine. This file is reused by all the components of the HPC~Challenge suite. The build script file should be created in the \texttt{hpl} directory. This directory contains instructions (the files \texttt{README} and \texttt{INSTALL}) on how to create the build script file for your system. The \texttt{hpl/setup} directory contains many examples of build script files. A recommended approach is to copy one of them to the \texttt{hpl} directory and if it doesn't work then change it. The build script file has a name that starts with \texttt{Make.} prefix and usally ends with a suffix that identifies the target system. For example, if the suffix chosen for the system is \texttt{Unix}, the file should be named \texttt{Make.Unix}. To build the benchmark executable (for the system named \texttt{Unix}) type: \texttt{make arch=Unix}. This command should be run in the top directory~(not in the \texttt{hpl} directory). It will look in the \texttt{hpl} directory for the build script file and use it to build the benchmark executable. The runtime behavior of the HPC~Challenge source code may be configured at compiled time by defining a few C preprocessor symbols. They can be defined by adding appropriate options to \texttt{CCNOOPT} and \texttt{CCFLAGS} make variables. The former controls options for source code files that need to be compiled without aggressive optimizations to ensure accurate generation of system-specific parameters. The latter applies to the rest of the files that need good compiler optimization for best performance. To define a symbol \texttt{S}, the majority of compilers requires option \texttt{-DS} to be used. Currently, the following options are available in the HPC~Challenge source code: \begin{itemize} \item \texttt{HPCC\_FFT\_235}: if this symbol is defined the FFTE code (an FFT implementation) will use vector sizes and processor counts that are not limited to powers of 2. Instead, the vector sizes and processor counts to be used will be a product of powers of 2, 3, and 5. \item \texttt{HPCC\_FFTW\_ESTIMATE}: if this symbol is defined it will affect the way external FFTW library is called~(it does not have any effect if the FFTW library is not used). When defined, this symbol will call the FFTW planning routine with \texttt{FFTW\_ESTIMATE} flag~(instead of \texttt{FFTW\_MEASURE}). This might result with worse performance results but shorter execution time of the benchmark. Defining this symbol may also positively affect the memory fragmentation caused by the FFTW's planning routine. \item \texttt{HPCC\_MEMALLCTR}: if this symbol is defined a custom memory allocator will be used to alleviate effects of memory fragmentation and allow for larger data sets to be used which may result in obtaining better performance. \item \texttt{HPL\_USE\_GETPROCESSTIMES}: if this symbol is defined then Windows-specific \texttt{GetProcessTimes()} function will be used to measure the elapsed CPU time. \item \texttt{USE\_MULTIPLE\_RECV}: if this symbol is defined then multiple non-blocking receives will be posted simultaneously. By default only one non-blocking receive is posted. \item \texttt{RA\_SANDIA\_NOPT}: if this symbol is defined the HPC~Challenge standard algorithm for Global RandomAccess will not be used. Instead, an alternative implementation from Sandia National Laboratory will be used. It routes messages in software across virtual hyper-cube topology formed from MPI processes. \item \texttt{RA\_SANDIA\_OPT2}: if this symbol is defined the HPC~Challenge standard algorithm for Global RandomAccess will not be used. Instead, instead an alternative implementation from Sandia National Laboratory will be used. This implementation is optimized for number of processors being powers of two. The optimizations are sorting of data before sending and unrolling the data update loop. If the number of process is not a power two then the code is the same as the one performed with the \texttt{RA\_SANDIA\_NOPT} setting. \item \texttt{RA\_TIME\_BOUND\_DISABLE}: if this symbol is defined then the standard Global RandomAccess code will be used without time limits. This is discouraged for most runs because the standard algorithm tends to be slow for large array sizes due to a large overhead for short MPI messages. \item \texttt{USING\_FFTW}: if this symbol is defined the standard HPC~Challenge FFT implemenation~(called FFTE) will not be used. Instead, FFTW library will be called. Defining the \texttt{USING\_FFTW} symbol is not sufficient: appropriate flags have to be added in the make script so that FFTW headers files can be found at compile time and the FFTW libraries at link time. \end{itemize} \section{Runtime Configuration} The HPC~Challenge is driven by a short input file named \texttt{hpccinf.txt} that is almost the same as the input file for HPL~(customarily called \texttt{HPL.dat}). Refer to the directory \texttt{hpl/www/tuning.html} for details about the input file for HPL. A sample input file is included with the HPC~Challenge distribution. The differences between HPL's input file and HPC~Challenge's input file can be summarized as follows: \begin{itemize} \item Lines 3 and 4 are ignored. The output is always appended to the file named \texttt{hpccoutf.txt}. \item There are additional lines~(starting with line 33) that may~(but do not have to) be used to customize the HPC~Challenge benchmark. They are described below. \end{itemize} The additional lines in the HPC~Challenge input file~(compared to the HPL input file) are: \begin{itemize} \item Lines 33 and 34 describe additional matrix sizes to be used for running the PTRANS benchmark~(one of the components of the HPC~Challenge benchmark). \item Lines 35 and 36 describe additional blocking factors to be used for running the PTRANS test. \end{itemize} Just for completeness, here is the list of lines of the HPC Challenge's input file and brief description of their meaning: \begin{itemize} \item Line 1: ignored \item Line 2: ignored \item Line 3: ignored \item Line 4: ignored \item Line 5: number of matrix sizes for HPL (and PTRANS) \item Line 6: matrix sizes for HPL (and PTRANS) \item Line 7: number of blocking factors for HPL (and PTRANS) \item Line 8: blocking factors for HPL (and PTRANS) \item Line 9: type of process ordering for HPL \item Line 10: number of process grids for HPL (and PTRANS) \item Line 11: numbers of process rows of each process grid for HPL (and PTRANS) \item Line 12: numbers of process columns of each process grid for HPL (and PTRANS) \item Line 13: threshold value not to be exceeded by scaled residual for HPL (and PTRANS) \item Line 14: number of panel factorization methods for HPL \item Line 15: panel factorization methods for HPL \item Line 16: number of recursive stopping criteria for HPL \item Line 17: recursive stopping criteria for HPL \item Line 18: number of recursion panel counts for HPL \item Line 19: recursion panel counts for HPL \item Line 20: number of recursive panel factorization methods for HPL \item Line 21: recursive panel factorization methods for HPL \item Line 22: number of broadcast methods for HPL \item Line 23: broadcast methods for HPL \item Line 24: number of look-ahead depths for HPL \item Line 25: look-ahead depths for HPL \item Line 26: swap methods for HPL \item Line 27: swapping threshold for HPL \item Line 28: form of L1 for HPL \item Line 29: form of U for HPL \item Line 30: value that specifies whether equilibration should be used by HPL \item Line 31: memory alignment for HPL \item Line 32: ignored \item Line 33: number of additional problem sizes for PTRANS \item Line 34: additional problem sizes for PTRANS \item Line 35: number of additional blocking factors for PTRANS \item Line 36: additional blocking factors for PTRANS \end{itemize} \section{Running} The exact way to run the HPC~Challenge benchmark depends on the MPI implementation and system details. An example command to run the benchmark could like like this: \texttt{mpirun -np 4 hpcc}. The meaning of the command's components is as follows: \begin{itemize} \item \texttt{mpirun} is the command that starts execution of an MPI code. Depending on the system, it might also be \texttt{aprun}, \texttt{mpiexec}, \texttt{mprun}, \texttt{poe}, or something appropriate for your computer. \item \texttt{-np 4} is the argument that specifies that 4 MPI processes should be started. The number of MPI processes should be large enough to accomodate all the process grids specified in the \texttt{hpccinf.txt} file. \item \texttt{hpcc} is the name of the HPC~Challenge executable to run. \end{itemize} After the run, a file called \texttt{hpccoutf.txt} is created. It contains results of the benchmark. This file should be uploaded through the web form at the HPC~Challenge website. \section{Source Code Changes across Versions (ChangeLog)} \subsection{Version 1.5.0 (2016-03-18)} \begin{enumerate} \item Fixed memory leak in STREAM code. \item Fixed bug in STREAM that resulted in minimum results reported as 0. \item Removed some of the compilation warnings. \end{enumerate} \subsection{Version 1.5.0beta (2015-07-23)} \begin{enumerate} \item Added new targets to the main make(1) file. \item Fixed bug introduced while updating to MPI STREAM 1.7 with spurious global communicator (reported by NEC). \item Added make(1) file for OpenMPI from MacPorts. \item Fixed bug introduced while updating to MPI STREAM 1.7 that caused some ranks to use NULL communicator. \item Fixed bug introduced while updating to MPI STREAM 1.7 that caused syntax errors. \end{enumerate} \subsection{Version 1.5.0alpha (2015-05-22)} \begin{enumerate} \item Added global error accounting in STREAM. \item Updated checking to report from multiple MPI processes contributing to overall error. \item Added barrier to make sure all processes enter STREAM kernel tests at the same time. \item Updated naming conventions to match the original benchmark in STREAM. \item Changed scaling constant to prevent verification from overflowing in STREAM. \item Simplified MPI communicator code in STREAM. \item Substituted large constants for more descriptive compile time arithmetic in STREAM. \item Added the ``restrict'' keyword to the STREAM vector pointers for faster generated code. \item Updated STREAM code to the official STREAM MPI version 1.7. \item Removed infinite loop due to default compiler optimization in DLAMCH and SLAMCH. \item Added compiler flags to allow compiling with a C++ compiler. \end{enumerate} \subsection{Version 1.4.3 (2013-08-26)} \begin{enumerate} \item Increased the size of scratch vector for local FFT tests that was missed in the previous version (reported by SGI). \item Added Makefile for Blue Gene/P contributed by Vasil Tsanov. \end{enumerate} \subsection{Version 1.4.2 (2012-10-12)} \begin{enumerate} \item Increased sizes of scratch vectors for local FFT tests to account for runs on systems with large main memory (reported by IBM, SGI and Intel). \item Reduced vector size for local FFT tests due to larger scratch space needed. \item Added a type cast to prevent overflow of a 32-bit integer vector size in FFT data generation routine (reported by IBM). \item Fixed variable types to handle array sizes that overflow 32-bit integers in RandomAccess (reported by IBM and SGI). \item Changed time-bound code to be used by default in Global RandomAccess and allowed for it to be switched off with a compile time flag if necessary. \item Code cleanup to allow compilation without warnings of RandomAccess test. \item Changed communication code in PTRANS to avoid large message sizes that caused problems in some MPI implementations. \item Updated documentation in README.txt and README.html files. \end{enumerate} \subsection{Version 1.4.1 (2010-06-01)} \begin{enumerate} \item Added optimized variants of RandomAccess that use Linear Congruential Generator for random number generation. \item Made corrections to comments that provide definition of the RandomAccess test. \item Removed initialization of the main array from the timed section of optimized versions of RandomAccess. \item Fixed the length of the vector used to compute error when using MPI implementation from FFTW. \item Added global reduction to error calculation in MPI FFT to achieve more accurate error estimate. \item Updated documentation in README. \end{enumerate} \subsection{Version 1.4.0 (2010-03-26)} \begin{enumerate} \item Added new variant of RandomAccess that uses Linear Congruential Generator for random number generation. \item Rearranged the order of benchmarks so that HPL component runs last and may be aborted if the performance of other components was not satisfactory. RandomAccess is now first to assist in tuning the code. \item Added global initialization and finalization routine that allows to properly initialize and finalize external software and hardware components without changing the rest of the HPCC testing harness. \item Lack of \texttt{hpccinf.txt} is no longer reported as error but as a warning. \end{enumerate} \subsection{Version 1.3.2 (2009-03-24)} \begin{enumerate} \item Fixed memory leaks in G-RandomAccess driver routine. \item Made the check for 32-bit vector sizes in G-FFT optional. MKL allows for 64-bit vector sizes in its FFTW wrapper. \item Fixed memory bug in single-process FFT. \item Update documentation (README). \end{enumerate} \subsection{Version 1.3.1 (2008-12-09)} \begin{enumerate} \item Fixed a dead-lock problem in FFT component due to use of wrong communicator. \item Fixed the 32-bit random number generator in PTRANS that was using 64-bit routines from HPL. \end{enumerate} \subsection{Version 1.3.0 (2008-11-13)} \begin{enumerate} \item Updated HPL component to use HPL 2.0 source code \begin{enumerate} \item Replaced 32-bit Pseudo Random Number Generator (PRNG) with a 64-bit one. \item Removed 3 numerical checks of the solution residual with a single one. \item Added support for 64-bit systems with large memory sizes (before they would overflow during index calculations 32-bit integers.) \end{enumerate} \item Introduced a limit on FFT vector size so they fit in a 32-bit integer (only applicable when using FFTW version 2.) \end{enumerate} \subsection{Version 1.2.0 (2007-06-25)} \begin{enumerate} \item Changes in the FFT component: \begin{enumerate} \item Added flexibility in choosing vector sizes and processor counts: now the code can do powers of 2, 3, and 5 both sequentially and in parallel tests. \item FFTW can now run with ESTIMATE (not just MEASURE) flag: it might produce worse performance results but often reduces time to run the test and cuases less memory fragmentation. \end{enumerate} \item Changes in the DGEMM component: \begin{enumerate} \item Added more comprehensive checking of the numerical properties of the test's results. \end{enumerate} \item Changes in the RandomAccess component: \begin{enumerate} \item Removed time-bound functionality: only runs that perform complete computation are now possible. \item Made the timing more accurate: main array initialization is not counted towards performance timing. \item Cleaned up the code: some non-portable C language constructs have been removed. \item Added new algorithms: new algorithms from Sandia based on hypercube network topology can now be chosen at compile time which results on much better performance results on many types of parallel systems. \item Fixed potential resource leaks by adding function calls rquired by the MPI standard. \end{enumerate} \item Changes in the HPL component: \begin{enumerate} \item Cleaned up reporting of numerics: more accurate printing of scaled residual formula. \end{enumerate} \item Changes in the PTRANS component: \begin{enumerate} \item Added randomization of virtual process grids to measure bandwidth of the network more accurately. \end{enumerate} \item Miscellaneous changes: \begin{enumerate} \item Added better support for Windows-based clusters by taking advantage of Win32 API. \item Added custom memory allocator to deal with memory fragmentation on some systems. \item Added better reporting of configuration options in the output file. \end{enumerate} \end{enumerate} \subsection{Version 1.0.0 (2005-06-11)} \subsection{Version 0.8beta (2004-10-19)} \subsection{Version 0.8alpha (2004-10-15)} \subsection{Version 0.6beta (2004-08-21)} \subsection{Version 0.6alpha (2004-05-31)} \subsection{Version 0.5beta (2003-12-01)} \subsection{Version 0.4alpha (2003-11-13)} \subsection{Version 0.3alpha (2004-11-05)} \end{document}