ReleaseNotes


             MAGMA Release Notes

-----------------------------------------------------

MAGMA is intended for CUDA enabled NVIDIA GPUs and HIP enabled AMD GPUs.
It supports NVIDIA's Kepler, Maxwell, Pascal, Volta, Turing, Ampere, and Hopper
GPUs, and AMD's HIP GPUs.

Included are routines for the following algorithms:

    * LU, QR, and Cholesky factorizations
    * Hessenberg, bidiagonal, and tridiagonal reductions
    * Linear solvers based on LU, QR, and Cholesky
    * Eigenvalue and singular value (SVD) problem solvers
    * Generalized Hermitian-definite eigenproblem solver
    * Mixed-precision iterative refinement solvers based on LU, QR, and Cholesky
    * MAGMA BLAS including gemm, gemv, symv, and trsm
    * Batched MAGMA BLAS including gemm, gemv, herk, and trsm
    * Batched MAGMA LAPACK including LU, inverse (getri), QR, and Cholesky factorizations
    * MAGMA Sparse including CG, GMRES, BiCGSTAB, LOBPCG, iterative refinement,
      preconditioners, sparse kernels (SpMV, SpMM), and support for CSR, ELL, and
      SELL-P data formats

Most routines have all four precisions:
single (s), double (d), single-complex (c), double-complex (z).

2.8.0 - Mar 25, 2024
    * New functionality: band LU factorization and solve
      - magma_{s,d,c,z}gbtrf_native computes the LU factorization of a band matrix
        using partial pivoting with row interchanges. This is equivalent to the
        LAPACK GBTRF routine.
      - magma_{s,d,c,z}gbsv_native computes the solution to a system of linear
        equations A * X = B, where A is a band matrix and X and B are general dense
        matrices. This is equivalent to the LAPACK GBSV routine.
      - magma_{s,d,c,z}gbtrf_batched and magma_{s,d,c,z}gbtrf_batched_strided are
        the batched and the stride-batched versions of GBTRF, respectively.
      - magma_{s,d,c,z}gbsv_batched and magma_{s,d,c,z}gbsv_batched_strided are
        the batched and the stride-batched versions of GBSV, respectively.
    * Native Cholesky factorization, magma_{s,d,c,z}potrf_native, now supports uplo = MagmaUpper
    * Bug fixes:
      - Batch QR factorization: fix numerical behavior for some corner cases
      - Variable-size batch GEMM: fix numerical behavior when k = 0 and beta != 1
      - GESV: fix failures for very large matrices (beyond 46k)
      - Batch GESV: fix failure when the number of right hand sides is larger than 1024
      - Fix compilation for rocm-6
      - Multi-GPU syevd: fix failures on very large matrices
      - Multi-GPU potrf: fix failures on 4 or more GPUs

2.7.2 - Aug 25, 2023
    * Add expert interfaces for LU, QR, and Cholesky factorizations
    * Add tuning specifications for LU, QR, and Cholesky factorizations
    * Tuning for Ampere and later GPUs
    * Fused LU panel for AMD GPUs
    * Bug fixes for batched LU on singular matrices

2.7.1 - Feb 23, 2023
    * Add support for CUDA 12
    * Add a new interface for batch GEMV that accepts a pointer + stride
    * Add sparse test matrices to the release tarball
    * Performance improvement for batch GEMV targeting square sizes up to 32
    * Update CMakeLists compiler flags for Windows

2.7.0 - Nov 9, 2022
    * Add support for builds targeting NVIDIA's Hopper architecture
    * New routine: magma_dshposv_gpu and magma_dshposv_native solve Ax = b, for a
      symmetric positive definite matrix 'A', using FP16 during the Cholesky
      factorization. GMRES-based iterative refinement is used to recover the solution
      up to double precision accuracy. The '_gpu' suffix denotes a hybrid CPU-GPU
      factorization, while '_native' denotes a GPU-only factorization
    * Performance improvement for the batch QR factorization routine
    * Performance improvement for the variable size batch LU factorization routine
    * Bug fixes, performance optimizations, benchmark additions, and maintenance
      updates to support current and new MAGMA routines, latest NVIDIA and AMD
      math libraries and GPU hardware

 2.6.2 - Mar 15, 2022
    * New routine: magma_{s,d,c,z}getrf_vbatched provides a variable-size batched LU
      factorization with partial pivoting. This is a reference implementation, with more
      performance optimizations planned for future releases
    * New routine: magmablas_{s,d,c,z}trsm_vbatched now provides a variable-size batched
      TRSM that does not invert the diagonal blocks of the input triangular matrix. The
      routine can be tested by passing "--version 3" to testing_{s,d,c,z}trsm_vbatched
    * Caling more hipBLAS functions
    * Bug fixes (n==0 in Cholesky factorization; synchronization in LQ; installation)
    * Remove gfx803 target for AMD GPUs
    * Add uplo argument in inertia compuattion routines (only upper was supported before)
    * Fix memory leak in magma_queue for hip functions
    * Add FP16 and FP16-FP32 GEMM benchmark for HIP

 2.6.1 - July 12, 2021
    * Bug fix for installing MAGMA with spack on CUDA 9 and older
    * Expert interface for Cholesky factorization to improve performance
      for small problems
    * Define some magma_blas routines to call AMD BLAS for HIP installation
      (these routines were previously either not present or were underperforming
       underperforming in AMD BLAS, and were therefore defined through magmablas)

 2.6.0 - June 26, 2021
    * Added HIP support for AMD GPUs (former hipMAGMA) as part of MAGMA
    * Added inertia computational routines for GPUs
    * Performance improvements for AMD GPUs
    * Performance improvement for magma_Xgesv_batched for small sizes
    * Added Bunch-Kaufman GPU-only sover using BLAS calls (magma_zhetrs_gpu)
    * Added include/magma_config.h file storing the configuration for a particular
      magma installation (CUDA vs. HIP, etc.)
    * Added expert interfaces for magma_Xgetrf_gpu and magma_Xpotrf_gpu. These
      interfaces allow the user to specify the factorization mode; hybrid (CPU+GPU)
      vs. native (GPU only), as well as the blocking size (nb)
    * Added tuning for small size LU, QR, and Cholesky factorizations.

 2.5.4 - Oct 8, 2020
    * Support for CUDA 11
    * Support for Ampere GPUs
    * New routine: add trmm in all precisions
    * New routine: add sidi routine in real precisions to compute inertia for
      symmetric indefinite matrices
    * New routine: GPU interfaces to hetrf in all precisions
    * New routine: magmablas_Xdiinertia to compute the inertia of a diagonal
      of a matrix in the GPU memory
    * Bug fixes in herk and sytrd
    * Bug fixes in ranged eignesolver testers and fallback calls for small matrices
    * Performance improvement for Symmetric/Hermitian eigensolvers

 2.5.3 - Mar 28, 2020
    * Small modifications to enable hipMAGMA generation from MAGMA to support AMD GPUs
    * New routine: add syrk in all precisions
    * New routine: add hemm/symm in all precisions
    * New routine: add GEMM-based herk and her2k in all precision
    * Bug fix in cmake when USE_FORTRAN is OFF
    * Bug fix in example_sparse.c
    * Fix support for half computation in magmablas_hgemm_batched tester for CUDA < 9.2

 2.5.2 - Nov 24, 2019
    * New routine: magmablas_hgemm_batched for fixed size batched matrix multiplication
      in FP16 using the Tensor Cores. The routine does not currently support pre-Volta GPUs.
      The routine outperforms cuBLAS for sizes less than 100, as well as for general sizes that
      are not multiple of 8. The kernel is tuned for the notrans-notrans case only.
      Comprehensive tuning is planned in future releases
    * Fix magmablas_?gemm_vbatched routines to correctly handle batch sizes over
      65535. The same fix is applied to vbatched syrk, herk, syr2k, her2k, symm, hemm, and trmm
    * Fix a bug in the FP32 <-> FP16 conversion routines (magmablas_hlag2s and
      magmablas_slag2h). The bug used to cause a launch failure for very large
      matrices
    * Fix a bug in batched LU factorization to avoind NaNs when singularity is ancountered
    * Fix a bug in batched LU factorization to ensure that the first pivot is always returned
      even when multilpe pivots with the same absolute value are found
    * Add Frobenius norm for general matrices
      (supported as option to magmablas_Xlange for X = 's', 'd', 'c', or 'z')

 2.5.1 - Aug 2, 2019
    * New routine: magmablas_Xherk_small_reduce (X = 's', 'd', 'c', or 'z')
      is a special HERK routine that assumes that the output matrix is very
      small (up to 32), that that the input matrix is very tall-and-skinny

 2.5.1-alpha1 - May 9, 2019
    * Updates and improvements in CMakeLists.txt for improved/friendlier CMake
      and spack installations
    * Fixes related to MAGMA installation on GPUs and CUDA versions that do not
      support FP16 arithmetic
    * Support for Turing GPUs added
    * Remove some C++ features from MAGMA Sparse for friedlier compilation
      (using nvcc and various CPU compilers)

 2.5.0 - Nov 16, 2018
    * New routines: Magma is releasing the Nvidia Tensor Cores version
      of its linear mixed-precision solver that is able to provide an
      FP64 solution with up to 4X speedup. The release includes:
      magma_dhgesv_iteref_gpu (FP64-FP16 solver with FP64 input and solution)
      magma_dsgesv_iteref_gpu (FP64-FP32 solver with FP64 input and solution)
      magma_hgetrf_gpu        (mixed precision FP32-FP16 LU factorization)
      magma_htgetrf_gpu       (mixed precision FP32-FP16 LU factorization using Tensor Cores)

      Further details for the function names and the testing routines are given in file
      README_FP16_Iterative_Refinement.txt

    * New routine: magmablas_Xgemm_batched_strided (X = {s, d, c, z}) is
      the stride-based variant of magmablas_Xgemm_batched
    * New routine: magma_Xgetrf_native (X = {s, d, c, z}) performs the
      LU factorization with partial pivoting using the GPU only. It has
      the same interface as the hybrid (CPU+GPU) implementation provided
      by magma_Xgetrf_gpu. Testing the performance of this routine is
      possible through running testing_Xgetrf_gpu with the option
      (--version 3)
    * New routine: magma_Xpotrf_native (X = {s, d, c, z}) performs the
      Cholesky factorization using the GPU only. It has the same interface
      as the hybrid (CPU+GPU) implementation provided by magma_Xpotrf_gpu.
      Testing the performance of this routine is possible through running
      testing_Xpotrf_gpu with the option (--version 2)
    * Added benchmark for GEMM in FP16 arithmetic (HGEMM) as well as
      auxiliary functions to cast matrices from FP32 to FP16 storage
      (magmablas_slag2h) and from FP16 to FP32 (magmablas_hlag2s).
    * Added Fortran wrappers to allocate memory, manage queues and devices,
      and for BLAS routines with queues.

 2.4.0 - Jun 25, 2018
    * Added constrained least squares routines (magma_[sdcz]gglse)
      and dependencies:
      magma_zggrqf - generalized RQ factorization
      magma_zunmrq - multiply by orthogonal Q as returned by zgerqf
    * Performance improvements across many batch routines, including
      batched TRSM, batched LU, batched LU-nopiv, and batched Cholesky
    * Fixed some compilation issues with inf, nan, and nullptr.

    MAGMA-sparse
    * Changed the way how data from an external application is handled:
      There is now a clear distinction between memory allocated/used/freed from
      MAGMA and the user application.
      We added a functions magma_zvcopy and magma_zvpass that do not allocate
      memory, instead they copy values from/to application-allocated memory.
    * The examples ( in example/example_sparse.c ) give a demonstration on how
      these routines should be used.

 2.3.0 - Nov 15, 2017
    * Moved MAGMA's repository to Bitbucket: https://bitbucket.org/icl/magma
    * Added support for Volta GPUs
    * Improved performance for batched LU and QR factorizations
      on small square sizes up to 32
    * Added test matrix generator to many testers

    MAGMA-sparse
    * Added support for CUDA 9.0
    * Improved the ParILUT algorithm w.r.t. stability and scalability
    * Added ParICT, a symmetry-exploiting version of the ParILUT algorithm

 2.2.0 - Nov 20, 2016
    * Added variable size batched Cholesky factorization
      magma_[sdcz]potrf_vbatched
    * Added new fixed size batched BLAS routines
      magmablas_[cz]{hemm, hemv, trmm}_batched
      magmablas_[sd]{symm, symv, trmm}_batched
    * Added new variable size batched BLAS routines
      magmablas_[cz]{hemm, hemv, trmm, trsm}_vbatched
      magmablas_[sd]{symm, symv, trmm, trsm}_vbatched
    * Fixed memory leaks in {sy,he}evdx_2stage and getri_outofplace_batched.
    * Fixed bug for small matrices in {symm, hemm}_mgpu and updated tester.
    * Fixed libraries in make.inc examples for MKL with gcc.
    * More robust error checking for Batched BLAS routines.

    MAGMA-sparse
    * Added Incomplete Sparse Approximate Inverse (ISAI) Preconditioner
      for sparse triangular solves, including batched generation.
    * Added Block-Jacobi triangular solves, including variable blocksize
      (based on supervariable amalgamation).
    * Added ParILUT, a parallel threshold ILU based on OpenMP.
    * Added CSR5 format and CSR5 SpMV kernel, a sparse matrix vector product
      often outperforming the cuSPARSE SpMV CSR and HYB.

 2.1.0 - Aug 30, 2016
    * Added variable size batched routines:
      magmablas_[sdcz]{gemm, gemv, syrk, herk, syr2k, her2k}_vbatched
    * Improved performance of SVD routines, and fixed workspace size bugs.
    * More robust error checking for BLAS routines.
    * Expanded and reorganized documentation.
    * Improved install (added DESTDIR, LIB_SUFFIX to Makefile; added install to CMake).

    MAGMA-sparse
    * Added a preconditioned QMR iterative solver (PQMR) including a kernel-merged version.
    * Updated the preconditioner structure to allow for a specific ILU triangular solver.

 2.0.1 - Feb 26, 2016
    * Fixed bug with 'make install'

 2.0.0 - final:  Feb  8, 2016
       - beta 3: Jan 22, 2016
       - beta 2: Jan  6, 2016
    * See "README-v2.txt" for details about updating code.
    * Removed support for CUDA arch 1.x, which NVIDIA no longer supports since CUDA 6.
    * Changed to non-recursive Makefile.
    * Changed definition of magma_queue_t to opaque structure.
    * Changed header from magma.h to magma_v2.h
    * Changed magma_get_{getrf, geqp3, geqrf, geqlf, gelqf, gebrd, gesvd}_nb to take both m, n.
    * Added queue argument to magmablas routines, and deprecated magmablas{Set,Get}KernelStream.
      This resolves a thread safety issue with using global magmablas{Set,Get}KernelStream.
    * Fixed bugs related to relying on CUDA NULL stream implicit synchronization.
    * Fixed memory leaks (zunmqr_m, zheevdx_2stage, etc.). Add -DDEBUG_MEMORY option to catch leaks.
    * Fixed geqrf*_gpu bugs for m == nb, n >> m (-N 64,10000); and m >> n, n == nb+i (-N 10000,129)
    * Fixed zunmql2_gpu for rectangular sizes.
    * Fixed zhegvdx_m itype 3.
    * Added zunglq, zungbr, zgeadd2 (which takes both alpha and beta).
    * Merged single & multi-GPU CPU interface testers (e.g., merged testing_dgeev_m into testing_dgeev).
    * Deprecated magma_device_sync; use magma_queue_sync instead.

    MAGMA-sparse
    * Added QMR, TFQMR, preconditioned TFQMR
    * Added CGS, preconditioned CGS
    * Added kernel-fused versions for CGS/PCGS QMR, TFQMR/PTFQMR
    * Changed relative stopping criterion to be relative to RHS
    * Fixed bug in complex version of CG
    * Accelerated version of Jacobi-CG
    * Added very efficient IDR
    * Performance tuning for SELLP SpMV

 1.7.0 - final:  Sep 11, 2015
       - beta 1: Aug 25, 2015
    * Added results archive to compare historical performance.
    * Added Fortran code to example directory.
    * Added magmaf_wtime for consistency with other Fortran interfaces; deprecated magma_wtime_f.
    * Added and template batched MAGMA BLAS routine gemm, gemv, herk, trsv, and trsm
    * Tuned batched MAGMA BLAS routines, in particular gemm, gemv, herk, and trsm
    * Tuned batched MAGMA LAPACK routines, in particular Cholesky factorizations
    * Tuned two stage symmetric eigenvalue code, {sy|he}heevdx_2stage, to improve performance.
    * Tuned symmetric eigenvalue code, {sy|he}evd, to improve performance for N < 2000.
    * Fixed NaN result with {sy|he}mv and {sy|he}mv_mgpu if GPU shared memory had NaN.
    * Fixed Fortran constants (MagmaTrans, MagmaUpper, etc.).
    * Fixed workspace requirements for the two stage symmetric eigenvalue problem
      {sy|he}heevdx_2stage and multi-GPU {sy|he}heevdx_2stage_m.
    * Fixed workspace requirements for Hessenberg (gehrd and gehrd_m) and multi-GPU geev_m.
    * Fixed trtri for unit diagonal, and added tester.
    * Fixed testing check for inverse (getri).
    * Fixed multi-GPU {or|un}gqr_m for some k < n. (Currently only used in geev_m with m = n = k.)
    * Fixed bug for batched routines
    * Rename lapack_const to lapack_const_str, to avoid name conflict with PLASMA.
    * Allow CMake build without Fortran (already existed for make).

    MAGMA-sparse
    * Added Induced Dimension Reduction Iterative solver (IDR).
    * Added iterative sparse triangular solves for
      incomplete factorization preconditioners.

 1.6.2 - May 4, 2015
    * Added magma_{s,d,c,z}sqrt for real and complex scalar square root.
    * Added magma_ceildiv and magma_roundup.
    * Fixed magmablas_zlaset and magmablas_zlacpy for large M or N > 4M.
    * Fixed testers for geqrf_batched and trsm_batched to compile with CUDA 5.x.

    MAGMA-sparse
    * All allocation failures and other errors now return error codes.
    * cuSPARSE error codes mapped to MAGMA error codes.
    * LOBPCG sparse eigensolver enabled for preconditioning using Jacobi and
      incomplete LU factorizations.
    * Some name changes in MAGMA-sparse for consistency with dense MAGMA.
      All functions working on matrices now start with the prefix magma_zm***
      instead some of them starting with magma_z_m***.
    * magma_zmvisu for printing a matrix is now called magma_zprint_matrix.
    * Added a tester for the sparse level 1 BLAS.
    * Rename magma_z_sparse_matrix into magma_z_matrix.
    * Redefine all vectors as dense matrices.
    * Replace the vector functions with matrix functions.
    * Bug fix in complex FGMRES.
    * Added iterative incomplete factorization routines (iterative ILU/iterative IC).
    * Enhance the ILU/IC with fill-in (level-ILU).

 1.6.1 - January 30, 2015
    * Building as both shared and static library is default now.
      Comment out FPIC in make.inc to build only static library.
    * Added max norm and one norm to [zcsd]lange.
    * Extended {sy|he}mv and {sy|he}mv_mgpu implementation to upper triangular.
    * Fixed memory access bug in {sy|he}mv_mgpu, used in {sy|he}trd_mgpu.
    * Fixed errant argument check in laswp, affecting getrf_mgpu.
    * Fixed tau in [cz]gelqf, which needed to be conjugated.
    * Fixed workspace size in symmetric/Hermitian eigenvalue solvers.
    * Made fast magmablas_zhemv default in symmetric/Hermitian eigenvalue solvers
      (previously needed to define -DFAST_HEMV option).
    * Added FGMRES for non-constant preconditioner operator.
    * Added backward communication interfaces for SpMV and
      preconditioner passing the vectors on the GPU.
    * Added function to generate cuSPARSE ILU level-scheduling information
      for a given matrix.
    * Added the batched QR routine.
    * Performance improvments of all batched routines.
    * Fixed "nan" output for batched factorizations.

 1.6.0 - November 16, 2014
    * Added MAGMA batched linear algebra routines:
        * Batched MAGMA BLAS including gemm, gemv, herk, and trsm
        * Batched LU, GETRI, and Cholesky factorizations
    * Added Bunch-Kaufman factorization and solver for symmetric
      indefinite matrices: [zcsd]{he|sy}trf
    * Added non-pivoted LDLt
    * Added a Random Butterfly Transformation (RBT) and a new solver based
      on RBT + LU without pivoting + iterative refinement
    * Comprehensive release of sparse routines:
        * All sparse routines equipped with a queue.
        * Enhanced debugging routines.
        * Interface to cuSPARSE functions.
        * Added interface to pass data structures located in main/device memory.
        * Added generic interface to call any solver/eigensolver.
        * Added testscript checking correctness of routines.
        * Added capability to iterate in block-wise fashion.
        * Checks for memory leaks.

 1.5.0 - final:  Aug   30, 2014
       - beta 3: July  18, 2014
       - beta 2: May   30, 2014
       - beta 1: April 25, 2014
    * Added pre-release of sparse routines.
    * Replaced character constants with symbolic constants (enums),
      e.g., 'N' with MagmaNoTrans.
    * Added SVD with Divide & Conquer, gesdd.
    * Added unmbr/ormbr, unmlq/ormlq, used in gesdd.
    * Improved performance of geev when computing eigenvectors by using
      multi-threaded trevc.
    * Added testing/run_tests.py script for more extensive testing.
    * Changed laset interface to match LAPACK.
    * Fixed memory access bug in transpose, and changed interface to match LAPACK.
    * Fixed memory access bugs in lanhe/lansy, zlag2c, clag2z, dlag2s, slag2d,
      zlat2c, dlat2s, trsm (trtri_diag).
    * Added clat2z, slat2d.
    * Added upper & lower cases in lacpy.
    * Fixed unmql/ormql for rectangular matrices.
    * Allow compiling without Fortran, but then testers have reduced functionality.
    * Added wrappers for CPU BLAS asum, nrm2, dotu, dotc, dot. This isolates
      the dependence on CBLAS to src/cblas*.cpp.
    * Added queue/stream interfaces for many MAGMABLAS routines, using _q suffix.
      These take magma_queue_t, which is a wrapper around CUDA stream.
    * Updated documentation to doxygen format.

 1.4.1 - final:  December 17, 2013
       - beta 2: December  9, 2013
       - beta 1: November 23, 2013
    * Improved performance of geev when computing eigenvectors by using blocked trevc.
    * Added right-looking multi-GPU Cholesky factorization.
    * Added new CMake installation for compiling on Windows.
    * Updated magmablas to call appropriate version based on CUDA architecture
      at runtime. GPU_TARGET now accepts multiple architectures together.

 1.4.0 - final:  Aug  14, 2013
       - beta 2: June 28, 2013
       - beta 1: June 19, 2013
    * Use magma_init() and magma_finalize() to initialize and cleanup MAGMA.
    * Merge libmagmablas into libmagma to eliminate circular dependencies.
      Link with just -lmagma now.
    * User can now #include <cublas_v2.h> before #include <magma.h>.
      See testing_z_cublas_v2.cpp for an example.
    * Can compile as shared library; see make.inc.mkl-shared and 'make shared'.
    * Fix required workspace size in gels_gpu, gels3_gpu, geqrs_gpu, geqrs3_gpu.
    * Fix required workspace size in [zcsd]geqrf.
    * Fix required workspace size in [he|sy]evd*, [he|sy]gvd*.
    * [zc|ds]geqrsv no longer segfaults when M > N.
    * Fix gesv and posv in some situations when GPU memory is close to full.
    * Fix synchronization in multi-GPU getrf_m and getrf2_mgpu.
    * Fix multi-GPU geqrf_mgpu for M < N.
    * Add MAGMA_ILP64 to compile with int being 64-bit. See make.inc.mkl-ilp64.
    * Add panel factorizations for LU, QR, and Cholesky entirely on the GPU,
      correspondingly in [zcsd]getf2_gpu, [zcsd]geqr2_gpu, and [zcsd]potf2_gpu.
    * Add QR with pivoting in GPU interface (functions [zcsd]geqp3_gpu);
      improve performance for both CPU and GPU interface QR with pivoting.
    * Add multi-GPU Hessenberg and non-symmetric eigenvalue routines:
      geev_m, gehrd_m, unghr_m, ungqr_m.
    * Add multi-GPU symmetric eigenvalue routines (one-stage)
      ([zhe|che|ssy|dsy]trd_mgpu,
       [zhe|che|ssy|dsy]evd_m, [zhe|che|ssy|dsy]evdx_m,
       [zhe|che|ssy|dsy]gvd_m, [zhe|che|ssy|dsy]gvdx_m ).
    * Add single and multi-GPU symmetric eigenvalue routines (two-stage)
      ([zhe|che|ssy|dsy]evdx_2stage,   [zhe|che|ssy|dsy]gvdx_2stage,
       [zhe|che|ssy|dsy]evdx_2stage_m, [zhe|che|ssy|dsy]gvdx_2stage_m ).
    * Add magma_strerror to get error message.
    * Revised most testers to use common framework and options.
    * Use CUBLAS gemm in src files, since it has been optimized for Kepler.
    * Determine block sizes at runtime based on current card's architecture.
    * In-place transpose now works for arbitrary n-by-n square matrix.
      This also reduces required memory in zgetrf_gpu.
    * Update Fortran wrappers with automated script.
    * Fix Makefile for Kepler (3.0 and 3.5).

 1.3.0 - November 12, 2012
    * Add MAGMA_VERSION constants and magma_version() in magma.h.
    * Fix printing complex matrices.
    * Fix documentation and query for heevd/syevd workspace sizes.
    * Fix singularity check in trtri and trtri_gpu.
    * Fixes for compiling on Windows (small, __attribute__, magma_free_cpu, etc.)
    * Implement all 4 cases for zunmqr (QC, Q'C, CQ, CQ') and fix workspace size.
    * Fix permuting rows for M > 32K.
    * Check residual ||Ax-b||; faster and uses less memory than ||PA-LU|| check.

 1.2.1 - June 29, 2012
    * Fix bug in [zcsd]getrf_gpu.cpp
    * Fix workspace requirement for SVD in [zcsd]gesvd.cpp
    * Fix a bug in freeing pinned memory (in interface_cuda/alloc.cpp)
    * Fix a bug in [zcsd]geqrf_mgpu.cpp
    * Fix zdotc to use cblas for portability
    * Fix uppercase entries in blas/lapack headers
    * Use magma_int_t in blas/lapack headers, and fix sources accordingly
    * Fix magma_is_devptr error handling
    * Add magma_malloc_cpu to allocate CPU memory aligned to 32-byte boundary
      for performance and reproducibility
    * Fix memory leaks in latrd* and zcgeqrsv_gpu
    * Remove dependency on CUDA device driver
    * Add QR with pivoting in CPU interface (functions [zcsd]geqp3)
    * Add hegst/sygst Fortran interface
    * Improve performance of gesv CPU interface by 30%
    * Improve performance of ungqr/orgqr CPU and GPU interfaces by 30%;
      more for small matrices

 1.2.0 - May 10, 2012
    * Fix bugs in [zcsd]hegst[_gpu].cpp
    * Fix a bug in [zcsd]latrd.cpp
    * Fix a bug in [zcsd]gelqf_gpu.cpp
    * Added application of a block reflector H or its transpose from the Right.
      Routines changed -- [zcsd]larfb_gpu.cpp, [zc]unmqr2_gpu.cpp, and
      [ds]ormqr2_gpu.cpp
    * Fix *larfb_gpu for reflector vectors stored row-wise.
    * Fix memory allocation bugs in [zc]unmqr2_gpu.cpp, [ds]ormqr2_gpu.cpp,
      [zc]unmqr.cpp, and [ds]ormqr.cpp (thanks to Azzam Haidar).
    * Fix bug in *lacpy that overwrote memory.
    * Fix residual formula in testing_*gesv* and testing_*posv*.
    * Fix sizeptr.cpp compile warning that caused make to fail.
    * Fix warning in *getrf.cpp when nb0 is zero.
    * Add reduction to band-diagonal for symmetric/Hermitian definite matrices
      in [zc]hebbd.cpp and [ds]sybbd.cpp
    * Updated eigensolvers for standard and generalized eigenproblems for
      symmetric/Hermitian definite matrices
    * Add wrappers around CUDA and CUBLAS functions,
      for portability and error checking.
    * Add tracing functions.
    * Add two-stage reduction to tridiabonal form
    * Add matrix print functions.
    * Make info and return codes consistent.
    * Change GPU_TARGET in make.inc to descriptive name (e.g., Fermi).
    * Move magma_stream to -lmagmablas to eliminate dependency on -lmagma.

 1.1.0 - 11-11-11
    * Fix a bug in [zcsd]geqrf_gpu.cpp and [zcsd]geqrf3_gpu.cpp for n>m
    * Fix a bug in [zcsd]laset - to call the kernel only when m!=0 && n!=0
    * Fix a bug in [zcsd]gehrd for ilo > 1 or ihi < n.
    * Added missing Fortran interfaces
    * Add general matrix inverse, [zcds]getri GPU interface.
    * Add [zcds]potri in CPU and GPU interfaces
       [Hatem Ltaief et al.]
    * Add [zcds]trtri in CPU and GPU interfaces
       [Hatem Ltaief et al.]
    * Add [zcds]lauum in CPU and GPU interfaces
       [Hatem Ltaief et al.]
    * Add zgemm for Fermi obtained using autotuning
    * Add non-GPU-resident versions of [zcds]geqrf, [zcds]potrf, and [zcds]getrf
    * Add multi-GPU LU, QR, and Cholesky factorizations
    * Add tile algorithms for multicore and multi-GPUs using the StarPU
      runtime system (in directory 'multi-gpu-dynamic')
    * Add [zcds]gesv and [zcds]posv in CPU interface. GPU interface was already in 1.0
    * Add LAPACK linear equation testing code (in 'testing/lin')
    * Add experimental directory ('exp') with algorithms for:
      (1) Multi-core QR, LU, Cholskey
      (2) Single GPU, all available CPU cores QR
    * Add eigenvalue solver driver routines for the standard and generalized
      symmetric/Hermitian eigenvalue problems [Raffaele Solca et al.].

 1.0.0 - August 25th, 2011
    * Fix make.inc.mkl (Thanks to ar1309)
    * Add gpu interfaces to [zcsd]hetrd, [zcsd]heevd
    * Add all cases for [zcds]unmtr_gpu
       [Raffaele Solca et al.]
    * Add generalized Hermitian-definite eigenproblem solver ([zcds]hegvd)
       [Raffaele Solca et al.]

 1.0.0RC5 - April 6th, 2011
    * Add fortran interface for lapack functions
    * Add new QR version on GPU ([zcsd]geqrf3_gpu) and corresponding
      LS solver ([zcds]geqrs3_gpu)
    * Add [cz]unmtr, [sd]ormtr functions
    * Add two functions in fortran to compute the offset on device pointers
          magmaf_[sdcz]off1d( NewPtr, OldPtr, inc, i)
          magmaf_[sdcz]off2d( NewPtr, OldPtr, lda, i, j)
        indices are given in Fortran (1 to N)
    * WARNING: add FOPTS variable to the make.inc to use preprocessing
      in compilation of Fortran files
    * WARNING: fix bug with fortran compilers which don;t change the name
      now fortran prefix is magmaf instead of magma
    * Small documentation fixes
    * Fix timing under windows, thanks to Evan Lazar
    * Fix problem when __func__ is not present, thanks to Evan Lazar
    * Fix bug with m==n==0 in LU, thanks to Evan Lazar
    * Fix bug on [cz]unmqr, [sd]ormqr functions
    * Fix bug in [zcsd]gebrd; fixes bug in SVD for n>m
    * Fix bug in [zcsd]geqrs_gpu for multiple RHS
    * Added functionality - zcgesv_gpu and dsgesv_gpu can now solve also
      A' X = B using mixed-precision iterative refinement
    * Fix error code in testings.h to compile with cuda 4.0

 1.0.0RC4 - March 8th, 2011

    * Add control directory to group all non computational functions
    * Integration of the eigenvalues solvers
    * Clean some f2c code in eigenvalues solvers
    * Arithmetic consistency: cuDoubleComplex and cuFloatComplex are
      the  only types used for complex now.
    * Consistency of the interface of some functions.
    * Clean most of the return values in lapack functions
    * Fix multiple definition of min, max,
    * Fix headers problem under windows, thanks to Willem Burger