[go: up one dir, main page]

Sørensen, 2013 - Google Patents

Auto‐tuning of level 1 and level 2 BLAS for GPUs

Sørensen, 2013

View PDF
Document ID
3033340626605093330
Author
Sørensen H
Publication year
Publication venue
Concurrency and Computation: Practice and Experience

External Links

Snippet

The use of high‐performance libraries for dense linear algebra operations is of great importance in many numerical scientific applications. The most common operations form the backbone of the Basic Linear Algebra Subroutines (BLAS) library. In this paper, we consider …
Continue reading at gpulab.compute.dtu.dk (PDF) (other versions)

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformations of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • G06F8/4441Reducing the execution time required by the program code
    • G06F8/4442Reducing the number of cache misses; Data prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformations of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/451Code distribution
    • G06F8/452Loops
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformations of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/456Parallelism detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for programme control, e.g. control unit
    • G06F9/06Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
    • G06F9/30Arrangements for executing machine-instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for programme control, e.g. control unit
    • G06F9/06Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformations of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/433Dependency analysis; Data or control flow analysis
    • G06F8/434Pointers; Aliasing
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a programme unit and a register, e.g. for a simultaneous processing of several programmes
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/50Computer-aided design
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F19/00Digital computing or data processing equipment or methods, specially adapted for specific applications
    • G06F19/10Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N99/00Subject matter not provided for in other groups of this subclass

Similar Documents

Publication Publication Date Title
Vázquez et al. A new approach for sparse matrix vector product on NVIDIA GPUs
Jang et al. Exploiting memory access patterns to improve memory performance in data-parallel architectures
US20150324707A1 (en) System and method for selecting useful smart kernels for general-purpose gpu computing
US8583898B2 (en) System and method for managing processor-in-memory (PIM) operations
US20150324441A1 (en) System and method for high performance k-means clustering on gpu with smart kernels
Araujo et al. NAS Parallel Benchmarks with CUDA and beyond
Elafrou et al. Sparsex: A library for high-performance sparse matrix-vector multiplication on multicore platforms
Ploskas et al. Efficient GPU-based implementations of simplex type algorithms
Andreetta et al. Finpar: A parallel financial benchmark
Izquierdo-Carrasco et al. A generic vectorization scheme and a GPU kernel for the phylogenetic likelihood library
Phillips et al. A cuda implementation of the high performance conjugate gradient benchmark
Falch et al. Register caching for stencil computations on GPUs
KR20240090423A (en) System and method for auto-parallelization of processing codes for multi-processor systems with optimized latency
Oyarzun et al. Portable implementation model for CFD simulations. Application to hybrid CPU/GPU supercomputers
Morozov et al. Massively parallel computation using graphics processors with application to optimal experimentation in dynamic control
Fortuna et al. A limit study of JavaScript parallelism
Ding et al. Instruction roofline: An insightful visual performance model for gpus
He et al. A novel CSR‐based sparse matrix‐vector multiplication on GPUs
Mehta et al. Evaluating performance portability of openmp for snap on nvidia, intel, and amd gpus using the roofline methodology
Fang et al. Aristotle: A performance impact indicator for the OpenCL kernels using local memory
Kuzma et al. Fast matrix multiplication via compiler‐only layered data reorganization and intrinsic lowering
Carrijo Nasciutti et al. Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUs
Szustak et al. Performance portable parallel programming of heterogeneous stencils across shared-memory platforms with modern Intel processors
Sørensen Auto-tuning dense vector and matrix-vector operations for Fermi GPUs
Sørensen Auto‐tuning of level 1 and level 2 BLAS for GPUs