Sørensen, 2013 - Google Patents
Auto‐tuning of level 1 and level 2 BLAS for GPUsSørensen, 2013
View PDF- Document ID
- 3033340626605093330
- Author
- Sørensen H
- Publication year
- Publication venue
- Concurrency and Computation: Practice and Experience
External Links
Snippet
The use of high‐performance libraries for dense linear algebra operations is of great importance in many numerical scientific applications. The most common operations form the backbone of the Basic Linear Algebra Subroutines (BLAS) library. In this paper, we consider …
- 230000002104 routine 0 abstract description 8
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformations of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/443—Optimisation
- G06F8/4441—Reducing the execution time required by the program code
- G06F8/4442—Reducing the number of cache misses; Data prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformations of program code
- G06F8/41—Compilation
- G06F8/45—Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
- G06F8/451—Code distribution
- G06F8/452—Loops
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformations of program code
- G06F8/41—Compilation
- G06F8/45—Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
- G06F8/456—Parallelism detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/30—Arrangements for executing machine-instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformations of program code
- G06F8/41—Compilation
- G06F8/43—Checking; Contextual analysis
- G06F8/433—Dependency analysis; Data or control flow analysis
- G06F8/434—Pointers; Aliasing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/11—Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a programme unit and a register, e.g. for a simultaneous processing of several programmes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/50—Computer-aided design
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F19/00—Digital computing or data processing equipment or methods, specially adapted for specific applications
- G06F19/10—Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N99/00—Subject matter not provided for in other groups of this subclass
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Vázquez et al. | A new approach for sparse matrix vector product on NVIDIA GPUs | |
Jang et al. | Exploiting memory access patterns to improve memory performance in data-parallel architectures | |
US20150324707A1 (en) | System and method for selecting useful smart kernels for general-purpose gpu computing | |
US8583898B2 (en) | System and method for managing processor-in-memory (PIM) operations | |
US20150324441A1 (en) | System and method for high performance k-means clustering on gpu with smart kernels | |
Araujo et al. | NAS Parallel Benchmarks with CUDA and beyond | |
Elafrou et al. | Sparsex: A library for high-performance sparse matrix-vector multiplication on multicore platforms | |
Ploskas et al. | Efficient GPU-based implementations of simplex type algorithms | |
Andreetta et al. | Finpar: A parallel financial benchmark | |
Izquierdo-Carrasco et al. | A generic vectorization scheme and a GPU kernel for the phylogenetic likelihood library | |
Phillips et al. | A cuda implementation of the high performance conjugate gradient benchmark | |
Falch et al. | Register caching for stencil computations on GPUs | |
KR20240090423A (en) | System and method for auto-parallelization of processing codes for multi-processor systems with optimized latency | |
Oyarzun et al. | Portable implementation model for CFD simulations. Application to hybrid CPU/GPU supercomputers | |
Morozov et al. | Massively parallel computation using graphics processors with application to optimal experimentation in dynamic control | |
Fortuna et al. | A limit study of JavaScript parallelism | |
Ding et al. | Instruction roofline: An insightful visual performance model for gpus | |
He et al. | A novel CSR‐based sparse matrix‐vector multiplication on GPUs | |
Mehta et al. | Evaluating performance portability of openmp for snap on nvidia, intel, and amd gpus using the roofline methodology | |
Fang et al. | Aristotle: A performance impact indicator for the OpenCL kernels using local memory | |
Kuzma et al. | Fast matrix multiplication via compiler‐only layered data reorganization and intrinsic lowering | |
Carrijo Nasciutti et al. | Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUs | |
Szustak et al. | Performance portable parallel programming of heterogeneous stencils across shared-memory platforms with modern Intel processors | |
Sørensen | Auto-tuning dense vector and matrix-vector operations for Fermi GPUs | |
Sørensen | Auto‐tuning of level 1 and level 2 BLAS for GPUs |