Sørensen, 2013 - Google Patents

Auto‐tuning of level 1 and level 2 BLAS for GPUs

Sørensen, 2013

Document ID: 3033340626605093330
Author: Sørensen H
Publication year: 2013
Publication venue: Concurrency and Computation: Practice and Experience

External Links

Cited by

Snippet

The use of high‐performance libraries for dense linear algebra operations is of great importance in many numerical scientific applications. The most common operations form the backbone of the Basic Linear Algebra Subroutines (BLAS) library. In this paper, we consider …

Continue reading at gpulab.compute.dtu.dk (PDF) (other versions)

230000002104 routine 0 abstract description 8

Classifications

- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformations of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/443—Optimisation
- G06F8/4441—Reducing the execution time required by the program code
- G06F8/4442—Reducing the number of cache misses; Data prefetching
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformations of program code
- G06F8/41—Compilation
- G06F8/45—Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
- G06F8/451—Code distribution
- G06F8/452—Loops
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformations of program code
- G06F8/41—Compilation
- G06F8/45—Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
- G06F8/456—Parallelism detection
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/30—Arrangements for executing machine-instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformations of program code
- G06F8/41—Compilation
- G06F8/43—Checking; Contextual analysis
- G06F8/433—Dependency analysis; Data or control flow analysis
- G06F8/434—Pointers; Aliasing
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/11—Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a programme unit and a register, e.g. for a simultaneous processing of several programmes
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/50—Computer-aided design
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F19/00—Digital computing or data processing equipment or methods, specially adapted for specific applications
- G06F19/10—Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N99/00—Subject matter not provided for in other groups of this subclass

Similar Documents

Publication	Publication Date	Title
Vázquez et al.	2011	A new approach for sparse matrix vector product on NVIDIA GPUs
Jang et al.	2010	Exploiting memory access patterns to improve memory performance in data-parallel architectures
US20150324707A1 (en)	2015-11-12	System and method for selecting useful smart kernels for general-purpose gpu computing
US8583898B2 (en)	2013-11-12	System and method for managing processor-in-memory (PIM) operations
US20150324441A1 (en)	2015-11-12	System and method for high performance k-means clustering on gpu with smart kernels
Araujo et al.	2023	NAS Parallel Benchmarks with CUDA and beyond
Elafrou et al.	2018	Sparsex: A library for high-performance sparse matrix-vector multiplication on multicore platforms
Ploskas et al.	2015	Efficient GPU-based implementations of simplex type algorithms
Andreetta et al.	2016	Finpar: A parallel financial benchmark
Izquierdo-Carrasco et al.	2013	A generic vectorization scheme and a GPU kernel for the phylogenetic likelihood library
Phillips et al.	2014	A cuda implementation of the high performance conjugate gradient benchmark
Falch et al.	2014	Register caching for stencil computations on GPUs
KR20240090423A (en)	2024-06-21	System and method for auto-parallelization of processing codes for multi-processor systems with optimized latency
Oyarzun et al.	2017	Portable implementation model for CFD simulations. Application to hybrid CPU/GPU supercomputers
Morozov et al.	2012	Massively parallel computation using graphics processors with application to optimal experimentation in dynamic control
Fortuna et al.	2010	A limit study of JavaScript parallelism
Ding et al.	2022	Instruction roofline: An insightful visual performance model for gpus
He et al.	2016	A novel CSR‐based sparse matrix‐vector multiplication on GPUs
Mehta et al.	2020	Evaluating performance portability of openmp for snap on nvidia, intel, and amd gpus using the roofline methodology
Fang et al.	2014	Aristotle: A performance impact indicator for the OpenCL kernels using local memory
Kuzma et al.	2023	Fast matrix multiplication via compiler‐only layered data reorganization and intrinsic lowering
Carrijo Nasciutti et al.	2019	Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUs
Szustak et al.	2019	Performance portable parallel programming of heterogeneous stencils across shared-memory platforms with modern Intel processors
Sørensen	2011	Auto-tuning dense vector and matrix-vector operations for Fermi GPUs
Sørensen	2013	Auto‐tuning of level 1 and level 2 BLAS for GPUs