[go: up one dir, main page]

Draft: Add typed scalar comparison functions.

Revives !576 (closed).

Original author: @derekjchow

Original description:

Follow up for MR#568.

I've copied the CWiseTernaryOp flags into select, and added "typed scalar compare" operators . This should enable vectorization for logical comparisons/select with just a little bit of syntactic sugar. I've also added a benchmark for the compare/select flow.

Original x86:

int8_t: Ran 10000 in 2.31417 seconds.
int16_t: Ran 10000 in 1.54581 seconds.
int32_t: Ran 10000 in 1.667 seconds.
int64_t: Ran 10000 in 2.458 seconds.
uint8_t: Ran 10000 in 2.30792 seconds.
uint16_t: Ran 10000 in 1.61437 seconds.
uint32_t: Ran 10000 in 2.07421 seconds.
uint64_t: Ran 10000 in 3.95125 seconds.
float: Ran 10000 in 2.33216 seconds.
double: Ran 10000 in 4.04993 seconds.

After vectorization:

int8_t: Ran 10000 in 1.60796 seconds.
int16_t: Ran 10000 in 1.52178 seconds.
int32_t: Ran 10000 in 2.22586 seconds.
int64_t: Ran 10000 in 2.43813 seconds.
uint8_t: Ran 10000 in 1.60132 seconds.
uint16_t: Ran 10000 in 1.57519 seconds.
uint32_t: Ran 10000 in 2.22631 seconds.
uint64_t: Ran 10000 in 2.94516 seconds.
float: Ran 10000 in 1.65587 seconds.
double: Ran 10000 in 3.23203 seconds.

Original ARM64

int8_t: Ran 10000 in 16.2288 seconds.
int16_t: Ran 10000 in 15.1875 seconds.
int32_t: Ran 10000 in 17.4648 seconds.
int64_t: Ran 10000 in 22.7227 seconds.
uint8_t: Ran 10000 in 24.6178 seconds.
uint16_t: Ran 10000 in 26.3035 seconds.
uint32_t: Ran 10000 in 30.5126 seconds.
uint64_t: Ran 10000 in 40.6052 seconds.
float: Ran 10000 in 37.195 seconds.
double: Ran 10000 in 45.929 seconds.

After vectorization:

int8_t: Ran 10000 in 3.97981 seconds.
int16_t: Ran 10000 in 15.1806 seconds.
int32_t: Ran 10000 in 17.4845 seconds.
int64_t: Ran 10000 in 30.6549 seconds.
uint8_t: Ran 10000 in 8.73699 seconds.
uint16_t: Ran 10000 in 26.6233 seconds.
uint32_t: Ran 10000 in 30.9642 seconds.
uint64_t: Ran 10000 in 65.5785 seconds.
float: Ran 10000 in 36.8863 seconds.
double: Ran 10000 in 65.7165 seconds.

Generally speaking, I think we only enabled vectorization for (u)int8 types, but I can turn on more if desired

Edited by Antonio Sánchez

Merge request reports

Loading