Improve Gebp kernel for Arm Neon float.

Submitted by Renjie Liu

Assigned to Nobody

Link to original bugzilla bug (#1624)
Operating system: Android

Description

Created attachment 892
proposed optimization patch

Currently, the loadRhs operation will load a single value and duplicate it for all lanes, then madd will perform
a vector multiply-add a vector, this is unnecessary for Neon since vfmaq_lane can handle vector multiply-add
a single value.

On a Arm Neon device, SDM 845 (Pixel 3), this can give us a ~20% performance boosting.

~~Attachment 892~~, "proposed optimization patch":
patch.diff

Edited Dec 05, 2019 by Eigen Bugzilla