ppc64le matrix multiplication instructions enablement for float32.
Here follows a WIP patch to enable support for the new matrix multiplication instructions by IBM. The attached document describes such possible future instructions, support for intrinsics will be available on gcc 10. The patch creates a template specialization for gebp_kernel, specifically float32 as an example on how to implement this new feature. We also add packing specialization since, although not for float32, the packing required has a different scheme. The instructions for float32/64 look like generic outer products but for lower precision those instructions are more akin to small matrix multiplications hence the need for new packing procedures.
The path of template specialization helps to keep current Eigen performance of non-IBM architectures untouched and guarantees a reasonable path for other vendors to pursuit if they are willing to enhance performance of matrix multiplications for specific datatypes as well. We are aware that VNNI might require such changes so we see this change as a step forward toward enhancing Eigen's matrix multiplication performance for everyone.