Author: | Wojciech Muła |
---|---|
Added on: | 2025-01-19 |
I get back to work on simdutf recently, and noticed that the library gained support for LoongArch64. This is a custom design and custom ISA by Loongson from China. They provide documentation for scalar ISA, but not for the vector extension. Despite that, GCC, binutils, QEMU and other tools already support the ISA. To our luck, Jiajie Chen did an impressive work of reverse engineering the vector stuff and published results online as The Unofficial LoongArch Intrinsics Guide.
LoongArch comes with two vector extensions:
These extensions are similar, especially most instructions present in LSX exist in LSAX. According to the Wikipedia entry, the ISA is mixture of RISC-V and MIPS.
ISA supports both integer and floating point instructions. There's support for 8-bit, 16-bit, 32-bit, 64-bit and also 128-bit integers. Floating point instructions cover single precision, double precision and half precision numbers.
Comparisons yield byte-masks, similarly to SSE.
Integer instructions are defined for most integer types, this makes the ISA regular.
My impression is that the ISA is well designed, but have not vectorized any code for that architecture. Below is the list of features I found interesting while browsing the intrinsics guide.
It's possible to control the program flow based on two predicates:
Similar solution exists in SSE (PTEST), but it works on the whole vector, not vector elements.
There are vector instructions that allow to alter individual bits of a word:
That's just cool (easily doable in SIMD ISAs having variable shifts).
The following counting ops are defined:
Counting leading zeros and population count are present only in AVX-512.
There are two instructions related to absolute values.
calculate absolute value of difference:
dst[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i];
the sum of absolute values:
dst[i] = abs(a[i]) + abs(b[i]);
Add even (or odd) element of given width and produce vector of wider elements. For instance we may add two vectors of 32-bit elements and get a vector of 64-bit sums.
There are an integer division & modulo of 8-, 16-, 32- and 64-bit integers.
There is no ternary logic, but the set of binary operations is rich:
There is also a bit-select operation: (a and c) or (b and not c).
The following operations related to elements of vectors are possible:
It's possible to clamp integer values, both signed and unsigned, to the given range. The range is given as power of two: in the case of unsigned number the range is given as [0…2n − 1], in the case of signed number the range is [ − 2n…2n − 1], where parameter n is given as an immediate.
In-register shuffling reads data from two registers, it's something we know from AMD XOP extension.
There's an instruction that allows to set arbitrary permutation of bytes within 4-byte lanes. The permutation is given as an immediate.
The instruction finds the first negative number and stores its index to the selected lane of destination.
That is a pretty complex one.
The instruction performs the following operation:
if a[i] == 0 { dst[i] = a[i]; } else if a[i] > 0 { dst[i] = b[i]; } else { dst[i] = -b[i]; }
If arguments a and b are the same vector, this instruction calculate the absolute value of the vector elemnets.
There is a powerful instruction that allows to initialize a vector from an immediate: