LoongArch64 subjective higlights

Author:	Wojciech Muła
Added on:	2025-01-19

Introduction

I get back to work on simdutf recently, and noticed that the library gained support for LoongArch64. This is a custom design and custom ISA by Loongson from China. They provide documentation for scalar ISA, but not for the vector extension. Despite that, GCC, binutils, QEMU and other tools already support the ISA. To our luck, Jiajie Chen did an impressive work of reverse engineering the vector stuff and published results online as The Unofficial LoongArch Intrinsics Guide.

LoongArch comes with two vector extensions:

LSX, having 128-bit vector registers,
LSAX, having 256-bit vector registers.

These extensions are similar, especially most instructions present in LSX exist in LSAX. According to the Wikipedia entry, the ISA is mixture of RISC-V and MIPS.

ISA supports both integer and floating point instructions. There's support for 8-bit, 16-bit, 32-bit, 64-bit and also 128-bit integers. Floating point instructions cover single precision, double precision and half precision numbers.

Comparisons yield byte-masks, similarly to SSE.

Integer instructions are defined for most integer types, this makes the ISA regular.

My impression is that the ISA is well designed, but have not vectorized any code for that architecture. Below is the list of features I found interesting while browsing the intrinsics guide.

Personal highlights

Conditional branches

It's possible to control the program flow based on two predicates:

if any element of a vector is zero,
if all elements of a vector are non-zero.

Similar solution exists in SSE (PTEST), but it works on the whole vector, not vector elements.

Bit operations

There are vector instructions that allow to alter individual bits of a word:

set a bit,
reset a bit,
toggle a bit.

That's just cool (easily doable in SIMD ISAs having variable shifts).

Counting bits

The following counting ops are defined:

counting leading zeros,
counting leading ones,
population count (counting ones).

Counting leading zeros and population count are present only in AVX-512.

128-bit arithmetic

There are addition and subtraction for 128-bit numbers. Very nice!
There is a selection of any 64-bit subword from a 128-bit word.

Integer absolute values

There are two instructions related to absolute values.

calculate absolute value of difference:

dst[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i];

the sum of absolute values:
```
dst[i] = abs(a[i]) + abs(b[i]);
```

Widening addition

Add even (or odd) element of given width and produce vector of wider elements. For instance we may add two vectors of 32-bit elements and get a vector of 64-bit sums.

Integer division

There are an integer division & modulo of 8-, 16-, 32- and 64-bit integers.

Logical operations

There is no ternary logic, but the set of binary operations is rich:

and
or
xor
and-not (a and not b)
not-or (not (a or b))
or-not (a or not b))

There is also a bit-select operation: (a and c) or (b and not c).

Accessing individual elements of vector

The following operations related to elements of vectors are possible:

Store into memory the selected element; element index is given as an immediate.
Copy i-th element from source vector to j-th element of destination; again, indices i & j are encoded as an immediate.
Broadcast selected element; in this case index is read from generic purpose register.

Integer clamping

It's possible to clamp integer values, both signed and unsigned, to the given range. The range is given as power of two: in the case of unsigned number the range is given as [0…2ⁿ − 1], in the case of signed number the range is [ − 2ⁿ…2ⁿ − 1], where parameter n is given as an immediate.

Shuffling

In-register shuffling reads data from two registers, it's something we know from AMD XOP extension.

4-byte-lanes shuffling

There's an instruction that allows to set arbitrary permutation of bytes within 4-byte lanes. The permutation is given as an immediate.

Horizontal integer scanning

The instruction finds the first negative number and stores its index to the selected lane of destination.

That is a pretty complex one.

Integer copy sign

The instruction performs the following operation:

if a[i] == 0 {
    dst[i] = a[i];
}
else if a[i] > 0 {
    dst[i] = b[i];
} else {
    dst[i] = -b[i];
}

If arguments a and b are the same vector, this instruction calculate the absolute value of the vector elemnets.

Broadcast immediate to vector

There is a powerful instruction that allows to initialize a vector from an immediate:

broadcast an 8-bit immediate to all bytes;
broadcast a sign-extended 10-bit immediate as 16-, 32- or 64-bit elements;
broadcast an 8-bit immediate to 16-bit elements,
broadcast an 8-bit immediate shifted left by 8 bits to 16-bit elements;
broadcast an 8-bit immediate to 32-bit elements,
broadcast an 8-bit immediate shifted left by 8 bits to 32-bit elements;
broadcast an 8-bit immediate shifted left by 16 bits to 32-bit elements;
broadcast an 8-bit immediate shifted left by 24 bits to 32-bit elements;
broadcast an 8-bit immediate shifted left by 8 bits and or-ed with 0xff to 32-bit elements ((imm << 8) | 0xff);
broadcast an 8-bit immediate shifted left by 16 bits and or-ed with 0xffff to 32-bit elements ((imm << 16) | 0xffff);
convert an 8-bit value into a byte mask (64-bit value) and broadcast that value;
convert an 8-bit immediate into a floating-point number (single or double precision) in range ± 0.125…7.75.