Author: | Wojciech Muła |
---|
December
March
February
January
February
January
November
October
May
April
March
October
December
November
October
May
March
October
September
March
January
December
November
October
September
April
February
April
March
Table of contents
With simple program following times were measured:
Function defined in scched.h returns the number of a CPU that runs calling thread or process.
There is no such instruction — CVTDQ2PS converts signed 32-bit ints. Solution: first zero the MSB, such number is never negative in U2, so mentioned instruction could be used. Then add 232 if the MSB was set.
float CONST[4] SIMD_ALIGN = packed_float((float)((uint32_t)(1 << 31))); /* 2^31 */ uint32_t MASK_0_30[4] SIMD_ALIGN = packed_dword(0x7fffffff); uint32_t MASK_31[4] SIMD_ALIGN = packed_dword(0x80000000); void convert_uint32_float(uint32_t in[4], float out[4]) { __asm__ volatile ( "movdqu (%%eax), %%xmm0 \n" "movdqa %%xmm0, %%xmm1 \n" "pand MASK_0_30, %%xmm0 \n" // xmm0 - mask MSB bit - never less then zero in U2 "cvtdq2ps %%xmm0, %%xmm0 \n" // convert this value to float "psrad $32, %%xmm1 \n" // populate MSB in higher word (enough to mask CONST) "pand CONST, %%xmm1 \n" // xmm1 = MSB set ? float(2^31) : float(0) "addps %%xmm1, %%xmm0 \n" // add 2^31 if MSB set "movdqu %%xmm0, (%%ebx) \n" : /* no output */ : "a" (in), "b" (out) ); }
Branch-less x86 code:
movl %eax, %ebx sarl $32, %ebx ; fill ebx with sign bit xorl %ebx, %eax ; negate eax (if negative) subl %ebx, %eax ; increment eax by 1 (if negative)
SSE2:
pshufd $0b11110101, %xmm0, %xmm1 ; populate dwords 3 and 1 psrad $32, %xmm1 ; fill quad words with sign bit pxor %xmm1, %xmm0 ; negate (if negative) psubq %xmm1, %xmm0 ; increment (if negative)
RDTSC is incremented with bus-clock cycles, and then multiplied by core-clock/bus-clock ratio. From programmer view, RDTSC counter is incremented by value greater then 1, for example on C2D E8200 it is 8.
Latency of RDTSC in Pentium4 is about 60-120 cycles, on AMD CPU around 6 cycles.
asm( "..." : "+a" (var) );
This won't work, GCC complains:
asm( "..." : /* no output */ : "a" (var) : "eax" );
We can declare a temporary variable, and treat it as read-write:
int tmp_var = var; asm( "..." : "+a" (tmp_var) );
If there are more registers, or var shouldn't be changed, then we can declare a common dummy variable:
int dummy __attribute__((unused)); asm( "..." : "=a" (dummy), "=b" (dummy), "=c" (dummy) : "a" (var_or_value1), "b" (var_or_value2), "c" (var_or_value2) );