Notes

Author:Wojciech Muła

Main page

Table of contents

2018

November

October

May

April

March

2017

November

2016

October

September

March

January

2015

December

November

October

May

March

2014

October

September

March

January

2013

December

November

October

September

2012

2011

April

February

2010

June

April

March

2008

Latency of PUSHA and POPA on Core2 [25.06.2008]

With simple program following times were measured:

  • pusha — 9-10 cycles
  • sequence pusha, popa — 14 cycles

Linux scheduler: int sched_getcpu() [21.06.2008]

Function defined in scched.h returns the number of a CPU that runs calling thread or process.

SSE: conversion uint32 to float [18.06.2008]

There is no such instruction — CVTDQ2PS converts signed 32-bit ints. Solution: first zero the MSB, such number is never negative in U2, so mentioned instruction could be used. Then add 232 if the MSB was set.

float    CONST[4]     SIMD_ALIGN = packed_float((float)((uint32_t)(1 << 31))); /* 2^31 */
uint32_t MASK_0_30[4] SIMD_ALIGN = packed_dword(0x7fffffff);
uint32_t MASK_31[4]   SIMD_ALIGN = packed_dword(0x80000000);

void convert_uint32_float(uint32_t in[4], float out[4]) {
    __asm__ volatile (
    "movdqu   (%%eax), %%xmm0  \n"
    "movdqa    %%xmm0, %%xmm1  \n"

    "pand   MASK_0_30, %%xmm0  \n" // xmm0 - mask MSB bit - never less then zero in U2
    "cvtdq2ps  %%xmm0, %%xmm0  \n" // convert this value to float

    "psrad        $32, %%xmm1  \n" // populate MSB in higher word (enough to mask CONST)
    "pand       CONST, %%xmm1  \n" // xmm1 = MSB set ? float(2^31) : float(0)

    "addps     %%xmm1, %%xmm0  \n" // add 2^31 if MSB set

    "movdqu    %%xmm0, (%%ebx) \n"

    : /* no output */
    : "a" (in),
      "b" (out)
    );
}

See a sample implementation.

PABSQ — absolute value of two singed 64-bit numbers [8.06.2008]

Branch-less x86 code:

movl  %eax, %ebx
sarl   $32, %ebx        ; fill ebx with sign bit
xorl  %ebx, %eax        ; negate eax (if negative)
subl  %ebx, %eax        ; increment eax by 1 (if negative)

SSE2:

pshufd $0b11110101, %xmm0, %xmm1        ; populate dwords 3 and 1
psrad   $32, %xmm1      ; fill quad words with sign bit
pxor  %xmm1, %xmm0      ; negate (if negative)
psubq %xmm1, %xmm0      ; increment (if negative)

RDTSC on Core2 [8.06.2008]

RDTSC is incremented with bus-clock cycles, and then multiplied by core-clock/bus-clock ratio. From programmer view, RDTSC counter is incremented by value greater then 1, for example on C2D E8200 it is 8.

Latency of RDTSC in Pentium4 is about 60-120 cycles, on AMD CPU around 6 cycles.

GCC asm constraints [7.06.2008]

Read-write variables

asm(
        "..."
        : "+a" (var)
);

Read-only variables, registers are clobbered

This won't work, GCC complains:

asm(
        "..."
        : /* no output */
        : "a" (var)
        : "eax"
);

We can declare a temporary variable, and treat it as read-write:

int tmp_var = var;
asm(
        "..."
        : "+a" (tmp_var)
);

If there are more registers, or var shouldn't be changed, then we can declare a common dummy variable:

int dummy __attribute__((unused));
asm(
        "..."
        : "=a" (dummy),
          "=b" (dummy),
          "=c" (dummy)
        : "a" (var_or_value1),
          "b" (var_or_value2),
          "c" (var_or_value2)
);