Complete.Org: Mailing Lists: Archives: freeciv-dev: March 2004:
[Freeciv-Dev] Request for optimization
Home

[Freeciv-Dev] Request for optimization

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: "Ross W. Wetmore" <rwetmore@xxxxxxxxxxxx>, Rafal Bursig <bursig@xxxxxxxxx>
Cc: freeciv development list <freeciv-dev@xxxxxxxxxxx>
Subject: [Freeciv-Dev] Request for optimization
From: Raimar Falke <i-freeciv-lists@xxxxxxxxxxxxx>
Date: Fri, 5 Mar 2004 20:10:09 +0100

I have a 32bit image in RAM. 24bit RGB + 8bit mask. The memory of the
whole image is continuous. Now I want to blit depending on the mask.

The basic function looks like this:

struct image {
  int width, height;
  int pitch;
  unsigned char *data;
};

#define P(image, x, y) ((image)->data + (image)->pitch * (y) + 4 * (x))

static void image_blit_masked(const struct ct_size *size,
                              const struct image *src,
                              const struct ct_point *src_pos,
                              struct image *dest,
                              const struct ct_point *dest_pos)
{
  int y;
  int width = size->width;
  unsigned long long start, end;
  static unsigned long long total_clocks = 0, total_pixels = 0;
  static int total_blits = 0;

#define rdtscll(val) __asm__ __volatile__ ("rdtsc" : "=A" (val))

  rdtscll(start);

  for (y = 0; y < size->height; y++) {
    int src_y = y + src_pos->y;
    unsigned char *psrc = P(src, src_pos->x, src_y);

    int dest_y = y + dest_pos->y;
    unsigned char *pdest = P(dest, dest_pos->x, dest_y);

    {
      int x;

      for (x = 0; x < width; x++) {
        if (psrc[3] != 0) {
          memcpy(pdest, psrc, 4);
        }
        psrc += 4;
        pdest += 4;
      }
    }
  }
  rdtscll(end);
  total_clocks += (end - start);
  total_pixels += (size->width * size->height);
  total_blits++;
  if((total_blits%1000)==0) {
      printf("%f clocks per pixel\n",(float)total_clocks/total_pixels);
  }
}

The function prints out 20 clocks per pixel.

In profiling this functions ends up eating the most CPU by a great
distance:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 42.68      2.36     2.36    71518     0.03     0.03  image_blit_masked
 17.54      3.33     0.97    62280     0.02     0.02  be_draw_region
 13.02      4.05     0.72      937     0.77     0.77  image_set_mask
  8.32      4.51     0.46    69541     0.01     0.01  set_mask_masked
  6.15      4.85     0.34      104     3.27     3.27  fill_ximage_from_image_565

So I looked at the asm output: the inner loop is

.L37:
        cmpb    $0, 3(%ecx)
        je      .L28
        movl    (%ecx), %eax
        movl    %eax, (%ebx)
.L28:
        addl    $4, %ecx
        addl    $4, %ebx
        decl    %edx
        jne     .L37

which seems correct.

I tried various things:
 - replace the inner loop with one big read:
        {
          unsigned int t = *((unsigned int *) psrc);
          unsigned int *tp = (unsigned int *) pdest;

          if ((t & 0xff000000) != 0) {
            *tp = t;
          }
        }
    result: 20 clocks/pixel

 - replace the inner loop to get a cmov instruction
        {
          unsigned int t = *((unsigned int *) psrc);
          unsigned int *tp = (unsigned int *) pdest;
          unsigned int s = *tp;

          if ((t & 0xff000000) != 0) {
            s = t;
          }
          *tp = s;
        }
    result: 24 clocks/pixel

 - 8 times loop unrolling of the original inner loop:
    result: 20 clocks/pixel    

I'm a bit lost here. Could it be that the memory bandwidth is the
bottleneck here?

        Raimar

-- 
 email: rf13@xxxxxxxxxxxxxxxxx
 "Reality? That's where the pizza delivery guy comes from!"


[Prev in Thread] Current Thread [Next in Thread]
  • [Freeciv-Dev] Request for optimization, Raimar Falke <=