[Freeciv-Dev] Request for optimization
[Top] [All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
I have a 32bit image in RAM. 24bit RGB + 8bit mask. The memory of the
whole image is continuous. Now I want to blit depending on the mask.
The basic function looks like this:
struct image {
int width, height;
int pitch;
unsigned char *data;
};
#define P(image, x, y) ((image)->data + (image)->pitch * (y) + 4 * (x))
static void image_blit_masked(const struct ct_size *size,
const struct image *src,
const struct ct_point *src_pos,
struct image *dest,
const struct ct_point *dest_pos)
{
int y;
int width = size->width;
unsigned long long start, end;
static unsigned long long total_clocks = 0, total_pixels = 0;
static int total_blits = 0;
#define rdtscll(val) __asm__ __volatile__ ("rdtsc" : "=A" (val))
rdtscll(start);
for (y = 0; y < size->height; y++) {
int src_y = y + src_pos->y;
unsigned char *psrc = P(src, src_pos->x, src_y);
int dest_y = y + dest_pos->y;
unsigned char *pdest = P(dest, dest_pos->x, dest_y);
{
int x;
for (x = 0; x < width; x++) {
if (psrc[3] != 0) {
memcpy(pdest, psrc, 4);
}
psrc += 4;
pdest += 4;
}
}
}
rdtscll(end);
total_clocks += (end - start);
total_pixels += (size->width * size->height);
total_blits++;
if((total_blits%1000)==0) {
printf("%f clocks per pixel\n",(float)total_clocks/total_pixels);
}
}
The function prints out 20 clocks per pixel.
In profiling this functions ends up eating the most CPU by a great
distance:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
42.68 2.36 2.36 71518 0.03 0.03 image_blit_masked
17.54 3.33 0.97 62280 0.02 0.02 be_draw_region
13.02 4.05 0.72 937 0.77 0.77 image_set_mask
8.32 4.51 0.46 69541 0.01 0.01 set_mask_masked
6.15 4.85 0.34 104 3.27 3.27 fill_ximage_from_image_565
So I looked at the asm output: the inner loop is
.L37:
cmpb $0, 3(%ecx)
je .L28
movl (%ecx), %eax
movl %eax, (%ebx)
.L28:
addl $4, %ecx
addl $4, %ebx
decl %edx
jne .L37
which seems correct.
I tried various things:
- replace the inner loop with one big read:
{
unsigned int t = *((unsigned int *) psrc);
unsigned int *tp = (unsigned int *) pdest;
if ((t & 0xff000000) != 0) {
*tp = t;
}
}
result: 20 clocks/pixel
- replace the inner loop to get a cmov instruction
{
unsigned int t = *((unsigned int *) psrc);
unsigned int *tp = (unsigned int *) pdest;
unsigned int s = *tp;
if ((t & 0xff000000) != 0) {
s = t;
}
*tp = s;
}
result: 24 clocks/pixel
- 8 times loop unrolling of the original inner loop:
result: 20 clocks/pixel
I'm a bit lost here. Could it be that the memory bandwidth is the
bottleneck here?
Raimar
--
email: rf13@xxxxxxxxxxxxxxxxx
"Reality? That's where the pizza delivery guy comes from!"
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Freeciv-Dev] Request for optimization,
Raimar Falke <=
|
|