Complete.Org: Mailing Lists: Archives: freeciv-dev: June 2002:
[Freeciv-Dev] Re: [RFC] Path finding implementation.
Home

[Freeciv-Dev] Re: [RFC] Path finding implementation.

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: Raimar Falke <rf13@xxxxxxxxxxxxxxxxx>
Cc: Gregory Berkolaiko <Gregory.Berkolaiko@xxxxxxxxxxxx>, Freeciv Development List <freeciv-dev@xxxxxxxxxxx>, "Ross W. Wetmore" <rwetmore@xxxxxxxxxxxx>
Subject: [Freeciv-Dev] Re: [RFC] Path finding implementation.
From: "Ross W. Wetmore" <rwetmore@xxxxxxxxxxxx>
Date: Sat, 29 Jun 2002 09:56:31 -0400

This sort of (unnecessary) attempt to duplicate compiler optimizations
has been tried and shown to be less efficient. Compiler writers and
chip makers have worked really hard over the years to make these things
as efficient as possible as part of the "benchmark" wars. The assumption
that one can hack code to improve on their efforts is alwasy interesting
to watch.

The best thing to do is stay out of the way of the compiler by not
inserting special algorithmic hacks, or optimization killers like
(inline) function calls.


When using functions or inline functions (which are the worst
of all worlds) rather than macros to do the address computations,
one blocks the compiler from generating the sort of efficient
LEA instructions that do (index * base +offset) calculations in a 
single instruction cycle.

In general the compiler will sort out common expressions and cache
the results in registers unless one does things to defeat its pattern
recognition, or allowed optimization constraints.


The basic problem with your lattice system is you are not allocating
large blocks of cleared memory, either a square array, or a linear
vector. If you do try to implement (foolish) sparse systems, the
preset time will eventually kill you, but the solution is not to go
to individual allocations, but a block allocation of working cells
that are "allocated" by setting an indirection index. Free and active
queues of cells can be used to manage this if needed, but I think you
will find that it is pretty much a oneway list of allocates followed
by a full reset that is the usual pattern.

Some of a previous email trail below ...

Cheers,
RossW
=====

At 12:35 PM 02/06/29 +0200, Raimar Falke wrote:
>On Fri, Jun 28, 2002 at 10:28:49PM +0100, Gregory Berkolaiko wrote:
>> I put in priority queue instead of bucket list and got a very significant 
>> improvement, from around 11sec to 7.9sec.  I think it is mostly because of 
>> huge memory allocations that were necessary to do bucket list.
>
>I have also restarted working on this. I also introduced a heap
>now. I'm currently fighting with a nice effect: the algo finds a
>shorter path to a tile. But this tile is in the heap and so the heap
>may become invalid. It has to be rebuilt. Took me quite some time to
>catch this.
>
>> Also, much to my surprise, when I put the lattice sizes to be the exact 
>> map sizes, as opposed to some power of 2, I also get good speed 
>> improvement.  Strange, I thought it is much faster to do x % 128 than to 
>> do x % 60...  Maybe my compiler does not optimize x % 128 to x | 127 ??
>
>gcc -S is your friend. But I'm sure that it does.
>
>> Or the memory allocation is that expensive??
>
>Could be. Maybe also cache trashing.
>
>> > You can avoid the multiplication by using a 2D array.
>
>I implemented this (and removed the hash). And it is quite painless:
>
>struct pf_map {
>  struct cell **cells;
>...
>};
>
>build up:
>
>  pf_map->cells = fc_malloc(sizeof(*pf_map->cells)*map.xsize);
>  for (x = 0; x < map.xsize; x++) {
>    pf_map->cells[x] = fc_malloc(sizeof(*pf_map->cells[x]) * map.ysize);
>    memset(pf_map->cells[x],0,sizeof(*pf_map->cells[x]) * map.ysize);
>  }
>
>accessing:
>
>static struct cell *get_cell(pf_map_t pf_map, int x, int y)
>{
>  struct cell *result=&pf_map->cells[x][y];
>...
>
>       Raimar
>
>-- 
> email: rf13@xxxxxxxxxxxxxxxxx
>  The trick is to keep breathing.
=====


Facts can really ruin a good logical argument :-).

Your numbers are a case in point, so I had to poke around with things
to see what was going on. Below is a slightly modified version of your 
program that uses clock() to get the user+system CPU times as accurately
as possible.

I compiled `cc -O3 -o dynarr_00 dynarr_00.c". The assembly for this was
obtained by `cc -O3 -s dynarr_00.c". You will note that it inlined the
functions in main.

The sample times contradict your data and reinforce my argument. But 
the differences are incredibly subtle and thus my guess is that in a
real life programming example, there is lots of room for things to
change. These were run on a PIII-866. It is also interesting that we
both show that static is the worst which is really counter intuitive 
:-)

The inner loops in main run from L.76, L.95, L.116 and end at the 
next "jns" instruction in each case. One needs to look at the three 
branch targets above the loop and the two sections below the jns to 
see the code differences. The inner loops are all pretty much the 
same, i.e. optimized to register accesses.

Note that no one ever really does a multiply which sort of puts a hole
in your argument. Array index computations really are heavily optimized.

The subtle difference between static and indexed is that static does
a memory lookup at the outer loop entry, whereas the indexed version
uses the edi register value. There is an inc and compare that uses
memory at the bottom of the indexed loop, but it too is accessing the 
stack (ebp) which is virtually always cached and thus runs at 
register speeds. Intel chips all use the stack as a big register
cache which in the RISC-CISC wars largely nullified the RISC claims
of performance benefit from loads of on chip registers. Intel used
the chip resources to do other optimizations which won that war for
them :-). Although the compare nominally "reads" memory. if you think 
about it, the value read and inc'd is probably still in the chip 
pipeline and thus won't really be refetched. You could probably add a 
"volatile" attribute somewhere to force things to always refetch real 
memory values if you want to experiment. 

I'm actually at a loss to understand just what is different between a
stack access in the outer loop jump code at top or bottom unless it 
is the fact that you need the %ebx register two instructions down from 
the top and its value may not be available to the pipeline if it was a 
memory vs register access causing a pipeline stall for a cycle or so.

The dynamic case is reading from memory in these cases. This is not
stack accesses, but it is linear access to real memory, though in the 
simple case here will likely result in a fault on only a few accesses
(depending on cache line size), and register speed for the next few
lookups. This can change in a real programming example where there
are other memory lookups competing for the same cache space and thus
the tile[] is dropped from cache before the next iteration. It is 
also the case that the index loop computation cannot be optimized as 
much as shown by the fact that it takes a few more instructions to 
handle both loop indices and array access indices. It is still
somewhat intriguing that it manages to beat the static case but that
seems to be all due to the thinness of the loop setup which is also
where the indexed case seems to have managed its win :-).

[Aside] Note that map_inx() loops over x before y which is the reverse
of tile[x][y] lookups. There may be subtle things happening in places
like GUI tiling if you change the order of things. It is also better
for optimization to keep the inner loop bigger, and do fewer outer
loop iterations.

Cheers,
RossW
=====

root@wetrec[318] ./dynarr_00
declared =  15220000
indexed =   10680000
dynamic =   15010000


#include<stdlib.h>

/********************************/
void declared()
{
  int i, j;
  unsigned int k;
  int tiles[80][50];

  for(k = 0; k < 1000000; k++) {

    for(i = 0; i < 80; i++) {
      for(j = 0; j < 50; j++) {
        tiles[i][j] = i + j;
      }
    }
  }
}

/********************************/
void dynamic_but_indexed()
{
  int i, j;
  unsigned int k;
  int *tiles;

  tiles = malloc(80 * 50 * sizeof(int));

  for(k = 0; k < 1000000; k++) {

    for(i = 0; i < 80; i++) {
      for(j = 0; j < 50; j++) {
        tiles[j + i * 50] = i + j;
      }
    }
  }
  free(tiles);
}

/********************************/
int dynamic()
{
  int i, j;
  unsigned int k;
  int *tiles_storage;
  int **tiles;

  tiles_storage = malloc(80 * 50 * sizeof(int));
  tiles = malloc(80 * sizeof(int *));

  for(i = 0; i < 80; i++) {
    tiles[i] = tiles_storage + (i * 50);
  }

  for(k = 0; k < 1000000; k++) {
 
    for(i = 0; i < 80; i++) {
      for(j = 0; j < 50; j++) {
        tiles[i][j] = i + j;
      }
    }
  }
  free(tiles);
  free(tiles_storage);
}

/********************************/
int main()
{
  time_t clock0, clocks;
  int ret;

  clock0 = clock();
  declared();
  clocks = clock();
  printf("declared = %9d\n", clocks-clock0);

  clock0 = clock();
  dynamic_but_indexed();
  clocks = clock();
  printf("indexed =  %9d\n", clocks-clock0);

  clock0 = clock();
  ret = dynamic();
  clocks = clock();
  printf("dynamic =  %9d\n", clocks-clock0);

  return 0;
}

        .file   "dynarr_00.c"
        .version        "01.01"
gcc2_compiled.:
                .section        .rodata
.LC0:
        .string "declared = %9d\n"
.LC1:
        .string "indexed =  %9d\n"
.LC2:
        .string "dynamic =  %9d\n"
.text
        .align 4
.globl main
        .type    main,@function
main:
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %edi
        pushl   %esi
        pushl   %ebx
        subl    $16028, %esp
        call    clock
        movl    %eax, -16028(%ebp)
        leal    -16024(%ebp), %eax
        xorl    %edi, %edi
        movl    %eax, -16032(%ebp)
        .p2align 2
.L73:
        xorl    %esi, %esi
        movl    -16032(%ebp), %ebx
        .p2align 2
.L76:
        movl    $49, %ecx
        leal    196(%ebx), %edx
        leal    49(%esi), %eax
        .p2align 2
.L79:
        movl    %eax, (%edx)
        subl    $4, %edx
        decl    %eax
        decl    %ecx
        jns     .L79
        incl    %esi
        addl    $200, %ebx
        cmpl    $79, %esi
        jle     .L76
        incl    %edi
        cmpl    $999999, %edi
        jbe     .L73
        call    clock
        subl    $8, %esp
        subl    -16028(%ebp), %eax
        pushl   %eax
        pushl   $.LC0
        call    printf
        call    clock
        movl    $16000, (%esp)
        movl    %eax, -16028(%ebp)
        call    malloc
        movl    %eax, %edi
        movl    $0, -16036(%ebp)
        addl    $16, %esp
        .p2align 2
.L89:
        xorl    %esi, %esi
        movl    %edi, %ebx
        .p2align 2
.L92:
        movl    $49, %eax
        leal    196(%ebx), %ecx
        leal    49(%esi), %edx
        .p2align 2
.L95:
        movl    %edx, (%ecx)
        subl    $4, %ecx
        decl    %edx
        decl    %eax
        jns     .L95
        incl    %esi
        addl    $200, %ebx
        cmpl    $79, %esi
        jle     .L92
        incl    -16036(%ebp)
        cmpl    $999999, -16036(%ebp)
        jbe     .L89
        subl    $12, %esp
        pushl   %edi
        call    free
        addl    $16, %esp
        call    clock
        subl    $8, %esp
        subl    -16028(%ebp), %eax
        pushl   %eax
        pushl   $.LC1
        call    printf
        call    clock
        movl    $16000, (%esp)
        movl    %eax, -16028(%ebp)
        call    malloc
        movl    $320, (%esp)
        movl    %eax, -16040(%ebp)
        call    malloc
        movl    %eax, %esi
        movl    -16040(%ebp), %edx
        addl    $16, %esp
        movl    $79, %ebx
        leal    316(%esi), %ecx
        addl    $15800, %edx
        .p2align 2
.L105:
        movl    %edx, (%ecx)
        subl    $4, %ecx
        subl    $200, %edx
        decl    %ebx
        jns     .L105
        xorl    %edi, %edi
        .p2align 2
.L110:
        xorl    %ebx, %ebx
        .p2align 2
.L113:
        movl    (%esi,%ebx,4), %ecx
        movl    $49, %eax
        addl    $196, %ecx
        leal    49(%ebx), %edx
        .p2align 2
.L116:
        movl    %edx, (%ecx)
        subl    $4, %ecx
        decl    %edx
        decl    %eax
        jns     .L116
        incl    %ebx
        cmpl    $79, %ebx
        jle     .L113
        incl    %edi
        cmpl    $999999, %edi
        jbe     .L110
        subl    $12, %esp
        pushl   %esi
        call    free
        popl    %eax
        pushl   -16040(%ebp)
        call    free
        addl    $16, %esp
        call    clock
        subl    $8, %esp
        subl    -16028(%ebp), %eax
        pushl   %eax
        pushl   $.LC2
        call    printf
        addl    $16, %esp
        leal    -12(%ebp), %esp
        popl    %ebx
        popl    %esi
        xorl    %eax, %eax
        popl    %edi
        popl    %ebp
        ret
.Lfe1:
        .size    main,.Lfe1-main
        .align 4
.globl declared
        .type    declared,@function
declared:
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %edi
        pushl   %esi
        pushl   %ebx
        subl    $16012, %esp
        xorl    %edi, %edi
        .p2align 2
.L21:
        leal    -24(%ebp), %eax
        xorl    %esi, %esi
        movl    %eax, %ebx
        .p2align 2
.L25:
        movl    $49, %ecx
        leal    -15804(%ebx), %edx
        leal    49(%esi), %eax
        .p2align 2
.L29:
        movl    %eax, (%edx)
        subl    $4, %edx
        decl    %eax
        decl    %ecx
        jns     .L29
        incl    %esi
        addl    $200, %ebx
        cmpl    $79, %esi
        jle     .L25
        incl    %edi
        cmpl    $999999, %edi
        jbe     .L21
        addl    $16012, %esp
        popl    %ebx
        popl    %esi
        popl    %edi
        popl    %ebp
        ret
.Lfe2:
        .size    declared,.Lfe2-declared
        .align 4
.globl dynamic_but_indexed
        .type    dynamic_but_indexed,@function
dynamic_but_indexed:
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %edi
        pushl   %esi
        pushl   %ebx
        subl    $24, %esp
        pushl   $16000
        call    malloc
        movl    %eax, %edi
        movl    $0, -16(%ebp)
        addl    $16, %esp
        .p2align 2
.L37:
        xorl    %esi, %esi
        movl    %edi, %ebx
        .p2align 2
.L41:
        movl    $49, %eax
        leal    196(%ebx), %ecx
        leal    49(%esi), %edx
        .p2align 2
.L45:
        movl    %edx, (%ecx)
        subl    $4, %ecx
        decl    %edx
        decl    %eax
        jns     .L45
        incl    %esi
        addl    $200, %ebx
        cmpl    $79, %esi
        jle     .L41
        incl    -16(%ebp)
        cmpl    $999999, -16(%ebp)
        jbe     .L37
        subl    $12, %esp
        pushl   %edi
        call    free
        addl    $16, %esp
        leal    -12(%ebp), %esp
        popl    %ebx
        popl    %esi
        popl    %edi
        popl    %ebp
        ret
.Lfe3:
        .size    dynamic_but_indexed,.Lfe3-dynamic_but_indexed
        .align 4
.globl dynamic
        .type    dynamic,@function
dynamic:
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %edi
        pushl   %esi
        pushl   %ebx
        subl    $24, %esp
        pushl   $16000
        call    malloc
        movl    $320, (%esp)
        movl    %eax, -16(%ebp)
        call    malloc
        movl    %eax, %esi
        movl    -16(%ebp), %edx
        addl    $16, %esp
        movl    $79, %ebx
        leal    316(%esi), %ecx
        addl    $15800, %edx
        .p2align 2
.L53:
        movl    %edx, (%ecx)
        subl    $4, %ecx
        subl    $200, %edx
        decl    %ebx
        jns     .L53
        xorl    %edi, %edi
        .p2align 2
.L58:
        xorl    %ebx, %ebx
        .p2align 2
.L62:
        movl    (%esi,%ebx,4), %ecx
        movl    $49, %eax
        addl    $196, %ecx
        leal    49(%ebx), %edx
        .p2align 2
.L66:
        movl    %edx, (%ecx)
        subl    $4, %ecx
        decl    %edx
        decl    %eax
        jns     .L66
        incl    %ebx
        cmpl    $79, %ebx
        jle     .L62
        incl    %edi
        cmpl    $999999, %edi
        jbe     .L58
        subl    $12, %esp
        pushl   %esi
        call    free
        popl    %edx
        pushl   -16(%ebp)
        call    free
        addl    $16, %esp
        leal    -12(%ebp), %esp
        popl    %ebx
        popl    %esi
        popl    %edi
        popl    %ebp
        ret
.Lfe4:
        .size    dynamic,.Lfe4-dynamic
        .ident  "GCC: (GNU) 2.96 20000731 (Red Hat Linux 7.0)"




[Prev in Thread] Current Thread [Next in Thread]