Complete.Org: Mailing Lists: Archives: freeciv-dev: June 2002:
[Freeciv-Dev] Re: [RFC] Path finding implementation.
Home

[Freeciv-Dev] Re: [RFC] Path finding implementation.

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: "Ross W. Wetmore" <rwetmore@xxxxxxxxxxxx>
Cc: Gregory Berkolaiko <Gregory.Berkolaiko@xxxxxxxxxxxx>, Freeciv Development List <freeciv-dev@xxxxxxxxxxx>
Subject: [Freeciv-Dev] Re: [RFC] Path finding implementation.
From: Raimar Falke <rf13@xxxxxxxxxxxxxxxxx>
Date: Sat, 29 Jun 2002 17:09:44 +0200

On Sat, Jun 29, 2002 at 09:56:31AM -0400, Ross W. Wetmore wrote:
> 
> This sort of (unnecessary) attempt to duplicate compiler optimizations
> has been tried and shown to be less efficient. Compiler writers and
> chip makers have worked really hard over the years to make these things
> as efficient as possible as part of the "benchmark" wars. The assumption
> that one can hack code to improve on their efforts is alwasy interesting
> to watch.
> 
> The best thing to do is stay out of the way of the compiler by not
> inserting special algorithmic hacks, or optimization killers like
> (inline) function calls.
> 
> 
> When using functions or inline functions (which are the worst
> of all worlds) rather than macros to do the address computations,
> one blocks the compiler from generating the sort of efficient
> LEA instructions that do (index * base +offset) calculations in a 
> single instruction cycle.

> In general the compiler will sort out common expressions and cache
> the results in registers unless one does things to defeat its pattern
> recognition, or allowed optimization constraints.

Ack.

> The basic problem with your lattice system is you are not allocating
> large blocks of cleared memory, either a square array, or a linear
> vector. If you do try to implement (foolish) sparse systems, the
> preset time will eventually kill you, but the solution is not to go
> to individual allocations, but a block allocation of working cells
> that are "allocated" by setting an indirection index. Free and active
> queues of cells can be used to manage this if needed, but I think you
> will find that it is pretty much a oneway list of allocates followed
> by a full reset that is the usual pattern.
> 
> Some of a previous email trail below ...
> 
> Cheers,
> RossW
> =====
> 
> At 12:35 PM 02/06/29 +0200, Raimar Falke wrote:
> >On Fri, Jun 28, 2002 at 10:28:49PM +0100, Gregory Berkolaiko wrote:
> >> I put in priority queue instead of bucket list and got a very significant 
> >> improvement, from around 11sec to 7.9sec.  I think it is mostly because of 
> >> huge memory allocations that were necessary to do bucket list.
> >
> >I have also restarted working on this. I also introduced a heap
> >now. I'm currently fighting with a nice effect: the algo finds a
> >shorter path to a tile. But this tile is in the heap and so the heap
> >may become invalid. It has to be rebuilt. Took me quite some time to
> >catch this.
> >
> >> Also, much to my surprise, when I put the lattice sizes to be the exact 
> >> map sizes, as opposed to some power of 2, I also get good speed 
> >> improvement.  Strange, I thought it is much faster to do x % 128 than to 
> >> do x % 60...  Maybe my compiler does not optimize x % 128 to x | 127 ??
> >
> >gcc -S is your friend. But I'm sure that it does.
> >
> >> Or the memory allocation is that expensive??
> >
> >Could be. Maybe also cache trashing.
> >
> >> > You can avoid the multiplication by using a 2D array.
> >
> >I implemented this (and removed the hash). And it is quite painless:
> >
> >struct pf_map {
> >  struct cell **cells;
> >...
> >};
> >
> >build up:
> >
> >  pf_map->cells = fc_malloc(sizeof(*pf_map->cells)*map.xsize);
> >  for (x = 0; x < map.xsize; x++) {
> >    pf_map->cells[x] = fc_malloc(sizeof(*pf_map->cells[x]) * map.ysize);
> >    memset(pf_map->cells[x],0,sizeof(*pf_map->cells[x]) * map.ysize);
> >  }
> >
> >accessing:
> >
> >static struct cell *get_cell(pf_map_t pf_map, int x, int y)
> >{
> >  struct cell *result=&pf_map->cells[x][y];
> >...
> >
> >     Raimar
> >
> >-- 
> > email: rf13@xxxxxxxxxxxxxxxxx
> >  The trick is to keep breathing.
> =====
> 
> 
> Facts can really ruin a good logical argument :-).
> 
> Your numbers are a case in point, so I had to poke around with things
> to see what was going on. Below is a slightly modified version of your 
> program that uses clock() to get the user+system CPU times as accurately
> as possible.
>
> I compiled `cc -O3 -o dynarr_00 dynarr_00.c". The assembly for this was
> obtained by `cc -O3 -s dynarr_00.c". 

> You will note that it inlined the functions in main.

As -O3 passed to gcc will do.
 
> The sample times contradict your data and reinforce my argument. But 
> the differences are incredibly subtle and thus my guess is that in a
> real life programming example, there is lots of room for things to
> change. These were run on a PIII-866. It is also interesting that we
> both show that static is the worst which is really counter intuitive 
> :-)
> 
> The inner loops in main run from L.76, L.95, L.116 and end at the 
> next "jns" instruction in each case. One needs to look at the three 
> branch targets above the loop and the two sections below the jns to 
> see the code differences. The inner loops are all pretty much the 
> same, i.e. optimized to register accesses.
> 
> Note that no one ever really does a multiply which sort of puts a hole
> in your argument. Array index computations really are heavily optimized.
> 
> The subtle difference between static and indexed is that static does
> a memory lookup at the outer loop entry, whereas the indexed version
> uses the edi register value. There is an inc and compare that uses
> memory at the bottom of the indexed loop, but it too is accessing the 
> stack (ebp) which is virtually always cached and thus runs at 
> register speeds. Intel chips all use the stack as a big register
> cache which in the RISC-CISC wars largely nullified the RISC claims
> of performance benefit from loads of on chip registers. Intel used
> the chip resources to do other optimizations which won that war for
> them :-). Although the compare nominally "reads" memory. if you think 
> about it, the value read and inc'd is probably still in the chip 
> pipeline and thus won't really be refetched. You could probably add a 
> "volatile" attribute somewhere to force things to always refetch real 
> memory values if you want to experiment. 
> 
> I'm actually at a loss to understand just what is different between a
> stack access in the outer loop jump code at top or bottom unless it 
> is the fact that you need the %ebx register two instructions down from 
> the top and its value may not be available to the pipeline if it was a 
> memory vs register access causing a pipeline stall for a cycle or so.
> 
> The dynamic case is reading from memory in these cases. This is not
> stack accesses, but it is linear access to real memory, though in the 
> simple case here will likely result in a fault on only a few accesses
> (depending on cache line size), and register speed for the next few
> lookups. This can change in a real programming example where there
> are other memory lookups competing for the same cache space and thus
> the tile[] is dropped from cache before the next iteration. It is 
> also the case that the index loop computation cannot be optimized as 
> much as shown by the fact that it takes a few more instructions to 
> handle both loop indices and array access indices. It is still
> somewhat intriguing that it manages to beat the static case but that
> seems to be all due to the thinness of the loop setup which is also
> where the indexed case seems to have managed its win :-).

Long explanation. But misses the point.

> /********************************/
> void dynamic_but_indexed()
> {
...
>     for(i = 0; i < 80; i++) {
>       for(j = 0; j < 50; j++) {
>         tiles[j + i * 50] = i + j;
>       }
>     }

is compiled to

> .L41:
>       movl    $49, %eax
>       leal    196(%ebx), %ecx
>       leal    49(%esi), %edx
> .L45:
>       movl    %edx, (%ecx)
>       subl    $4, %ecx
>       decl    %edx
>       decl    %eax
>       jns     .L45
>       incl    %esi
>       addl    $200, %ebx
>       cmpl    $79, %esi
>       jle     .L41

which is nice work from the compiler. It is essentially:

  int *p=&tiles[80*50-1];
  int x,y;
  for(x=0;x<80;x++) {
    for(y=49;y>=0;y--) {
      *p--=y;
    }
  }

but this optimization is only possible because the compiler know the
array sizes.

If the 80 and 50 is replaced with global vars (unsigned int to get an
easier assembly) you get somthing like this:

-24 == ysize
-20 == tiles
-36 == xsize

.L43:
        movl    -24(%ebp), %ebx
        xorl    %ecx, %ecx
        cmpl    %ebx, %ecx
        jae     .L53
.L47:
        movl    %esi, %eax
        movl    -20(%ebp), %edi
        imull   %ebx, %eax
        leal    (%ecx,%esi), %edx
        leal    (%eax,%ecx), %eax
        incl    %ecx
        movl    %edx, (%edi,%eax,4)
        cmpl    %ebx, %ecx
        jb      .L47
        movl    -36(%ebp), %edx
.L53:
        incl    %esi
        cmpl    %edx, %esi
        jb      .L43

And now the dynamic case:

        movl    ysize, %eax
.L67:
        xorl    %edx, %edx
        cmpl    %eax, %edx
        jae     .L78
        movl    (%esi,%ebx,4), %ecx
.L71:
        leal    (%edx,%ebx), %eax
        movl    %eax, (%ecx,%edx,4)
        incl    %edx
        movl    ysize, %eax
        cmpl    %eax, %edx
        jb      .L71
        movl    xsize, %ecx
.L78:
        incl    %ebx
        cmpl    %ecx, %ebx
        jb      .L67

So in the dynamic case we no imull but an extra memory access. The
other commands are cheap.

> [Aside] Note that map_inx() loops over x before y which is the reverse
> of tile[x][y] lookups. There may be subtle things happening in places
> like GUI tiling if you change the order of things. It is also better
> for optimization to keep the inner loop bigger, and do fewer outer
> loop iterations.
> 
> Cheers,
> RossW
> =====
> 
> root@wetrec[318] ./dynarr_00
> declared =  15220000
> indexed =   10680000
> dynamic =   15010000

$ ./a.out
declared =   4000000
indexed =    6430000
dynamic =    4060000

        Raimar

-- 
 email: rf13@xxxxxxxxxxxxxxxxx
  "brand memory are for windows users that think their stability
   problems come from the memory"
    -- bomek in #freeciv

Attachment: a.c
Description: Text document


[Prev in Thread] Current Thread [Next in Thread]