Complete.Org: Mailing Lists: Archives: freeciv-dev: June 2002:
[Freeciv-Dev] Re: [RFC] Path finding implementation.
Home

[Freeciv-Dev] Re: [RFC] Path finding implementation.

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: "Ross W. Wetmore" <rwetmore@xxxxxxxxxxxx>
Cc: Gregory Berkolaiko <Gregory.Berkolaiko@xxxxxxxxxxxx>, Freeciv Development List <freeciv-dev@xxxxxxxxxxx>
Subject: [Freeciv-Dev] Re: [RFC] Path finding implementation.
From: Raimar Falke <rf13@xxxxxxxxxxxxxxxxx>
Date: Sun, 30 Jun 2002 16:31:30 +0200

On Sun, Jun 30, 2002 at 06:22:45AM -0400, Ross W. Wetmore wrote:
> At 05:09 PM 02/06/29 +0200, Raimar Falke wrote:
> >On Sat, Jun 29, 2002 at 09:56:31AM -0400, Ross W. Wetmore wrote:
> >> 
> >> This sort of (unnecessary) attempt to duplicate compiler optimizations
> >> has been tried and shown to be less efficient. Compiler writers and
> >> chip makers have worked really hard over the years to make these things
> >> as efficient as possible as part of the "benchmark" wars. The assumption
> >> that one can hack code to improve on their efforts is alwasy interesting
> >> to watch.
> >> 
> >> The best thing to do is stay out of the way of the compiler by not
> >> inserting special algorithmic hacks, or optimization killers like
> >> (inline) function calls.
> >> 
> >> 
> >> When using functions or inline functions (which are the worst
> >> of all worlds) rather than macros to do the address computations,
> >> one blocks the compiler from generating the sort of efficient
> >> LEA instructions that do (index * base +offset) calculations in a 
> >> single instruction cycle.
> >
> >> In general the compiler will sort out common expressions and cache
> >> the results in registers unless one does things to defeat its pattern
> >> recognition, or allowed optimization constraints.
> >
> >Ack.
> 
> Glad you showed you recognized this before using it to write some poor
> code to try and make a worthless point :-).
> 
> [...]
> >Long explanation. But misses the point.
> 
> Actually, it hit the point. You are missing the point about letting
> the compiler handle things when you use bad code example to try and
> reverse the timings.
> 
> >> /********************************/
> >> void dynamic_but_indexed()
> >> {
> >...
> >>     for(i = 0; i < 80; i++) {
> >>       for(j = 0; j < 50; j++) {
> >>         tiles[j + i * 50] = i + j;
> >>       }
> >>     }
> >
> >is compiled to
> >
> >> .L41:
> >>       movl    $49, %eax
> >>       leal    196(%ebx), %ecx
> >>       leal    49(%esi), %edx
> >> .L45:
> >>       movl    %edx, (%ecx)
> >>       subl    $4, %ecx
> >>       decl    %edx
> >>       decl    %eax
> >>       jns     .L45
> >>       incl    %esi
> >>       addl    $200, %ebx
> >>       cmpl    $79, %esi
> >>       jle     .L41
> >
> >which is nice work from the compiler. It is essentially:
> >
> >  int *p=&tiles[80*50-1];
> >  int x,y;
> >  for(x=0;x<80;x++) {
> >    for(y=49;y>=0;y--) {
> >      *p--=y;
> >    }
> >  }
> >
> >but this optimization is only possible because the compiler know the
> >array sizes.
> 
> It doesn't know anything about array sizes. This was a malloc'd block.
> It does know about loop values though.
> 
> However, you correctly state the point, if you let the compiler do the
> optimizations it will choose an appropriate optimization for this case
> from many different possibilities.
> 
> >If the 80 and 50 is replaced with global vars (unsigned int to get an
> >easier assembly) you get somthing like this:
> 
> Which is a very stupid thing to do, as globals block the compiler from
> optimizing by forcing a load from memory. It is the same performance 
> killer as using a function call to do a one liner. One presumes most
> people design code to do the efficient thing and not the inefficient, 
> and will choose the former over the latter when given a choice.
> 
> Globals are evil. Repeat it to yourself 10 times ...
> 
> Store the loop values in a local first, and you will find they get placed 
> in registers with simpler precomputed form in the outer loop, rather than 
> loaded from memory everytime and the computation redone explicitly in 
> the inner loop. 
> 
> This code is essentially the same as the original case except the hard
> constants in the outer loop are replaced by stack memory loads (which 
> are almost as fast as registers). The extra memory access as well as the 
> more expensive imull instructions are now gone.
> 
> >-24 == ysize
> >-20 == tiles
> >-36 == xsize
> >
> >.L43:
> >        movl    -24(%ebp), %ebx
> >        xorl    %ecx, %ecx
> >        cmpl    %ebx, %ecx
> >        jae     .L53
> >.L47:
> >        movl    %esi, %eax
> >        movl    -20(%ebp), %edi
> >        imull   %ebx, %eax
> >        leal    (%ecx,%esi), %edx
> >        leal    (%eax,%ecx), %eax
> >        incl    %ecx
> >        movl    %edx, (%edi,%eax,4)
> >        cmpl    %ebx, %ecx
> >        jb      .L47
> >        movl    -36(%ebp), %edx
> >.L53:
> >        incl    %esi
> >        cmpl    %edx, %esi
> >        jb      .L43
> >
> >And now the dynamic case:
> >
> >        movl    ysize, %eax
> >.L67:
> >        xorl    %edx, %edx
> >        cmpl    %eax, %edx
> >        jae     .L78
> >        movl    (%esi,%ebx,4), %ecx
> >.L71:
> >        leal    (%edx,%ebx), %eax
> >        movl    %eax, (%ecx,%edx,4)
> >        incl    %edx
> >        movl    ysize, %eax
> >        cmpl    %eax, %edx
> >        jb      .L71
> >        movl    xsize, %ecx
> >.L78:
> >        incl    %ebx
> >        cmpl    %ecx, %ebx
> >        jb      .L67
> >
> >So in the dynamic case we no imull but an extra memory access. The
> >other commands are cheap.
> 
> Which is basically the same code as before because it still does the
> index memory lookup, and doesn't have to precompute the value, i.e. no
> compiler optimization was broken here because this is the step that
> you took over from the compiler.

It looks like I have made my point not clear in the previous post. The
dynamic_but_indexed code has to do a multiplication. However if the
compiler knows the position and/or the sizes it can optimize this. In
the path_finding case the compiler knows nothing. To model this I made
the array sizes globals. While I agree with you in general that
globals should be avoided they solve here the problem and prevent the
compiler to assume anything about the sizes. You can also pass the
sizes as parameters...you get the same result.

So in general the dynamic_but_indexed has to do a multiplication with
unknown factors. And this is the problem spot because such imull
instructions are expensive. More expensive than a memory access. I
know that the last statement is only valid for certain architectures
and than in the future the imull may be cheaper.

> The other bad thing about the dynamic reference system you propose is 
> that it is very easy to generate memory leaks, and very clumsy to set up 
> and tear down. It mistakenly gives people the idea they are dealing with
> a single double indexed array, rather than a subtly complex set of
> references. Coding bugs and errors are nasty.

I don't see a problem here.

> >> root@wetrec[318] ./dynarr_00
> >> declared =  15220000
> >> indexed =   10680000
> >> dynamic =   15010000
> >
> >$ ./a.out
> >declared =   4000000
> >indexed =    6430000
> >dynamic =    4060000

If I run the attached program I get (P2-400):

mull   = 3.860000s
access = 3.800000s

On a sparc ultra-2 (RISC so no mull opcode) the numbers are much
worse:

mull   = 31.469999s
access = 7.880000s

        Raimar

-- 
 email: rf13@xxxxxxxxxxxxxxxxx
 "Sit, disk, sit. Good boy. Now spin up. Very good. Here's a netscape
  cookie for you. Fetch me some data. Come on, you can do it. No, not that
  data. Bad disk. Bad." 
    -- Calle Dybedahl, alt.sysadmin.recovery

Attachment: b.c
Description: Text document


[Prev in Thread] Current Thread [Next in Thread]