[Freeciv-Dev] Re: [RFC] Path finding implementation.
[Top] [All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
This sort of (unnecessary) attempt to duplicate compiler optimizations
has been tried and shown to be less efficient. Compiler writers and
chip makers have worked really hard over the years to make these things
as efficient as possible as part of the "benchmark" wars. The assumption
that one can hack code to improve on their efforts is alwasy interesting
to watch.
The best thing to do is stay out of the way of the compiler by not
inserting special algorithmic hacks, or optimization killers like
(inline) function calls.
When using functions or inline functions (which are the worst
of all worlds) rather than macros to do the address computations,
one blocks the compiler from generating the sort of efficient
LEA instructions that do (index * base +offset) calculations in a
single instruction cycle.
In general the compiler will sort out common expressions and cache
the results in registers unless one does things to defeat its pattern
recognition, or allowed optimization constraints.
The basic problem with your lattice system is you are not allocating
large blocks of cleared memory, either a square array, or a linear
vector. If you do try to implement (foolish) sparse systems, the
preset time will eventually kill you, but the solution is not to go
to individual allocations, but a block allocation of working cells
that are "allocated" by setting an indirection index. Free and active
queues of cells can be used to manage this if needed, but I think you
will find that it is pretty much a oneway list of allocates followed
by a full reset that is the usual pattern.
Some of a previous email trail below ...
Cheers,
RossW
=====
At 12:35 PM 02/06/29 +0200, Raimar Falke wrote:
>On Fri, Jun 28, 2002 at 10:28:49PM +0100, Gregory Berkolaiko wrote:
>> I put in priority queue instead of bucket list and got a very significant
>> improvement, from around 11sec to 7.9sec. I think it is mostly because of
>> huge memory allocations that were necessary to do bucket list.
>
>I have also restarted working on this. I also introduced a heap
>now. I'm currently fighting with a nice effect: the algo finds a
>shorter path to a tile. But this tile is in the heap and so the heap
>may become invalid. It has to be rebuilt. Took me quite some time to
>catch this.
>
>> Also, much to my surprise, when I put the lattice sizes to be the exact
>> map sizes, as opposed to some power of 2, I also get good speed
>> improvement. Strange, I thought it is much faster to do x % 128 than to
>> do x % 60... Maybe my compiler does not optimize x % 128 to x | 127 ??
>
>gcc -S is your friend. But I'm sure that it does.
>
>> Or the memory allocation is that expensive??
>
>Could be. Maybe also cache trashing.
>
>> > You can avoid the multiplication by using a 2D array.
>
>I implemented this (and removed the hash). And it is quite painless:
>
>struct pf_map {
> struct cell **cells;
>...
>};
>
>build up:
>
> pf_map->cells = fc_malloc(sizeof(*pf_map->cells)*map.xsize);
> for (x = 0; x < map.xsize; x++) {
> pf_map->cells[x] = fc_malloc(sizeof(*pf_map->cells[x]) * map.ysize);
> memset(pf_map->cells[x],0,sizeof(*pf_map->cells[x]) * map.ysize);
> }
>
>accessing:
>
>static struct cell *get_cell(pf_map_t pf_map, int x, int y)
>{
> struct cell *result=&pf_map->cells[x][y];
>...
>
> Raimar
>
>--
> email: rf13@xxxxxxxxxxxxxxxxx
> The trick is to keep breathing.
=====
Facts can really ruin a good logical argument :-).
Your numbers are a case in point, so I had to poke around with things
to see what was going on. Below is a slightly modified version of your
program that uses clock() to get the user+system CPU times as accurately
as possible.
I compiled `cc -O3 -o dynarr_00 dynarr_00.c". The assembly for this was
obtained by `cc -O3 -s dynarr_00.c". You will note that it inlined the
functions in main.
The sample times contradict your data and reinforce my argument. But
the differences are incredibly subtle and thus my guess is that in a
real life programming example, there is lots of room for things to
change. These were run on a PIII-866. It is also interesting that we
both show that static is the worst which is really counter intuitive
:-)
The inner loops in main run from L.76, L.95, L.116 and end at the
next "jns" instruction in each case. One needs to look at the three
branch targets above the loop and the two sections below the jns to
see the code differences. The inner loops are all pretty much the
same, i.e. optimized to register accesses.
Note that no one ever really does a multiply which sort of puts a hole
in your argument. Array index computations really are heavily optimized.
The subtle difference between static and indexed is that static does
a memory lookup at the outer loop entry, whereas the indexed version
uses the edi register value. There is an inc and compare that uses
memory at the bottom of the indexed loop, but it too is accessing the
stack (ebp) which is virtually always cached and thus runs at
register speeds. Intel chips all use the stack as a big register
cache which in the RISC-CISC wars largely nullified the RISC claims
of performance benefit from loads of on chip registers. Intel used
the chip resources to do other optimizations which won that war for
them :-). Although the compare nominally "reads" memory. if you think
about it, the value read and inc'd is probably still in the chip
pipeline and thus won't really be refetched. You could probably add a
"volatile" attribute somewhere to force things to always refetch real
memory values if you want to experiment.
I'm actually at a loss to understand just what is different between a
stack access in the outer loop jump code at top or bottom unless it
is the fact that you need the %ebx register two instructions down from
the top and its value may not be available to the pipeline if it was a
memory vs register access causing a pipeline stall for a cycle or so.
The dynamic case is reading from memory in these cases. This is not
stack accesses, but it is linear access to real memory, though in the
simple case here will likely result in a fault on only a few accesses
(depending on cache line size), and register speed for the next few
lookups. This can change in a real programming example where there
are other memory lookups competing for the same cache space and thus
the tile[] is dropped from cache before the next iteration. It is
also the case that the index loop computation cannot be optimized as
much as shown by the fact that it takes a few more instructions to
handle both loop indices and array access indices. It is still
somewhat intriguing that it manages to beat the static case but that
seems to be all due to the thinness of the loop setup which is also
where the indexed case seems to have managed its win :-).
[Aside] Note that map_inx() loops over x before y which is the reverse
of tile[x][y] lookups. There may be subtle things happening in places
like GUI tiling if you change the order of things. It is also better
for optimization to keep the inner loop bigger, and do fewer outer
loop iterations.
Cheers,
RossW
=====
root@wetrec[318] ./dynarr_00
declared = 15220000
indexed = 10680000
dynamic = 15010000
#include<stdlib.h>
/********************************/
void declared()
{
int i, j;
unsigned int k;
int tiles[80][50];
for(k = 0; k < 1000000; k++) {
for(i = 0; i < 80; i++) {
for(j = 0; j < 50; j++) {
tiles[i][j] = i + j;
}
}
}
}
/********************************/
void dynamic_but_indexed()
{
int i, j;
unsigned int k;
int *tiles;
tiles = malloc(80 * 50 * sizeof(int));
for(k = 0; k < 1000000; k++) {
for(i = 0; i < 80; i++) {
for(j = 0; j < 50; j++) {
tiles[j + i * 50] = i + j;
}
}
}
free(tiles);
}
/********************************/
int dynamic()
{
int i, j;
unsigned int k;
int *tiles_storage;
int **tiles;
tiles_storage = malloc(80 * 50 * sizeof(int));
tiles = malloc(80 * sizeof(int *));
for(i = 0; i < 80; i++) {
tiles[i] = tiles_storage + (i * 50);
}
for(k = 0; k < 1000000; k++) {
for(i = 0; i < 80; i++) {
for(j = 0; j < 50; j++) {
tiles[i][j] = i + j;
}
}
}
free(tiles);
free(tiles_storage);
}
/********************************/
int main()
{
time_t clock0, clocks;
int ret;
clock0 = clock();
declared();
clocks = clock();
printf("declared = %9d\n", clocks-clock0);
clock0 = clock();
dynamic_but_indexed();
clocks = clock();
printf("indexed = %9d\n", clocks-clock0);
clock0 = clock();
ret = dynamic();
clocks = clock();
printf("dynamic = %9d\n", clocks-clock0);
return 0;
}
.file "dynarr_00.c"
.version "01.01"
gcc2_compiled.:
.section .rodata
.LC0:
.string "declared = %9d\n"
.LC1:
.string "indexed = %9d\n"
.LC2:
.string "dynamic = %9d\n"
.text
.align 4
.globl main
.type main,@function
main:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $16028, %esp
call clock
movl %eax, -16028(%ebp)
leal -16024(%ebp), %eax
xorl %edi, %edi
movl %eax, -16032(%ebp)
.p2align 2
.L73:
xorl %esi, %esi
movl -16032(%ebp), %ebx
.p2align 2
.L76:
movl $49, %ecx
leal 196(%ebx), %edx
leal 49(%esi), %eax
.p2align 2
.L79:
movl %eax, (%edx)
subl $4, %edx
decl %eax
decl %ecx
jns .L79
incl %esi
addl $200, %ebx
cmpl $79, %esi
jle .L76
incl %edi
cmpl $999999, %edi
jbe .L73
call clock
subl $8, %esp
subl -16028(%ebp), %eax
pushl %eax
pushl $.LC0
call printf
call clock
movl $16000, (%esp)
movl %eax, -16028(%ebp)
call malloc
movl %eax, %edi
movl $0, -16036(%ebp)
addl $16, %esp
.p2align 2
.L89:
xorl %esi, %esi
movl %edi, %ebx
.p2align 2
.L92:
movl $49, %eax
leal 196(%ebx), %ecx
leal 49(%esi), %edx
.p2align 2
.L95:
movl %edx, (%ecx)
subl $4, %ecx
decl %edx
decl %eax
jns .L95
incl %esi
addl $200, %ebx
cmpl $79, %esi
jle .L92
incl -16036(%ebp)
cmpl $999999, -16036(%ebp)
jbe .L89
subl $12, %esp
pushl %edi
call free
addl $16, %esp
call clock
subl $8, %esp
subl -16028(%ebp), %eax
pushl %eax
pushl $.LC1
call printf
call clock
movl $16000, (%esp)
movl %eax, -16028(%ebp)
call malloc
movl $320, (%esp)
movl %eax, -16040(%ebp)
call malloc
movl %eax, %esi
movl -16040(%ebp), %edx
addl $16, %esp
movl $79, %ebx
leal 316(%esi), %ecx
addl $15800, %edx
.p2align 2
.L105:
movl %edx, (%ecx)
subl $4, %ecx
subl $200, %edx
decl %ebx
jns .L105
xorl %edi, %edi
.p2align 2
.L110:
xorl %ebx, %ebx
.p2align 2
.L113:
movl (%esi,%ebx,4), %ecx
movl $49, %eax
addl $196, %ecx
leal 49(%ebx), %edx
.p2align 2
.L116:
movl %edx, (%ecx)
subl $4, %ecx
decl %edx
decl %eax
jns .L116
incl %ebx
cmpl $79, %ebx
jle .L113
incl %edi
cmpl $999999, %edi
jbe .L110
subl $12, %esp
pushl %esi
call free
popl %eax
pushl -16040(%ebp)
call free
addl $16, %esp
call clock
subl $8, %esp
subl -16028(%ebp), %eax
pushl %eax
pushl $.LC2
call printf
addl $16, %esp
leal -12(%ebp), %esp
popl %ebx
popl %esi
xorl %eax, %eax
popl %edi
popl %ebp
ret
.Lfe1:
.size main,.Lfe1-main
.align 4
.globl declared
.type declared,@function
declared:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $16012, %esp
xorl %edi, %edi
.p2align 2
.L21:
leal -24(%ebp), %eax
xorl %esi, %esi
movl %eax, %ebx
.p2align 2
.L25:
movl $49, %ecx
leal -15804(%ebx), %edx
leal 49(%esi), %eax
.p2align 2
.L29:
movl %eax, (%edx)
subl $4, %edx
decl %eax
decl %ecx
jns .L29
incl %esi
addl $200, %ebx
cmpl $79, %esi
jle .L25
incl %edi
cmpl $999999, %edi
jbe .L21
addl $16012, %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.Lfe2:
.size declared,.Lfe2-declared
.align 4
.globl dynamic_but_indexed
.type dynamic_but_indexed,@function
dynamic_but_indexed:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $24, %esp
pushl $16000
call malloc
movl %eax, %edi
movl $0, -16(%ebp)
addl $16, %esp
.p2align 2
.L37:
xorl %esi, %esi
movl %edi, %ebx
.p2align 2
.L41:
movl $49, %eax
leal 196(%ebx), %ecx
leal 49(%esi), %edx
.p2align 2
.L45:
movl %edx, (%ecx)
subl $4, %ecx
decl %edx
decl %eax
jns .L45
incl %esi
addl $200, %ebx
cmpl $79, %esi
jle .L41
incl -16(%ebp)
cmpl $999999, -16(%ebp)
jbe .L37
subl $12, %esp
pushl %edi
call free
addl $16, %esp
leal -12(%ebp), %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.Lfe3:
.size dynamic_but_indexed,.Lfe3-dynamic_but_indexed
.align 4
.globl dynamic
.type dynamic,@function
dynamic:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $24, %esp
pushl $16000
call malloc
movl $320, (%esp)
movl %eax, -16(%ebp)
call malloc
movl %eax, %esi
movl -16(%ebp), %edx
addl $16, %esp
movl $79, %ebx
leal 316(%esi), %ecx
addl $15800, %edx
.p2align 2
.L53:
movl %edx, (%ecx)
subl $4, %ecx
subl $200, %edx
decl %ebx
jns .L53
xorl %edi, %edi
.p2align 2
.L58:
xorl %ebx, %ebx
.p2align 2
.L62:
movl (%esi,%ebx,4), %ecx
movl $49, %eax
addl $196, %ecx
leal 49(%ebx), %edx
.p2align 2
.L66:
movl %edx, (%ecx)
subl $4, %ecx
decl %edx
decl %eax
jns .L66
incl %ebx
cmpl $79, %ebx
jle .L62
incl %edi
cmpl $999999, %edi
jbe .L58
subl $12, %esp
pushl %esi
call free
popl %edx
pushl -16(%ebp)
call free
addl $16, %esp
leal -12(%ebp), %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.Lfe4:
.size dynamic,.Lfe4-dynamic
.ident "GCC: (GNU) 2.96 20000731 (Red Hat Linux 7.0)"
- [Freeciv-Dev] Re: [RFC] Path finding implementation., (continued)
- [Freeciv-Dev] Re: [RFC] Path finding implementation., Raimar Falke, 2002/06/24
- [Freeciv-Dev] Re: [RFC] Path finding implementation., Gregory Berkolaiko, 2002/06/25
- [Freeciv-Dev] Re: [RFC] Path finding implementation., Raimar Falke, 2002/06/25
- [Freeciv-Dev] Re: [RFC] Path finding implementation., Gregory Berkolaiko, 2002/06/26
- [Freeciv-Dev] Re: [RFC] Path finding implementation., Raimar Falke, 2002/06/26
- [Freeciv-Dev] Re: [RFC] Path finding implementation., Gregory Berkolaiko, 2002/06/26
- [Freeciv-Dev] Re: [RFC] Path finding implementation., Raimar Falke, 2002/06/26
[Freeciv-Dev] Re: [RFC] Path finding implementation., Gregory Berkolaiko, 2002/06/28
- [Freeciv-Dev] Re: [RFC] Path finding implementation., Raimar Falke, 2002/06/29
- [Freeciv-Dev] Re: [RFC] Path finding implementation., G. Dyke, 2002/06/29
- [Freeciv-Dev] Re: [RFC] Path finding implementation.,
Ross W. Wetmore <=
- [Freeciv-Dev] Re: [RFC] Path finding implementation., Raimar Falke, 2002/06/29
- [Freeciv-Dev] Re: [RFC] Path finding implementation., Gregory Berkolaiko, 2002/06/29
- [Freeciv-Dev] Re: [RFC] Path finding implementation., Raimar Falke, 2002/06/29
- [Freeciv-Dev] Re: [RFC] Path finding implementation., Ross W. Wetmore, 2002/06/30
[Freeciv-Dev] Re: [RFC] Path finding implementation., Ross W. Wetmore, 2002/06/30
[Freeciv-Dev] Re: [RFC] Path finding implementation., Raimar Falke, 2002/06/30
[Freeciv-Dev] Re: [RFC] Path finding implementation., Ross W. Wetmore, 2002/06/30
[Freeciv-Dev] Re: [RFC] Path finding implementation., Gregory Berkolaiko, 2002/06/29
[Freeciv-Dev] Re: [RFC] Path finding implementation., Raimar Falke, 2002/06/29
[Freeciv-Dev] Re: [RFC] Path finding implementation., Gregory Berkolaiko, 2002/06/29
|
|