Complete.Org: Mailing Lists: Archives: gopher: July 2008:
[gopher] Re: parallelizing Veronica-2
Home

[gopher] Re: parallelizing Veronica-2

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: gopher@xxxxxxxxxxxx
Subject: [gopher] Re: parallelizing Veronica-2
From: chris <chris@xxxxxxxxxx>
Date: Wed, 23 Jul 2008 12:07:06 -0500
Reply-to: gopher@xxxxxxxxxxxx

I currently use 6 threads on my spider/crawler with no one complaining although 
they run independent of each other to different sites untill sites are 
exhausted and when it comes to the last few sites then the spiders work the 
same site together or in groups. My Veronica indexes in a single thread as well.
I don't think your going to be a burden on any servers.
Ohh my selector rest time is 3 seconds between hits.
Chris

On Tue, 22 Jul 2008 22:09:32 -0700 (PDT)
Cameron Kaiser <spectre@xxxxxxxxxxxx> wrote:

> One big problem with Veronica-2 is that the crawling process is presently
> quite slow (besides the manual data massaging that I have to do every so
> often, review the dead hosts list, prune pathologic selectors, etc.) and this
> impacts the accuracy of the search database because it is nowhere near as
> sprightly as Google.
> 
> V-2 will never quite be the Google of Gopherspace but there are some
> optimizations I have in mind for increasing its coverage and therefore
> relevance. Some of these I'm implementing now.
> 
> However, the biggest change I am considering is parallelizing it (rather
> than paralysing it ;-). Right now there is a single thread running doing
> the crawling, which is somewhat inefficient, but done this way to make
> debugging easier. As it stands, there have been no major changes to the
> crawling core for almost two years -- I have been making various changes
> to the search client end, but not to the actual crawler.
> 
> For this reason, I'd like to increase the number of crawl threads from one
> to three as a test case to see how well this operates. This won't improve
> throughput 3x, because my profiling shows a fair bit of the load is database
> writes, but it will improve it by a non-trivial factor. However, there is
> also the possibility that people will see parallel hits to their server if
> the crawlers have a small set of servers to iterate over at any given time.
> To reduce the possibility of a loop causing the crawl threads to hammer
> individual hosts, each individual thread can at most hit a selector every
> five seconds even if it is switching to a different host just in case the
> interthread communication glitches. This should keep load down while crawling
> is in progress as there will always be a hard rate limit.
> 
> People who have observed the crawler in operation will also note that it
> does not request every single resource anyway, since it doesn't index them;
> it looks at menus primarily, and only individual resources if they were
> linked in from somewhere else to verify their existence. This goes a long way
> to making V-2 a better neighbour, I think, and I did this by design.
> 
> Please let me know if there is any strenuous opposition to increasing the
> crawl rate. This will not go into effect probably for a few weeks while I
> internally debug the synchronization code, but it may be in operation by
> this fall.
> 
> -- 
> ------------------------------------ personal: http://www.cameronkaiser.com/ 
> --
>   Cameron Kaiser * Floodgap Systems * www.floodgap.com * ckaiser@xxxxxxxxxxxx
> -- BOND THEME NOW PLAYING: "Diamonds Are Forever" 
> -----------------------------
> 
> 
> 


-- 
FreeBSD it's Da Bomb



[Prev in Thread] Current Thread [Next in Thread]