Complete.Org: Mailing Lists: Archives: gopher: July 2008:
[gopher] Re: parallelizing Veronica-2

[gopher] Re: parallelizing Veronica-2

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: gopher@xxxxxxxxxxxx
Subject: [gopher] Re: parallelizing Veronica-2
From: JumpJet Mailbox <jumpjetinfo@xxxxxxxxx>
Date: Wed, 23 Jul 2008 12:41:03 -0700 (PDT)
Reply-to: gopher@xxxxxxxxxxxx

JumpJet has NO objection to any spider running multiple threads on the server.  
Also, PLEASE index as much as you can (EVERY resource if possible), and do it 
on a regular basis (things change often).
--- On Wed, 7/23/08, chris <chris@xxxxxxxxxx> wrote:

From: chris <chris@xxxxxxxxxx>
Subject: [gopher] Re: parallelizing Veronica-2
To: gopher@xxxxxxxxxxxx
Date: Wednesday, July 23, 2008, 1:07 PM

I currently use 6 threads on my spider/crawler with no one complaining although
they run independent of each other to different sites untill sites are
exhausted and when it comes to the last few sites then the spiders work the
same site together or in groups. My Veronica indexes in a single thread as
I don't think your going to be a burden on any servers.
Ohh my selector rest time is 3 seconds between hits.

On Tue, 22 Jul 2008 22:09:32 -0700 (PDT)
Cameron Kaiser <spectre@xxxxxxxxxxxx> wrote:

> One big problem with Veronica-2 is that the crawling process is presently
> quite slow (besides the manual data massaging that I have to do every so
> often, review the dead hosts list, prune pathologic selectors, etc.) and
> impacts the accuracy of the search database because it is nowhere near as
> sprightly as Google.
> V-2 will never quite be the Google of Gopherspace but there are some
> optimizations I have in mind for increasing its coverage and therefore
> relevance. Some of these I'm implementing now.
> However, the biggest change I am considering is parallelizing it (rather
> than paralysing it ;-). Right now there is a single thread running doing
> the crawling, which is somewhat inefficient, but done this way to make
> debugging easier. As it stands, there have been no major changes to the
> crawling core for almost two years -- I have been making various changes
> to the search client end, but not to the actual crawler.
> For this reason, I'd like to increase the number of crawl threads from
> to three as a test case to see how well this operates. This won't
> throughput 3x, because my profiling shows a fair bit of the load is
> writes, but it will improve it by a non-trivial factor. However, there is
> also the possibility that people will see parallel hits to their server if
> the crawlers have a small set of servers to iterate over at any given
> To reduce the possibility of a loop causing the crawl threads to hammer
> individual hosts, each individual thread can at most hit a selector every
> five seconds even if it is switching to a different host just in case the
> interthread communication glitches. This should keep load down while
> is in progress as there will always be a hard rate limit.
> People who have observed the crawler in operation will also note that it
> does not request every single resource anyway, since it doesn't index
> it looks at menus primarily, and only individual resources if they were
> linked in from somewhere else to verify their existence. This goes a long
> to making V-2 a better neighbour, I think, and I did this by design.
> Please let me know if there is any strenuous opposition to increasing the
> crawl rate. This will not go into effect probably for a few weeks while I
> internally debug the synchronization code, but it may be in operation by
> this fall.
> -- 
> ------------------------------------ personal: --
>   Cameron Kaiser * Floodgap Systems * *
> -- BOND THEME NOW PLAYING: "Diamonds Are Forever"

FreeBSD it's Da Bomb


[Prev in Thread] Current Thread [Next in Thread]