Complete.Org: Mailing Lists: Archives: gopher: July 2008:
[gopher] parallelizing Veronica-2
Home

[gopher] parallelizing Veronica-2

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: gopher@xxxxxxxxxxxx
Subject: [gopher] parallelizing Veronica-2
From: Cameron Kaiser <spectre@xxxxxxxxxxxx>
Date: Tue, 22 Jul 2008 22:09:32 -0700 (PDT)
Reply-to: gopher@xxxxxxxxxxxx

One big problem with Veronica-2 is that the crawling process is presently
quite slow (besides the manual data massaging that I have to do every so
often, review the dead hosts list, prune pathologic selectors, etc.) and this
impacts the accuracy of the search database because it is nowhere near as
sprightly as Google.

V-2 will never quite be the Google of Gopherspace but there are some
optimizations I have in mind for increasing its coverage and therefore
relevance. Some of these I'm implementing now.

However, the biggest change I am considering is parallelizing it (rather
than paralysing it ;-). Right now there is a single thread running doing
the crawling, which is somewhat inefficient, but done this way to make
debugging easier. As it stands, there have been no major changes to the
crawling core for almost two years -- I have been making various changes
to the search client end, but not to the actual crawler.

For this reason, I'd like to increase the number of crawl threads from one
to three as a test case to see how well this operates. This won't improve
throughput 3x, because my profiling shows a fair bit of the load is database
writes, but it will improve it by a non-trivial factor. However, there is
also the possibility that people will see parallel hits to their server if
the crawlers have a small set of servers to iterate over at any given time.
To reduce the possibility of a loop causing the crawl threads to hammer
individual hosts, each individual thread can at most hit a selector every
five seconds even if it is switching to a different host just in case the
interthread communication glitches. This should keep load down while crawling
is in progress as there will always be a hard rate limit.

People who have observed the crawler in operation will also note that it
does not request every single resource anyway, since it doesn't index them;
it looks at menus primarily, and only individual resources if they were
linked in from somewhere else to verify their existence. This goes a long way
to making V-2 a better neighbour, I think, and I did this by design.

Please let me know if there is any strenuous opposition to increasing the
crawl rate. This will not go into effect probably for a few weeks while I
internally debug the synchronization code, but it may be in operation by
this fall.

-- 
------------------------------------ personal: http://www.cameronkaiser.com/ --
  Cameron Kaiser * Floodgap Systems * www.floodgap.com * ckaiser@xxxxxxxxxxxx
-- BOND THEME NOW PLAYING: "Diamonds Are Forever" -----------------------------



[Prev in Thread] Current Thread [Next in Thread]