Complete.Org: Mailing Lists: Archives: gopher: December 2005:
[gopher] Re: Whats all this talk about?
Home

[gopher] Re: Whats all this talk about?

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: gopher@xxxxxxxxxxxx
Subject: [gopher] Re: Whats all this talk about?
From: Cameron Kaiser <spectre@xxxxxxxxxxxx>
Date: Tue, 20 Dec 2005 16:30:50 -0800 (PST)
Reply-to: gopher@xxxxxxxxxxxx

> The box Veronica is on is a p200mmx
> Jughead is on another p200mmx
> Freebsd for both.
> The list of sites is included with the 
> About_Veronica_Search text and
> About_Multi_Search talks of Jughead

Ah, thanks (*reads it*).

> I am having problems with the .tree script in 
> that there is not any decent fall backs for things
> like high latency or lost connection, there is 
> an "Alarm" sent in text and that ends the "tree-ing"
> for that site. This may be why the results are so far
> differing at times with yours Cameron.

Is the set up actually a crawler? It's not clear to me if you're using a
predigested index the outside sites provide, or if you're crawling it
yourself. I'm assuming based on

> I have shown which sites "Alarmed" and therefore
> are incomplete.
> For instance:
> gopher.semo.edu #alarm long way in
> that is to say after a long time and quite far
> in the tree I recieved an alarm which indcates
> one of several things, timeout, loss of connection,
> exceded "depth" etc.

that you are crawling it yourself.

> Cameron I think you are indexing more than I atm as 
> well, with my raw data being about 20M and the data 
> file being 10M with a 1M offset file and a 5M "other" 
> file .

How many selectors does that translate to? For the record,

gopher% ls -sk # in kilobytes
total 956399
146408 history.MYD          3664 prospects.MYI          12 stats.frm
105496 history.MYI            12 prospects.frm      304968 textil.MYD
    12 history.frm             6 stats.MYD          391104 textil.MYI
  4696 prospects.MYD           9 stats.MYI              12 textil.frm

so not quite a gig so far. Note it is not full-text.

textil is the keyword and relevancy table, history is the selector/display
string database, prospects is the workspace table and stats is cached
precomputed statistics used for /world. This is with 1.1 million selectors,
give or take a couple thousand, using my regular "stupid" crawler library.

Mind you, this is not a competition :) I'm just curious about how you're
getting things up and running. So far you seem to be getting pretty good
results for an early effort, so you are to be congratulated.

-- 
--------------------------------- personal: http://www.armory.com/~spectre/ ---
  Cameron Kaiser * Floodgap Systems * www.floodgap.com * ckaiser@xxxxxxxxxxxx
-- Hi! I am a .signature virus.  Copy me into your .signature to join in! -----



[Prev in Thread] Current Thread [Next in Thread]