Complete.Org: Mailing Lists: Archives: gopher: December 2005:
[gopher] Re: Whats all this talk about?
Home

[gopher] Re: Whats all this talk about?

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: gopher@xxxxxxxxxxxx
Subject: [gopher] Re: Whats all this talk about?
From: Chris <chris@xxxxxxxxxx>
Date: Wed, 21 Dec 2005 12:40:58 -0600
Reply-to: gopher@xxxxxxxxxxxx

Ok my Data is set up differently than yours...


 128.112.128.152  .. .. .. .. ..text/plain    39 Kb    Dec 16 20:13
 128.118.88.200 . .. .. .. .. ..text/plain   154 Kb    Dec 16 20:13
 128.138.77.16 .. .. .. .. .. ..text/plain    15 Kb    Dec 16 20:13
 128.143.22.55 .. .. .. .. .. ..text/plain    92 Kb    Dec 16 20:13
 128.228.1.2 . .. .. .. .. .. ..text/plain    74 Kb    Dec 16 20:13
 128.32.112.200 . .. .. .. .. ..text/plain   641 Kb    Dec 16 20:13
 129.79.225.200 . .. .. .. .. ..text/plain  4165 Kb    Dec 16 20:13
 130.149.17.12 .. .. .. .. .. ..text/plain  8412 Kb    Dec 16 20:13
 132.239.50.108 . .. .. .. .. ..text/plain  4841 bytes Dec 16 20:13
 132.248.10.7  .. .. .. .. .. ..text/plain  3024 Kb    Dec 16 20:13
 134.124.15.133 . .. .. .. .. ..text/plain   818 Kb    Dec 16 20:13
 138.247.32.12 .. .. .. .. .. ..text/plain  4603 bytes Dec 16 20:13
 140.113.209.66 . .. .. .. .. ..   unknown  1375 Kb    Dec 16 20:13
 140.198.26.4  .. .. .. .. .. ..text/plain   478 bytes Dec 16 20:13
 142.32.161.61 .. .. .. .. .. ..text/plain   442 bytes Dec 16 20:13
 150.201.32.1  .. .. .. .. .. ..text/plain  1504 bytes Dec 16 20:13
 158.182.4.1 . .. .. .. .. .. ..text/plain    53 Kb    Dec 16 20:13
 160.94.23.22  .. .. .. .. .. ..text/plain   546 bytes Dec 16 20:13
 169.226.140.28 . .. .. .. .. ..text/plain  3202 bytes Dec 16 20:13
 192.58.246.4  .. .. .. .. .. ..text/plain  1085 Kb    Dec 15 03:28
 192.98.80.1 . .. .. .. .. .. ..text/plain  4643 bytes Dec 16 20:13
 193.225.12.73 .. .. .. .. .. ..text/plain  1148 Kb    Dec 16 20:13
 193.225.12.74 .. .. .. .. .. ..text/plain   209 bytes Dec 16 20:13
 198.108.1.48  .. .. .. .. .. ..text/plain    78 Kb    Dec 16 20:13
 198.135.224.143  .. .. .. .. ..text/plain    14 Kb    Dec 16 20:13
 198.151.172.33 . .. .. .. .. ..text/plain  1413 bytes Dec 16 20:13
 198.161.91.194 . .. .. .. .. ..text/plain  2074 bytes Dec 15 03:28
 198.30.120.11 .. .. .. .. .. ..text/plain  9656 bytes Dec 16 20:13
 199.125.85.11 .. .. .. .. .. ..text/plain  7418 bytes Dec 16 20:13
 206.80.4.10 . .. .. .. .. .. ..text/plain   200 Kb    Dec 16 20:13
 209.113.213.86 . .. .. .. .. ..text/plain    92 Kb    Dec 16 20:13
 209.216.94.5  .. .. .. .. .. ..text/plain   107 Kb    Dec 15 03:28
 212.68.221.103 . .. .. .. .. ..text/plain     0 bytes Dec 15 03:28
 213.237.16.246 . .. .. .. .. ..text/plain    84 Kb    Dec 15 03:28
 216.138.233.67 . .. .. .. .. ..text/plain  8158 bytes Dec 15 03:28
 216.143.130.27 . .. .. .. .. ..text/plain   404 bytes Dec 16 20:13
 216.99.211.113 . .. .. .. .. ..text/plain    38 Kb    Dec 15 03:28
 217.215.6.225 .. .. .. .. .. ..text/plain  1250 Kb    Dec 15 03:28
 24.185.18.37  .. .. .. .. .. ..text/plain     0 bytes Dec 15 03:28
 66.159.214.138 . .. .. .. .. ..text/plain   247 Kb    Dec 15 03:28
 66.18.231.71  .. .. .. .. .. ..text/plain   330 Kb    Dec 15 03:28
 67.18.92.178  .. .. .. .. .. ..text/plain  3319 bytes Dec 15 03:28
 69.21.205.10  .. .. .. .. .. ..text/plain  1967 Kb    Dec 15 03:29
 69.217.43.23  .. .. .. .. .. ..text/plain    50 Kb    Dec 15 03:29
 72.192.21.54  .. .. .. .. .. ..text/plain  6045 bytes Dec 15 03:29
 80.68.194.26  .. .. .. .. .. ..text/plain    64 Kb    Dec 15 03:29
 80.89.239.61  .. .. .. .. .. ..text/plain  8569 bytes Dec 15 03:29
 84.139.112.5  .. .. .. .. .. ..text/plain  4064 bytes Dec 15 03:29
 Data .. .. .. .. .. .. .. .. ..text/plain    11 Mb    Dec 16 20:13
 Data.offset.0.5  .. .. .. .. ..text/plain  1225 Kb    Dec 16 20:13
 Other . .. .. .. .. .. .. .. ..text/plain  5269 Kb    Dec 16 20:13
Or ll -h :
total 43486
 38K Dec 16 20:13 128.112.128.152
 153K Dec 16 20:13 128.118.88.200
 14K Dec 16 20:13 128.138.77.16
 92K Dec 16 20:13 128.143.22.55
 74K Dec 16 20:13 128.228.1.2
 640K Dec 16 20:13 128.32.112.200
 4M Dec 16 20:13 129.79.225.200
 8M Dec 16 20:13 130.149.17.12
 4K Dec 16 20:13 132.239.50.108
 2M Dec 16 20:13 132.248.10.7
 817K Dec 16 20:13 134.124.15.133
 4K Dec 16 20:13 138.247.32.12
 1M Dec 16 20:13 140.113.209.66
 478B Dec 16 20:13 140.198.26.4
 442B Dec 16 20:13 142.32.161.61
 1K Dec 16 20:13 150.201.32.1
 53K Dec 16 20:13 158.182.4.1
 546B Dec 16 20:13 160.94.23.22
 3K Dec 16 20:13 169.226.140.28
 1M Dec 15 03:28 192.58.246.4
 4K Dec 16 20:13 192.98.80.1
 1M Dec 16 20:13 193.225.12.73
 209B Dec 16 20:13 193.225.12.74
 77K Dec 16 20:13 198.108.1.48
 14K Dec 16 20:13 198.135.224.143
 1K Dec 16 20:13 198.151.172.33
 2K Dec 15 03:28 198.161.91.194
 9K Dec 16 20:13 198.30.120.11
 7K Dec 16 20:13 199.125.85.11
 200K Dec 16 20:13 206.80.4.10
 91K Dec 16 20:13 209.113.213.86
 106K Dec 15 03:28 209.216.94.5
 0B Dec 15 03:28 212.68.221.103
 83K Dec 15 03:28 213.237.16.246
 7K Dec 15 03:28 216.138.233.67
There are:
92219 lines of individual words attributed to which gopher they came from. 
These words are derived from file names as well _not_ full text.
This is in the Data file.
62434 lines of text attributed to which gopher they came from are in "Other" 
file
95171 words are in the offset file and only given a number, I believe this is 
how the partial words are assigned numeric values .

I manually, the first few times, have indexed each gopher one by one, based on 
the spiderings of my gspider (author Tim Fraser from the list).
I will be making a script to index as I do on Jughead. Each site is crawled to 
a set depth of directories, but not off the site (yet).
So the site is crawled but the web is not by Veronica-local, Gspider
crawls link to link finding sites, I script a simple script to hit sites
and actually darned near use the lists in the About files as they are.
I am using Jughead and Veronica in the same way as far as crawling , of course 
this is for now, eventually I want Veronica to crawl the web on her own and 
Jughead will be pointed by a script, derived from Gspider or another spider or 
my whim I guess ;).
By the way my last Jughead data file ended up just a tiny bit smaller
than Veronica , this is because I have pulled some sites from Jughead and put 
them on Veronica and balanced them, so that together they should
eventually cover a very large portion of gopherspace. I am right around 200,000 
selectors (selectors being words attributed to a gopher or gophers that are 
shown as results when searched ) with both. Jughead on my system can handle 
that alone but since it builds the database in ram I cant go tooo much 
farther...maybe 425-440,000 selectors, Veronica doesnt have that issue so it 
looks very promising. I am unsure if my terminolgy and yours is exactly the 
same re: selectors
In my mind a word indexed on a gopher then becomes a selector just
to clarify. In Veronica there is a word list nearly 100,000 lines long
each line contains from whence it came, in Jughead there are about the
same amount of lines but the word is from just that site 
Jughead:
ISuburban4    /Audio_Visual/Images/Suburban/Suburban4.jpg     hal3000.cx
Veronica:
suburban:4 207 68200|11 125 694148|31 328 124692|
(see one word at more than one gopher vs Jughead which is word and gopher site)


On Tue, 20 Dec 2005 16:30:50 -0800 (PST)
Cameron Kaiser <spectre@xxxxxxxxxxxx> wrote:

> > The box Veronica is on is a p200mmx
> > Jughead is on another p200mmx
> > Freebsd for both.
> > The list of sites is included with the 
> > About_Veronica_Search text and
> > About_Multi_Search talks of Jughead
> 
> Ah, thanks (*reads it*).
> 
> > I am having problems with the .tree script in 
> > that there is not any decent fall backs for things
> > like high latency or lost connection, there is 
> > an "Alarm" sent in text and that ends the "tree-ing"
> > for that site. This may be why the results are so far
> > differing at times with yours Cameron.
> 
> Is the set up actually a crawler? It's not clear to me if you're using a
> predigested index the outside sites provide, or if you're crawling it
> yourself. I'm assuming based on
> 
> > I have shown which sites "Alarmed" and therefore
> > are incomplete.
> > For instance:
> > gopher.semo.edu #alarm long way in
> > that is to say after a long time and quite far
> > in the tree I recieved an alarm which indcates
> > one of several things, timeout, loss of connection,
> > exceded "depth" etc.
> 
> that you are crawling it yourself.
> 
> > Cameron I think you are indexing more than I atm as 
> > well, with my raw data being about 20M and the data 
> > file being 10M with a 1M offset file and a 5M "other" 
> > file .
> 
> How many selectors does that translate to? For the record,
> 
> gopher% ls -sk # in kilobytes
> total 956399
> 146408 history.MYD          3664 prospects.MYI          12 stats.frm
> 105496 history.MYI            12 prospects.frm      304968 textil.MYD
>     12 history.frm             6 stats.MYD          391104 textil.MYI
>   4696 prospects.MYD           9 stats.MYI              12 textil.frm
> 
> so not quite a gig so far. Note it is not full-text.
> 
> textil is the keyword and relevancy table, history is the selector/display
> string database, prospects is the workspace table and stats is cached
> precomputed statistics used for /world. This is with 1.1 million selectors,
> give or take a couple thousand, using my regular "stupid" crawler library.
> 
> Mind you, this is not a competition :) I'm just curious about how you're
> getting things up and running. So far you seem to be getting pretty good
> results for an early effort, so you are to be congratulated.
> 
> -- 
> --------------------------------- personal: http://www.armory.com/~spectre/ 
> ---
>   Cameron Kaiser * Floodgap Systems * www.floodgap.com * ckaiser@xxxxxxxxxxxx
> -- Hi! I am a .signature virus.  Copy me into your .signature to join in! 
> -----
> 
> 
> 


-- 
Join FSF as an Associate Member at:
<URL:http://member.fsf.org/join?referrer=3014>



[Prev in Thread] Current Thread [Next in Thread]