Complete.Org: Mailing Lists: Archives: gopher: June 2003:
[gopher] Veronica-2 and robot exclusion
Home

[gopher] Veronica-2 and robot exclusion

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: gopher@xxxxxxxxxxxx
Subject: [gopher] Veronica-2 and robot exclusion
From: Cameron Kaiser <spectre@xxxxxxxxxxxx>
Date: Wed, 18 Jun 2003 06:24:18 -0700 (PDT)
Reply-to: gopher@xxxxxxxxxxxx

Let's revisit an old issue since we're not talking about anything right now :-)

Veronica-2 is moving right along. The client libraries are now on the new
server, and I'm setting up a test network of three simulated gopher servers
to bounce requests off of and study autoprune and performance before
unleashing it on an unsuspecting populace.

I'm rewriting the crawler from scratch to get rid of some old inefficiencies
that were in the previous version (and also to leverage better database
support, which will further improve its indexing capabilities). Now's the
time to get your requests in.

I re-read the long thread back in January '01 where we went over how to
implement a robot exclusion standard for Gopherspace, and the solution that
had the widest support (except from yours truly at that time) was to
re-use the existing robots.txt (for your convenience, the current 1994
standard is available at

        gopher://gopher.floodgap.com/0/v2/robotstxt.txt

). Now that the crawler is off the University network and on my dime, I see
no reason not to let the crawler shuffle around 24/7/365, so this limits the
re-fetch penalty of using a global robots.txt file for each gopher, and
pretty much limits my only major complaint.

Reviewing the robots.txt spec, it seems that this can pretty much be adapted
as is for gopherspace. The Veronica-2 robot's User-Agent (like it really
matters, since it's the only crawler that I know of) will be "veronica-2".

A few things will be grokked differently by this robot which are worth
talking about. If you specify a Disallow of "/", not only will V-2 not index
your gopher, but it will not even mark you as an active server, and you
will not appear in its server statistics list (as a consequence of where
robot exclusion filtering is done when selecting new prospects and updating
the global statistics table). Is this desirable/correct behaviour?

The current timeout on such files is twenty-four hours from first server
access (it may be fetched slightly more frequently under extraordinary
circumstances).

Also, I was thinking of making the file ".robots.txt" since many Unix
gophers don't serve dot-files, although there are a growing number of
Windows-hosted gophers and I don't know if it will break these (I don't
do x86 myself).

Comments requested, since I'm writing this Right This Minute (tm).

-- 
---------------------------------- personal: http://www.armory.com/~spectre/ --
 Cameron Kaiser, Floodgap Systems Ltd * So. Calif., USA * ckaiser@xxxxxxxxxxxx
-- The early bird may get the worm, but the second mouse gets the cheese. -----


[Prev in Thread] Current Thread [Next in Thread]