Complete.Org: Mailing Lists: Archives: gopher: October 2005:
[gopher] Re: New Gopher Wayback Machine Bot
Home

[gopher] Re: New Gopher Wayback Machine Bot

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: gopher@xxxxxxxxxxxx
Subject: [gopher] Re: New Gopher Wayback Machine Bot
From: Cameron Kaiser <spectre@xxxxxxxxxxxx>
Date: Wed, 12 Oct 2005 22:23:09 -0700 (PDT)
Reply-to: gopher@xxxxxxxxxxxx

> > > Cameron, floodgap.com seems to have some sort of rate limiting and keeps
> > > giving me a Connection refused error after a certain number of documents
> > > have been spidered.
> > 
> > I'm a little concerned about your project since I do host a number of large
> > subparts which are actually proxied services, and I think even a gentle bot
> > going methodically through them would not be pleasant for the other side
> > (especially if you mean to regularly update your snapshot).
> 
> Valid concern.  I had actually already marked your site off-limits
> because I noticed that.  Incidentally, your robots.txt doesn't seem to
> disallow anything -- might want to take a look at that ;-)

I know ;) it's because Veronica-2 won't harm the proxied services due to
the way it operates. However, I should be able to accomodate other bots that
may be around or come on board, so I'll rectify this.

> > I do support robots.txt, see
> > 
> >     gopher.floodgap.com/0/v2/help/indexer
> 
> Do you happen to have the source code for that available?  I've got
> some questions for you that it could explain (or you could), such as:
> 
>  1. Which would you use?  (Do you expect URLs to be HTTP-escaped?)
> 
>     Disallow: /Applications and Games
>     Disallow: /Applications%20and%20Games
> 
> 2. Do you assume that all Disallow patterns begin with a slash as they
>    do in HTML, even if the Gopher selector doesn't?
> 
> 3. Do you have any special code to handle the UMN case where
>    1/foo, /foo, and foo all refer to the same document?
> 
> I will be adding robots.txt support to my bot and restarting it shortly.

It does not understand URL escaping, but literal selectors only. In the
case of #2/#3, well, maybe it would be better just to post the relevant code.
It should be relatively easy to understand (in Perl, from the V-2 iteration
library). $psr is the persistent state hash reference, and key "xcnd" contains
a list of selectors generated from Disallow: lines with User-agent: veronica
or *.

        # filter on exclusions
        my %excludes = %{ $psr->{"$host:$port"}->{"xcnd"} };
        my $key;
        foreach $key (sort { length($a) <=> length($b) } keys %excludes) {
                return (undef, undef, undef, undef, undef,
                                'excluded by robots.txt', 1)
                        if ($key eq $sel || $key eq "$sel/" ||
                                ($key =~ m#/$# &&
                                substr($sel, 0, length($key)) eq $key));
        }

As you can see from here, they would need to be specified separately, since
other servers might not treat them the same.

-- 
---------------------------------- personal: http://www.armory.com/~spectre/ --
 Cameron Kaiser, Floodgap Systems Ltd * So. Calif., USA * ckaiser@xxxxxxxxxxxx
-- An apple every eight hours will keep three doctors away. -------------------



[Prev in Thread] Current Thread [Next in Thread]