[gopher] Re: New Gopher Wayback Machine Bot
[Top] [All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
> > > Cameron, floodgap.com seems to have some sort of rate limiting and keeps
> > > giving me a Connection refused error after a certain number of documents
> > > have been spidered.
> >
> > I'm a little concerned about your project since I do host a number of large
> > subparts which are actually proxied services, and I think even a gentle bot
> > going methodically through them would not be pleasant for the other side
> > (especially if you mean to regularly update your snapshot).
>
> Valid concern. I had actually already marked your site off-limits
> because I noticed that. Incidentally, your robots.txt doesn't seem to
> disallow anything -- might want to take a look at that ;-)
I know ;) it's because Veronica-2 won't harm the proxied services due to
the way it operates. However, I should be able to accomodate other bots that
may be around or come on board, so I'll rectify this.
> > I do support robots.txt, see
> >
> > gopher.floodgap.com/0/v2/help/indexer
>
> Do you happen to have the source code for that available? I've got
> some questions for you that it could explain (or you could), such as:
>
> 1. Which would you use? (Do you expect URLs to be HTTP-escaped?)
>
> Disallow: /Applications and Games
> Disallow: /Applications%20and%20Games
>
> 2. Do you assume that all Disallow patterns begin with a slash as they
> do in HTML, even if the Gopher selector doesn't?
>
> 3. Do you have any special code to handle the UMN case where
> 1/foo, /foo, and foo all refer to the same document?
>
> I will be adding robots.txt support to my bot and restarting it shortly.
It does not understand URL escaping, but literal selectors only. In the
case of #2/#3, well, maybe it would be better just to post the relevant code.
It should be relatively easy to understand (in Perl, from the V-2 iteration
library). $psr is the persistent state hash reference, and key "xcnd" contains
a list of selectors generated from Disallow: lines with User-agent: veronica
or *.
# filter on exclusions
my %excludes = %{ $psr->{"$host:$port"}->{"xcnd"} };
my $key;
foreach $key (sort { length($a) <=> length($b) } keys %excludes) {
return (undef, undef, undef, undef, undef,
'excluded by robots.txt', 1)
if ($key eq $sel || $key eq "$sel/" ||
($key =~ m#/$# &&
substr($sel, 0, length($key)) eq $key));
}
As you can see from here, they would need to be specified separately, since
other servers might not treat them the same.
--
---------------------------------- personal: http://www.armory.com/~spectre/ --
Cameron Kaiser, Floodgap Systems Ltd * So. Calif., USA * ckaiser@xxxxxxxxxxxx
-- An apple every eight hours will keep three doctors away. -------------------
- [gopher] Re: New Gopher Wayback Machine Bot, (continued)
|
|