Complete.Org: Mailing Lists: Archives: gopher: December 2007:
[gopher] Re: Improved binary file detection in Bucktooth 0.2.2
Home

[gopher] Re: Improved binary file detection in Bucktooth 0.2.2

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: gopher@xxxxxxxxxxxx
Subject: [gopher] Re: Improved binary file detection in Bucktooth 0.2.2
From: Cameron Kaiser <spectre@xxxxxxxxxxxx>
Date: Fri, 28 Dec 2007 05:49:59 -0800 (PST)
Reply-to: gopher@xxxxxxxxxxxx

> I'm using buckd to serve up binary files, and noticed that several
> binary files (mostly older PDFs with a lot of text in the file header)
> were being identified as item type "0" rather than "9". It turns out
> that buckd uses the Perl -B operator to determine binary files.  To do
> this, it examines some number of bytes in the file header for certain
> characteristics (nul bytes, high-order bits set, etc.) and if that
> number of bytes exceeds 30%, Perl identifies it as a binary file.
> 
> This wasn't accurate enough for my purposes, so I modified buckd.in so
> that it calls the UNIX "file" command and greps for the string "text"
> (guaranteed to be returned if a file is identified as a text file).

The other thing I might do is just expand the number of file extensions
Bucktooth recognizes and generates item types for, since -B is the
fall-through case and there will always be datasets falling in the tails
of the bell curve.

The 'file' command approach is ingenious but does of course have performance
implications (though since I'm running out of inetd I'm obviously not that
concerned with performance ;-).

I appreciate the heads-up.

-- 
------------------------------------ personal: http://www.cameronkaiser.com/ --
  Cameron Kaiser * Floodgap Systems * www.floodgap.com * ckaiser@xxxxxxxxxxxx
-- Life isn't fair. But having the root password helps. -----------------------



[Prev in Thread] Current Thread [Next in Thread]