Summary: Cluster Analysis software

stolcke@ICSI.Berkeley.EDU stolcke at ICSI.Berkeley.EDU
Mon Nov 12 23:34:53 EST 1990



Here is a short summary of the replies I got to my request for clustering
software.  Thanks to all who replied, the problem is more that solved.

A couple of home-brewn programs were mentioned.  One that apparently has been
floating around quite a bit is "cluster" by Yoshiro Miyata.
This turned out to be the program that draws cluster trees in ASCII
that I had mentioned in my original message.

Many people mentioned "S" (or "S+" in a more recent version), a general
statistics package that does cluster analysis among other things.
Reportedly PostScript and X Windows graphics output are supported.
This is commercial software, the contact address is 
STATSCI, PO Box 85625, Seattle, WA 98145-1625, 206-322-8707
(standard disclaimer: I have no personal interest in the economic success
of this company, I don't even know their product yet).
There is also a book about S by R.A. Becker & J.M. Chambers, "S: An Inter-
active Environment for Data Analysis and Graphics," Wadsworth, Belmont, Ca.,
1984 (2nd edition, probably not the most recent one).

An alternative (possibly less expensive, but then I have no pricing info
yet), might be Berkeley's BLSS software package, available through
BLSS Project, Dept. of Statistics, University of California, Berkeley,
CA 94720, 415-642-5258, blss at back.berkeley.edu.	 Again, there's a book out,
by D.M. Abrahams & F. Rizzardi, "BLSS: The Berkeley Interactive Statistical
System," W.W. Norton & Company, 1988.

Finally, since I needed something FAST, I created my own solution.
I added a couple of features to Miyata's cluster program to make it more
useful in conjunction with other programs.  It now can be used conveniently
as a UNIX filter, and optionally produces device-independent graphics output 
that may be rendered on a variety of output media through the family of
standard UNIX programs dealing with plot(5) format (including PostScript
and X Windows previewing).  Other features added include options for
output format selection, distance metric selection, and scaling of
individual dimensions.

I have done absolutely nothing to improve the algorithm used to
do the actual clustering, since my data sets are typically small and I needed
to quickly hack up something that produces acceptable output.
(The complexity of the algorithm currently used is n^3, whereas n log n is
feasible, I am told.)  The program is written modularly enough so that
it shouldn't be too hard to plug in a better algorithm if available.

The program as it stands now should be flexible enough to be of public use.
It could be a poor man's solution for those who need to produce cluster
analyses but do not have access to fancy statistics software packages or 
don't want to deal with one.  I'll make it available through anonymous FTP
from icsi-ftp.berkeley.edu (128.32.201.55).  To get it use FTP as follows:

% ftp icsi-ftp.berkeley.edu
Connected to icsic.Berkeley.EDU.
220 icsi-ftp (icsic) FTP server (Version 5.60 local) ready.
Name (icsic.Berkeley.EDU:stolcke): anonymous
Password (icsic.Berkeley.EDU:anonymous):
331 Guest login ok, send ident as password.
230 Guest login Ok, access restrictions apply.
ftp> cd pub
250 CWD command successful.
ftp> binary
200 Type set to I.
ftp> get cluster.tar.Z
200 PORT command successful.
150 Opening BINARY mode data connection for cluster.tar.Z (15531 bytes).
226 Transfer complete.
15531 bytes received in 0.08 seconds (1.9e+02 Kbytes/s)
ftp> quit
221 Goodbye.

Then unpack and compile using

% zcat cluster.tar.Z | tar xf -
% make

Please don't ask me to mail it if you can get it through FTP.
Of course I cannot guarantee the usefulness of the program or promise
any level of support etc.  However, I'd like to know of any bug fixes and
improvements, especially to the algorithm (hint, hint ...)

Again, thanks for all the replies,

				Andreas





More information about the Connectionists mailing list