Summary of responses on data transformation tools

Danny L. Silver dsilver at csd.uwo.ca
Fri Jul 7 09:35:45 EDT 1995


Some time ago (May/95), I request additional information on data 
transformation tools:

> Many of us spend hours preparing data files for acceptance by
> machine learning systems.  Typically, I use awk or C code to transform
> ASCII records into numeric or symbolic attribute tuples for a neural net,
> inductive decision tree, etc.  Before re-inventing the wheel, has anyone
> developed a general tool for perfoming some of the more common 
> transformations.   Any related suggestions would be of great use to many 
> on the network.

Below is a summary of the most informative responses I received.
Sorry for the delay.
. Danny
-- 

=========================================================================
=  Daniel L. Silver    University of Western Ontario, London, Canada    =
=                      N6A 3K7 - Dept. of Comp. Sci. - Office: MC27b    =
=  dsilver at csd.uwo.ca  H: (519)473-6168   O: (519)679-2111 (ext.6903)   =
=========================================================================

From: A. Famili

I have done quite a bit of work in this area, on data preparation and data 
pre-processing, and also rule post-processing in induction. As part of our
induction system that we have built, we have some data pre-processing 
capabilities added. I am also organizing and will be chairing a panel
on the "Role of data pre-processing in Intelligent Data Analysis" in 
IDA-95 Symposium. (Intelligent Data Analysis Symposium to be held in 
Germany in Aug. 1995).

The most common tool in the market is NeuralWare's Data Sculptor 
(I have only seen the brochure and a demo). 
It is claimed to be a general purpose tool. Others are
in a short report that I send you below. 

A. Famili, Ph.D.
Senior Research Scientist
Knowledge Systems Lab.
IIT- NRC, Bldg. M-50
Montreal Rd. Ottawa, Ont. 
K1A 0R6  Canada

Phone: (613) 993-8554
Fax  : (613) 952-7151
email: famili at ai.iit.nrc.ca

---------------------------


A. Famili
Knowledge Systems Laboratory
Institute for Information Technology
National Research Council Canada

1.0  Introduction

This report outlines a comparison that was made for three commercial
data pre-processing tools that are available in the market. The purpose 
of the study was to identify useful features that exist in these tools 
that could be helpful in intelligent filtering and data analysis of the 
IDS project. The comparison study does not involve use and evaluation 
of either tools on real data. Two of these tools (LabView and OS/2 
Visualizer) are avail- able in the KSL.

2.0  Data Sculptor

Developed by NeuralWare, the criteria was that in neural network data 
analysis applica- tions, 80 percent of time is spent on data 
preprocessing. This tool was developed to han- dle any type of 
transformation or manipulation of data, before the data being analysed. 
The graphics capabilities include: histograms, bar charts, line, 
pie and scatter plots. There are several stat. functions to be used on 
the data. There are also options to create new variables (attribute 
vectors) based on transformation of other variables. Following are some 
important specifications, as explained in the fact sheets and demo 
version:

- Input Data Formats: DBase, Excell, Paradox, Fixed Field, ASCII, 
  and Binary.

- Output Data Formats: Fixed Field, Delimited ASCII and Binary

- General Data Transformations: Sorting, File Merge, Field Comparison, 
  Sieve and Duplicate and Neighborhood. 

- Math. Transformations: Arithmetic, Trigonometric, and Exponential.

- Special Transformations: Encodings of the type One-of-N, Fuzzy
  One-of-N, R-of-N, Analog R-of-N, Thermometer, Circular, and 
  Inverse Thermometer, Normalizing Func- tions, Fast Fourier 
  Transformations and some more.

- Stat. Functions: Count, Sum, Mean, Chi-square, Min, Max, STD, 
  Variance, Correla- tion and some more.

- Graph Formats: Bar chart, Histogram, Scatter Plot, Pie, etc.

- Spreadsheet: Data Viewing, and Search Function.

A data pre-processing application can be built by using (or defining) 
icons and assembling the entire application in the Data Sculptor 
environment, which is quite easy to use. There are a number of demo 
applications that came with the demo diskettes. On- line hypertext help 
facility is also available.

Data Sculptor runs under Windows. Information for Data Sculptor comes 
from the literature and two demo diskettes.

3.0  LabView and Data Engine

Lab View (Laboratory Virtual Instrument Engineering Workbench) is a 
product developed by National Instruments. It is however available with 
Data Engine, a data analysis product developed by MIT in Germany. 
LabView, a high level programming environ- ment, has been developed to 
simplify the scientific computation, analyzing process control, and 
test and measurement applications. It is far more sophisticated than 
other data pre-processing systems. Unlike other programming systems that 
are text based, LabView is graphics based and lets users create data 
viewing and simulation programs in block diagram forms. LabView also 
contains application specific libraries for data acquisition, data
analysis, data presentation, and data storage. It even comes with it's 
own GUI builder facilities (called front panel) so that the application 
is monitored and run to simulate the panel of a physical instrument. 
There are also a number of LabView companion products that have been 
developed by users or suppliers of this product.

4.0  OS/2 Visualizer

The Visualizer comes with OS2 and is installed on the PC's of the IDS 
project. It's main function is support for data visualization, and 
consists of three modules: (i) Charts, (ii) Statistics, and (iii) Query. 
The visualizer Charts provides support for a variety of chart making 
requirements. Examples are: line, pie, bar, scatter, surface, mixed, 
etc. The visualizer Statistics provides support in 57 statistical methods
in seven categories of: (i) Exploratory methods, (ii) Distributions, 
(iii) Relations, (iv) Quality control, (v) Model fitting, (vi) Analysis
of variance, and (vii) Tests. Each of the above categories consists of 
several features that are useful for statistical analysis of data. 
The visualizer Query provides support for a number of query tasks to be 
performed on the data. These include means to access and work with the 
data that is currently used, creating and storing new tables in the 
database, combining data from many tables, and many more. It is not 
evident, from the documentation, whether or not we can perform some form 
of data transformation or preprocessing on the queried data so that a 
preprocessed data file is created for further analysis.


================================================================
From: Matthijs Kadijk

I personnaly think that AWK is the best most general tool fit for those 
purposes, but for those who want something less general but easy to use I 
suggest to use dm, (a data manipulater) which is part of Gary Perlman
UNIX|STAT package. It should be no problem to find it on the net.

I also use the unix|stat programs to analyse the results of simulations 
with my NN programs. 

I'll attatch the dm tutorial to this mail  (DLS: not include in this summary).

Matthijs Kadijk
 _____________________   ______________________________ 
/ Matthijs Kadijk     \ / email: kkm at bouw.tno.nl        \ 
| TNO-Bouw, Postbus 49 |  www:   http://www.bouw.tno.nl  \___________________
| NL-2600 AA Delft     |  tel: +31 - 15 - 842 195  /\     fax: +31 15 843975 \
\_____________________/ \ ________________________/  \_ _____________________/


=====================================================================

From: stefanos at vuse.vanderbilt.edu (Stefanos Manganaris)

I have been using this code to read, into LISP, UCI and C4.5 data files.
It will enable you to manipulate the records in LISP.  All you need to
do is define once an appropriate "make-instance" function for each of
the learning systems you use.

Stef.

-- 
Stefanos Manganaris.
Computer Science Department, Vanderbilt University, Nashville, Tennessee.
http://www.vuse.vanderbilt.edu/~stefanos/stefanos.html


-------------------------------- cut here ------------------------------------

#|============================================================================

		  READ IN LISP UCI and C4.5 DATA FILES

  $Id: read-data.cl,v 1.1 1995/04/12 04:13:51 stefanos Exp $

  Last Edited: Apr 11/95 23:08 CDT by stefanos at worf (Stefanos Manganaris)

  Written by Stefanos Manganaris, Computer Sciences, Vanderbilt University.

  stefanos at vuse.vanderbilt.edu

  http://www.vuse.vanderbilt.edu/~stefanos/stefanos.html

============================================================================|#

(in-package "USER")

(defvar *eol* nil)


(defun make-simple-instance (class attributes)
  "A simple example for read-data-file's make-instance-f argument.  Change
this function to return instances in whatever format your learner expects."
  (cons class attributes))

;; Usage:
;; (read-data-file "file.data" #'make-simple-instance)


#|____________________________________________________________Sat Feb  4/95____

   Function  - READ-DATA-FILE

       Reads the UCI or C4.5 FILE and returns a list of instances.  Each
   instance is created by supplying its class and attribute values to
   MAKE-INSTANCE-F.  Note:

   * The list of instances is returned in reverse order.
   * Spaces are not allowed as part of class names or values.
   * Make sure there is a new line before EOF.

   Inputs    -> file make-instance-f

   Returns   -> list of instances

   History   -> 
     Sat Feb  4/95: Created 
_______________________________________________________________________Stef__|#

(defun read-data-file (file make-instance-f)
  "Args: file make-instance-f
   Reads the UCI or C4.5 FILE and returns a list of instances."
  (let ((instances nil)
	(last-token nil))
    (multiple-value-bind (f-comma commap)
	(get-macro-character #\,)
      (set-macro-character #\, #'comma-reader nil)
      (set-macro-character #\newline #'newline-reader nil)
      (with-open-file (stream file :direction :input)
	(loop
	  (setq *eol* nil)
	  (setq last-token
		(do ((token (read stream t)
			    (read stream t))
		     (attribute-values nil))
		    (*eol*
		     (if last-token
			 (push (funcall make-instance-f
					last-token
					attribute-values)
			       instances))
		     (return token))
		  (if last-token
		      (setq attribute-values
			    (nconc attribute-values
				   (cons last-token nil))))
		  (setq last-token token)))
	  (if (null last-token)
	      (return))))
      (set-macro-character #\, f-comma commap)
      (set-syntax-from-char #\newline #\newline))
    (return-from read-data-file instances)))


#|____________________________________________________________Sat Feb  4/95____

   Function  - COMMA-READER

       Special reader function for comma characters in UCI and C4.5 files.

   Inputs    -> stream char

   Returns   -> 

   History   -> 
     Sat Feb  4/95: Created 
_______________________________________________________________________Stef__|#

(defun comma-reader (stream char)
  "Args: stream char
   Special reader function for comma characters in UCI and C4.5 files."
  (declare (ignore stream char))
  (values))


#|____________________________________________________________Sat Feb  4/95____

   Function  - NEWLINE-READER

       Special reader function for newline characters in UCI and C4.5 files.

   Inputs    -> stream char

   Returns   -> 

   History   -> 
     Sat Feb  4/95: Created 
_______________________________________________________________________Stef__|#

(defun newline-reader (stream char)
  "Args: stream char
   Special reader function for newline characters in UCI and C4.5 files."
  (declare (ignore char))
  (setq *eol* t)
  (read stream nil nil t))

;; EOF

=======================================================================




More information about the Connectionists mailing list