Summary of responses on data transformation tools
Danny L. Silver
dsilver at csd.uwo.ca
Fri Jul 7 09:35:45 EDT 1995
Some time ago (May/95), I request additional information on data
transformation tools:
> Many of us spend hours preparing data files for acceptance by
> machine learning systems. Typically, I use awk or C code to transform
> ASCII records into numeric or symbolic attribute tuples for a neural net,
> inductive decision tree, etc. Before re-inventing the wheel, has anyone
> developed a general tool for perfoming some of the more common
> transformations. Any related suggestions would be of great use to many
> on the network.
Below is a summary of the most informative responses I received.
Sorry for the delay.
. Danny
--
=========================================================================
= Daniel L. Silver University of Western Ontario, London, Canada =
= N6A 3K7 - Dept. of Comp. Sci. - Office: MC27b =
= dsilver at csd.uwo.ca H: (519)473-6168 O: (519)679-2111 (ext.6903) =
=========================================================================
From: A. Famili
I have done quite a bit of work in this area, on data preparation and data
pre-processing, and also rule post-processing in induction. As part of our
induction system that we have built, we have some data pre-processing
capabilities added. I am also organizing and will be chairing a panel
on the "Role of data pre-processing in Intelligent Data Analysis" in
IDA-95 Symposium. (Intelligent Data Analysis Symposium to be held in
Germany in Aug. 1995).
The most common tool in the market is NeuralWare's Data Sculptor
(I have only seen the brochure and a demo).
It is claimed to be a general purpose tool. Others are
in a short report that I send you below.
A. Famili, Ph.D.
Senior Research Scientist
Knowledge Systems Lab.
IIT- NRC, Bldg. M-50
Montreal Rd. Ottawa, Ont.
K1A 0R6 Canada
Phone: (613) 993-8554
Fax : (613) 952-7151
email: famili at ai.iit.nrc.ca
---------------------------
A. Famili
Knowledge Systems Laboratory
Institute for Information Technology
National Research Council Canada
1.0 Introduction
This report outlines a comparison that was made for three commercial
data pre-processing tools that are available in the market. The purpose
of the study was to identify useful features that exist in these tools
that could be helpful in intelligent filtering and data analysis of the
IDS project. The comparison study does not involve use and evaluation
of either tools on real data. Two of these tools (LabView and OS/2
Visualizer) are avail- able in the KSL.
2.0 Data Sculptor
Developed by NeuralWare, the criteria was that in neural network data
analysis applica- tions, 80 percent of time is spent on data
preprocessing. This tool was developed to han- dle any type of
transformation or manipulation of data, before the data being analysed.
The graphics capabilities include: histograms, bar charts, line,
pie and scatter plots. There are several stat. functions to be used on
the data. There are also options to create new variables (attribute
vectors) based on transformation of other variables. Following are some
important specifications, as explained in the fact sheets and demo
version:
- Input Data Formats: DBase, Excell, Paradox, Fixed Field, ASCII,
and Binary.
- Output Data Formats: Fixed Field, Delimited ASCII and Binary
- General Data Transformations: Sorting, File Merge, Field Comparison,
Sieve and Duplicate and Neighborhood.
- Math. Transformations: Arithmetic, Trigonometric, and Exponential.
- Special Transformations: Encodings of the type One-of-N, Fuzzy
One-of-N, R-of-N, Analog R-of-N, Thermometer, Circular, and
Inverse Thermometer, Normalizing Func- tions, Fast Fourier
Transformations and some more.
- Stat. Functions: Count, Sum, Mean, Chi-square, Min, Max, STD,
Variance, Correla- tion and some more.
- Graph Formats: Bar chart, Histogram, Scatter Plot, Pie, etc.
- Spreadsheet: Data Viewing, and Search Function.
A data pre-processing application can be built by using (or defining)
icons and assembling the entire application in the Data Sculptor
environment, which is quite easy to use. There are a number of demo
applications that came with the demo diskettes. On- line hypertext help
facility is also available.
Data Sculptor runs under Windows. Information for Data Sculptor comes
from the literature and two demo diskettes.
3.0 LabView and Data Engine
Lab View (Laboratory Virtual Instrument Engineering Workbench) is a
product developed by National Instruments. It is however available with
Data Engine, a data analysis product developed by MIT in Germany.
LabView, a high level programming environ- ment, has been developed to
simplify the scientific computation, analyzing process control, and
test and measurement applications. It is far more sophisticated than
other data pre-processing systems. Unlike other programming systems that
are text based, LabView is graphics based and lets users create data
viewing and simulation programs in block diagram forms. LabView also
contains application specific libraries for data acquisition, data
analysis, data presentation, and data storage. It even comes with it's
own GUI builder facilities (called front panel) so that the application
is monitored and run to simulate the panel of a physical instrument.
There are also a number of LabView companion products that have been
developed by users or suppliers of this product.
4.0 OS/2 Visualizer
The Visualizer comes with OS2 and is installed on the PC's of the IDS
project. It's main function is support for data visualization, and
consists of three modules: (i) Charts, (ii) Statistics, and (iii) Query.
The visualizer Charts provides support for a variety of chart making
requirements. Examples are: line, pie, bar, scatter, surface, mixed,
etc. The visualizer Statistics provides support in 57 statistical methods
in seven categories of: (i) Exploratory methods, (ii) Distributions,
(iii) Relations, (iv) Quality control, (v) Model fitting, (vi) Analysis
of variance, and (vii) Tests. Each of the above categories consists of
several features that are useful for statistical analysis of data.
The visualizer Query provides support for a number of query tasks to be
performed on the data. These include means to access and work with the
data that is currently used, creating and storing new tables in the
database, combining data from many tables, and many more. It is not
evident, from the documentation, whether or not we can perform some form
of data transformation or preprocessing on the queried data so that a
preprocessed data file is created for further analysis.
================================================================
From: Matthijs Kadijk
I personnaly think that AWK is the best most general tool fit for those
purposes, but for those who want something less general but easy to use I
suggest to use dm, (a data manipulater) which is part of Gary Perlman
UNIX|STAT package. It should be no problem to find it on the net.
I also use the unix|stat programs to analyse the results of simulations
with my NN programs.
I'll attatch the dm tutorial to this mail (DLS: not include in this summary).
Matthijs Kadijk
_____________________ ______________________________
/ Matthijs Kadijk \ / email: kkm at bouw.tno.nl \
| TNO-Bouw, Postbus 49 | www: http://www.bouw.tno.nl \___________________
| NL-2600 AA Delft | tel: +31 - 15 - 842 195 /\ fax: +31 15 843975 \
\_____________________/ \ ________________________/ \_ _____________________/
=====================================================================
From: stefanos at vuse.vanderbilt.edu (Stefanos Manganaris)
I have been using this code to read, into LISP, UCI and C4.5 data files.
It will enable you to manipulate the records in LISP. All you need to
do is define once an appropriate "make-instance" function for each of
the learning systems you use.
Stef.
--
Stefanos Manganaris.
Computer Science Department, Vanderbilt University, Nashville, Tennessee.
http://www.vuse.vanderbilt.edu/~stefanos/stefanos.html
-------------------------------- cut here ------------------------------------
#|============================================================================
READ IN LISP UCI and C4.5 DATA FILES
$Id: read-data.cl,v 1.1 1995/04/12 04:13:51 stefanos Exp $
Last Edited: Apr 11/95 23:08 CDT by stefanos at worf (Stefanos Manganaris)
Written by Stefanos Manganaris, Computer Sciences, Vanderbilt University.
stefanos at vuse.vanderbilt.edu
http://www.vuse.vanderbilt.edu/~stefanos/stefanos.html
============================================================================|#
(in-package "USER")
(defvar *eol* nil)
(defun make-simple-instance (class attributes)
"A simple example for read-data-file's make-instance-f argument. Change
this function to return instances in whatever format your learner expects."
(cons class attributes))
;; Usage:
;; (read-data-file "file.data" #'make-simple-instance)
#|____________________________________________________________Sat Feb 4/95____
Function - READ-DATA-FILE
Reads the UCI or C4.5 FILE and returns a list of instances. Each
instance is created by supplying its class and attribute values to
MAKE-INSTANCE-F. Note:
* The list of instances is returned in reverse order.
* Spaces are not allowed as part of class names or values.
* Make sure there is a new line before EOF.
Inputs -> file make-instance-f
Returns -> list of instances
History ->
Sat Feb 4/95: Created
_______________________________________________________________________Stef__|#
(defun read-data-file (file make-instance-f)
"Args: file make-instance-f
Reads the UCI or C4.5 FILE and returns a list of instances."
(let ((instances nil)
(last-token nil))
(multiple-value-bind (f-comma commap)
(get-macro-character #\,)
(set-macro-character #\, #'comma-reader nil)
(set-macro-character #\newline #'newline-reader nil)
(with-open-file (stream file :direction :input)
(loop
(setq *eol* nil)
(setq last-token
(do ((token (read stream t)
(read stream t))
(attribute-values nil))
(*eol*
(if last-token
(push (funcall make-instance-f
last-token
attribute-values)
instances))
(return token))
(if last-token
(setq attribute-values
(nconc attribute-values
(cons last-token nil))))
(setq last-token token)))
(if (null last-token)
(return))))
(set-macro-character #\, f-comma commap)
(set-syntax-from-char #\newline #\newline))
(return-from read-data-file instances)))
#|____________________________________________________________Sat Feb 4/95____
Function - COMMA-READER
Special reader function for comma characters in UCI and C4.5 files.
Inputs -> stream char
Returns ->
History ->
Sat Feb 4/95: Created
_______________________________________________________________________Stef__|#
(defun comma-reader (stream char)
"Args: stream char
Special reader function for comma characters in UCI and C4.5 files."
(declare (ignore stream char))
(values))
#|____________________________________________________________Sat Feb 4/95____
Function - NEWLINE-READER
Special reader function for newline characters in UCI and C4.5 files.
Inputs -> stream char
Returns ->
History ->
Sat Feb 4/95: Created
_______________________________________________________________________Stef__|#
(defun newline-reader (stream char)
"Args: stream char
Special reader function for newline characters in UCI and C4.5 files."
(declare (ignore char))
(setq *eol* t)
(read stream nil nil t))
;; EOF
=======================================================================
More information about the Connectionists
mailing list