[Research] internal representation of datsets

John K. Ostlund jostlund at cs.cmu.edu
Fri Apr 16 09:59:48 EDT 2010


Hi Jeff (et al),

This is something "we" have known about for some
time, but I guess you weren't part of "we"--sorry!
Let me guess: You got burned by large integer ID
numbers losing precision in the last few digits?

We (which included Artur, as I recall) punted on
the issue because, as you say, this is a change in
the core of the code that (a) will cause datsets
to take more memory and (b) will change the results
produced by algorithms in some cases, in terms of
sorting classifications results that are very close.
It's also part of a larger issue, in terms of what
a datset should contain and how smart our datset
loading algorithm should be.

My own observation is that, most of the time, the
size of the datset in memory is not nearly as large
as the size of the other data structures built from
the datset, so changing float to double is probably
a good idea without too many implications that way.
But this isn't true in *all* cases.  Also, a considerable
amount of testing will need to be done to validate all
the datset interaction functions.  Are there hidden
dependencies on float-vs-double?

My pie-in-the-sky preferred solution to the whole
business would be to (a) use Microsoft Excel's rules
for smart .csv file loading, (b) make it easy within
the .csv file header AND via command line options/
load function arguments for the user to specify exact
handling of each column, (c) distinguish between string,
int, bool, and double, not just string and float,
(d) never lose track of the original string read in
for each cell from the .csv file, and, (e) (big deep
breath) use this as part of an excuse to switch to C++.

But in the short run, changing float to double and doing
exhaustive testing would be easier.

My two cents,

- John O.



On Thu, April 15, 2010 10:38 pm, Jeff Schneider wrote:
Hi guys,

I just (quite painfully) discovered that our datset implementation actually
stores doubles as floats internally in the something called a pvector.

I'd like to change these to be doubles so what I experienced doesn't
happen to
anyone else.  HOWEVER, that seems like a big change to the very core of the
code.  And at the least it will certainly cause datsets to consume more
memory
internally.

Any thoughts/advice on doing this?  Or suggestions on alternate ways to
not get
burned by this again in those cases where you really want double precision?

Jeff.
_______________________________________________
Research mailing list
Research at autonlab.org
https://www.autonlab.org/mailman/listinfo/research







More information about the Autonlab-research mailing list