[Research] internal representation of datsets

Fri Apr 16 11:17:50 EDT 2010

to add to John's two cents:

plus one vote in favor of a well-planned, coordinated transition to c++. this would solve quite a lot of problems.

in the future, we should probably borrow heavily from the various SQL apis in the world. they solve many of the same problems as our datset API, in terms of data access. they are also very well-traveled.

I do think changing floats to doubles, plus rigorous testing, is the best solution for now.

"John K. Ostlund" <jostlund at cs.cmu.edu> wrote:

>Hi Jeff (et al),
>
>This is something "we" have known about for some
>time, but I guess you weren't part of "we"--sorry!
>Let me guess: You got burned by large integer ID
>numbers losing precision in the last few digits?
>
>We (which included Artur, as I recall) punted on
>the issue because, as you say, this is a change in
>the core of the code that (a) will cause datsets
>to take more memory and (b) will change the results
>produced by algorithms in some cases, in terms of
>sorting classifications results that are very close.
>It's also part of a larger issue, in terms of what
>a datset should contain and how smart our datset
>loading algorithm should be.
>
>My own observation is that, most of the time, the
>size of the datset in memory is not nearly as large
>as the size of the other data structures built from
>the datset, so changing float to double is probably
>a good idea without too many implications that way.
>But this isn't true in *all* cases.  Also, a considerable
>amount of testing will need to be done to validate all
>the datset interaction functions.  Are there hidden
>dependencies on float-vs-double?
>
>My pie-in-the-sky preferred solution to the whole
>business would be to (a) use Microsoft Excel's rules
>for smart .csv file loading, (b) make it easy within
>the .csv file header AND via command line options/
>load function arguments for the user to specify exact
>handling of each column, (c) distinguish between string,
>int, bool, and double, not just string and float,
>(d) never lose track of the original string read in
>for each cell from the .csv file, and, (e) (big deep
>breath) use this as part of an excuse to switch to C++.
>
>But in the short run, changing float to double and doing
>exhaustive testing would be easier.
>
>My two cents,
>
>- John O.
>
>
>
>On Thu, April 15, 2010 10:38 pm, Jeff Schneider wrote:
>Hi guys,
>
>I just (quite painfully) discovered that our datset implementation actually
>stores doubles as floats internally in the something called a pvector.
>
>I'd like to change these to be doubles so what I experienced doesn't
>happen to
>anyone else.  HOWEVER, that seems like a big change to the very core of the
>code.  And at the least it will certainly cause datsets to consume more
>memory
>internally.
>
>Any thoughts/advice on doing this?  Or suggestions on alternate ways to
>not get
>burned by this again in those cases where you really want double precision?
>
>Jeff.
>_______________________________________________
>Research mailing list
>Research at autonlab.org
>https://www.autonlab.org/mailman/listinfo/research
>
>
>
>
>_______________________________________________
>Research mailing list
>Research at autonlab.org
>https://www.autonlab.org/mailman/listinfo/research
>