[Research] internal representation of datsets

Fri Apr 16 11:00:12 EDT 2010

How about implementing a compiler flag that loads higher precision data. 
We could default to float to make it backward compatible and memory 
efficient.

Robin

John K. Ostlund wrote:
> Hi Jeff (et al),
>
> This is something "we" have known about for some
> time, but I guess you weren't part of "we"--sorry!
> Let me guess: You got burned by large integer ID
> numbers losing precision in the last few digits?
>
> We (which included Artur, as I recall) punted on
> the issue because, as you say, this is a change in
> the core of the code that (a) will cause datsets
> to take more memory and (b) will change the results
> produced by algorithms in some cases, in terms of
> sorting classifications results that are very close.
> It's also part of a larger issue, in terms of what
> a datset should contain and how smart our datset
> loading algorithm should be.
>
> My own observation is that, most of the time, the
> size of the datset in memory is not nearly as large
> as the size of the other data structures built from
> the datset, so changing float to double is probably
> a good idea without too many implications that way.
> But this isn't true in *all* cases.  Also, a considerable
> amount of testing will need to be done to validate all
> the datset interaction functions.  Are there hidden
> dependencies on float-vs-double?
>
> My pie-in-the-sky preferred solution to the whole
> business would be to (a) use Microsoft Excel's rules
> for smart .csv file loading, (b) make it easy within
> the .csv file header AND via command line options/
> load function arguments for the user to specify exact
> handling of each column, (c) distinguish between string,
> int, bool, and double, not just string and float,
> (d) never lose track of the original string read in
> for each cell from the .csv file, and, (e) (big deep
> breath) use this as part of an excuse to switch to C++.
>
> But in the short run, changing float to double and doing
> exhaustive testing would be easier.
>
> My two cents,
>
> - John O.
>
>
>
> On Thu, April 15, 2010 10:38 pm, Jeff Schneider wrote:
> Hi guys,
>
> I just (quite painfully) discovered that our datset implementation actually
> stores doubles as floats internally in the something called a pvector.
>
> I'd like to change these to be doubles so what I experienced doesn't
> happen to
> anyone else.  HOWEVER, that seems like a big change to the very core of the
> code.  And at the least it will certainly cause datsets to consume more
> memory
> internally.
>
> Any thoughts/advice on doing this?  Or suggestions on alternate ways to
> not get
> burned by this again in those cases where you really want double precision?
>
> Jeff.
> _______________________________________________
> Research mailing list
> Research at autonlab.org
> https://www.autonlab.org/mailman/listinfo/research
>
>
>
>
> _______________________________________________
> Research mailing list
> Research at autonlab.org
> https://www.autonlab.org/mailman/listinfo/research
>
>