[Research] internal representation of datsets

Paul Komarek komarek.paul at gmail.com
Fri Apr 16 11:18:24 EDT 2010


I remember this, and at the time I decided that I didn't care about
the representation on disk.  However, if you store intermediate
computation results in an fds file, I can totally see that you'd be
unhappy with the result.

In order to prevent backwards compatibility issues, I wouldn't change
the .fds format.  Instead, I'd look for a compressed csv format.  When
I open sourced my LR code, this is what I ended up doing.  You can
download my stuff from http://komarix.org/lr if you want to see
details.  There are many benefits, including cross-platform
compatibility and less code on your side.

On Fri, Apr 16, 2010 at 8:00 AM, Robin Sabhnani <sabhnani+ at cs.cmu.edu> wrote:
> How about implementing a compiler flag that loads higher precision data.
> We could default to float to make it backward compatible and memory
> efficient.
>
> Robin
>
> John K. Ostlund wrote:
>> Hi Jeff (et al),
>>
>> This is something "we" have known about for some
>> time, but I guess you weren't part of "we"--sorry!
>> Let me guess: You got burned by large integer ID
>> numbers losing precision in the last few digits?
>>
>> We (which included Artur, as I recall) punted on
>> the issue because, as you say, this is a change in
>> the core of the code that (a) will cause datsets
>> to take more memory and (b) will change the results
>> produced by algorithms in some cases, in terms of
>> sorting classifications results that are very close.
>> It's also part of a larger issue, in terms of what
>> a datset should contain and how smart our datset
>> loading algorithm should be.
>>
>> My own observation is that, most of the time, the
>> size of the datset in memory is not nearly as large
>> as the size of the other data structures built from
>> the datset, so changing float to double is probably
>> a good idea without too many implications that way.
>> But this isn't true in *all* cases.  Also, a considerable
>> amount of testing will need to be done to validate all
>> the datset interaction functions.  Are there hidden
>> dependencies on float-vs-double?
>>
>> My pie-in-the-sky preferred solution to the whole
>> business would be to (a) use Microsoft Excel's rules
>> for smart .csv file loading, (b) make it easy within
>> the .csv file header AND via command line options/
>> load function arguments for the user to specify exact
>> handling of each column, (c) distinguish between string,
>> int, bool, and double, not just string and float,
>> (d) never lose track of the original string read in
>> for each cell from the .csv file, and, (e) (big deep
>> breath) use this as part of an excuse to switch to C++.
>>
>> But in the short run, changing float to double and doing
>> exhaustive testing would be easier.
>>
>> My two cents,
>>
>> - John O.
>>
>>
>>
>> On Thu, April 15, 2010 10:38 pm, Jeff Schneider wrote:
>> Hi guys,
>>
>> I just (quite painfully) discovered that our datset implementation actually
>> stores doubles as floats internally in the something called a pvector.
>>
>> I'd like to change these to be doubles so what I experienced doesn't
>> happen to
>> anyone else.  HOWEVER, that seems like a big change to the very core of the
>> code.  And at the least it will certainly cause datsets to consume more
>> memory
>> internally.
>>
>> Any thoughts/advice on doing this?  Or suggestions on alternate ways to
>> not get
>> burned by this again in those cases where you really want double precision?
>>
>> Jeff.
>> _______________________________________________
>> Research mailing list
>> Research at autonlab.org
>> https://www.autonlab.org/mailman/listinfo/research
>>
>>
>>
>>
>> _______________________________________________
>> Research mailing list
>> Research at autonlab.org
>> https://www.autonlab.org/mailman/listinfo/research
>>
>>
>
> _______________________________________________
> Research mailing list
> Research at autonlab.org
> https://www.autonlab.org/mailman/listinfo/research
>



More information about the Autonlab-research mailing list