From schneide at cs.cmu.edu  Thu Apr 15 22:38:33 2010
From: schneide at cs.cmu.edu (Jeff Schneider)
Date: Thu, 15 Apr 2010 22:38:33 -0400
Subject: [Research] internal representation of datsets
Message-ID: <4BC7CDA9.70902@cs.cmu.edu>

Hi guys,

I just (quite painfully) discovered that our datset implementation actually 
stores doubles as floats internally in the something called a pvector.

I'd like to change these to be doubles so what I experienced doesn't happen to 
anyone else.  HOWEVER, that seems like a big change to the very core of the 
code.  And at the least it will certainly cause datsets to consume more memory 
internally.

Any thoughts/advice on doing this?  Or suggestions on alternate ways to not get 
burned by this again in those cases where you really want double precision?

Jeff.


From jostlund at cs.cmu.edu  Fri Apr 16 09:59:48 2010
From: jostlund at cs.cmu.edu (John K. Ostlund)
Date: Fri, 16 Apr 2010 09:59:48 -0400 (EDT)
Subject: [Research] internal representation of datsets
In-Reply-To: <4BC7CDA9.70902@cs.cmu.edu>
References: <4BC7CDA9.70902@cs.cmu.edu>
Message-ID: <38428.24.3.147.252.1271426388.squirrel@webmail.cs.cmu.edu>

Hi Jeff (et al),

This is something "we" have known about for some
time, but I guess you weren't part of "we"--sorry!
Let me guess: You got burned by large integer ID
numbers losing precision in the last few digits?

We (which included Artur, as I recall) punted on
the issue because, as you say, this is a change in
the core of the code that (a) will cause datsets
to take more memory and (b) will change the results
produced by algorithms in some cases, in terms of
sorting classifications results that are very close.
It's also part of a larger issue, in terms of what
a datset should contain and how smart our datset
loading algorithm should be.

My own observation is that, most of the time, the
size of the datset in memory is not nearly as large
as the size of the other data structures built from
the datset, so changing float to double is probably
a good idea without too many implications that way.
But this isn't true in *all* cases.  Also, a considerable
amount of testing will need to be done to validate all
the datset interaction functions.  Are there hidden
dependencies on float-vs-double?

My pie-in-the-sky preferred solution to the whole
business would be to (a) use Microsoft Excel's rules
for smart .csv file loading, (b) make it easy within
the .csv file header AND via command line options/
load function arguments for the user to specify exact
handling of each column, (c) distinguish between string,
int, bool, and double, not just string and float,
(d) never lose track of the original string read in
for each cell from the .csv file, and, (e) (big deep
breath) use this as part of an excuse to switch to C++.

But in the short run, changing float to double and doing
exhaustive testing would be easier.

My two cents,

- John O.


On Thu, April 15, 2010 10:38 pm, Jeff Schneider wrote:
Hi guys,

I just (quite painfully) discovered that our datset implementation actually
stores doubles as floats internally in the something called a pvector.

I'd like to change these to be doubles so what I experienced doesn't
happen to
anyone else.  HOWEVER, that seems like a big change to the very core of the
code.  And at the least it will certainly cause datsets to consume more
memory
internally.

Any thoughts/advice on doing this?  Or suggestions on alternate ways to
not get
burned by this again in those cases where you really want double precision?

Jeff.
_______________________________________________
Research mailing list
Research at autonlab.org
https://www.autonlab.org/mailman/listinfo/research


From sabhnani+ at cs.cmu.edu  Fri Apr 16 11:00:12 2010
From: sabhnani+ at cs.cmu.edu (Robin Sabhnani)
Date: Fri, 16 Apr 2010 11:00:12 -0400
Subject: [Research] internal representation of datsets
In-Reply-To: <38428.24.3.147.252.1271426388.squirrel@webmail.cs.cmu.edu>
References: <4BC7CDA9.70902@cs.cmu.edu>
	<38428.24.3.147.252.1271426388.squirrel@webmail.cs.cmu.edu>
Message-ID: <4BC87B7C.7060208@cs.cmu.edu>

How about implementing a compiler flag that loads higher precision data. 
We could default to float to make it backward compatible and memory 
efficient.

Robin

John K. Ostlund wrote:
> Hi Jeff (et al),
>
> This is something "we" have known about for some
> time, but I guess you weren't part of "we"--sorry!
> Let me guess: You got burned by large integer ID
> numbers losing precision in the last few digits?
>
> We (which included Artur, as I recall) punted on
> the issue because, as you say, this is a change in
> the core of the code that (a) will cause datsets
> to take more memory and (b) will change the results
> produced by algorithms in some cases, in terms of
> sorting classifications results that are very close.
> It's also part of a larger issue, in terms of what
> a datset should contain and how smart our datset
> loading algorithm should be.
>
> My own observation is that, most of the time, the
> size of the datset in memory is not nearly as large
> as the size of the other data structures built from
> the datset, so changing float to double is probably
> a good idea without too many implications that way.
> But this isn't true in *all* cases.  Also, a considerable
> amount of testing will need to be done to validate all
> the datset interaction functions.  Are there hidden
> dependencies on float-vs-double?
>
> My pie-in-the-sky preferred solution to the whole
> business would be to (a) use Microsoft Excel's rules
> for smart .csv file loading, (b) make it easy within
> the .csv file header AND via command line options/
> load function arguments for the user to specify exact
> handling of each column, (c) distinguish between string,
> int, bool, and double, not just string and float,
> (d) never lose track of the original string read in
> for each cell from the .csv file, and, (e) (big deep
> breath) use this as part of an excuse to switch to C++.
>
> But in the short run, changing float to double and doing
> exhaustive testing would be easier.
>
> My two cents,
>
> - John O.
>
>
>
> On Thu, April 15, 2010 10:38 pm, Jeff Schneider wrote:
> Hi guys,
>
> I just (quite painfully) discovered that our datset implementation actually
> stores doubles as floats internally in the something called a pvector.
>
> I'd like to change these to be doubles so what I experienced doesn't
> happen to
> anyone else.  HOWEVER, that seems like a big change to the very core of the
> code.  And at the least it will certainly cause datsets to consume more
> memory
> internally.
>
> Any thoughts/advice on doing this?  Or suggestions on alternate ways to
> not get
> burned by this again in those cases where you really want double precision?
>
> Jeff.
> _______________________________________________
> Research mailing list
> Research at autonlab.org
> https://www.autonlab.org/mailman/listinfo/research
>
>
>
>
> _______________________________________________
> Research mailing list
> Research at autonlab.org
> https://www.autonlab.org/mailman/listinfo/research
>
>   


From sbrudene at andrew.cmu.edu  Fri Apr 16 11:17:50 2010
From: sbrudene at andrew.cmu.edu (Steven Brudenell)
Date: Fri, 16 Apr 2010 11:17:50 -0400
Subject: [Research] internal representation of datsets
Message-ID: <2329wnlsij14t3q53r1o4j4o.1271430654453@email.android.com>

to add to John's two cents:

plus one vote in favor of a well-planned, coordinated transition to c++. this would solve quite a lot of problems.

in the future, we should probably borrow heavily from the various SQL apis in the world. they solve many of the same problems as our datset API, in terms of data access. they are also very well-traveled.

I do think changing floats to doubles, plus rigorous testing, is the best solution for now.

"John K. Ostlund" <jostlund at cs.cmu.edu> wrote:

>Hi Jeff (et al),
>
>This is something "we" have known about for some
>time, but I guess you weren't part of "we"--sorry!
>Let me guess: You got burned by large integer ID
>numbers losing precision in the last few digits?
>
>We (which included Artur, as I recall) punted on
>the issue because, as you say, this is a change in
>the core of the code that (a) will cause datsets
>to take more memory and (b) will change the results
>produced by algorithms in some cases, in terms of
>sorting classifications results that are very close.
>It's also part of a larger issue, in terms of what
>a datset should contain and how smart our datset
>loading algorithm should be.
>
>My own observation is that, most of the time, the
>size of the datset in memory is not nearly as large
>as the size of the other data structures built from
>the datset, so changing float to double is probably
>a good idea without too many implications that way.
>But this isn't true in *all* cases.  Also, a considerable
>amount of testing will need to be done to validate all
>the datset interaction functions.  Are there hidden
>dependencies on float-vs-double?
>
>My pie-in-the-sky preferred solution to the whole
>business would be to (a) use Microsoft Excel's rules
>for smart .csv file loading, (b) make it easy within
>the .csv file header AND via command line options/
>load function arguments for the user to specify exact
>handling of each column, (c) distinguish between string,
>int, bool, and double, not just string and float,
>(d) never lose track of the original string read in
>for each cell from the .csv file, and, (e) (big deep
>breath) use this as part of an excuse to switch to C++.
>
>But in the short run, changing float to double and doing
>exhaustive testing would be easier.
>
>My two cents,
>
>- John O.
>
>
>
>On Thu, April 15, 2010 10:38 pm, Jeff Schneider wrote:
>Hi guys,
>
>I just (quite painfully) discovered that our datset implementation actually
>stores doubles as floats internally in the something called a pvector.
>
>I'd like to change these to be doubles so what I experienced doesn't
>happen to
>anyone else.  HOWEVER, that seems like a big change to the very core of the
>code.  And at the least it will certainly cause datsets to consume more
>memory
>internally.
>
>Any thoughts/advice on doing this?  Or suggestions on alternate ways to
>not get
>burned by this again in those cases where you really want double precision?
>
>Jeff.
>_______________________________________________
>Research mailing list
>Research at autonlab.org
>https://www.autonlab.org/mailman/listinfo/research
>
>
>
>
>_______________________________________________
>Research mailing list
>Research at autonlab.org
>https://www.autonlab.org/mailman/listinfo/research
>

From komarek.paul at gmail.com  Fri Apr 16 11:18:24 2010
From: komarek.paul at gmail.com (Paul Komarek)
Date: Fri, 16 Apr 2010 08:18:24 -0700
Subject: [Research] internal representation of datsets
In-Reply-To: <4BC87B7C.7060208@cs.cmu.edu>
References: <4BC7CDA9.70902@cs.cmu.edu>
	<38428.24.3.147.252.1271426388.squirrel@webmail.cs.cmu.edu>
	<4BC87B7C.7060208@cs.cmu.edu>
Message-ID: <l2z459f38471004160818gca1e8798k69026f92f881ba54@mail.gmail.com>

I remember this, and at the time I decided that I didn't care about
the representation on disk.  However, if you store intermediate
computation results in an fds file, I can totally see that you'd be
unhappy with the result.

In order to prevent backwards compatibility issues, I wouldn't change
the .fds format.  Instead, I'd look for a compressed csv format.  When
I open sourced my LR code, this is what I ended up doing.  You can
download my stuff from http://komarix.org/lr if you want to see
details.  There are many benefits, including cross-platform
compatibility and less code on your side.

On Fri, Apr 16, 2010 at 8:00 AM, Robin Sabhnani <sabhnani+ at cs.cmu.edu> wrote:
> How about implementing a compiler flag that loads higher precision data.
> We could default to float to make it backward compatible and memory
> efficient.
>
> Robin
>
> John K. Ostlund wrote:
>> Hi Jeff (et al),
>>
>> This is something "we" have known about for some
>> time, but I guess you weren't part of "we"--sorry!
>> Let me guess: You got burned by large integer ID
>> numbers losing precision in the last few digits?
>>
>> We (which included Artur, as I recall) punted on
>> the issue because, as you say, this is a change in
>> the core of the code that (a) will cause datsets
>> to take more memory and (b) will change the results
>> produced by algorithms in some cases, in terms of
>> sorting classifications results that are very close.
>> It's also part of a larger issue, in terms of what
>> a datset should contain and how smart our datset
>> loading algorithm should be.
>>
>> My own observation is that, most of the time, the
>> size of the datset in memory is not nearly as large
>> as the size of the other data structures built from
>> the datset, so changing float to double is probably
>> a good idea without too many implications that way.
>> But this isn't true in *all* cases. ?Also, a considerable
>> amount of testing will need to be done to validate all
>> the datset interaction functions. ?Are there hidden
>> dependencies on float-vs-double?
>>
>> My pie-in-the-sky preferred solution to the whole
>> business would be to (a) use Microsoft Excel's rules
>> for smart .csv file loading, (b) make it easy within
>> the .csv file header AND via command line options/
>> load function arguments for the user to specify exact
>> handling of each column, (c) distinguish between string,
>> int, bool, and double, not just string and float,
>> (d) never lose track of the original string read in
>> for each cell from the .csv file, and, (e) (big deep
>> breath) use this as part of an excuse to switch to C++.
>>
>> But in the short run, changing float to double and doing
>> exhaustive testing would be easier.
>>
>> My two cents,
>>
>> - John O.
>>
>>
>>
>> On Thu, April 15, 2010 10:38 pm, Jeff Schneider wrote:
>> Hi guys,
>>
>> I just (quite painfully) discovered that our datset implementation actually
>> stores doubles as floats internally in the something called a pvector.
>>
>> I'd like to change these to be doubles so what I experienced doesn't
>> happen to
>> anyone else. ?HOWEVER, that seems like a big change to the very core of the
>> code. ?And at the least it will certainly cause datsets to consume more
>> memory
>> internally.
>>
>> Any thoughts/advice on doing this? ?Or suggestions on alternate ways to
>> not get
>> burned by this again in those cases where you really want double precision?
>>
>> Jeff.
>> _______________________________________________
>> Research mailing list
>> Research at autonlab.org
>> https://www.autonlab.org/mailman/listinfo/research
>>
>>
>>
>>
>> _______________________________________________
>> Research mailing list
>> Research at autonlab.org
>> https://www.autonlab.org/mailman/listinfo/research
>>
>>
>
> _______________________________________________
> Research mailing list
> Research at autonlab.org
> https://www.autonlab.org/mailman/listinfo/research
>


From awd at cs.cmu.edu  Fri Apr 16 17:17:50 2010
From: awd at cs.cmu.edu (Artur Dubrawski)
Date: Fri, 16 Apr 2010 17:17:50 -0400
Subject: [Research] Auton Lab meeting: Wednesday April 21 at 12noon
Message-ID: <4BC8D3FE.6010708@cs.cmu.edu>

Dear Autonians,

The Lab meeting next week will involve practice talks by Liang and Yi
before their appearances at the upcoming SDM conference.
Abstracts are provided below.
Please come along and provide them with your feedback!

Time: Wednesday April 21, 12noon
Karen will confirm the location.

See you,
Artur

---
Liang:

Temporal Collaborative Filtering with Bayesian Probabilistic Tensor
Factorization

  Real-world relational data are seldom stationary, yet traditional
  collaborative filtering algorithms generally rely on this assumption.
  Motivated by our sales prediction problem, we propose a factor-based 
algorithm
  that is able to take time into account. By introducing additional 
factors for
  time, we formalize this problem as a tensor factorization with a special
  constraint on the time dimension. Further, we provide a fully Bayesian
  treatment to avoid tuning parameters and achieve automatic model 
complexity
  control. To learn the model we develop an efficient sampling procedure 
that is
  capable of analysing large-scale data sets. This new algorithm, called
  Bayesian Probabilistic Tensor Factorization (BPTF), is evaluated on
  several real-world problems including sales prediction and movie
  recommendation. Empirical results demonstrate the superiority of our 
temporal
  model.

---
Yi:
Learning Compressible Models.

In this paper, we study the combination of compression and L1-norm 
regularization in a machine learning context: learning compressible 
models. By including a compression operation into the L-1 
regularization, the assumption on model sparsity is relaxed to 
compressibility: model coefficients are compressed before being 
penalized, and sparsity is achieved in a compressed domain rather than 
the original space. We focus on the design of different compression 
operations, by which we can encode various compressibility assumptions 
and inductive biases, e.g., piecewise local smoothness, compacted energy 
in the frequency domain, and semantic correlation. We show that use of a 
compression operation provides an opportunity to leverage auxiliary 
information from various sources, e.g., domain knowledge, coding 
theories, unlabeled data. We conduct extensive experiments on 
brain-computer interfacing, handwritten character recognition and text 
classification. Empirical results show clear improvements in prediction 
performance by including compression in L-1 regularization. We also 
analyze the learned model coefficients under appropriate compressibility 
assumptions, which further demonstrate the advantages of learning 
compressible models instead of sparse models.


From krw at andrew.cmu.edu  Mon Apr 19 08:36:53 2010
From: krw at andrew.cmu.edu (Karen Widmaier)
Date: Mon, 19 Apr 2010 08:36:53 -0400
Subject: [Research] Auton Lab meeting: Wednesday April 21 at 12noon
In-Reply-To: <4BC8D3FE.6010708@cs.cmu.edu>
References: <4BC8D3FE.6010708@cs.cmu.edu>
Message-ID: <000c01cadfbd$005a4790$010ed6b0$@cmu.edu>

Hello,
I've reserved NSH 1507 for the Lab meeting on Wednesday, April 21.
Karen

-----Original Message-----
From: research-bounces at autonlab.org [mailto:research-bounces at autonlab.org]
On Behalf Of Artur Dubrawski
Sent: Friday, April 16, 2010 5:18 PM
To: research at autonlab.org
Subject: [Research] Auton Lab meeting: Wednesday April 21 at 12noon

Dear Autonians,

The Lab meeting next week will involve practice talks by Liang and Yi
before their appearances at the upcoming SDM conference.
Abstracts are provided below.
Please come along and provide them with your feedback!

Time: Wednesday April 21, 12noon
Karen will confirm the location.

See you,
Artur

---
Liang:

Temporal Collaborative Filtering with Bayesian Probabilistic Tensor
Factorization

  Real-world relational data are seldom stationary, yet traditional
  collaborative filtering algorithms generally rely on this assumption.
  Motivated by our sales prediction problem, we propose a factor-based 
algorithm
  that is able to take time into account. By introducing additional 
factors for
  time, we formalize this problem as a tensor factorization with a special
  constraint on the time dimension. Further, we provide a fully Bayesian
  treatment to avoid tuning parameters and achieve automatic model 
complexity
  control. To learn the model we develop an efficient sampling procedure 
that is
  capable of analysing large-scale data sets. This new algorithm, called
  Bayesian Probabilistic Tensor Factorization (BPTF), is evaluated on
  several real-world problems including sales prediction and movie
  recommendation. Empirical results demonstrate the superiority of our 
temporal
  model.

---
Yi:
Learning Compressible Models.

In this paper, we study the combination of compression and L1-norm 
regularization in a machine learning context: learning compressible 
models. By including a compression operation into the L-1 
regularization, the assumption on model sparsity is relaxed to 
compressibility: model coefficients are compressed before being 
penalized, and sparsity is achieved in a compressed domain rather than 
the original space. We focus on the design of different compression 
operations, by which we can encode various compressibility assumptions 
and inductive biases, e.g., piecewise local smoothness, compacted energy 
in the frequency domain, and semantic correlation. We show that use of a 
compression operation provides an opportunity to leverage auxiliary 
information from various sources, e.g., domain knowledge, coding 
theories, unlabeled data. We conduct extensive experiments on 
brain-computer interfacing, handwritten character recognition and text 
classification. Empirical results show clear improvements in prediction 
performance by including compression in L-1 regularization. We also 
analyze the learned model coefficients under appropriate compressibility 
assumptions, which further demonstrate the advantages of learning 
compressible models instead of sparse models.
_______________________________________________
Research mailing list
Research at autonlab.org
https://www.autonlab.org/mailman/listinfo/research


From awd at cs.cmu.edu  Tue Apr 27 11:41:09 2010
From: awd at cs.cmu.edu (Artur Dubrawski)
Date: Tue, 27 Apr 2010 11:41:09 -0400
Subject: [Research] Lab Meeting Tomorrow!
Message-ID: <4BD70595.1050400@cs.cmu.edu>

Dear Autonians,

Please join Purna and I on Wednesday April 28th (tomorrow) at noon
to hear Purna's talk on "Tractable Ranking with Random Walks in Large 
Graphs".

Look below for the summary.

This will be a practice talk for the upcoming thesis defense so please come
ready to provide valuable feedback.

Karen Widmaier will confirm the location.

See you!
Artur

---
Summary: A wide variety of interesting real world applications, e.g.
friend suggestion in social networks, keyword search in databases,
web-spam detection etc. can be framed as ranking entities in a graph.
While random walk based proximity measures are popular tools for ranking,
they are hard to compute in real-world networks with millions of entities.
We will present algorithms which improve both quality and speed of
relevance search in large real world graphs. We mainly design local
algorithms, which are highly generalizable to different random walk based
measures and disk-based clustered graph representations. All our
algorithms are evaluated using various link-prediction tasks. Earlier work
has shown that different heuristics behave differently on link prediction
tasks on different graphs. In the last part of this talk, I will describe
how to justify useful link prediction heuristics, by bringing together
generative models for link formation and geometric intuitions.


From krw at andrew.cmu.edu  Tue Apr 27 12:53:59 2010
From: krw at andrew.cmu.edu (Karen Widmaier)
Date: Tue, 27 Apr 2010 12:53:59 -0400
Subject: [Research] Lab Meeting Tomorrow!
In-Reply-To: <4BD70595.1050400@cs.cmu.edu>
References: <4BD70595.1050400@cs.cmu.edu>
Message-ID: <013a01cae62a$3b7c57a0$b27506e0$@cmu.edu>

I've reserved GHC 6501 from noon until 1:30 for tomorrow's meeting.
Karen

-----Original Message-----
From: research-bounces at autonlab.org [mailto:research-bounces at autonlab.org]
On Behalf Of Artur Dubrawski
Sent: Tuesday, April 27, 2010 11:41 AM
To: research at autonlab.org
Subject: [Research] Lab Meeting Tomorrow!

Dear Autonians,

Please join Purna and I on Wednesday April 28th (tomorrow) at noon
to hear Purna's talk on "Tractable Ranking with Random Walks in Large 
Graphs".

Look below for the summary.

This will be a practice talk for the upcoming thesis defense so please come
ready to provide valuable feedback.

Karen Widmaier will confirm the location.

See you!
Artur

---
Summary: A wide variety of interesting real world applications, e.g.
friend suggestion in social networks, keyword search in databases,
web-spam detection etc. can be framed as ranking entities in a graph.
While random walk based proximity measures are popular tools for ranking,
they are hard to compute in real-world networks with millions of entities.
We will present algorithms which improve both quality and speed of
relevance search in large real world graphs. We mainly design local
algorithms, which are highly generalizable to different random walk based
measures and disk-based clustered graph representations. All our
algorithms are evaluated using various link-prediction tasks. Earlier work
has shown that different heuristics behave differently on link prediction
tasks on different graphs. In the last part of this talk, I will describe
how to justify useful link prediction heuristics, by bringing together
generative models for link formation and geometric intuitions.


_______________________________________________
Research mailing list
Research at autonlab.org
https://www.autonlab.org/mailman/listinfo/research


From krw at andrew.cmu.edu  Tue Apr 27 13:18:19 2010
From: krw at andrew.cmu.edu (Karen Widmaier)
Date: Tue, 27 Apr 2010 13:18:19 -0400
Subject: [Research] ROOM CHANGE  Lab Meeting Tomorrow!
In-Reply-To: <013a01cae62a$3b7c57a0$b27506e0$@cmu.edu>
References: <4BD70595.1050400@cs.cmu.edu>
	<013a01cae62a$3b7c57a0$b27506e0$@cmu.edu>
Message-ID: <01ad01cae62d$a1acbe90$e5063bb0$@cmu.edu>

Hello all,
The lab meeting for tomorrow, April 28 will be in NSH 1507.
Karen

-----Original Message-----
From: research-bounces at autonlab.org [mailto:research-bounces at autonlab.org]
On Behalf Of Karen Widmaier
Sent: Tuesday, April 27, 2010 12:54 PM
To: awd at cs.cmu.edu; research at autonlab.org
Subject: Re: [Research] Lab Meeting Tomorrow!

I've reserved GHC 6501 from noon until 1:30 for tomorrow's meeting.
Karen

-----Original Message-----
From: research-bounces at autonlab.org [mailto:research-bounces at autonlab.org]
On Behalf Of Artur Dubrawski
Sent: Tuesday, April 27, 2010 11:41 AM
To: research at autonlab.org
Subject: [Research] Lab Meeting Tomorrow!

Dear Autonians,

Please join Purna and I on Wednesday April 28th (tomorrow) at noon
to hear Purna's talk on "Tractable Ranking with Random Walks in Large 
Graphs".

Look below for the summary.

This will be a practice talk for the upcoming thesis defense so please come
ready to provide valuable feedback.

Karen Widmaier will confirm the location.

See you!
Artur

---
Summary: A wide variety of interesting real world applications, e.g.
friend suggestion in social networks, keyword search in databases,
web-spam detection etc. can be framed as ranking entities in a graph.
While random walk based proximity measures are popular tools for ranking,
they are hard to compute in real-world networks with millions of entities.
We will present algorithms which improve both quality and speed of
relevance search in large real world graphs. We mainly design local
algorithms, which are highly generalizable to different random walk based
measures and disk-based clustered graph representations. All our
algorithms are evaluated using various link-prediction tasks. Earlier work
has shown that different heuristics behave differently on link prediction
tasks on different graphs. In the last part of this talk, I will describe
how to justify useful link prediction heuristics, by bringing together
generative models for link formation and geometric intuitions.


_______________________________________________
Research mailing list
Research at autonlab.org
https://www.autonlab.org/mailman/listinfo/research


_______________________________________________
Research mailing list
Research at autonlab.org
https://www.autonlab.org/mailman/listinfo/research


From krw at andrew.cmu.edu  Wed Apr 28 11:50:19 2010
From: krw at andrew.cmu.edu (Karen Widmaier)
Date: Wed, 28 Apr 2010 11:50:19 -0400
Subject: [Research] Lab Meeting Tomorrow!
In-Reply-To: <4BD70595.1050400@cs.cmu.edu>
References: <4BD70595.1050400@cs.cmu.edu>
Message-ID: <00eb01cae6ea$80fbc5c0$82f35140$@cmu.edu>

Reminder...meeting is in NSH 1507.

-----Original Message-----
From: research-bounces at autonlab.org [mailto:research-bounces at autonlab.org]
On Behalf Of Artur Dubrawski
Sent: Tuesday, April 27, 2010 11:41 AM
To: research at autonlab.org
Subject: [Research] Lab Meeting Tomorrow!

Dear Autonians,

Please join Purna and I on Wednesday April 28th (tomorrow) at noon
to hear Purna's talk on "Tractable Ranking with Random Walks in Large 
Graphs".

Look below for the summary.

This will be a practice talk for the upcoming thesis defense so please come
ready to provide valuable feedback.

Karen Widmaier will confirm the location.

See you!
Artur

---
Summary: A wide variety of interesting real world applications, e.g.
friend suggestion in social networks, keyword search in databases,
web-spam detection etc. can be framed as ranking entities in a graph.
While random walk based proximity measures are popular tools for ranking,
they are hard to compute in real-world networks with millions of entities.
We will present algorithms which improve both quality and speed of
relevance search in large real world graphs. We mainly design local
algorithms, which are highly generalizable to different random walk based
measures and disk-based clustered graph representations. All our
algorithms are evaluated using various link-prediction tasks. Earlier work
has shown that different heuristics behave differently on link prediction
tasks on different graphs. In the last part of this talk, I will describe
how to justify useful link prediction heuristics, by bringing together
generative models for link formation and geometric intuitions.


_______________________________________________
Research mailing list
Research at autonlab.org
https://www.autonlab.org/mailman/listinfo/research