From schneide at cs.cmu.edu Thu Apr 15 22:38:33 2010 From: schneide at cs.cmu.edu (Jeff Schneider) Date: Thu, 15 Apr 2010 22:38:33 -0400 Subject: [Research] internal representation of datsets Message-ID: <4BC7CDA9.70902@cs.cmu.edu> Hi guys, I just (quite painfully) discovered that our datset implementation actually stores doubles as floats internally in the something called a pvector. I'd like to change these to be doubles so what I experienced doesn't happen to anyone else. HOWEVER, that seems like a big change to the very core of the code. And at the least it will certainly cause datsets to consume more memory internally. Any thoughts/advice on doing this? Or suggestions on alternate ways to not get burned by this again in those cases where you really want double precision? Jeff. From jostlund at cs.cmu.edu Fri Apr 16 09:59:48 2010 From: jostlund at cs.cmu.edu (John K. Ostlund) Date: Fri, 16 Apr 2010 09:59:48 -0400 (EDT) Subject: [Research] internal representation of datsets In-Reply-To: <4BC7CDA9.70902@cs.cmu.edu> References: <4BC7CDA9.70902@cs.cmu.edu> Message-ID: <38428.24.3.147.252.1271426388.squirrel@webmail.cs.cmu.edu> Hi Jeff (et al), This is something "we" have known about for some time, but I guess you weren't part of "we"--sorry! Let me guess: You got burned by large integer ID numbers losing precision in the last few digits? We (which included Artur, as I recall) punted on the issue because, as you say, this is a change in the core of the code that (a) will cause datsets to take more memory and (b) will change the results produced by algorithms in some cases, in terms of sorting classifications results that are very close. It's also part of a larger issue, in terms of what a datset should contain and how smart our datset loading algorithm should be. My own observation is that, most of the time, the size of the datset in memory is not nearly as large as the size of the other data structures built from the datset, so changing float to double is probably a good idea without too many implications that way. But this isn't true in *all* cases. Also, a considerable amount of testing will need to be done to validate all the datset interaction functions. Are there hidden dependencies on float-vs-double? My pie-in-the-sky preferred solution to the whole business would be to (a) use Microsoft Excel's rules for smart .csv file loading, (b) make it easy within the .csv file header AND via command line options/ load function arguments for the user to specify exact handling of each column, (c) distinguish between string, int, bool, and double, not just string and float, (d) never lose track of the original string read in for each cell from the .csv file, and, (e) (big deep breath) use this as part of an excuse to switch to C++. But in the short run, changing float to double and doing exhaustive testing would be easier. My two cents, - John O. On Thu, April 15, 2010 10:38 pm, Jeff Schneider wrote: Hi guys, I just (quite painfully) discovered that our datset implementation actually stores doubles as floats internally in the something called a pvector. I'd like to change these to be doubles so what I experienced doesn't happen to anyone else. HOWEVER, that seems like a big change to the very core of the code. And at the least it will certainly cause datsets to consume more memory internally. Any thoughts/advice on doing this? Or suggestions on alternate ways to not get burned by this again in those cases where you really want double precision? Jeff. _______________________________________________ Research mailing list Research at autonlab.org https://www.autonlab.org/mailman/listinfo/research From sabhnani+ at cs.cmu.edu Fri Apr 16 11:00:12 2010 From: sabhnani+ at cs.cmu.edu (Robin Sabhnani) Date: Fri, 16 Apr 2010 11:00:12 -0400 Subject: [Research] internal representation of datsets In-Reply-To: <38428.24.3.147.252.1271426388.squirrel@webmail.cs.cmu.edu> References: <4BC7CDA9.70902@cs.cmu.edu> <38428.24.3.147.252.1271426388.squirrel@webmail.cs.cmu.edu> Message-ID: <4BC87B7C.7060208@cs.cmu.edu> How about implementing a compiler flag that loads higher precision data. We could default to float to make it backward compatible and memory efficient. Robin John K. Ostlund wrote: > Hi Jeff (et al), > > This is something "we" have known about for some > time, but I guess you weren't part of "we"--sorry! > Let me guess: You got burned by large integer ID > numbers losing precision in the last few digits? > > We (which included Artur, as I recall) punted on > the issue because, as you say, this is a change in > the core of the code that (a) will cause datsets > to take more memory and (b) will change the results > produced by algorithms in some cases, in terms of > sorting classifications results that are very close. > It's also part of a larger issue, in terms of what > a datset should contain and how smart our datset > loading algorithm should be. > > My own observation is that, most of the time, the > size of the datset in memory is not nearly as large > as the size of the other data structures built from > the datset, so changing float to double is probably > a good idea without too many implications that way. > But this isn't true in *all* cases. Also, a considerable > amount of testing will need to be done to validate all > the datset interaction functions. Are there hidden > dependencies on float-vs-double? > > My pie-in-the-sky preferred solution to the whole > business would be to (a) use Microsoft Excel's rules > for smart .csv file loading, (b) make it easy within > the .csv file header AND via command line options/ > load function arguments for the user to specify exact > handling of each column, (c) distinguish between string, > int, bool, and double, not just string and float, > (d) never lose track of the original string read in > for each cell from the .csv file, and, (e) (big deep > breath) use this as part of an excuse to switch to C++. > > But in the short run, changing float to double and doing > exhaustive testing would be easier. > > My two cents, > > - John O. > > > > On Thu, April 15, 2010 10:38 pm, Jeff Schneider wrote: > Hi guys, > > I just (quite painfully) discovered that our datset implementation actually > stores doubles as floats internally in the something called a pvector. > > I'd like to change these to be doubles so what I experienced doesn't > happen to > anyone else. HOWEVER, that seems like a big change to the very core of the > code. And at the least it will certainly cause datsets to consume more > memory > internally. > > Any thoughts/advice on doing this? Or suggestions on alternate ways to > not get > burned by this again in those cases where you really want double precision? > > Jeff. > _______________________________________________ > Research mailing list > Research at autonlab.org > https://www.autonlab.org/mailman/listinfo/research > > > > > _______________________________________________ > Research mailing list > Research at autonlab.org > https://www.autonlab.org/mailman/listinfo/research > > From sbrudene at andrew.cmu.edu Fri Apr 16 11:17:50 2010 From: sbrudene at andrew.cmu.edu (Steven Brudenell) Date: Fri, 16 Apr 2010 11:17:50 -0400 Subject: [Research] internal representation of datsets Message-ID: <2329wnlsij14t3q53r1o4j4o.1271430654453@email.android.com> to add to John's two cents: plus one vote in favor of a well-planned, coordinated transition to c++. this would solve quite a lot of problems. in the future, we should probably borrow heavily from the various SQL apis in the world. they solve many of the same problems as our datset API, in terms of data access. they are also very well-traveled. I do think changing floats to doubles, plus rigorous testing, is the best solution for now. "John K. Ostlund" wrote: >Hi Jeff (et al), > >This is something "we" have known about for some >time, but I guess you weren't part of "we"--sorry! >Let me guess: You got burned by large integer ID >numbers losing precision in the last few digits? > >We (which included Artur, as I recall) punted on >the issue because, as you say, this is a change in >the core of the code that (a) will cause datsets >to take more memory and (b) will change the results >produced by algorithms in some cases, in terms of >sorting classifications results that are very close. >It's also part of a larger issue, in terms of what >a datset should contain and how smart our datset >loading algorithm should be. > >My own observation is that, most of the time, the >size of the datset in memory is not nearly as large >as the size of the other data structures built from >the datset, so changing float to double is probably >a good idea without too many implications that way. >But this isn't true in *all* cases. Also, a considerable >amount of testing will need to be done to validate all >the datset interaction functions. Are there hidden >dependencies on float-vs-double? > >My pie-in-the-sky preferred solution to the whole >business would be to (a) use Microsoft Excel's rules >for smart .csv file loading, (b) make it easy within >the .csv file header AND via command line options/ >load function arguments for the user to specify exact >handling of each column, (c) distinguish between string, >int, bool, and double, not just string and float, >(d) never lose track of the original string read in >for each cell from the .csv file, and, (e) (big deep >breath) use this as part of an excuse to switch to C++. > >But in the short run, changing float to double and doing >exhaustive testing would be easier. > >My two cents, > >- John O. > > > >On Thu, April 15, 2010 10:38 pm, Jeff Schneider wrote: >Hi guys, > >I just (quite painfully) discovered that our datset implementation actually >stores doubles as floats internally in the something called a pvector. > >I'd like to change these to be doubles so what I experienced doesn't >happen to >anyone else. HOWEVER, that seems like a big change to the very core of the >code. And at the least it will certainly cause datsets to consume more >memory >internally. > >Any thoughts/advice on doing this? Or suggestions on alternate ways to >not get >burned by this again in those cases where you really want double precision? > >Jeff. >_______________________________________________ >Research mailing list >Research at autonlab.org >https://www.autonlab.org/mailman/listinfo/research > > > > >_______________________________________________ >Research mailing list >Research at autonlab.org >https://www.autonlab.org/mailman/listinfo/research > From komarek.paul at gmail.com Fri Apr 16 11:18:24 2010 From: komarek.paul at gmail.com (Paul Komarek) Date: Fri, 16 Apr 2010 08:18:24 -0700 Subject: [Research] internal representation of datsets In-Reply-To: <4BC87B7C.7060208@cs.cmu.edu> References: <4BC7CDA9.70902@cs.cmu.edu> <38428.24.3.147.252.1271426388.squirrel@webmail.cs.cmu.edu> <4BC87B7C.7060208@cs.cmu.edu> Message-ID: I remember this, and at the time I decided that I didn't care about the representation on disk. However, if you store intermediate computation results in an fds file, I can totally see that you'd be unhappy with the result. In order to prevent backwards compatibility issues, I wouldn't change the .fds format. Instead, I'd look for a compressed csv format. When I open sourced my LR code, this is what I ended up doing. You can download my stuff from http://komarix.org/lr if you want to see details. There are many benefits, including cross-platform compatibility and less code on your side. On Fri, Apr 16, 2010 at 8:00 AM, Robin Sabhnani wrote: > How about implementing a compiler flag that loads higher precision data. > We could default to float to make it backward compatible and memory > efficient. > > Robin > > John K. Ostlund wrote: >> Hi Jeff (et al), >> >> This is something "we" have known about for some >> time, but I guess you weren't part of "we"--sorry! >> Let me guess: You got burned by large integer ID >> numbers losing precision in the last few digits? >> >> We (which included Artur, as I recall) punted on >> the issue because, as you say, this is a change in >> the core of the code that (a) will cause datsets >> to take more memory and (b) will change the results >> produced by algorithms in some cases, in terms of >> sorting classifications results that are very close. >> It's also part of a larger issue, in terms of what >> a datset should contain and how smart our datset >> loading algorithm should be. >> >> My own observation is that, most of the time, the >> size of the datset in memory is not nearly as large >> as the size of the other data structures built from >> the datset, so changing float to double is probably >> a good idea without too many implications that way. >> But this isn't true in *all* cases. ?Also, a considerable >> amount of testing will need to be done to validate all >> the datset interaction functions. ?Are there hidden >> dependencies on float-vs-double? >> >> My pie-in-the-sky preferred solution to the whole >> business would be to (a) use Microsoft Excel's rules >> for smart .csv file loading, (b) make it easy within >> the .csv file header AND via command line options/ >> load function arguments for the user to specify exact >> handling of each column, (c) distinguish between string, >> int, bool, and double, not just string and float, >> (d) never lose track of the original string read in >> for each cell from the .csv file, and, (e) (big deep >> breath) use this as part of an excuse to switch to C++. >> >> But in the short run, changing float to double and doing >> exhaustive testing would be easier. >> >> My two cents, >> >> - John O. >> >> >> >> On Thu, April 15, 2010 10:38 pm, Jeff Schneider wrote: >> Hi guys, >> >> I just (quite painfully) discovered that our datset implementation actually >> stores doubles as floats internally in the something called a pvector. >> >> I'd like to change these to be doubles so what I experienced doesn't >> happen to >> anyone else. ?HOWEVER, that seems like a big change to the very core of the >> code. ?And at the least it will certainly cause datsets to consume more >> memory >> internally. >> >> Any thoughts/advice on doing this? ?Or suggestions on alternate ways to >> not get >> burned by this again in those cases where you really want double precision? >> >> Jeff. >> _______________________________________________ >> Research mailing list >> Research at autonlab.org >> https://www.autonlab.org/mailman/listinfo/research >> >> >> >> >> _______________________________________________ >> Research mailing list >> Research at autonlab.org >> https://www.autonlab.org/mailman/listinfo/research >> >> > > _______________________________________________ > Research mailing list > Research at autonlab.org > https://www.autonlab.org/mailman/listinfo/research > From awd at cs.cmu.edu Fri Apr 16 17:17:50 2010 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Fri, 16 Apr 2010 17:17:50 -0400 Subject: [Research] Auton Lab meeting: Wednesday April 21 at 12noon Message-ID: <4BC8D3FE.6010708@cs.cmu.edu> Dear Autonians, The Lab meeting next week will involve practice talks by Liang and Yi before their appearances at the upcoming SDM conference. Abstracts are provided below. Please come along and provide them with your feedback! Time: Wednesday April 21, 12noon Karen will confirm the location. See you, Artur --- Liang: Temporal Collaborative Filtering with Bayesian Probabilistic Tensor Factorization Real-world relational data are seldom stationary, yet traditional collaborative filtering algorithms generally rely on this assumption. Motivated by our sales prediction problem, we propose a factor-based algorithm that is able to take time into account. By introducing additional factors for time, we formalize this problem as a tensor factorization with a special constraint on the time dimension. Further, we provide a fully Bayesian treatment to avoid tuning parameters and achieve automatic model complexity control. To learn the model we develop an efficient sampling procedure that is capable of analysing large-scale data sets. This new algorithm, called Bayesian Probabilistic Tensor Factorization (BPTF), is evaluated on several real-world problems including sales prediction and movie recommendation. Empirical results demonstrate the superiority of our temporal model. --- Yi: Learning Compressible Models. In this paper, we study the combination of compression and L1-norm regularization in a machine learning context: learning compressible models. By including a compression operation into the L-1 regularization, the assumption on model sparsity is relaxed to compressibility: model coefficients are compressed before being penalized, and sparsity is achieved in a compressed domain rather than the original space. We focus on the design of different compression operations, by which we can encode various compressibility assumptions and inductive biases, e.g., piecewise local smoothness, compacted energy in the frequency domain, and semantic correlation. We show that use of a compression operation provides an opportunity to leverage auxiliary information from various sources, e.g., domain knowledge, coding theories, unlabeled data. We conduct extensive experiments on brain-computer interfacing, handwritten character recognition and text classification. Empirical results show clear improvements in prediction performance by including compression in L-1 regularization. We also analyze the learned model coefficients under appropriate compressibility assumptions, which further demonstrate the advantages of learning compressible models instead of sparse models. From krw at andrew.cmu.edu Mon Apr 19 08:36:53 2010 From: krw at andrew.cmu.edu (Karen Widmaier) Date: Mon, 19 Apr 2010 08:36:53 -0400 Subject: [Research] Auton Lab meeting: Wednesday April 21 at 12noon In-Reply-To: <4BC8D3FE.6010708@cs.cmu.edu> References: <4BC8D3FE.6010708@cs.cmu.edu> Message-ID: <000c01cadfbd$005a4790$010ed6b0$@cmu.edu> Hello, I've reserved NSH 1507 for the Lab meeting on Wednesday, April 21. Karen -----Original Message----- From: research-bounces at autonlab.org [mailto:research-bounces at autonlab.org] On Behalf Of Artur Dubrawski Sent: Friday, April 16, 2010 5:18 PM To: research at autonlab.org Subject: [Research] Auton Lab meeting: Wednesday April 21 at 12noon Dear Autonians, The Lab meeting next week will involve practice talks by Liang and Yi before their appearances at the upcoming SDM conference. Abstracts are provided below. Please come along and provide them with your feedback! Time: Wednesday April 21, 12noon Karen will confirm the location. See you, Artur --- Liang: Temporal Collaborative Filtering with Bayesian Probabilistic Tensor Factorization Real-world relational data are seldom stationary, yet traditional collaborative filtering algorithms generally rely on this assumption. Motivated by our sales prediction problem, we propose a factor-based algorithm that is able to take time into account. By introducing additional factors for time, we formalize this problem as a tensor factorization with a special constraint on the time dimension. Further, we provide a fully Bayesian treatment to avoid tuning parameters and achieve automatic model complexity control. To learn the model we develop an efficient sampling procedure that is capable of analysing large-scale data sets. This new algorithm, called Bayesian Probabilistic Tensor Factorization (BPTF), is evaluated on several real-world problems including sales prediction and movie recommendation. Empirical results demonstrate the superiority of our temporal model. --- Yi: Learning Compressible Models. In this paper, we study the combination of compression and L1-norm regularization in a machine learning context: learning compressible models. By including a compression operation into the L-1 regularization, the assumption on model sparsity is relaxed to compressibility: model coefficients are compressed before being penalized, and sparsity is achieved in a compressed domain rather than the original space. We focus on the design of different compression operations, by which we can encode various compressibility assumptions and inductive biases, e.g., piecewise local smoothness, compacted energy in the frequency domain, and semantic correlation. We show that use of a compression operation provides an opportunity to leverage auxiliary information from various sources, e.g., domain knowledge, coding theories, unlabeled data. We conduct extensive experiments on brain-computer interfacing, handwritten character recognition and text classification. Empirical results show clear improvements in prediction performance by including compression in L-1 regularization. We also analyze the learned model coefficients under appropriate compressibility assumptions, which further demonstrate the advantages of learning compressible models instead of sparse models. _______________________________________________ Research mailing list Research at autonlab.org https://www.autonlab.org/mailman/listinfo/research From awd at cs.cmu.edu Tue Apr 27 11:41:09 2010 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Tue, 27 Apr 2010 11:41:09 -0400 Subject: [Research] Lab Meeting Tomorrow! Message-ID: <4BD70595.1050400@cs.cmu.edu> Dear Autonians, Please join Purna and I on Wednesday April 28th (tomorrow) at noon to hear Purna's talk on "Tractable Ranking with Random Walks in Large Graphs". Look below for the summary. This will be a practice talk for the upcoming thesis defense so please come ready to provide valuable feedback. Karen Widmaier will confirm the location. See you! Artur --- Summary: A wide variety of interesting real world applications, e.g. friend suggestion in social networks, keyword search in databases, web-spam detection etc. can be framed as ranking entities in a graph. While random walk based proximity measures are popular tools for ranking, they are hard to compute in real-world networks with millions of entities. We will present algorithms which improve both quality and speed of relevance search in large real world graphs. We mainly design local algorithms, which are highly generalizable to different random walk based measures and disk-based clustered graph representations. All our algorithms are evaluated using various link-prediction tasks. Earlier work has shown that different heuristics behave differently on link prediction tasks on different graphs. In the last part of this talk, I will describe how to justify useful link prediction heuristics, by bringing together generative models for link formation and geometric intuitions. From krw at andrew.cmu.edu Tue Apr 27 12:53:59 2010 From: krw at andrew.cmu.edu (Karen Widmaier) Date: Tue, 27 Apr 2010 12:53:59 -0400 Subject: [Research] Lab Meeting Tomorrow! In-Reply-To: <4BD70595.1050400@cs.cmu.edu> References: <4BD70595.1050400@cs.cmu.edu> Message-ID: <013a01cae62a$3b7c57a0$b27506e0$@cmu.edu> I've reserved GHC 6501 from noon until 1:30 for tomorrow's meeting. Karen -----Original Message----- From: research-bounces at autonlab.org [mailto:research-bounces at autonlab.org] On Behalf Of Artur Dubrawski Sent: Tuesday, April 27, 2010 11:41 AM To: research at autonlab.org Subject: [Research] Lab Meeting Tomorrow! Dear Autonians, Please join Purna and I on Wednesday April 28th (tomorrow) at noon to hear Purna's talk on "Tractable Ranking with Random Walks in Large Graphs". Look below for the summary. This will be a practice talk for the upcoming thesis defense so please come ready to provide valuable feedback. Karen Widmaier will confirm the location. See you! Artur --- Summary: A wide variety of interesting real world applications, e.g. friend suggestion in social networks, keyword search in databases, web-spam detection etc. can be framed as ranking entities in a graph. While random walk based proximity measures are popular tools for ranking, they are hard to compute in real-world networks with millions of entities. We will present algorithms which improve both quality and speed of relevance search in large real world graphs. We mainly design local algorithms, which are highly generalizable to different random walk based measures and disk-based clustered graph representations. All our algorithms are evaluated using various link-prediction tasks. Earlier work has shown that different heuristics behave differently on link prediction tasks on different graphs. In the last part of this talk, I will describe how to justify useful link prediction heuristics, by bringing together generative models for link formation and geometric intuitions. _______________________________________________ Research mailing list Research at autonlab.org https://www.autonlab.org/mailman/listinfo/research From krw at andrew.cmu.edu Tue Apr 27 13:18:19 2010 From: krw at andrew.cmu.edu (Karen Widmaier) Date: Tue, 27 Apr 2010 13:18:19 -0400 Subject: [Research] ROOM CHANGE Lab Meeting Tomorrow! In-Reply-To: <013a01cae62a$3b7c57a0$b27506e0$@cmu.edu> References: <4BD70595.1050400@cs.cmu.edu> <013a01cae62a$3b7c57a0$b27506e0$@cmu.edu> Message-ID: <01ad01cae62d$a1acbe90$e5063bb0$@cmu.edu> Hello all, The lab meeting for tomorrow, April 28 will be in NSH 1507. Karen -----Original Message----- From: research-bounces at autonlab.org [mailto:research-bounces at autonlab.org] On Behalf Of Karen Widmaier Sent: Tuesday, April 27, 2010 12:54 PM To: awd at cs.cmu.edu; research at autonlab.org Subject: Re: [Research] Lab Meeting Tomorrow! I've reserved GHC 6501 from noon until 1:30 for tomorrow's meeting. Karen -----Original Message----- From: research-bounces at autonlab.org [mailto:research-bounces at autonlab.org] On Behalf Of Artur Dubrawski Sent: Tuesday, April 27, 2010 11:41 AM To: research at autonlab.org Subject: [Research] Lab Meeting Tomorrow! Dear Autonians, Please join Purna and I on Wednesday April 28th (tomorrow) at noon to hear Purna's talk on "Tractable Ranking with Random Walks in Large Graphs". Look below for the summary. This will be a practice talk for the upcoming thesis defense so please come ready to provide valuable feedback. Karen Widmaier will confirm the location. See you! Artur --- Summary: A wide variety of interesting real world applications, e.g. friend suggestion in social networks, keyword search in databases, web-spam detection etc. can be framed as ranking entities in a graph. While random walk based proximity measures are popular tools for ranking, they are hard to compute in real-world networks with millions of entities. We will present algorithms which improve both quality and speed of relevance search in large real world graphs. We mainly design local algorithms, which are highly generalizable to different random walk based measures and disk-based clustered graph representations. All our algorithms are evaluated using various link-prediction tasks. Earlier work has shown that different heuristics behave differently on link prediction tasks on different graphs. In the last part of this talk, I will describe how to justify useful link prediction heuristics, by bringing together generative models for link formation and geometric intuitions. _______________________________________________ Research mailing list Research at autonlab.org https://www.autonlab.org/mailman/listinfo/research _______________________________________________ Research mailing list Research at autonlab.org https://www.autonlab.org/mailman/listinfo/research From krw at andrew.cmu.edu Wed Apr 28 11:50:19 2010 From: krw at andrew.cmu.edu (Karen Widmaier) Date: Wed, 28 Apr 2010 11:50:19 -0400 Subject: [Research] Lab Meeting Tomorrow! In-Reply-To: <4BD70595.1050400@cs.cmu.edu> References: <4BD70595.1050400@cs.cmu.edu> Message-ID: <00eb01cae6ea$80fbc5c0$82f35140$@cmu.edu> Reminder...meeting is in NSH 1507. -----Original Message----- From: research-bounces at autonlab.org [mailto:research-bounces at autonlab.org] On Behalf Of Artur Dubrawski Sent: Tuesday, April 27, 2010 11:41 AM To: research at autonlab.org Subject: [Research] Lab Meeting Tomorrow! Dear Autonians, Please join Purna and I on Wednesday April 28th (tomorrow) at noon to hear Purna's talk on "Tractable Ranking with Random Walks in Large Graphs". Look below for the summary. This will be a practice talk for the upcoming thesis defense so please come ready to provide valuable feedback. Karen Widmaier will confirm the location. See you! Artur --- Summary: A wide variety of interesting real world applications, e.g. friend suggestion in social networks, keyword search in databases, web-spam detection etc. can be framed as ranking entities in a graph. While random walk based proximity measures are popular tools for ranking, they are hard to compute in real-world networks with millions of entities. We will present algorithms which improve both quality and speed of relevance search in large real world graphs. We mainly design local algorithms, which are highly generalizable to different random walk based measures and disk-based clustered graph representations. All our algorithms are evaluated using various link-prediction tasks. Earlier work has shown that different heuristics behave differently on link prediction tasks on different graphs. In the last part of this talk, I will describe how to justify useful link prediction heuristics, by bringing together generative models for link formation and geometric intuitions. _______________________________________________ Research mailing list Research at autonlab.org https://www.autonlab.org/mailman/listinfo/research