The Geomblog: research

Showing posts with label research. Show all posts

Monday, February 01, 2016

On "the moral hazard of complexity-theoretic assumptions"

(ed: In a recent CACM editorial, CACM editor-in-chief +Moshe Vardi discussed Babai's result on graph isomorphism and the recent hardness result for edit distance in the context of a larger discussion on the significance of complexity-theoretic assumptions. +Piotr Indyk (one of the authors of the edit distance result and an occasional guest blogger) posted the following as a response. This response has also been posted as a comment on Moshe's post.)

(ed: Update: After Piotr posted the comment, Moshe responded, and then Piotr responded again. Please visit the article page to read the exchange between the two).

In a recent CACM editorial, Dr. Vardi addresses what he calls a "a moral hazard of complexity-theoretic assumptions" and "a growing gap between current theory and practice of complexity and algorithms". Given that the article mentions a paper that I am a co-author of [ "Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false)", STOC'15], I believe it is appropriate for me to respond. In short, I believe that much of the analysis in the article stems from a confusion between press coverage and the actual scientific inquiry. Indeed, it is unclear whether Dr. Vardi addresses what he believes to be a "media" phenomenon (how journalists describe scientific developments to a broad public) or a "scientific" phenomenon (how and why the scientists make and use assumptions in their research and describe them in their papers). In order to avoid conflating these two issues, I will address them one by one, focusing on our paper as an example.

Media aspects: The bulk of the editorial is focused on some of the press coverage describing recent scientific developments in algorithms and complexity. In particular, Dr. Vardi mentions the title of a Boston Globe article covering our work ("For 40 Years, Computer Scientists Looked for a Solution that Doesn't Exist.") . As I already discussed this elsewhere (https://liorpachter.wordpress.com/2015/08/14/in-biology-n-does-not-go-to-infinity/#comment-4792 ), I completely agree that the title and some other parts of the article leave a lot to be desired. Among many things, the conditional aspects of the result are discussed only at the end of the article, and therefore are easy to miss. At the same time, I believe some perspective is needed. The inaccuracy or confusion in popular reporting of scientific results is an unfortunate but common and longstanding phenomenon (see e.g., this account https://lpsdp.files.wordpress.com/2011/10/ellipsoid-stories.pdf of press coverage of the famous Khachiyan's linear programming algorithm in the 1970s). There are many reasons for this. Perhaps the chief one is the cultural gap between the press and the scientists, where journalists emphasize accessibility and newsworthiness while scientists emphasize precision. As a result, simplification in scientific reporting is a necessity, and the risk of oversimplification, inaccuracy or incorrectness is high. Fortunately, more time and effort spent on both sides can lead to more thorough and nuanced articles (e.g., see https://www.quantamagazine.org/20150929-edit-distance-computational-complexity/ ). Given that the coverage of algorithms and complexity results in popular press is growing, I believe that, in time, both scientists and journalists will gain valuable experience in this process.
Scientific aspects: Dr. Vardi also raises some scientific points. In particular:
- Dr. Vardi is critical of the title of our paper: "Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false).". I can only say that, given that we are stating the assumption explicitly in the title, in the abstract, in the introduction, and in the main body of the paper, I believe the title and the paper accurately represents its contribution.
- Dr. Vardi is critical of the validity of SETH as a hardness assumption: this question is indeed a subject of a robust discussion and investigation (see e.g., the aforementioned Quanta article). The best guess of mine and of most of the people I discussed this with is that the assumption is true. However, this is far from a universal opinion. Quantifying the level of support for this conjecture would be an interesting project, perhaps along the lines of similar efforts concerning the P vs. NP conjecture ( https://www.cs.umd.edu/~gasarch/papers/poll2012.pdf ). In any case, it is crucial to strengthen the framework by relying on weaker assumptions, or replacing one-way reductions with equivalences; both are subjects of ongoing research. However, even the existing developments have already led to concrete benefits. For example, failed attempts to prove conditional hardness of certain problems have led to better algorithms for those tasks.

Finally, let me point out that one of the key motivations for this line of research is precisely the strengthening of the relationship between theory and practice in complexity and algorithms, a goal that Dr. Vardi refers to as an important challenge. Specifically, this research provides a framework for establishing evidence that certain computational questions cannot be solved within concrete (e.g., sub-quadratic) polynomial time bounds. In general, I believe that a careful examination of the developments in algorithms and complexity over the last decade would show that the gap between theory and practice is shrinking, not growing. But that is a topic for another discussion.

Monday, October 26, 2015

Data and Civil Rights

I'm in DC right now for a one-day conference on Data and Civil Rights, run by the Data and Society Institute.

This is an annual event (this is the second such conference). Last year's conference was themed "Why Big Data is a civil rights issue", and this year's conference focuses on the very hot-button topic of big data and criminal justice.

Needless to say, issues of fairness and discrimination are front and center in an area like this, and so I'm hoping to learn a lot about the state of play (and maybe contribute as well).

This is more of a working meeting than a traditional conference: all material is private during the conference and we're expected not to talk about the discussions outside the event (a la Chatham House rules). Digested material from the different working groups will be posted in November.

JHU Workshop on Sublinear Algorithms

The latest in a long line of workshops on sublinear algorithms (streaming! sketching ! property testing ! all of the above !) will be held at JHU this year just before SODA 2016. The message from the organizers is below: do consider attending if you're planning to attend SODA. (Disclaimer: I'm giving one of the 20+ talks, but I will not promise that it's excellent).

Dear colleagues,

We are organizing a Sublinear Algorithms workshop that will take place at Johns Hopkins University, January 7-9, 2016. The workshop will bring together researchers interested in sublinear algorithms, including sublinear-time algorithms (e.g., property testing and distribution testing), sublinear-space algorithms (e.g., sketching and streaming) and sublinear measurements (e.g., sparse recovery and compressive sensing).

The workshop will be held right before SODA’16, which starts on January 10 in Arlington, VA (about 50 miles from JHU).

Participation in this workshop is open to all, with free registration. In addition to 20+ excellent invited talks, the program will include short contributed talks by graduating students and postdocs, as well as a poster session. To participate in the contributed talk session and/or the poster session, apply by December 1.

For further details and registration, please visit
http://www.cs.jhu.edu/~vova/sublinear2016/main.html .

Best,
Vladimir Braverman, Johns Hopkins University
Piotr Indyk, MIT
Robert Krauthgamer, Weizmann Institute of Science
Sofya Raskhodnikova, Pennsylvania State University

Tuesday, July 28, 2015

The 2nd Workshop on Fairness, Accuracy and Transparency in Machine Learning: A review

I was one of the organizers of the 2nd workshop on Fairness, Accuracy and Transparency in Machine Learning (FATML) at ICML 2015, and in my alternate career as moderator of data mining panels, I moderated the closing panel. The panelists were Fernando Diaz from MSR New York, Sorelle Friedler from Haverford College, Mykola Pechenizkiy from Eindhoven Instt. of Technology and Hanna Wallach from UMass-Amherst and MSR.

While my original intent was to do a review of the panel, it became clear that the panel discussion touched on themes that were bubbling up throughout the day. So what follows is organized by panel questions, but weaves in discussion from outside the panel as well.

This year's workshop, unlike the one at NIPS 2014, had a bit more of a technical focus: we had some of the early researchers in fairness work from Europe and Japan give talks on their work. So as a counterweight, I thought I'd ask the panel to look beyond the presentations for the first question:

Question 1

What is one thing you think we are missing (or should be paying more attention to) in our current discussions of fairness and discrimination in terms of interaction with the social/policy/legal world OUTSIDE CS ?

Some themes that emerged:

...on the difference between Europe and the US

Mykola made an interesting observation based on his experience in educational data mining. Governments around Europe are very concerned about the use of student data (including health records and academic information) for any kind of data driven tools, and have severely regulated use of such data. As a result, the nascent field of educational data mining has been crippled by lack of access to data.

This is almost the opposite of the situation in the US, where data driven policy is the newest buzzword in town, and those of us who are interested in issues of fairness and transparency feel like we're constantly on the outside looking in (though attention to these issues is increasing).

...connecting to other communities

It's been clear from the beginning that discourse on fairness and transparency in machine learning must draw on the corresponding discourses in society at large. Which means that before we can start solving problems, we have to understand what the problems really are. This came through very strongly in the discussions. To paraphrase one of the panelists, "Computer science likes to solve problems, and that's the problem !" (also known as "slap a metric on it and optimize").

So what are the different communities we should be connecting to, and how ?

a) Connecting with social science

A major concern is the "prediction vs understanding" problem. For the most part, machine learning is about prediction: you classify, label, clustering, regress, rank and so on. But in the study of society and the dynamics of human interactions, the goal is not just to predict how humans might behave, but to understand their behavior. Which is to say, data analysis (or even fairness-aware analysis) has to be the first step in a larger conversation, rather than the last one.

While I don't think this issue is specific to fairness and transparency, it plays a role in understanding the sources of inequality and discrimination. It's not enough to to detect examples of bias: what must happen next is an investigation of why the bias is happening.

(ed: personally, while I understand this concern, I don't think it's necessarily something computer scientists need to prioritize. This is after all what the social sciences do, and it doesn't make sense for us as computer scientists to merely try to acquire those skills. I think we need to be aware of the deeper issues of understanding a domain, but we also have strengths that we bring to the table and I'll say more about that later)

"galaxies don't care how they are studied, but people do"

Another point that was made over and over is that issues of fairness and bias are not abstract: they affect actual people. Keeping the human in focus is important for the ethical underpinning of what we do, and even how we might design experiments.

b) connecting with journalists

Nick Diakopoulos gave a talk on "algorithmic accountability" in journalism. In addition to talking about what made research on fairness newsworthy:

discriminatory/unfair practices
mistakes that denies a service
censorship
activities that break the law or social norms
false prediction

he made the strong argument that (government) legitimacy comes from transparency, and talked about what that might entail in the age of data driven policy, including transparency involving data collection, the algorithms used, the inferences generated, and the humans involved in the process.

(ed: I don't think that our demands on transparency should be limited to government entities: the sad fact is that at least in the US, much of what would be considered basic internet infrastructure is controlled by private corporations, and they should be held to similar standards: if not for legitimacy, at least for fairness)

c) connecting with the law

Our fearless leader Solon Barocas made a number of interesting observations on the connection between algorithmic fairness and the law, all the while disclaiming IANAL :). But his point (which he's made before) is worth repeating. One of the things that computer science can do well is make precise concepts that might be defined vaguely or only indirectly through case law. And then we can get to work teasing out the relationships between different concepts (both abstractly and computationally). Indeed, the idea of a "reduction" between concepts in fairness might be one of the most useful things that computer science can uniquely contribute.

It's clear we're in a "let a thousand definitions bloom" phase in fairness research. And it's interesting to see the different reactions to this: on the social science side, there appears to be some nervousness that we're "playing games with math", but from Solon's comments this doesn't seem like a bad thing as long as we're also trying to connect the definitions together.

Question 2

In your view, what’s the next most pressing question we should be asking (limited to INSIDE CS to distinguish from the previous question) ?

...better definitions

It was very clear from the discussion that we need broader definitions of F-A-T beyond what's mathematically plausible. One particular example that's reminiscent of the metrics for privacy: There's a notion of "utility": how much can we make the data or the task "fair" without changing the underlying results produced by the "unfair" data/algorithm. The problem is that utility itself is not very well defined. Firstly, you might be benefiting from discriminatory policies, so your perceived "utility" itself is a problem. Trying to maintain this defeats the purpose of fairness. Secondly, even if this is not the case, the framing of the question as a tradeoff implies that these two notions are necessarily in opposition. That shortchanges the moral imperative of fairness and is different from the parallel situation in privacy. Finally, we measure utility in terms of classifier accuracy. But that's a very poor notion of overall task effectiveness. For example, is there a Bayesian perspective to bring to this ?

At any rate, since we are good at understanding tradeoffs in computer science, we should understand the different dimensions of the space of fairness preserving methods, rather than limiting ourselves to a one-dimensional false dichotomy of "fairness vs utility".

...better usable artifacts

Nick asked us the following question at the panel:

when a CEO or an agency head comes to us and asks "what should we do about this fairness stuff". what do we tell them ?

We didn't have a good response, and that was interesting. While we're beginning to explore the space of what's possible, we don't have clear examples of artifacts to hand over and say "use this".

As usual, the topic of benchmarking came up. I joke that when industry folks bring up the issue of benchmarking, I always ask "so where's the data" and they usually go very silent. But I do think there are useful data sets to be explored that come to us from the government. Some common data sets that get used are the US census data on salaries and a German data set on consumer credit. The entire data set from the Ricci court case is also available (even though it's tiny), and there are Bureau of Justice recidivism data sets to play with.

Of course this goes against the imperative coming from the social scientists to look at specific domains and ask meaningful questions in that domain. And I think we need to look more at the literature on fairness and bias over the decades and extract data that people have studied.

...better problems

For the most part, researchers have been considering binary classification as the suspect task. But of course there are much more general tasks that we could be considering: what about unsupervised learning ? what about structured prediction ? Is there a way to define fairness when you don't have a simple binary response variable and binary attributes ?

One final question I asked was this:

Question 3

do we have to solve the causality problem in order to talk about fairness ?

This question was possibly not as well-posed as I would have liked, but it led to interesting discussions.

The law deals with intent, because the goal of the law is to assign responsibility. Algorithms are not agents and can't exhibit intent. Causality is a proxy for intent, in that if we can say that something caused something else, we can assign blame in a different way. In fact there were two talks at the workshop that talked about causality directly in the context of fairness.

But causality is a very hard problem. It's extremely subtle (if you doubt this, read through some of the examples Judea Pearl discusses in his book), and extremely controversial: different camps have their own view of how to mechanize causal inference, and the battles there make frequentists and Bayesians look like life-long friends.

In the discussion that followed, it became clear that there were really two ways of thinking about causality as it relates to fairness. The first way is to think about the underlying causal mechanisms that might lead to otherwise innocent features leading to biased outcomes: that is, how might zip code correlate with racial identity for example. The second way, which is closer to what I had in mind, is to think about the behavior of an algorithm causally: the use of these inputs or this algorithm *caused* a particular decision to be made. This second idea is not as far-fetched as it seems: some work in the database community has looked at trying to find which tuples "caused" a certain output to be generated from a query.

If you think it's not important to understand causality as it comes to automated methods, you might not want to drive a self-driving car or fly a plane. But as Solon suggested in the discussion, one way of getting around causality is to think about negligence with respect to algorithms: can we design reasonable best practices for predictive tools and argue that a failure to use these methods is negligence ? The legal ramifications of these idea have been explored in the context of robotics (article, and response) but more work is yet to be done.

...back to narratives

Another comment by Nick D, again connecting to the journalism perspective: narratives and story telling are a powerful way to explain the results of data mining. I haven't talked much about interpretability, which is an important part of the larger discussion of transparency and accountability. But one way to communicate the results of (say) a fairness audit would be to provide a human-interpretable linkage between the problematic attributes being used for prediction and the protected attribute. For more on this, see Michael Nielsen's very timely new Quanta article on machine-generated explanations.

It's clear from all the discussion that there's a lot of work to be done and a small but active community of people interested in pushing these issues forward. Fairness, and algorithmic bias, are hot topics in the news nowadays, and it's a good time to take advantage of this burst of interest.

Friday, May 15, 2015

Higher order interactions and SDM 2015

This year I'm one of the PC Chairs for SIAM Data Mining (along with Jieping Ye), and so I've been spending time in decidedly-not-sunny Vancouver. Being a PC Chair, or even being on a PC, is an exercise in constant deja vu: I hear a talk and think "Where have I heard this before" before realizing that I've probably reviewed the paper, or looked at its reviews or meta-reviews.

Being the PC chair means though that I can float around the conference freely without feeling the pressure to attend talks, network or otherwise be social: I've earned my keep :). Following standard conference networking maxims though, I made an effort to meet at least one coffee shop that I've met before and introduce myself to at least one new coffee shop, and another.

But I am attending talks ! And I enjoyed listening to some work on tensor spectral clustering by Benson, Gleich and Leskovec. It got me thinking about the larger issue of modeling higher-order interactions and what appear to be many different ways of modeling the problem.

The problem.

Imagine that you have a number of interacting entities. These could be points in a space, or vertices in a graph, or even dimensions of a point (aka variables). The easiest way to model a collection of such entities is to assume they're independent of each other. For example, I might draw i.i.d samples from a distribution. Or I might be looking at a collection of independent features describing an object, and so on.

Independence assumptions are powerful. They allow us to "factorize" structure into combinations of individual elements, which either leads to simple summations directly, or eventually after a log transformation of a product. This means we can deal with entities independently, and the "inference complexity blowup" is linear in the number of entities. A good example of this is the Naive Bayes approach to learning, where assuming all entities are independent leads to a likelihood cost function that's just a sum of terms, one for each entity.

I'm necessarily being rather loose with my statements here. Making them precise is possible in specific contexts, but it gets unwieldy.

But independence is too restrictive an assumption. It limits modeling power, and therefore will be unable to capture interactions that might make understanding structure a lot easier. For one thing, you'd never find correlations if you assumed that all features are independent.

Graphs.

The easiest form of interaction is a pairwise interaction. Modeling pairwise interactions gets us to a graph, and who doesn't love a graph ! More importantly for what follows,

a graph and its associated structures is a rich representation of a system of pairwise interactions

in that we have a colorful vocabulary and an arsenal of algorithms for talking about pairwise interactions and structures built on them.

Of course we've paid a price - in complexity. Instead of the linear cost incurred by independent entities, we now have quadratically many potential pairwise interactions to model. But (and here's the key), we can interpret a sparse graph as capturing weak interactions, and it's still a rich model to model different phenomena.
Higher-order interactions.

But what happens if we want to model interactions that aren't just pairwise ? What is the correct higher-order structure to model such interactions as effectively as graphs ? It turns out that there are many different ways to do this, and they all can be reduced to a sentence (pace Saunders MacLane) of the form

A graph is just a very special kind of X

for different values of X.

1. The graphical model view.

A graph is just a special kind of clique intersection structure, if you only have 2-cliques.

One way to manage a collection of higher order interactions is to factorize them in a more general way. This is the basis for the clique-tree idea in graphical models, where the interaction structure is viewed as a set of complete joint interactions (aka cliques) all connected together in a tree structure for easy inference. Another name for this would be the 'bounded treewidth' model, but it misses the fact that we are allowing higher-order interactions, but in a controlled way.

The advantage of this representation is that in a true parametrized complexity way, it isolates where the true complexity is coming from (the size of each clique) from the overall complexity (the size of the graph).

A spectral perspective.

When graphs arise from natural sources of data (social networks and the like), we have to deal with noise and spurious signals. And the simple language of connectivity and paths is no longer robust enough. For example a graph might be connected, but only because there's one edge connecting two huge components. If this edge was spurious, we've just made a huge mistake on modeling this graph structure.

Spectral methods are currently our best way of dealing with noisy interactions. By focusing not on the topological structure of connectivity but on the amount of connectivity measured via cuts, spectral analysis of graphs has become perhaps the best way of finding structures in large graphs.

The spectral lens sees a graph through random walks on the edges. This is great for modeling a collection of pairwise interactions between entities, but not for modeling interactions among sets of entities. We have to be careful here. Spectral methods are actually quite good at finding community structure in graphs (i.e a partition into sets of vertices). What they can't do is find higher order partitionings in graphs (i.e sets of triangles or sets of special 4-vertex structures). And that's where the next three higher-order methods enter the picture

2. The algebraic topology view.

A graph is just the 1-skeleton of a simplicial complex.

If we're looking to model higher-order interactions, we need a language for describing a collection of well-defined higher order structures. That's what a simplicial complex is. I'll skip the formal definition, but the basic idea is that if you have a simplex (an interacting group of entities), then all subsets must be simplices as well. But you declare the simplices first, which means that the simplex $\{a,b,c\}$ is different from the three simplices $\{a,b\}, \{b, c\}, \{c, a\}$, even though the first must contain the second.

A simplicial complex is a topological object. It generalizes a graph because a graph is what you get if you limit yourself to simplices of size at most 2. Because it's a discrete topological object, you can now play with it using all the tools of topology, and in particular very powerful tools like homology and homotopy that reveal all kinds of hidden structure not accessible via a graph metaphor.

While simplicial complexes allow you to express higher order interactions (tunnels! holes!), they don't remove the problem of noise: one edge/simplex can still change the structure in nontrivial ways. There are two approaches that researchers have taken to this problem: one spectral, and one not.

The non-spectral approach is by far the most well-established one. It is based on the idea of persistence: a way to determine from the homology groups of the simplicial complex what structures are interesting and which ones are not. Persistence is the dominant weapon in the arsenal of topological data analysis, and I'll say no more about it here, so as to keep focus on spectral methods.

The spectral approach is less well developed, but is quite interesting. The idea is to generalize the notion of expansion from a graph to higher-order simplices, as well as generalizing the Laplacian operator to higher order simplices (or homology groups). Then a random walk on the simplicial complex can be linked to the Laplacian operator, and the eigenstructure of the operator can be linked to the existence (or nonexistence) of certain homology groups. Two places to start with this topic are one on capturing generalizations of edge expansion, and another on building random walks on simplicial complexes and connecting them to a combinatorial Laplacian.

3. The differential geometry view.

A graph is just a discrete approximation of (scalar functions on) a manifold.

The graph Laplacian is a discrete approximation of the Laplacian second order differential operator, and more generally the Laplace-Beltrami operator on a manifold. Indeed, one way to build intuition for what the graph Laplacian means is that it's capturing heat diffusion on an implicit manifold that the graph is merely approximating.

The Laplace-Beltrami operator is a "zeroth-order" operator in that it applies to the zero-dimensional entities on a manifold: namely, scalar fields over the points of the manifold. Suppose you were to build a vector field instead over the manifold, and wished to reason about it. Then the generalization of the L-B operator that you'd need is called the Laplace-de Rham operator which formally acts like a Laplacian on the higher order differential forms defined over the manifold (formally, on sections of the tangent bundle). Writing down the L-R operator is a little tricky: it involves a combination of the exterior derivative and its dual (via the Hodge * operator). But one useful observation is that the L-R operator on graphs amounts to a Laplacian on the set of edges, rather than vertices.

This means that you can now treat edges as first-class objects for grouping, rather than vertices. And this is useful for higher-order clustering. Whether this can be generalized even further remains to be seen.

4. The linear algebraic view.

A graph (adjacency matrix) is just a special case of a tensor structure on the entities.

This is perhaps the most well-known of the different higher-order approaches to modeling interactions, and is making the most waves right now. The idea is very simple. If we think of a graph in terms of its adjacency matrix, then each entry encodes a relation between two basic entities. If we wished to encode relations between (say) three entities, then we need a "3D matrix", or more precisely, a tensor.

Of course a tensor is more than just a way to assemble a collection of triples into a box, just like a matrix is much more than just a grid of numbers. The most basic question in tensor modeling is factorization: just like we can use the SVD to write down a matrix as a linear combination of outer products of basis vectors, can we write a tensor as 3-way wedge product of basic vectors ? If so, then we've been able to identify the key 3-way factors controlling an interaction.

Tensor factorization is a hard problem: unlike the SVD, most formulations of tensor factorization are NP-hard. I won't get into this very rich topic right now, and instead will point you to some slides by Ravi Kannan as well as an older paper of his.

But can we cluster using tensors ? Or more generally, is there a spectral theory of tensors, and maybe even the analog of Cheeger's inequality ? It turns out that it's possible to define eigenvalues and eigenvectors of tensors, and at least some of the standard spectral theory can be made to carry over - for more on this see these slides by Lek-Heng Lim. But we're still looking for a theory can truly work on hypergraphs or tensors in general. In the meantime, tensor-based approaches to clustering boil down to factorization and PCA-like methods, of which there are many.

5. Coda

Are these approaches in any sense equivalent ? It's hard to say in general, although for a particular problem there might be connections. Certainly, the de Rham cohomology yields a nice filtration over homology groups that could be wrestled into a persistence framework (research question !!!) thus allowing us to extract robust higher-order structures from both geometric and topological considerations. The tensor approach is closely linked to probabilistic models, not surprisingly. But whether the spectral perspective (and all its attendant intuition and value) will extend nicely across these different frameworks remains to be seen.

Monday, March 30, 2015

Revisiting the Misra-Gries estimator

If you've ever taken a class on streaming algorithms, or want to ace some tech company puzzle interview question, you've probably heard of this little gem:

Given n items, determine whether a majority element (one occurring strictly more than n/2 times) exists. You are allowed one pass over the data (which means you can read the elements in sequence, and that's it), and exactly TWO units of working storage.

This is a special case of the general Misra-Gries estimator for finding frequent items in a stream with small space. The MG estimator allows you to find (approximately) all items that occur more than $\epsilon n$ times using $O(1/\epsilon)$ space. Setting $\epsilon = 1/2$ yields the desired result.

This result has been proved and reproved a number of times, and there's a fairly general proof based on the fact that what you're really doing is approximating the true frequency to an additive error of $\epsilon n$ (and so any item that occurs more often than that will be retained).

What's more interesting is how you might go about proving the basic statement from first principles if you don't know the proof. This came up at dinner, when I was talking to people who hadn't heard of the problem before and were able to reconstruct a proof on the fly.

So let's return to the basic statement, and let's be a little more precise. The precise guarantee required is as follows:

If a strict majority element exists, you MUST return it
If not, you can return any element you want (even the most infrequent one)

Because of the second statement above, we only have to worry about streams in which a strict majority element does exist.

Observation 1: if you haven't seen a majority element in any prefix of the stream, you can throw that prefix away.

Why: we know there's a majority element somewhere. If it didn't show up so far, it has to be at least much of a majority in what's remaining.

Observation 2: This prefix could be as short as two elements long.

In other words, we see an element. Call it x. If we now see $y \ne x$, x is no longer a majority element so far, and we can chuck it, starting a new stream at $y$.

But if we do see $x$ again, then it could be a majority element. How do we keep track of whether it ever stops being a majority element ?

We could keep track of the number of items seen so far. But that's an extra counter. Why not just pair instances of $x$ with instances of other elements seen so far by subtraction. If any time we hit zero, we can invoke observation 1 and start again.

If not, then we will end with a nonzero count for $x$, which must be the correct element.

And this gives us the exact algorithm we need.

Friday, February 06, 2015

Friday chest-thumping.

My paper with Amirali Abdullah (or should I say Amir's paper with me) got into STOC ! One aspect not discussed in the linked blog post is the connection to Partial Match (a notoriously hard problem in the data structures literature). In brief, our result allows for a "smooth-ish" interpolation between approximate near neighbor lower bounds in $\ell_1$ and general partial match lower bounds, reinforcing the intuition that Partial Match is an "extremely asymmetric" nearest neighbor problem.
Another paper with 'the Streaming Kings' (said to the sound of a flamenco strum) of Cormode, Chakrabarti, McGregor and Thaler got into CCC (and this is my first paper at Complexity). I'll have more to say about this work in a separate post: in brief, we looked at streaming interactive proofs (of the kind first developed by Cormode, Thaler and Yi) where you have a prover and a streaming verifier and wish to verify a computation with constant rounds of communication. There's a 3-round nearest neighbor protocol in this paper, as well as a number of subtle results about the nature of multi-round communication protocols in the streaming setting.

"Well Suresh, you have two new papers, what are you going to do next" ?

I'm going to ~~DisneyWorld !~~ DC !! I'll be attending the CCC sponsored Leadership in Science Policy Institute April 26-27 in DC. And there's homework to boot !

Wednesday, February 04, 2015

No one is in control...

The New Scientist has a cover this week on the way in which algorithms have taken over our lives. They interviewed some of the participants at the FATML workshop I participated in (thanks to Solon Barocas for sending the reporter our way), and Sorelle Friedler got some nice quotes in regarding the work she, Carlos Scheidegger and I have been doing.

The article is currently paywalled, but a friend of mine sent me the text. The relevant paragraph, lightly edited (the "she" here is Sorelle):

By understanding the biases inherent in the underlying data, she hopes to eliminate bias in the algorithm. Her system looks for correlations between arbitrary properties – like height or address – and demographic groupings like race or gender. If the correlation is expected to lead to unwanted bias, then it would make sense to normalise the data. It is essentially affirmative action for algorithms, she says.

I'm not sure I'd describe our work as affirmative action :), but that's a longer argument.

Thursday, January 29, 2015

More FOCS 2014-blogging

In the spirit of better late than never, some more updates from Amirali Abdullah from his sojourn at FOCS 2014. Previously, he had blogged about the higher-order Fourier analysis workshop at FOCS.

I'll discuss now the first official day of FOCS, with a quick digression into the food first: the reception was lovely, with some nice quality beverages, and delectable appetizers which I munched on to perhaps some slight excess. As for the lunches given to participants, I will think twice in future about selecting a kosher option under dietary restrictions. One hopes for a little better than a microwave instant meal at a catered lunch, with the clear plastic covering still awaiting being peeled off. In fairness to the organizers, once I decided to revert to the regular menu on the remaining days, the chicken and fish were perfectly tasty.

I will pick out a couple of the talks I was most interested in to summarize briefly. This is of course not necessarily a reflection of comparative quality or scientific value; just which talk titles caught my eye.

The first talk is "Discrepancy minimization for convex sets" by Thomas Rothvoss. The basic setup of a discrepany problem is this: consider a universe of $n$ elements, $[n]$ and a set system of $m$ sets ($m$ may also be infinite), $S = \{S_1, S_2, \ldots, S_m \}$, where $S_i \subset [n]$. Then we want to find a $2$-coloring $\chi : [n] \to \{-1, +1 \}$ such that each set is as evenly colored as possible. The discrepany then measures how unevenly colored some set $S_i \in S$ must be under the best possible coloring.

One fundamental result is that of Spencer, which shows there always exists a coloring of discrepancy $O(\sqrt{n})$. This shaves a logarithmic factor off of a simple random coloring, and the proof is non-constructive. This paper by Rothvoss gives the first algorithm that serves as a constructive proof of the theorem.

The first (well-known) step is that Spencer's theorem can be recast as a problem in convex geometry. Each set $S_i$ can be converted to a geometric constraint in $R^n$, namely define a region $x \in R^n : \{ \sum_{j \in S_i} | x_j | \leq 100 \sqrt{n} \}$. Now the intersection of these set of constraints define a polytope $K$, and iff $K$ contains a point of the hypercube $\{-1 , +1 \}^n$ then this corresponds to the valid low discrepancy coloring.

One can also of course do a partial coloring iteratively - if a constant fraction of the elements can be colored with low discrepancy, it suffices to repeat.

The algorithm is surprisingly simple and follows from the traditional idea of trying to solve a discrete problem from the relaxation. Take a point $y$ which is generated from the sphercial $n$-dimensional Gaussian with variance 1. Now find the point $x$ closest to $y$ that lies in the intersection of the constraint set $K$ with the continuous hypercube $[-1, +1]^n$. (For example, by using the polynomial time ellipsoid method.) It turns out some constant fraction of the coordinates of $x$ are actually tight(i.e, integer valued in $\{-1, +1 \}$) and so $x$ turns out to be a good partial coloring.

To prove this, the paper shows that with high probability all subsets of $[-1 +1]^n$ with very few tight coordinates are far from the starting point $y$. Whereas with high probability, the intersection of $K$ with some set having many tight coordinates is close to $y$. This boils down to showing the latter has sufficiently large Gaussian measure, and can be shown by standard tools in convex analysis and probabilitiy theory. Or to rephrase, the proof works by arguing about the isoperimetry of the concerned sets.

The other talk I'm going to mention from the first day is by Karl Bringmann on the hardness of computing the Frechet distance between two curves. The Frechet distance is a measure of curve similarity, and is often popularly described as follows: "if a man and a dog each walk along two curves, each with a designated start and finish point, what is the shortest length leash required?"

The problem is solvable in $O(n^2)$ time by simple dynamic programming, and has since been improved to $O(n^2 / \log n)$ by Agarwal, Avraham, Kaplan and Sharir. It has long been conjectured that there is no strongly subquadratic algorithm for the Frechet distance. (A strongly subquadratic algorithm being defined as $O(n^{2 -\delta})$ complexity for some constant $\delta$, as opposed to say $O(n^2 / polylog(n))$.)

The work by Bringmann shows this conjecture to be true, assuming SETH (the Strongly Exponential Time Hypothesis), or more precisely that there is no $O*((2- \delta)^N)$ algorithm for CNF-SAT. The hardness result holds for both the discrete and continuous versions of the Frechet distance, as well as for any $1.001$ approximation.

The proof works on a high level by directly reducing an instance of CNF-SAT to two curves where the Frechet distance is smaller than $1$ iff the instance is satisfiable. Logically, one can imagine the set of variables are split into two halves, and assigned to each curve. Each curve consists of a collection of "clause and assignment" gadgets, which encode whether all clauses are satisfied by a particular partial assignment. A different such gadget is created for each possible partial assignment, so that there are $O*(2^{N/2})$ vertices in each curve. (This is why solving Frechet distance by a subquadratic algorithm would imply a violation of SETH.)

There are many technical and geometric details required in the gadgets which I won't go into here. I will note admiringly that the proof is surprisingly elementary. No involved machinery or complexity result is needed in the clever construction of the main result; mostly just explicit computations of the pairwise distances between the vertices of the gadgets.

I will have one more blog post in a few days about another couple of results I thought were interesting, and then comment on the Knuth Prize lecture by the distinguished Dick Lipton.

Tuesday, January 27, 2015

Streaming @ SODA: Part II

This is the second of two posts by Samira Daruki on the streaming sessions at SODA 2015. For the first post, see here.

In the third paper from the streaming graph family in SODA15: "Parameterized Streaming: Maximal Matching and Vertex Cover", Chitnis, Cormode, Hajiaghayi and Monemizadeh introduce a new approach to handling graph streams called parameterized streaming algorithms. Also, in addition to insertion-only model, they consider the dynamic model of streaming graphs in which the input is a sequence of insertion/deletion on the edges.

This dynamic model of streaming graph processing is popular when the graph size is changing, and has recently received much attention due to breakthroughs by Ahn, Guha and McGregor (one, and two). Over these two papers, they showed the first results for a number of graph problems over dynamic streams. This has provoked much interest into what can be computed over dynamic graph streams, although still there is not much work on solving graph optimization problems in this model. The challenge here is that when an edge is deleted, sometimes it requires a substantial work to repair the solution again, so we need to make sure that the algorithm has enough information to do so, while keeping only a bounded amount of working space. (ed: not surprisingly, some of these ideas are useful for non-streaming dynamic graph algorithms: witness the paper by Kapron, King and Mountjoy on dynamic connectivity in (randomized) worst-case polylog time from SODA a few years ago)

Returning to parametrized streaming, in this paper instead of solving exactly the optimization problem on the graph stream, the goal is to solve the “parametrized” version of the problem, where the parameter $k$ is given and we want to solve the following decision problem:

Is there a solution with size bounded by $k$?

The motivation behind parametrizing the problem comes from real world applications in which the solution of the graph problems is small comparing to the size of the input (i.e. sublinear in the size of input). In these cases, the interesting challenge is to solve the optimization graph problems in streaming fashion using space bounded by some function of the “solution size” instead of the “input size”.

To solve the parameterized problems, one of the techniques which is used is kernelization, which uses a polynomial time preprocessing to map the input to another equivalent input of smaller size $f(k)$ (called a kernel) with a new parameter value $k’ \le g(k)$, for a computable function $g$.

In this paper, by combining kernelization techniques with randomized sketch structures, the first streaming algorithms for the parameterized versions of the Vertex Cover problem is obtained. The main idea here is to maintain a maximal matching of underlying graph in a streaming fashion. Then run the well-known kernelization algorithm for Vertex Cover on the maintained maximal matching. The data structure to maintain the maximal matching use the $k$-sample recovery sketching algorithm, which is a generalization of linear sketching for $\ell_0$-sampling, as the main tool and apply it to the neighborhood of each vertex (incident edges) in the resulted matching. So as the edges are inserted or deleted, these sketches can be updated without needing knowledge of the full neighborhood of nodes. However, there are some challenges with deletion of edges: as the edges are deleted we need to have an intelligent mechanism to ensure the matching remains maximal using only limited stored information.

Another nice result here is showing a tight lower bound of $\Omega(k^2)$ (by reducing from the INDEX problem in communication complexity) for the space complexity of any (randomized) streaming algorithms for parameterized Vertex Cover, which holds even in the insertion-only model.

Besides the general models of insert-only and dynamic, another restricted model in dynamic framework is also discussed in which we know for sure that at time $i$, the size of the vertex cover of underlying graph induced by the edges till that point is at most $k$. With this promise, they develop a dynamic parameterized streaming algorithm whose space usage matches the proved lower bound.

It is interesting to think about other NP-hard problems in the framework of parameterized streaming and explore how kernelization can be helpful in this direction or see whether we can find other powerful hammers to overcome the challenges which arises in designing algorithms for hard problems in streaming setting.

Coda:
Along with the three papers discussed above, there was another paper on streaming presented at SODA (by Naumovitz and Saks) which provides a deterministic polylogarithmic-space streaming algorithm for approximating distance to monotonicity for a sequence of $n$ numbers, compared to the corresponding randomized result presented at SODA two years ago.

While I won't discuss this work here so as to keep these posts just about streaming graph algorithms, I encourage the interested readers to take a look at this paper as well, as the last one in the streaming family of SODA15.

Streaming @ SODA: Part I

This two part series is written by my student Samira Daruki.

Modern graph data sets are too large to fit in the memory. And so the streaming model is one of the more popular and attractive ones for analyzing massive graphs: in this model, for an input graph $G = (V, E)$ with $n$ vertices and $m$ edges, the edges arrive in an arbitrary order in a stream and the algorithm can only use $\tilde{O}(n)$ space. The algorithm is allowed to have several passes over the stream but usually the ideal case is to have just one pass.

For many graph problems such as matching, spanning tree, graph sparsification, approximate distance and counting subgraphs there now exist streaming algorithms with space complexity $O(n \text{poly} (\log n))$. In these algorithms, usually we assume that the nodes of the graphs can be stored in memory but edges cannot. Notice that for constructing matchings, spanners and sparsifiers, the output size is often $\Omega(n)$, so it forces the streaming algorithm to use $\Omega(n)$ space.

But if you don't need to report the output, then this can be avoided. For an example, see the work of Kapralov, Khanna and Sudan from SODA 2014 which approximates the matching size to a $\text{poly}(\log n)$ factor using $\text{poly}(\log n)$ space in a random stream (where edges appear in a random order rather than arbitrarily)

Thus, the question now is:

can we obtain a $\text{poly}\log$ space streaming algorithm for approximating the solution cost for other graph problems?

Consider for instance MAX-CUT. There are several results on approximating the maximum cut in graphs in a non-streaming model; the trivial approach is to take a random cut. This selects half of the edges in expectation, resulting in a factor $1/2$-approximation.

Thus implies that in a streaming model we can just count the number of edges $m$ and output $\frac{m}{2}$ which results in a $O(\log n)$-space algorithm. By keeping a sample of the edge set we can get a different tradeoff: a $(1+\epsilon)$-approximation algorithm which uses $O(\frac{n}{\epsilon^2})$ space.

Can we get a streaming algorithm with better than factor-$2$ approximation using just $poly(\log n)$ space?

A paper by Kapralov, Khanna and Sudan in the streaming session of SODA15 this year answers this question. This is the latest in a line of results on streaming graph problems by Kapralov and others from SODA 2012, 2013 and 2014 (mentioned above)

Here is their main result

For any constant $\epsilon > 0$ a single pass streaming algorithm for approximating the value of MAX-CUT to a factor $2 − \epsilon$ requires $\Omega(\sqrt{n})$ space, even in the random order model.

This result rules out the possibility of $\text{poly}(\log n)$ space, but suggests that $\tilde{O}(\sqrt{n})$ space cost might be possible in some specific settings.

Another major result of this paper is as follows:

For any constant $\epsilon > 0$ a single pass streaming algorithm for approximating MAX-CUT value to factor $1 + \epsilon$ requires $n^{1−O(\epsilon)}$ space in adversarial streams.

The main fact they use here is the connection between the MAX CUT value and (distance from) bipartiteness:

if a graph $G$ with $m$ edges is $\beta$-far from being bipartite, then maxcut value of $G$ is at most $(1-\beta)m$.

The other simple observation is that any algorithm that computes a $\gamma$-approximation to MAX CUT distinguishes between bipartite graphs and graphs that are $1-\frac{1}{\gamma}$-far from being bipartite. Thus to show that no streaming algorithm using space $c$ can achieve a $\gamma$- approximation with failure probability at most $\delta$, it's enough enough to show no streaming algorithm using space $c$ can distinguish between bipartite graphs and graphs which are $1- \frac{1}{\gamma}$-far from being bipartite with probability at least $1- \delta$.

Given these facts, now the core idea to prove the main result here is exhibiting a distribution over inputs where $(2-\epsilon)$ approximation to MAX-CUT requires $\Omega(\sqrt{n})$ space.

More precisely, the input graph instances are based on random Erdos-Renyi graphs, which are either bipartite in YES case or non-bipartite in the NO case. In order to achieve a $(2-\epsilon)$-factor gap for the MAX CUT in this structure, we choose the expected degree of vertices to be $\Theta(\frac{1}{\epsilon^2})$.

This way, the input streaming graph can be partitioned and given to the algorithm in $\Omega(\frac{1}{\epsilon^2})$ phases, which can be simulated as a $\frac{1}{\epsilon^2}$-party one-way communication game.

Then, by giving a reduction from a variation of Boolean Hidden Matching(BHM) called Distributional Boolean Hidden Partition(D-BHP) to the MAX-CUT on the input instance of the problem, and showing that $\Omega(\sqrt{n})$ space is necessary to differentiate between these two cases, the main streaming lower bound result for obtaining approximate MAX-CUT is straightforward.

There are many technical details in performing this reduction, but roughly speaking they show that any algorithm that solves MAX-CUT on the constructed instances must solve the two-party communication problem in at least one of the phases.

There are still some main open problems left in this paper:

One is whether breaking $2$-approximation barrier in $\sqrt{n}$ space is possible if we are allowed to have $poly(\log n)$ passes over the input stream?
Also it is interesting to think about designing streaming algorithms for other graph problems using $o(n)$ space.

This brings us to another paper presented in this session. In Streaming Algorithms for Estimating the Matching Size in Planar Graphs and Beyond (by Esfandiari, Hajiaghayi, Liaghat, Monemizadeh and Onak), this latter question is answered about finding the maximum matching for planar graphs using $o(n)$ space.

Here is the main result:

If the underlying graph is planar, then there is a streaming algorithm which provides a $O(1)$-approximation solution to maximum matching with high probability using $O(n^{\frac{2}{3}})$ space.

The main idea for proving the result here is to use a known structural graph property:

If we characterize the nodes of the input graph based on the degrees to two groups of light (deg < 9) and heavy (deg > 9) vertices and then define the shallow edges as the ones with two light endpoints, then we have the following nice property: (Assuming |maximum matching| = m, |shallow edges| = s and | heavy vertices| = h):

$$ \frac{\max (s, h)}{12} \leq m \leq h + s$$

Then using this structural fact as the main tool, constructing a small size matching (bounded by $c n^{\frac{2}{3}}$) as we read the edges in a greedy manner, and estimating the number of shallow edges and heavy vertices in the induced subgraph by a subset of sampled vertices with size $c n^{\frac{2}{3}}$, we can approximate the size of the maximum matching by a constant factor.

In addition to planar case, they show that similar results for approximating maximum matching in other graph structures such as $d$-degenerate graphs and forests are achievable.

Coming up: parametrized streaming and VERTEX COVER.

Thursday, January 15, 2015

Brief Review of the post-SODA workshop on algorithmic challenges in machine learning.

My student +John Moeller attended the workshop on algorithmic challenges in machine learning organized by +Shachar Lovett and +Kamalika Chaudhuri after SODA. He wrote up a brief report of some papers that he liked.

Monday, January 12, 2015

Privacy is dead, but not because Scott McNealy said so.

I decided to experiment with using Medium for a long-form not-quite technical post on the evolution of research in data privacy.

The idea of privacy has driven much of our concerns over data for the last few years, and has been the driver of extremely successful research efforts, most notably in differential privacy.What we’re seeing now though is a pivot away from the undifferentiated and problematic notion of privacy in data towards a more subtle and nuanced notion of how data is actually used, and how it should be used.

Here's the post: Medium allows this nifty way of commenting inline, so comment away there, or here.

Thursday, December 11, 2014

Accountability in data mining

For a while now, I've been thinking about the endgame in data mining. Or more precisely, about a world where everything is being mined for all kinds of patterns, and the contours of our life are shaped by the patterns learnt about us.

We're not too far from that world right now - especially if you use intelligent mail filtering, or get your news through social media. And when you live in such a world, the question is not "how do I find patterns in data", but rather

How can I trust the patterns being discovered ?

When you unpack this question, all kinds of questions come to mind: How do you verify the results of a computation ? How do you explain the results of a learning algorithm ? How do you ensure that algorithms are doing not just what they claim to be doing, but what they should be doing ?

The problem with algorithms is that they can often be opaque: but opacity is not the same as fairness, and this is now a growing concern amongst people who haven't yet drunk the machine learning kool-aid.

All of this is a lead up to two things. Firstly, the NIPS 2014 workshop on Fairness, Accountability and Transparency in Machine Learning, organized by +Moritz Hardt and +Solon Barocas and written about by Moritz here.

But secondly, it's about work that +Sorelle Friedler, +Carlos Scheidegger and I are doing on detecting when algorithms run afoul of a legal notion of bias called disparate impact. What we're trying to do is pull out from a legal notion of bias something more computational that we can use to test and certify whether algorithms are indulging in legally proscribed discriminatory behavior.

This is preliminary work, and we're grateful for the opportunity to air it out in a forum perfectly designed for it.

And here's the paper:

What does it mean for an algorithm to be biased?
In U.S. law, the notion of bias is typically encoded through the idea of disparate impact: namely, that a process (hiring, selection, etc) that on the surface seems completely neutral might still have widely different impacts on different groups. This legal determination expects an explicit understanding of the selection process.

If the process is an algorithm though (as is common these days), the process of determining disparate impact (and hence bias) becomes trickier. First, it might not be possible to disclose the process. Second, even if the process is open, it might be too complex to ascertain how the algorithm is making its decisions. In effect, since we don't have access to the algorithm, we must make inferences based on the data it uses.

We make three contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has not received as much attention as more traditional notions of accuracy. Second, we propose a test for the possibility of disparate impact based on analyzing the information leakage of protected information from the data. Finally, we describe methods by which data might be made "unbiased" in order to test an algorithm. Interestingly, our approach bears some resemblance to actual practices that have recently received legal scrutiny.

In one form or another, most of my recent research touches upon questions of accountability in data mining. It's a good thing that it's now a "trend for 2015". I'll have more to say about this in the next few months, and I hope that the issue of accountability stays relevant for more than a year.

Monday, October 06, 2014

Algorithms is the new Algorithms...

Hal Daumé wrote a rather provocative post titled ‘Machine Learning is the new algorithms’, and has begged someone to throw his gauntlet back at him. Consider it now thrown !

His thesis is the following quote:

Everything that algorithms was to computer science 15 years ago, machine learning is today

And among the conclusions he draws is the following:

we should yank algorithms out as an UG requirement and replace it with machine learning

Having spent many years having coffees and writing papers with Hal, I know that he truly does understand algorithms and isn’t just trying to be a troll (at least I hope not). So I’m trying to figure out exactly what he’s trying to say. It will help if you read his article first before returning…

First off, I don’t understand the conclusion. Why not (say) replace architecture with ML, or databases with ML. Or why replace anything at all ? the assumption is that ML is a core tool that students will need more than any other topic. Now I have no problem with adding ML to the list of “things a CS person ought to know”, but I don’t see why it’s not important for a CS person to understand how a database works, or how a compiler processes code, or even how we design an algorithm. This fake mutual exclusiveness appears to be without basis.

But maybe he’s saying that algorithms and ML are two flavors of the same object, and hence one can be replaced by the other. If so, what exactly is that object ? In his view, that object is:

given an input, what’s the best way to produce an (optimal) output ?

This seems to be an overly charitable view of ML. In ML, the goal is to, well, learn. Or to be more stodgy about it, ML provides tools for making systematic inferences and predictions from data.

Which suggests that the concerns are fundamentally orthogonal, not in opposition (and Sasho Nikolov makes this point very well in a comment on Hal’s post). As Hal correctly argues, the hard work in ML is front-loaded: modeling, feature extraction, and the like. The algorithm itself is mostly an afterthought.

But what’s ironic is that one of the most important trends in ML of late is the conversion of an ML problem to an optimization problem. The goal is to make good modeling choices that lead to an optimization problem that can be solved well. But wait: what do you need to know how to solve an optimization ? Wait for it…… ALGORITHMS !!

The argument about stability in algorithms is an odd one, especially coming from someone who’s just written a book on ML. Yes, some core algorithms techniques haven’t changed much in the last many years, but did you see that 2014 paper on improvements in recurrence analysis ? Or the new sorting algorithm by Mike Goodrich ? or even the flurry of new results for Hal’s beloved flow problems ?

As for “everything’s in a library”, I’ll take your boost graph library and give you back WEKA, or libSVM, or even scikit-learn. I don’t need to know anything from Hal’s book to do some basic futzing around in ML using libraries: much like I could invoke some standard hashing subroutine without knowing much about universal hash families.

In one sense though, Hal is right: ML is indeed where algorithms was 15 years ago. Because 15 years ago (approximately) is when the streaming revolution started, and with it the new interest in sub linear algorithms, communication complexity, matrix approximations, distributed algorithms with minimal communication, and the whole “theory of big data” effort. And ML is now trying to catch up to all of this, with some help with from algorithms folks :).

What is true is this though: it wouldn’t hurt us to revisit how we construct the core algorithms classes (undergrad and grad). I realize that CLRS is the canon, and it would cause serious heart palpitations to contemplate not stuffing every piece of paper of that book down students’ throats, but maybe we should be adapting algorithms courses to the new and exciting developments in algorithms itself. I bring in heavy doses of approximation and randomization in my grad algorithms class, and before we forked off a whole new class, I used to teach bits of streaming, bloom filters, min-wise hashing and the like as well. My geometry class used to be all about the core concepts from the 3 Marks book, but now I throw in lots of high dimensional geometry, approximate geometry, kernels and the like.

Ultimately, I think a claim of fake mutual exclusivity is lazy, and ignores the true synergy between ML and algorithms. I think algorithms research has a lot to learn from ML about robust solution design and the value of "noise-tolerance", and ML has plenty to learn from algorithms about how to organize problems and solutions, and how deep dives into the structure of a problem can yield insights that are resilient to changes in the problem specification.

Wednesday, April 30, 2014

SIAM Data Mining 2014: On differential privacy

After my trip to Haverford, I attended the SIAM Data Mining (SDM) conference in Philly. For those who aren't that familiar with the data mining universe, SDM is the SIAM entrant in the data mining conference sweepstakes, along with ACM (KDD) and IEEE (ICDM). SDM is probably also the smallest of the three venues, which makes it comparable in feel to SODA (also because of SIAM organization). The conference attracts the usual data mining suspects, but also more of the applied math folks.

I was the tutorials chair this year, and there were a number of very well-attended tutorials ranging from applications to core mining to theory. In particular, +Moritz Hardt and +Aleksandar Nikolov did a very nice tutorial on differential privacy entitled 'Safer Data Mining'.

SDM is a good venue for theory folks wanting to "test the waters" with data mining: the papers are consistently more mathematically oriented and less "business-heavy", and it's a friendly crowd :).

Shameless plug: I'm the PC co-chair next year along with Jieping Ye and I'd encourage more algorithms folks to submit, and visit Vancouver in April.

In a future post I'll talk more about a panel I also ran at the conference titled 'Ethics in Data Mining'

Tuesday, April 22, 2014

The Shape of Information

A brief synopsis of my general-audience talk at Haverford College.

I'm currently visiting Haverford College at the invitation of +Sorelle Friedler as part of Haverford's big data lecture series. Today's talk was a general audience talk about data mining, titled 'The Shape Of Information': (GDrive link)

The Shape Of Information

What makes data mining so powerful, and so ubiquitous? How can the same set of techniques identify patients at risk for a rare genetic disorder, consumers most likely to like Beyonce's latest album, or even a new star from an sky survey ?

The answer starts with an idea Descartes had nearly 500 years ago. He suggested expressing geometry in terms of numbers (coordinates). This turned out to be a powerful technique that led (among other things) to the development of the calculus. Data mining returns the favor. It starts with sets of numbers that describe a collection of objects. To find patterns in these objects, we create a geometry in which the numbers are coordinates. And just like that, objects become shapes, and the search for information becomes a quest for common structure in these shapes.

In this search, we are not limited by the geometry of our world: we can dream up ever more intricate geometries that capture the shape of the information that we seek to find in our data. In this sense, data mining is the best kind of science fiction come to life: we craft a world out of our imagination, and let the laws of this world lead us to fascinating discoveries about the data that inhabits it.

I had a great time visiting with Sorelle's students in their data mining class. Haverford College continues to impress me with the quality of their undergrads (and their faculty !)

Monday, April 07, 2014

Directed isoperimetry and Bregman divergences.

A new cell probe lower bound for Bregman near neighbor search via directed hypercontractivity.

Amirali Abdullah and I have been pondering Bregman divergences for a while now. You can read all about them in my previous post on the topic. While they generalize Euclidean geometry in some nice ways, they are also quite nasty. The most important nastiness that they exhibit is asymmetry:
\[ D_\phi(x, y) \ne D_\phi(y, x)\]
What's worse is that this asymmetry can grow without bound. In particular, we can quantify the degree of asymmetry by the parameter $\mu$:
\[ \mu = \max_{x,y \in \Delta} \frac{D_\phi(x,y)}{D_\phi(y,x)} \]
where $\Delta$ is the domain of interest.

There's been a ton of work on clustering Bregman divergences, and our work on low-dimensional approximate nearest neighbor search. In almost all these results, $\mu$ shows up in the various resources (space and/or time), and it seems hard to get rid of it without making other assumptions.

After our low-D ANN result, we started trying to work on the high-dimensional version, and our thoughts turned to locality-sensitive hashing. After struggling for a while to get something that might be useful, we started pondering lower bounds. In particular, it seemed clear that the worse $\mu$ was, the harder it became to design an algorithm. So could we actually come up with a lower bound that depended on $\mu$ ?

Our first result was via a reduction from the Hamming cube (and $\ell_1$ near neighbors). While these gave us the desired lower bound, it wasn't $\mu$-sensitive. So we went looking for something stronger.

And that's where isoperimetry made an entrance. There's been a long line of results that prove lower bounds for LSH-like data structures for near neighbor search. Roughly speaking, they work in the following manner:

Construct a gap distribution: a distribution over inputs and queries such that a query point is very close to its nearest neighbor and very far from any other point in the space. In the $\ell_1$ case, what you want is something like $\epsilon d$ distance to the nearest neighbor, and $\Omega(d)$ distance to the second nearest neighbor.
Imagine your data structure to be a function over the Hamming cube (assigning blocks of points to cells) and look at what happens when you perturb it (formally, by defining the function value to be its average over all nearby points)
Use isoperimetry (or hypercontractivity) to argue that the function "expands" a lot. What this implies is that points get scattered everywhere, and so any particular cell you probe doesn't have enough information to determine where the true nearest neighbor actually is. This last step can be proved in different ways, either using Fourier methods, or via the use of Poincaré-type inequalities.

Our initial hope was to use this framework to prove our bound, while incorporating terms relating to the asymmetry of Bregman divergences more directly.

This turned out to be hard. The asymmetry destroys many of the nice algebraic properties associated with the noise operator and related inner products. There's even a deeper problem: if you think of an asymmetric noise operator as pushing things "forward" on a cube with one probability and backwards with another, then it's not hard to imagine a scenario where applying it actually ends up concentrating the mass of the function in a smaller region (which would break hypercontractivity).

We needed two things: a way to get around this not-so-small issue that hypercontractivity isn't true in a directed setting, and a way to analyze the asymmetric noise operator. It turned out that we could circumvent the first problem: we could restrict the class of hash functions we considered in a way that didn't significantly change their entropy (only a constant). Alternatively, we could view hypercontractivity as being true "on average".

The second problem was a bit harder. The solution we came up with was to try and link the directed noise operator to a symmetric operator on a biased space. The intuition here was that the directed operator was creating a nonuniform distribution, so maybe starting off with one might give us what we need.

This actually worked ! There are versions of the Bonami-Beckner inequality in biased spaces that look almost just like their counterparts in uniform space (you merely set the probability of a bit being 1 to $p \ne 1/2$, and the Fourier coefficients are defined in terms of $p$).

There was lots more to address even after the isoperimetry result, but you'll have to read the paper for that :).

I should note that in many ways, this was a "Simons baby": Amir visited during one of the workshops for the program on real analysis, and we had fruitful conversations with Elchanan Mossel and Nathan Keller in early stages of this work.

Tuesday, March 04, 2014

Two big data talks at Haverford College

I'm giving two talks at Haverford College just before SDM 2014 in Philadelphia. If you're in the area. stop by and say hello !

Thursday, February 06, 2014

On "reverse engineering" algorithms, contextual readings, and the invisible choices that add value to research.

tl;dr: Understanding the value in a piece of work (a paper, an algorithm, a system) is not just a matter of understanding the various parts and how they fit. It is also a matter of understanding why the problem is decomposed that way. This observation has implications for how we read, write and evaluate research.

Nick Seaver is an cultural anthropologist at UC Irvine. He just wrote an article at Medium on reverse engineering algorithms. In it he distinguished between the factual reconstruction of an algorithm ("how does it work") from a more probing examination of the way it was designed ("why it is broken down in a particular way").

He comes to a conclusion that is quite sound and not at all controversial:

I want to suggest here that, while reverse engineering might be a useful strategy for figuring out how an existing technology works, it is less useful for telling us how it came to work that way. Because reverse engineering starts from a finished technical object, it misses the accidents that happened along the way — the abandoned paths, the unusual stories behind features that made it to release, moments of interpretation, arbitrary choice, and failure. Decisions that seemed rather uncertain and subjective as they were being made come to appear necessary in retrospect. Engineering looks a lot different in reverse.

But along the way, he makes an insightful observation about the very idea of structuralism as applied to algorithm design: namely, the idea that by breaking down the parts of the algorithm and understanding the pieces and how they fit together we can "understand" the algorithm at some higher level.

When you break an object down into its parts and put it back together again, you have not simply copied it — you’ve made something new. A movie’s set of microtags, no matter how fine-grained, is not the same thing as the movie. It is, as Barthes writes, a “directed, interested simulacrum” of the movie, a re-creation made with particular goals in mind. If you had different goals — different ideas about what the significant parts of movies were, different imagined use-cases — you might decompose differently. There is more than one way to tear apart content. (emphasis added)

In other words, the value of a reductionist analysis of a piece of work is not just in understanding the parts and how they fit, but in understanding the choices that led to that particular decomposition.

I think there are important lessons here for anyone involved in the creation and evaluation of new work. In particular:

Reading: While this is mostly advice for students, it applies whenever you're reading unfamiliar material. The particular form of the paper -- how the proofs are structured, how the system is built, or how the algorithm components work -- is a choice made by the authors and should not be left unexamined or unquestioned. All too often a student will read a paper as a factual news report, rather than reading it as a specific interpretation of a problem/algorithm/theorem that could lend itself to multiple interpretations (just a few days ago I was discussing some JL results with a colleague and realized that we had totally distinct and valid interpretations of some recent work in the area).

Reviewing: It's very easy (and common) to read a paper, understand the proofs, and then be completely underwhelmed by it: the dreaded "too simple" feeling that leads papers to get rejected from unnamed theory conferences. This is especially true when you're not familiar with an area, and don't realize that it was the choices the author(s) made that made the paper look simple. And so your assessment has to factor that choice in, rather than taking it for granted.

Writing/Presenting: Of course all of this impacts how you as a creator choose to present your work. There is not one way to tell a story (in writing or in a talk). But you do make choices about how to tell it. And so it's important to make those choices visible (either by explaining why they are needed, or why different choices don't work, or how they get around certain roadblocks) so that your work can receive a fair evaluation.

This can be excruciating, especially when (like many good papers) your work admits multiple interpretations, and you have to gamble on the right one to please the fickle and time-stretched reviewer. But be mindful of these choices in your work.

p.s On a separate note, it's intriguing to me how so many tools from the study of literature (narrative, contextual analysis, critical reading) show up in other forms of writing far removed from the humanities. This is yet another reason why I'm saddened (even though I'm on the "right" team) by the relentless focus on STEM in opposition to the liberal arts. It's a particularly virulent form of divide-and-conquer (the British colonial technique, not the algorithmic paradigm).

p.p.s Has it really been three months since I wrote anything ? I must be working !!

Pages