Visualizing Friendships

Data,Statistics,Visualizations — Zac Townsend @ December 14, 2010 10:57 am

An intern at Facebook has created a world map that visualizes the connections in the social graph:

Facebook World Map of Relationships

What's fascinating, through, is how he did it:

I began by taking a sample of about ten million pairs of friends from Apache Hive, our data warehouse. I combined that data with each user's current city and summed the number of friends between each pair of cities. Then I merged the data with the longitude and latitude of each city.

At that point, I began exploring it in R, an open-source statistics environment. As a sanity check, I plotted points at some of the latitude and longitude coordinates. To my relief, what I saw was roughly an outline of the world. Next I erased the dots and plotted lines between the points. After a few minutes of rendering, a big white blob appeared in the center of the map. Some of the outer edges of the blob vaguely resembled the continents, but it was clear that I had too much data to get interesting results just by drawing lines. I thought that making the lines semi-transparent would do the trick, but I quickly realized that my graphing environment couldn't handle enough shades of color for it to work the way I wanted.

Instead I found a way to simulate the effect I wanted. I defined weights for each pair of cities as a function of the Euclidean distance between them and the number of friends between them. Then I plotted lines between the pairs by weight, so that pairs of cities with the most friendships between them were drawn on top of the others. I used a color ramp from black to blue to white, with each line's color depending on its weight. I also transformed some of the lines to wrap around the image, rather than spanning more than halfway around the world.

What If We Tested Laws Before Passing Them?

Social Science,Statistics — Zac Townsend @ December 13, 2010 2:16 pm

An interesting article in the Boston Globe today on whether we should use randomized trials to test laws before they are passed.

There are certainly potential problems with this vision. First is the question of effectiveness: In some cases, it may prove too difficult to run an accurate test. The full repercussions of laws often take years to manifest themselves, and small-scale experiments do not always translate well to larger settings. Also at issue is fairness. Americans expect to be treated equally under the law, and this approach, by definition, entails disparate treatment.

“The problem is, we’re dealing with laws that have a huge impact on people’s lives,” says Barry Friedman, a law professor at New York University. “These aren’t casual tests. It’s not, you try Tide or you try laundry detergent X....Here we’re talking about basic benefits and fundamental rights.” Though Friedman is sympathetic to the goal of gaining better empirical knowledge, he says, “My guess is some of it’s doable in some contexts, and a lot of it’s not doable in other contexts.”

But others are more sanguine, and they make the opposite argument: That precisely because the stakes are so high, the laws that we enact on a large-scale, long-term basis must be more rigorously tested. This wave of thinking is part of a broader trend in fields from health care to education: Our practices should be “evidence-based,” rather than deriving from theories and unproven assumptions. The question is whether this kind of scientific approach can successfully take on a project as unruly as our society — and our politics.

From my earlier post, I think it is clear that I fall in the "the stakes are so high" lets test group.

Learning A New Statistical Method: Bayesian Additive Regression Trees

Social Science,Statistics — Zac Townsend @ December 13, 2010 1:53 pm

I may do some work for Jennifer Hill, an applied statistics professor at NYU's Steinhardt School. Having a career like hers is something I'm very interested in doing if I go the PhD route, which is get her doctorate in Statistics, focus on applications to social science, and work on interesting causal inference problems.

This last weekend I read a paper she sent me on Bayesian Additive Regression Trees (BART), which is quite interesting. The article, Bayesian Nonparametric Modeling for Causal Inference is coming out this January in Journal of Computational and Graphical Statistics. The abstract:

Researchers have long struggled to identify causal effects in nonexperimental settings. Many recently proposed strategies assume ignorability of the treatment assignment mechanism and require fitting two models—one for the assignment mechanism and one for the response surface. This article proposes a strategy that instead focuses on very flexibly modeling just the response surface using a Bayesian nonparametric modeling procedure, Bayesian Additive Regression Trees (BART). BART has several advantages: it is far simpler to use than many recent competitors, requires less guesswork in model fitting, handles a large number of predictors, yields coherent uncertainty intervals, and fluidly handles continuous treatment variables and missing data for the outcome variable. BART also naturally identifies heterogeneous treatment effects. BART produces more accurate estimates of average treatment effects compared to propensity score matching, propensity-weighted estimators, and regression adjustment in the nonlinear simulation situations examined. Further, it is highly competitive in linear settings with the “correct” model, linear regression. Supplemental materials including code and data to replicate simulations and examples from the article as well as methods for population inference are available online.

(This is perhaps more for me, than any reader) Basically, when using some methods to improve causal inference, such as matching, you're often fitting two models: one on whether or not a unit was treated, and than the more easily (or commonly) understood "response surface," which is the model for the outcome conditional on treatment and all the confounders. BART is a method to estimate the response surface non-parametrically, while being (it appears) as or more robust than other methods.

When trying to figure out how effective a treatment of some kind is, you cannot observe the outcomes for when an individual both receives the treatment Y_i(1) and does not receive the treatment Y_i(0). A fancy way of saying that is Y_i=Y_i(1)Z_i+Y_i(0)(1-Z_1), where Z_i is an indicator of whether you have or have not gotten the treatment. So that equation is saying that if you got the treatment the second term on the right side of the equal sign is zero, and in the alternative case, the first term is zero.

When doing casual inference, you want to compare two groups, one that received the treatment and one that did not, that are as similar as possible. That is, the only difference in the comparison groups is that one got the treatment and the other didn't. In this way, you can be sure that any observed difference in the groups is due to the treatment. This idea is formalized through the term ignorability. That is, if the two groups cannot be distinguished on all the observable characteristics (they have "balance"), the assignment to the treatment group is ignorable. (More formally, the potential outcomes are independent of treatment assignment, given the covariates or Y(0),Y(1) \perp\!\!\!\perp Z | X, where X are confounders and \perp\!\!\!\perp means conditionally independence). Ignorability also requires overlap or common support in the covariates across the two groups.

So, in the end with ignorability, we're left to estimate the E[Y(1)|x]=E[Y|X,Z=1] and E[Y(0)|x]=E[Y|X,Z=0]. Unfortunately, this estimation can be very difficult if the treatment outcomes are not linearly related to the covariates, the distribution of the covariates are different across the two groups, or, as is often the case in a world with increasing data, there are tons of confounding covariates or (and this happens all the the time) you really don't know which of them are needed to satisfy ignorability. A bunch of methods have been proposed to address this estimation problem (see the paper for a ton of citations), but the BART method, as I mentioned earlier is different because it "focuses solely on precise estimation of the response surface." Also, part of BART's advantage is that it doesn't require as many researchers choices:

Nonparametric and semiparametric versions of these [other cited] methods are more robust but require a higher level of researcher sophistication to understand and implement (e.g., to specify smoothing parameters such as number of terms in a series estimator or bandwidth for a kernel estimator). This article proposes that the benefits of the BART strategy in terms of simplicity, precision, robustness, and lack of required researcher interference outweigh the potential benefit of having an estimator that is strictly consistent under certain sets of conditions.

I think I'll save a careful description of the trees themselves for a later post, even thought that is most of the paper. Basically, though, BART is a sum-of-trees model that uses a set of binary trees to split up the observations on the confounders. What's most fascinating, though, is that the parameters for BART are defined as a statistical model, with a prior put on the parameters, which is quite different than the other learning/mining models I've learned about. For those happy few who might be interested, BART is described in even greater detail in "BART: Bayesian additive regression trees." Abstract:

We develop a Bayesian “sum-of-trees” model where each tree is constrained by a regularization prior to be a weak learner, and fitting and inference are accomplished via an iterative Bayesian backfitting MCMC algorithm that generates samples from a posterior. Effectively, BART is a nonparametric Bayesian regression approach which uses dimensionally adaptive random basis elements. Motivated by ensemble methods in general, and boosting algorithms in particular, BART is defined by a statistical model: a prior and a likelihood. This approach enables full posterior inference including point and interval estimates of the unknown regression function as well as the marginal effects of potential predictors. By keeping track of predictor inclusion frequencies, BART can also be used for model-free variable selection. BART’s many features are illustrated with a bake-off against competing methods on 42 different data sets, with a simulation experiment and on a drug discovery classification problem.

Testing Housing Aid

New York City,Social Science,Statistics — Zac Townsend @ December 12, 2010 5:27 pm

New York City is randomizing the people who get a certain Housing Aid program called Homebase:

It has long been the standard practice in medical testing: Give drug treatment to one group while another, the control group, goes without.

Now, New York City is applying the same methodology to assess one of its programs to prevent homelessness. Half of the test subjects — people who are behind on rent and in danger of being evicted — are being denied assistance from the program for two years, with researchers tracking them to see if they end up homeless.

The city’s Department of Homeless Services said the study was necessary to determine whether the $23 million program, called Homebase, helped the people for whom it was intended. Homebase, begun in 2004, offers job training, counseling services and emergency money to help people stay in their homes.

But some public officials and legal aid groups have denounced the study as unethical and cruel, and have called on the city to stop the study and to grant help to all the test subjects who had been denied assistance.

“They should immediately stop this experiment,” said the Manhattan borough president, Scott M. Stringer. “The city shouldn’t be making guinea pigs out of its most vulnerable.”

On a listserv I'm on, there has been a lot of ethical handwringing about this program, but these people weren't randomly assigned to poverty. They were randomly assigned not to receive a program.

If you agree with Stringer that citizens shouldn't be treated like lab rats, than the conclusion should be that they should receive no treatment. We have no idea whether this program is effective or not. We have no idea whether enrolling people in this program, in the long-term, might increase the time they spend homeless. We have no idea if the program leads to more crime or less. We have no idea if the program does anything. So if you're not interested in throwing people in to some unproven, untested, possibly ill-designed program at politicians' whims, the only option is to stop the intervention all-together.

Alternatively, perhaps we can test the program. We can see if the program is effective. We can learn whether the program meets its goals. Not necessarily on a cost-benefit basis, but at all. By any standard. To do that we turn to the randomized experiment.

Now, what is experimentation? In the ideal multiverse we could take the exact same people and give them intervention in one case, and not give them the intervention in the other. Then we could observe the difference and know that it was due to the Homebase program.

Absent that we have only one tool at our disposal that gets at causal inference with almost no exceptions, and that is the well-designed randomized experiment (note all the caveats because basically the randomized experiment is the gold standard, and there are SO many statistical and design tools to turn quasi-experiments and correlation studies into something approaching the ideal that NYC is implementing).

To do this you find two groups as alike as possible and you compare them. You give one of them the intervention, and you don't give it to the other group. You can't just give the program to as many people as apply and then pick some other group of people to test as a comparison. Applying is, itself, a factor you want to be equal across the groups. That's why in random experiments you tend to look for twice as many people as you can enroll, randomly enroll half of them, and then collect data on both them.

A large number of families are denied (1,500) due to lack of funding. Another way to think of the study is that there are 1,700 people rejected, and we found money to serve 200 of them. What is the best way to pick those people? The answer, to me, is the lottery. So 200 of those 1,700 families are assigned the intervention, and we randomly study another 200 of them. These two groups--all people who applied to the program--we can assume are basically similar (have something called "balance") across all observable and unobservable characteristics (we can measure the first and assume the second).

Now I'm masking a bunch of statistics that show that random assignment leads to balance, on average, but whatever. The point is that we're creating a counter-factual group of people who applied to the program but didn't get the intervention and people who did apply to the program who did. The selection was done by lottery--not by some other method such as who you're best friends with, or whether your name sounds right or whatever. Doesn't that seem like a just way to assign spots in a program?