To Attract the Next Google, New York City Seeks to Create an Engineering School

Zac Townsend @ 2:03 pm on December, 16

On Thursday morning, Robert K. Steel, the deputy mayor for economic development, announced that the city would seek a “top caliber academic institution” as a partner in building a school for applied science and engineering. The city is willing to consider locating it on one or more of its properties, including the old hospital campuses at the Brooklyn Navy Yard and on Roosevelt Island.

I’m not entirely sure what that means. Are they looking for a new school to form a campus in the city, or for an already existing university to expand? Are they willing to give land/money to Columbia’s Fu, NYU-Poly or Cooper Union or do they want MIT to have a satellite campus? The NYT article continues:

“New York has had some of the best science in the world for years, and it hasn’t translated into a first-rate center for technology start-ups the way it has elsewhere,” said Jonathan Bowles, director of the Center for an Urban Future. “It’s a mistake to think that any other region can become the next Silicon Valley, but New York can and should develop more of a technology presence than it has now.”

New Advice On How to Survive Nuclear Attack

Zac Townsend @ 8:58 am on December, 16

New “scientific analysis” shows that if a nuclear attack happens near you, that you shouldn’t flee. Your chances of survival go up a great deal if you stay inside some structure (even a car, but preferably something like a basement): “a nuclear attack is much more survivable if you immediately shield yourself from the lethal radiation that follows a blast, a simple tactic seen as saving hundreds of thousands of lives.” The Times’ reporting on the results continue:

The results were revealing. For instance, the scientists found that a bomb’s flash would blind many drivers, causing accidents and complicating evacuation.

The big surprise was how taking shelter for as little as several hours made a huge difference in survival rates.

“This has been a game changer,” Brooke Buddemeier, a Livermore health physicist, told a Los Angeles conference. He showed a slide labeled “How Many Lives Can Sheltering Save?”

If people in Los Angeles a mile or more from ground zero of an attack took no shelter, Mr. Buddemeier said, there would be 285,000 casualties from fallout in that region.

Taking shelter in a place with minimal protection, like a car, would cut that figure to 125,000 deaths or injuries, he said. A shallow basement would further reduce it to 45,000 casualties. And the core of a big office building or an underground garage would provide the best shelter of all.

Trade Deficits A Little Bit Skewed

Zac Townsend @ 2:16 pm on December, 15

I think its obvious from many of my posts that I am interested in statistics and data analysis. I’m also interested in the failure of unexamined metrics.  Ethan, an old friend of mine from Brown, shared WSJ’s Tech Supply Chain Exposes Limits of Trade Metrics, which notes that Apple’s iPhone, as it is produced in China, add to the US trade deficit. I think that director-general of the World Trade Organization explains the failure of that stat best:

“What we call ‘Made in China’ is indeed assembled in China, but what makes up the commercial value of the product comes from the numerous countries that preceded its assembly in China in the global value chain,” Pascal Lamy, director-general of the World Trade Organization, said in a speech in October. “The concept of country of origin for manufactured goods has gradually become obsolete.” Mr. Lamy said that if trade statistics were adjusted to reflect the actual value contributed to a product by different countries, the size of the U.S. trade deficit with China—$226.88 billion, according to U.S. figures —would be cut in half. That means, he argued, that political tensions over trade deficits are probably larger than they should be. “The statistical bias created by attributing the full commercial value to the last country of origin can pervert the political debate on the origin of the imbalances and lead to misguided, and hence counterproductive, decisions,” Mr. Lamy said in his speech to the French Senate in Paris.

Journalism in the Age of Data

Zac Townsend @ 9:20 am on December, 15

Spencer sent me a very cool site yesterday called Journalism in the Age of Data. The main content is a 54 minute video report on “data visualization as a storytelling medium.” It has a great interface that contains a lot of extra content.  There is a really interesting thread in the video about hard-to-make good visualizations, and the proliferation of bad and confusing visualizations. The “key points”:

The explosion of data has brought a complementary need for tools to analyze it
Researchers in visualization are helping by building tools for non-experts
Journalists are finding ways to adapt to the challenge of telling stories with data
With experience in charting data, infographics designers are well suited to bring data vis to journalism, but they debate how effective it is at explaining concepts
In a wired world, data is increasingly becoming a medium of personal expression
Data will increasingly arrive in real time, challenging our ability to absorb, analyze and display it
Technologies for creating online visualizations are in transition, but there are new tools coming out that will make the process easier
Data analysis is at least as important as visually displaying it; there are tools that help with this process

Some cool visualizations I saw in the video:
Budget Forecasts, Compared With Reality
The Crisis of Credit Visualized
San Francisco Crimespotting

And, a reference to a very cool paper on Narrative Visualization: Telling Stories with Data and a very cool JS library: Protovis.

How New York’s Racial Makeup Has Changed Since 2000

Zac Townsend @ 10:28 pm on December, 14

The Times has visualized the change in the ethnic break down of the City by census tract. Here is the map for Black New Yorkers:

Black New Yorkers MovementMap Key

The map text from the Times:

Canarsie, Brooklyn, had one of the greatest increases in its share of black residents in 2009 (to 81% from 67%), while recently gentrified neighborhoods like Prospect Heights, Clinton Hill and Fort Greene saw double-digit decreases.

See the rest of the maps here.

There is also an accompanying article Region Is Reshaped as Minorities Go to Suburbs:

Metropolitan New York is being rapidly reshaped as blacks, Latinos, Asians and immigrants surge into the suburbs, while gentrification by whites is widening the income gap in neighborhoods in Manhattan and Brooklyn, according to new census figures released on Tuesday.

Jon Stewart On The GOP For Blocking Health Care For 9/11 First Responders

Zac Townsend @ 2:14 pm on December, 14

“Here’s a tribute to a few Republican senators who find comfort and advantage in invoking the heroes of 9/11 but refuse to give them health care:”

The Daily Show With Jon Stewart Mon – Thurs 11p / 10c
Lame-as-F@#k Congress
www.thedailyshow.com
Daily Show Full Episodes Political Humor & Satire Blog</a> The Daily Show on Facebook

Gawker Passwords

Zac Townsend @ 1:02 pm on December, 14

This weekend the Gawker network of blogs was hacked, and a bunch of user passwords were compromised. For an interesting analysis of the see this post as Coding Horror. One of the things I found most interesting though, was the Wall Street Journal had an interesting article on passwords in the hack:

On Sunday night, hackers posted online a trove of data from Gawker Media’s servers, including the usernames, email addresses and passwords of more than one million registered users. The passwords were originally encrypted, but 188,279 of them were decoded and made public as part of the hack.

Then, using that dataset, the WSJ found the 50 most-popular Gawker Media passwords and made this interesting graph:
The Top 50 Gawker Media Passwords

Visualizing Friendships

Zac Townsend @ 10:57 am on December, 14

An intern at Facebook has created a world map that visualizes the connections in the social graph:

Facebook World Map of Relationships

What’s fascinating, through, is how he did it:

I began by taking a sample of about ten million pairs of friends from Apache Hive, our data warehouse. I combined that data with each user’s current city and summed the number of friends between each pair of cities. Then I merged the data with the longitude and latitude of each city.

At that point, I began exploring it in R, an open-source statistics environment. As a sanity check, I plotted points at some of the latitude and longitude coordinates. To my relief, what I saw was roughly an outline of the world. Next I erased the dots and plotted lines between the points. After a few minutes of rendering, a big white blob appeared in the center of the map. Some of the outer edges of the blob vaguely resembled the continents, but it was clear that I had too much data to get interesting results just by drawing lines. I thought that making the lines semi-transparent would do the trick, but I quickly realized that my graphing environment couldn’t handle enough shades of color for it to work the way I wanted.

Instead I found a way to simulate the effect I wanted. I defined weights for each pair of cities as a function of the Euclidean distance between them and the number of friends between them. Then I plotted lines between the pairs by weight, so that pairs of cities with the most friendships between them were drawn on top of the others. I used a color ramp from black to blue to white, with each line’s color depending on its weight. I also transformed some of the lines to wrap around the image, rather than spanning more than halfway around the world.

What If We Tested Laws Before Passing Them?

Zac Townsend @ 2:16 pm on December, 13

An interesting article in the Boston Globe today on whether we should use randomized trials to test laws before they are passed.

There are certainly potential problems with this vision. First is the question of effectiveness: In some cases, it may prove too difficult to run an accurate test. The full repercussions of laws often take years to manifest themselves, and small-scale experiments do not always translate well to larger settings. Also at issue is fairness. Americans expect to be treated equally under the law, and this approach, by definition, entails disparate treatment.

“The problem is, we’re dealing with laws that have a huge impact on people’s lives,” says Barry Friedman, a law professor at New York University. “These aren’t casual tests. It’s not, you try Tide or you try laundry detergent X….Here we’re talking about basic benefits and fundamental rights.” Though Friedman is sympathetic to the goal of gaining better empirical knowledge, he says, “My guess is some of it’s doable in some contexts, and a lot of it’s not doable in other contexts.”

But others are more sanguine, and they make the opposite argument: That precisely because the stakes are so high, the laws that we enact on a large-scale, long-term basis must be more rigorously tested. This wave of thinking is part of a broader trend in fields from health care to education: Our practices should be “evidence-based,” rather than deriving from theories and unproven assumptions. The question is whether this kind of scientific approach can successfully take on a project as unruly as our society — and our politics.

From my earlier post, I think it is clear that I fall in the “the stakes are so high” lets test group.

Learning A New Statistical Method: Bayesian Additive Regression Trees

Zac Townsend @ 1:53 pm on December, 13

I may do some work for Jennifer Hill, an applied statistics professor at NYU’s Steinhardt School. Having a career like hers is something I’m very interested in doing if I go the PhD route, which is get her doctorate in Statistics, focus on applications to social science, and work on interesting causal inference problems.

This last weekend I read a paper she sent me on Bayesian Additive Regression Trees (BART), which is quite interesting. The article, Bayesian Nonparametric Modeling for Causal Inference is coming out this January in Journal of Computational and Graphical Statistics. The abstract:

Researchers have long struggled to identify causal effects in nonexperimental settings. Many recently proposed strategies assume ignorability of the treatment assignment mechanism and require fitting two models—one for the assignment mechanism and one for the response surface. This article proposes a strategy that instead focuses on very flexibly modeling just the response surface using a Bayesian nonparametric modeling procedure, Bayesian Additive Regression Trees (BART). BART has several advantages: it is far simpler to use than many recent competitors, requires less guesswork in model fitting, handles a large number of predictors, yields coherent uncertainty intervals, and fluidly handles continuous treatment variables and missing data for the outcome variable. BART also naturally identifies heterogeneous treatment effects. BART produces more accurate estimates of average treatment effects compared to propensity score matching, propensity-weighted estimators, and regression adjustment in the nonlinear simulation situations examined. Further, it is highly competitive in linear settings with the “correct” model, linear regression. Supplemental materials including code and data to replicate simulations and examples from the article as well as methods for population inference are available online.

(This is perhaps more for me, than any reader) Basically, when using some methods to improve causal inference, such as matching, you’re often fitting two models: one on whether or not a unit was treated, and than the more easily (or commonly) understood “response surface,” which is the model for the outcome conditional on treatment and all the confounders. BART is a method to estimate the response surface non-parametrically, while being (it appears) as or more robust than other methods.

When trying to figure out how effective a treatment of some kind is, you cannot observe the outcomes for when an individual both receives the treatment Y_i(1) and does not receive the treatment Y_i(0). A fancy way of saying that is Y_i=Y_i(1)Z_i+Y_i(0)(1-Z_1), where Z_i is an indicator of whether you have or have not gotten the treatment. So that equation is saying that if you got the treatment the second term on the right side of the equal sign is zero, and in the alternative case, the first term is zero.

When doing casual inference, you want to compare two groups, one that received the treatment and one that did not, that are as similar as possible. That is, the only difference in the comparison groups is that one got the treatment and the other didn’t. In this way, you can be sure that any observed difference in the groups is due to the treatment. This idea is formalized through the term ignorability. That is, if the two groups cannot be distinguished on all the observable characteristics (they have “balance”), the assignment to the treatment group is ignorable. (More formally, the potential outcomes are independent of treatment assignment, given the covariates or Y(0),Y(1) \perp\!\!\!\perp Z | X, where X are confounders and \perp\!\!\!\perp means conditionally independence). Ignorability also requires overlap or common support in the covariates across the two groups.

So, in the end with ignorability, we’re left to estimate the E[Y(1)|x]=E[Y|X,Z=1] and E[Y(0)|x]=E[Y|X,Z=0]. Unfortunately, this estimation can be very difficult if the treatment outcomes are not linearly related to the covariates, the distribution of the covariates are different across the two groups, or, as is often the case in a world with increasing data, there are tons of confounding covariates or (and this happens all the the time) you really don’t know which of them are needed to satisfy ignorability. A bunch of methods have been proposed to address this estimation problem (see the paper for a ton of citations), but the BART method, as I mentioned earlier is different because it “focuses solely on precise estimation of the response surface.” Also, part of BART’s advantage is that it doesn’t require as many researchers choices:

Nonparametric and semiparametric versions of these [other cited] methods are more robust but require a higher level of researcher sophistication to understand and implement (e.g., to specify smoothing parameters such as number of terms in a series estimator or bandwidth for a kernel estimator). This article proposes that the benefits of the BART strategy in terms of simplicity, precision, robustness, and lack of required researcher interference outweigh the potential benefit of having an estimator that is strictly consistent under certain sets of conditions.

I think I’ll save a careful description of the trees themselves for a later post, even thought that is most of the paper. Basically, though, BART is a sum-of-trees model that uses a set of binary trees to split up the observations on the confounders. What’s most fascinating, though, is that the parameters for BART are defined as a statistical model, with a prior put on the parameters, which is quite different than the other learning/mining models I’ve learned about. For those happy few who might be interested, BART is described in even greater detail in “BART: Bayesian additive regression trees.” Abstract:

We develop a Bayesian “sum-of-trees” model where each tree is constrained by a regularization prior to be a weak learner, and fitting and inference are accomplished via an iterative Bayesian backfitting MCMC algorithm that generates samples from a posterior. Effectively, BART is a nonparametric Bayesian regression approach which uses dimensionally adaptive random basis elements. Motivated by ensemble methods in general, and boosting algorithms in particular, BART is defined by a statistical model: a prior and a likelihood. This approach enables full posterior inference including point and interval estimates of the unknown regression function as well as the marginal effects of potential predictors. By keeping track of predictor inclusion frequencies, BART can also be used for model-free variable selection. BART’s many features are illustrated with a bake-off against competing methods on 42 different data sets, with a simulation experiment and on a drug discovery classification problem.

« Previous PageNext Page »
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
(c) 2012 The Forward Lean | powered by WordPress with Barecity