Microtargeting in the Election

Data,Politics — Zac Townsend @ October 29, 2012 7:05 pm

I try to avoid sending out political stories that may appear overly partisan in nature, but having read Sahsha Issenberg's book, The Victory Lab, I know that he reports carefully on data use by both democrats and republicans (I mailed the book to our resident former Republican congressional candidate Ethan Wingfield). This is a great story on the use of random experiments in voter contact, persuasion and turnout and seeming advantage that Democrats have on this front:

In fact, when it comes to the use of voter data and analytics, the two sides appear to be as unmatched as they have ever been on a specific electioneering tactic in the modern campaign era. No party ever has ever had such a durable structural advantage over the other on polling, making television ads, or fundraising, for example. And the reason may be that the most important developments in how to analyze voter behavior has not emerged from within the political profession.

The Interesting Math Behind Congressional Reapportionment

Data,Politics — Zac Townsend @ December 22, 2010 12:31 pm

Computational Complexity has a short blog post on the algorithm used to find the new House of Representative's apportionment. The method currently in use is called Huntington–Hill method. To give you a snippet of this problems illustrious past: past solutions that were put in practice include those created by Daniel Webster, Thomas Jefferson and Alexander Hamilton.

Why is it called Huntington-Hill? This column I found from the AMS outlines the history of the method. The introduction also puts the problem quite well:

We can formulate the [apportionment problem] mathematically as follows:

Given states s1, ..., sn with populations P1, ..., Pn and a positive integer h (think of h as the number of seats in the legislature), determine non-negative integers a1, ..., an where a1 + ... + an = h. (It is customary to think of the value h as given in advance and fixed, since currently the size of the House of Representatives is fixed; however, for some applications one might have the freedom to vary h as part of solving the problem.)

The CAP problem differs from the one above in requiring that each ai be greater than or equal to 1, or more generally (mathematicians like to generalize!) greater than or equal to bi where bi is some positive integer. The Constitution does not specify the h which started at 65 in 1790 and has grown to the now permanent value of 435, though when Alaska and Hawaii were admitted to the Union the value of h rose temporarily to 437.

At first glance the AP problem does not seem hard. If a state has 10 percent of the population and there are 37 items (seats in the parliament, computer systems, libraries, etc.) to apportion, then .10 (37) equals 3.7. In a parliament interpretation, the problem is we can not send 3.7 people to the legislature (though some feel they do not get full representation from whole bodies); 3.7 is not an integer! What should be done with those nuisance fractions? The quota principle (fairness rule) would say, in this example, that 3 or 4 representatives be assigned. With 3 representatives a state would be underrepresented, with 4 it would be over represented, but the method we currently use to apportion the House of Representatives could assign fewer than 3 or more than 4 representatives!

The algorithm ultimately devised by Huntington (improving on the work of Hill) works as follows:

  1. Calculate something called the standard divisor, which is the average number of people in each district over the population of the US. That is roughly 309 million divided by 435.
  2. Calculate each state’s standard quota, which is the state's population divided by the standard divisor. This is how you get a number like 3.7 above.
  3. For each state, you take the lower rounding bound (3) and the upper round bound (4) and you take the geometric average of them, \sqrt{U\cdot L}. Then  you compare the old quota (3.7) and round down if its below this mean and up if its above it. So in this case, the geometric mean is 3.46, and you would round 3.7 up.
  4. Then you add up all of these quotas, and if they equal 435, you're done. If they don't you repeat step 3 with a smaller or larger divisor than the standard one depending on whether you're summed quotas are above or below your goal.

A not particularly efficient algorithm for this process is given by Computational Complexity:

Input: Pop, a population array for the 50 states.
Output: Rep, a representatives array for the 50 states.

Let Rep[i] = 1 for each state i.
For j = 51 to 435
Let i = arg max Pop[i]/sqrt(Rep[i]*(Rep[i]+1))
Rep[i] = Rep[i]+1

This algorithm isn't working as I described above, but is using the same method to add individual representatives to the state "most deserving." Its the same method, and actually shows how Huntington-Hill works efficient to change the number of representatives. The Census has a pretty well made video on this whole thing:

Trade Deficits A Little Bit Skewed

Data,Economics — Zac Townsend @ December 15, 2010 2:16 pm

I think its obvious from many of my posts that I am interested in statistics and data analysis. I'm also interested in the failure of unexamined metrics.  Ethan, an old friend of mine from Brown, shared WSJ's Tech Supply Chain Exposes Limits of Trade Metrics, which notes that Apple's iPhone, as it is produced in China, add to the US trade deficit. I think that director-general of the World Trade Organization explains the failure of that stat best:

"What we call 'Made in China' is indeed assembled in China, but what makes up the commercial value of the product comes from the numerous countries that preceded its assembly in China in the global value chain," Pascal Lamy, director-general of the World Trade Organization, said in a speech in October. "The concept of country of origin for manufactured goods has gradually become obsolete." Mr. Lamy said that if trade statistics were adjusted to reflect the actual value contributed to a product by different countries, the size of the U.S. trade deficit with China—$226.88 billion, according to U.S. figures —would be cut in half. That means, he argued, that political tensions over trade deficits are probably larger than they should be. "The statistical bias created by attributing the full commercial value to the last country of origin can pervert the political debate on the origin of the imbalances and lead to misguided, and hence counterproductive, decisions," Mr. Lamy said in his speech to the French Senate in Paris.

Journalism in the Age of Data

Data,Visualizations — Zac Townsend @ December 15, 2010 9:20 am

Spencer sent me a very cool site yesterday called Journalism in the Age of Data. The main content is a 54 minute video report on "data visualization as a storytelling medium." It has a great interface that contains a lot of extra content.  There is a really interesting thread in the video about hard-to-make good visualizations, and the proliferation of bad and confusing visualizations. The "key points":

The explosion of data has brought a complementary need for tools to analyze it
Researchers in visualization are helping by building tools for non-experts
Journalists are finding ways to adapt to the challenge of telling stories with data
With experience in charting data, infographics designers are well suited to bring data vis to journalism, but they debate how effective it is at explaining concepts
In a wired world, data is increasingly becoming a medium of personal expression
Data will increasingly arrive in real time, challenging our ability to absorb, analyze and display it
Technologies for creating online visualizations are in transition, but there are new tools coming out that will make the process easier
Data analysis is at least as important as visually displaying it; there are tools that help with this process

Some cool visualizations I saw in the video:
Budget Forecasts, Compared With Reality
The Crisis of Credit Visualized
San Francisco Crimespotting

And, a reference to a very cool paper on Narrative Visualization: Telling Stories with Data and a very cool JS library: Protovis.

Gawker Passwords

Data,Visualizations — Zac Townsend @ December 14, 2010 1:02 pm

This weekend the Gawker network of blogs was hacked, and a bunch of user passwords were compromised. For an interesting analysis of the see this post as Coding Horror. One of the things I found most interesting though, was the Wall Street Journal had an interesting article on passwords in the hack:

On Sunday night, hackers posted online a trove of data from Gawker Media’s servers, including the usernames, email addresses and passwords of more than one million registered users. The passwords were originally encrypted, but 188,279 of them were decoded and made public as part of the hack.

Then, using that dataset, the WSJ found the 50 most-popular Gawker Media passwords and made this interesting graph:
The Top 50 Gawker Media Passwords

Visualizing Friendships

Data,Statistics,Visualizations — Zac Townsend @ December 14, 2010 10:57 am

An intern at Facebook has created a world map that visualizes the connections in the social graph:

Facebook World Map of Relationships

What's fascinating, through, is how he did it:

I began by taking a sample of about ten million pairs of friends from Apache Hive, our data warehouse. I combined that data with each user's current city and summed the number of friends between each pair of cities. Then I merged the data with the longitude and latitude of each city.

At that point, I began exploring it in R, an open-source statistics environment. As a sanity check, I plotted points at some of the latitude and longitude coordinates. To my relief, what I saw was roughly an outline of the world. Next I erased the dots and plotted lines between the points. After a few minutes of rendering, a big white blob appeared in the center of the map. Some of the outer edges of the blob vaguely resembled the continents, but it was clear that I had too much data to get interesting results just by drawing lines. I thought that making the lines semi-transparent would do the trick, but I quickly realized that my graphing environment couldn't handle enough shades of color for it to work the way I wanted.

Instead I found a way to simulate the effect I wanted. I defined weights for each pair of cities as a function of the Euclidean distance between them and the number of friends between them. Then I plotted lines between the pairs by weight, so that pairs of cities with the most friendships between them were drawn on top of the others. I used a color ramp from black to blue to white, with each line's color depending on its weight. I also transformed some of the lines to wrap around the image, rather than spanning more than halfway around the world.

The 70 Online Databases That Define Our Planet

Data — Zac Townsend @ December 13, 2010 1:15 pm

The MIT Technology Review has the list.