Shared Names and Voter Fraud

Posted: April 6, 2014 in analysis

RedState.com contributor streiff wrote an analysis of a Fox News story about database matching of North Carolina’s voter roles.  The original story, while somewhat alarmist, at least acknowledges that the analysis is preliminary at best. What North Carolina did was compare a complete list of everyone who voted in NC in 2012 to a list of voters in 28 other states.   35,750 voters matched first name, last name, and birthdate.  765 voters also matched the last 4 digits of their social security number. The Fox News article points out that these aren’t guaranteed cases of voter fraud, but streiff ignores this, with his headline “Rampant Vote Fraud Uncovered In North Carolina”, and comments like this:

This is not a minor problem, this is an industry. Under a most favorable scenario one has to expect the overwhelming majority of the voters matching name and date of birth are the same person.

I don’t think we can accept that claim without some analysis.  The match on matching in large groups is a funny, and most people don’t judge it very well.

Birthdays In A Room

The classic example of this is to ask “How many people need to be in a room before there is a 50% chance that two of them share a birthday?”.  Most people think that, with 365 days to choose from, it would need to be in the hundreds. The actual answer is 23.  The first person to enter the room can have any birthday.  The second person has one day they cannot have.  The third has two, and so on.  While the chance for each person colliding with someone else as they enter the room is low, the cumulative chance that every person will dodge every other person shrinks faster than most people think.

Names And Birthdays Of Voters

The math for voters isn’t the same, but it’s not too hard to figure out.  There were 6.5 million NC voters in the data, and 101 million non NC voters.  Voter birthdays are not evenly distributed, which means they will tend to match more often than by pure chance, but to keep the math simple, I assumed all voters were equally proportioned between 18 and 62.  The lessened range is more than compensated for by the flattening of the voter density. Therefore, each voter in NC with a given name has a 1 in 16,000 chance of sharing a birthday with a voter in another state who also shares their name.

Now, consider John Smith.  Howmanyofme.com says that there are 45,963 people named John Smith in the United States.  Assuming they are evenly proportioned, 939 would be among the 6.5 million NC voters, and 14,670 would have voted in another state.  Each of the NC John Smiths has a 60% chance of sharing a birthdate with an out-of-state voter, which means that 564 of them would turn up on this list.

And that’s just one name.  There’s also James Smith, Michael Smith, Robert Smith… and John Jones, James Jones… and so on and so on and so on. In fact, just the top 10 first names paired with the top 10 last names on howmanyofme.com predicts nearly 25,000 matches.  So, the 35,570 is easily reachable with no foul play whatsoever.  The key in the math is that not only is a common name more likely to match, because there are more of them in other states, it will account for more matches, because there are more of them in NC.

An alternate analysis to to take a sample of real names and average them.  Here, we only count each name in NC once, but we still use the frequency to predict the chances of a match.  Finding a good sample is hard; I used the officers of several NC clubs with webpages, and the rosters of several NC high school sports teams.

The results were interesting.  Many names are unique or nearly unique in the country, so have very little chance of matching.  But a few spike very high, and account for the vast majority of the matches.  My results were trending considerably lower than the reported results until I came across a Michael Smith, which pulled the average up to nearly double reality.

It seems that this method is very sensitive to the sample, so I will consider the other method more reliable.  If the top 100 names can account for a significant fraction of the reported value, we can assume that the rest of names will account for the rest.

Names, Birthdate, and Last 4 SS#

But what about the matches that also included social security numbers?  The math for this is really the same as above, just with a 1 in 159,984,000 (16,000*9999) chance that matching names completely match. Now, however, the odds do grow long.   There is only about 0.08 of a John Smith match, and the top 100 names only account for about 6 total matches.  It is still possible, and indeed likely, that some of the 765 voters who matched in this way are pure coincidence.  But not all of them. Voter fraud isn’t the only explanation for this.  But it is one worth looking into.

Conclusion

Streiff’s claim that this report supports 1 million cases of voter fraud is preposterous.  Matching names and birthdates is a completely insufficient tool to make any claim about the uniqueness of the people in question. Including social security numbers goes a long way towards resolving this concern, and those results warrant further investigation.  However, they are well within the potential margin of human error.

The Null Hypothesis and Bias

Posted: May 24, 2013 in analysis

It’s been a long time, and this is actually a boring subject, but it’s something I wanted to talk about.

Last week Nate Silver of FiveThirtyEight wrote a piece about flawed statistical thinking in an op-ed by Peggy Noonan.  He used some simple calculations to show his point.

Dan McLaughlin of RedState had an issue with Mr. Silver’s piece.

Silver concedes of his statistical analysis that “this calculation assumes that individuals’ risk of being audited is independent of their political views,” which of course is the very thing in dispute; it’s like the old joke about an economist stranded on a desert island with a stack of canned goods whose solution begins, “assume a can opener.” All things being equal, all things are equal.

Mr. McLaughlin fundamentally misidentified what Silver was doing when he made that assumption.  It doesn’t weaken his argument; it is necessary to make it, statistically.

The Null Hypothesis

What Mr. Silver was doing in his piece was using an informal version of the null hypothesis, which is the foundation of much of modern statistics.  The fundamental mathematical principle behind statistical significance relies, not on proving a hypothesis, but on disproving the null hypothesis.

Thus, if a statistician wants to show that smoking causes cancer, he does the math assuming that smoking has no effect on cancer.  If the math leads to an unlikely result, he has disproven the null hypothesis.   If the math doesn’t, then he has failed to disprove the null hypothesis.  A statistician never proves a positive hypothesis, they simply disprove null hypotheses.

This is a bit tough to grasp, so I’ll try to explain it with a very simple example.  If I have a coin that I think might be weighted, the way I test that is to flip if a bunch of times and write down the results.  Then, I assume it was 50-50 heads tails, and ask, “If the coin weren’t biased, how unlikely would it be that I got the results I just did?”  In 100 flips, if my sample came out 47-53, sure, the most likely answer is that it is slightly weighted.  But the null hypothesis is still very likely, so I would not reject it.  If my sample were 82-18, however, that would be a staggeringly unlikely event with a fair coin, so it is probably safe to reject the null.  If I only flipped the coin twice, however, I couldn’t disprove the null even both flips were heads, since that has a good chance of happening regardless.  Mathematically, this is what is represented by the p-value; the probability of a result like the one in question, given the null hypothesis.

That an individual’s risk of audit is independent of their political views is a null hypothesis.  Mr. Silver proposes it, then shows that Peggy Noonan’s evidence does not disprove it.  Thus, statistically, Peggy Noonan has very weak evidence.  He does this by showing that, if the null is true, it would not be unusual to find four or five (indeed, four or five thousand) Republican donors that were audited.   So, the fact that Peggy Noonan did find four or  five Republican donors that were audited is not statistical evidence that the null is false.

The null hypothesis itself, however, is not a political statement.  It is simply the way one has to formulate the problem in order to use the mathematical tools available.

As a side note, I was banned from RedState some time ago for formulating a statistical query in this way, because the null hypothesis looked like a political position, so the fact that it has come up again is of some interest to me.

Unionization and Economic Growth

Posted: December 16, 2012 in analysis

Today we look at the following article on RedState, which purports to show that

…Without cherry-picking data as union bosses must in order to defend forced unionism, total seasonally adjusted non-farm employment growth shows a huge advantage for residents of right to work states.

The actual data presented, however, are the employment growth over 20 years for all right to work states, but only a few union-friendly states.

This, particularly considering that the introduction explicitly calls out cherry-picking, triggered my sensors.  So, let’s see what happens if we look at data for all states.

First, I have to find the data used to create this chart.  It took some rummaging on the BLS website, but I finally found numbers that almost, but not quite, recreate the numbers on the chart in the original article.  My version of the chart is below.

20year-righttowork

As you can see, it looks essentially like the version in the article, though because I am using slightly different data, the percentages vary by a point or two.

Now, I’d like to present a different chart, this time comparing Ohio to other union-friendly states.

20year-unionfriendly

Here is where the cherry picking comes in.  Ohio job growth isn’t terrible because it is union friendly, it’s terrible because it’s terrible.  Nearly everyone does better than Ohio regardless of their labor policy.

Now, there is potentially something to be said that right-to-work states have a greater gain over the last 20 years than union friendly states do.  But that wasn’t the argument; Jason hart asserted a “huge advantage” for right-to-work states, and presented evidence that was built around comparing to the second-worst performing state of any kind.

Presenting the data like that, particularly in the same sentence as calling out others for cherry-picking data, is disingenuous at best.

Speaking on CNN (and as reported at talkingpointsmemo), Governor Bob McDonnell, in an effort to link the economic recovery to Republican governors and not President Obama, said:

 “There’s something going on with Republican-governed states. Seven out of the 10 states nationwide, Candy, that have the lowest unemployment rates: Republican governor states.”

By now you should have figured out the drill; does this show real evidence that Republican-governed states have lower unemployment than Democratic-governed ones?

The short answer is no. The long answer is nnnnnnooooooooooooooooooooooooooooooooooooooooooo.  (sorry, bad joke). Simply put, there are more Republican governors than Democratic ones, so more Republican states appear in every part of the unemployment list.  Only 3 out of 10 of the lowest unemployment states are governed by Democrats, but only 3 out of 10 of the highest unemployment states are (the one independent governed state means that only 6 of them are Republican).

Of course, we can do better than just counting from the top 10 and bottom 10.  A simple statistical model can give a much better sense of whether governor party affiliation affects unemployment.  The answer is no.  While Republican governed states have slightly lower average unemployment, the difference is tiny (0.3%) and is very likely caused by random change (p=0.54).  Trying to refine the model by adding length of incumbency or length of party incumbency does not produce any results other than noise.  Sometimes one party is a little ahead, sometimes the other, but the results are never significant.

The conclusion is pretty clear, then.  Mr. McDonnell’s statement is factually true only in the most technical sense, and any implication he tries to draw from it is faulty.

Pay Equality at the White House

Posted: April 20, 2012 in analysis

An article on The Free Beacon here makes the simple claim that the White House pays women less than men, according to public records.  They then go on to imply (as others who link to them do more explicitly, like here) that this is demonstrative of an anti-woman attitude in the administration.

So, lets take a look at the numbers, shall we? Read the rest of this entry »

A few months ago, Supreme Court Justice Ruth Bader Ginsburg was discussing the drafting of a new Egyptian Constitution in Egypt, and she said that she didn’t believe that the US Constitution was the best model.  This terribly offended many commentators, as can be seen here, here, and here.

But could there be a good reason why Justice Ginsburg doesn’t think the US Constitution is a good model?  I think there is; Presidential systems do not lend themselves well to long-term stable democracies, which is the goal of a well-written Constitution.  Of course, the United States is the exception, but how well do other countries with a Presidential system fare? Read the rest of this entry »

Political Sex Scandals Revisited

Posted: November 8, 2011 in analysis

Some of you may recall that last summer, I tried to build a model to predict the results of political sex scandals, and documented my efforts here and here.  The model was unusual, and it turned out to predict the then-current sex scandal (David Wu) very poorly.

Well, another sex scandal has made the news, so it’s time to put my model to the test again.  Hermain Cain’s scandal isn’t very interesting; quite frankly.  The variables that matter to the model are pretty straightforward; Mr. Cain’s scandal is nothing special.

  • Intensity: 5 – multiple instances of sexual advances, but no actual sex.
  • Unfaithfulness: 7 – Cain has been married for 40+ years, but hasn’t quite been accused of actually cheating on his wife.
  • Kinkiness: 3 – Nothing more than a little dirty talk.
  • Hypocrisy: 4 – Courting the religious right but having adulterous intentions.
  • Coercion: 6 – The actions were non-consensual.

The other ratings (such as Contrition, which is 1 (Cain denies the events), and Plausibility, which is 6 (there isn’t very strong evidence that they happened), aren’t a part of the model.

So, as a low intensity Republican with a coercive but not kinky scandal, the model does not predict a happy outcome for Mr. Cain.  Specifically, the result is a value of 0.16, which means he will most likely drop out of the race or lose the nomination.  But, this is the same model that predicted that David Wu wasn’t going anywhere on the precise day he announced his resignation, so take that with a grain of salt.

It should also be noted that only one of my model data cases (Jack Ryan) was a non-incumbent candidate for election, so the dynamics may be very different.  But I have the model, so it’s worth testing it again.  And the best way to test is to make the prediction in advance of the event, so there you are.