Archive for the 'Uncategorized' Category

On Mathematical models of the recall vote and fraud, part X: 2nd. Simon Bolivar Seminar

September 19, 2004

On Thursday the second Simon Bolivar University seminar on Statistical Analysis of the referendum process was held. There were supposed to be three talks, but nature conspired against Luis Raul Pericchi, who was in Puerto Rico, and was unable to come to Venezuela due to hurricane Jeanne. Then, they planned a videoconference, but unfortunately the island lost all electric power, making it impossible to set it up. It will be tentatively scheduled for next Thursday.


You can find the program for these conferences here, I though all presentations would be placed there, but only one of them has so far been posted, more on that particular one later.


 


-There was talk by Rafael Torrealba from the Math Department at Universidad Centro Occidental Lisandro Alvarado. The talk would have been useful two or three weeks ago, but by now it is too simplistic a model to be useful. Basically, Torrealba calculated the probability of coincidences assuming all machines have 500 voters and approximating the binomial distribution by a “box” with zero probability above and below a standard deviation. Using this, Torrealba got that coincidences were as likely as observed in the recall vote and cited Rubin’s work, but was unaware of Taylor, Valladares and Jimenez. Thus, it was too crude at this point to make a point.


 


Torrealba also showed some voter distributions from the Barquisimeto area where he lives to discuss the implications of applying a binomial distribution.


 


-There was a second talk by Isbelia Martin on the binomial distribution and the vote from the recall. She did a more complete presentation of the results I summarized here. In the talk she presented much more material than the one I showed and if she places her presentation online I will link it in the future here.


 


What she did was to present the data for a textbook binomial state, Vargas State, and compare it to the data I presented on Miranda State. There are more anomalies to the data that I discussed, including the fact that if one does a fit through the “clouds” of results to obtain the average for each cloud, they do not intersect zero as they should. Additionally, she and her colleagues find that in some cases the same center has machines in both clouds, which obviously makes no sense.


 


-Jimenez, Jimenez and Marcano have now placed a simplified version of their work on coincidences here, I wish everyone would make their work available like that; it would make discussions more lively and interesting.


 


What they have done is essentially to use what is called a bootstrap method, which is a basically a simulation of the vote using the actual data from the recall referendum and modeling the details of the structure of the centers, tables (mesas) and machines. They allow all variables to fluctuate so that they do not have to assume the data is random which would not be if it had been intervened with.


 


Jimenez et al. do also a more detailed calculation of the problem by looking not only at the number of coincidences in the SI or No votes, but by looking at Si, No and all votes and comparing the probability of coincidences for each type of center. That is, they not only calculate how many centers had coincidences in two machines, but calculated how many centers with two machines, had coincidences in any of the three numbers (Si, No or sum of votes), how many centers with three machines did, how many with four etc. In this fashion one has a wider number of probabilities to compare the real data to what the simulations say.


 


They then did 1238 simulations and calculated the same probabilities for centers from 2 to 11 machines. In this manner they found that in general, the proportion of coincidences is higher in the actual vote that in the simulations, which led them to do a test of ranges, calculating the probability that the observed number of coincidences in the recall vote may occur for each center with n=2,3,4…..11 machines. In this manner, it is not simply a matter of asking what the probability of two machines coinciding is, but what is the probability that centers with two machines had the level of coincidences observed.


 


You can see the results in their paper in Table 3, but I will summarize some cases with examples:


 


Centers with two machines: The probability of observing the number of Si coincidences seen was 0.0323, the number of No coincidences was 0.7746 and the number of total vote coincidences was 0.0638. Thus, while low, it was probable that there were that many coincidences.


 


Centers with four machines: The probability of observing that number of Si coincidences was ZERO, with the probability of No coincidences being 0.2883 and the probability of total votes coinciding 0.00807. Similarly low probabilities were observed for the total number of coincidences in centers with 6 and 7 machines or extremely low probabilities in Si coincidences for centers with six machines.


 


The authors conclude:


 


-The repetitions observed in the Si vote and the total number of voters per machine in one center are considerably larger than expected. It is strange, but probable


 


-The repetitions observed in the NO votes are absolutely credible and in many cases, close to what was expected.


 


-The repetitions observed in the Si votes in centers with 4 machines and the number of voters in centers with six machines are extreme cases of their analysis. In these cases the author CAN NOT accept the hypothesis that the repetitions are due to randomness.


 


This last conclusion is the strongest found in the study of the coincidences in the number of votes within one center and it says the data could not have been random.

On Mathematical models of the recall vote and fraud, part IX: Too much correlation between the 2000 and 2004 vote?

September 15, 2004

Jose Huerta whose page you can find here (He has some interesting statistics about education and poverty in Venezuela in his page)  has been looking at a comparison of the data from 4565 centers from the results of the 2000 Presidential vote and the recent recall referendum vote, as well as between the 1999 referendum and the 2000 Presidential vote. Essentially, you note that as usual time was working against Chavez as with each vote the anti-Chavez vote went up and the pro-Chavez vote went down. (Here is the full presentation and details in Power Point format). However, this trend stopped between 2000 and 2004, despite the fact that the time span between the two was much longer.


What is interesting about his results, is that he finds that if you look at the number of anti-Chavez votes at the municipal level, there is a high correlation between the 1999 and 2000 vote with R^2=0.9784:


 



 


 


 


But what is remarkable is that the correlation actually went UP between 2000 and 2004, with R^2 of 0.9866 at the municipal level:


 



 


 


which is somewhat counterintuitive given the time frame and everything that happened in Venezuela in those four years, including the new voters, changes in voting centers, migrations and political unrest.


 


Certainly very intriguing.

On Mathematical models of the recall vote and fraud, part IX: Too much correlation between the 2000 and 2004 vote?

September 15, 2004

Jose Huerta whose page you can find here (He has some interesting statistics about education and poverty in Venezuela in his page)  has been looking at a comparison of the data from 4565 centers from the results of the 2000 Presidential vote and the recent recall referendum vote, as well as between the 1999 referendum and the 2000 Presidential vote. Essentially, you note that as usual time was working against Chavez as with each vote the anti-Chavez vote went up and the pro-Chavez vote went down. (Here is the full presentation and details in Power Point format). However, this trend stopped between 2000 and 2004, despite the fact that the time span between the two was much longer.


What is interesting about his results, is that he finds that if you look at the number of anti-Chavez votes at the municipal level, there is a high correlation between the 1999 and 2000 vote with R^2=0.9784:


 



 


 


 


But what is remarkable is that the correlation actually went UP between 2000 and 2004, with R^2 of 0.9866 at the municipal level:


 



 


 


which is somewhat counterintuitive given the time frame and everything that happened in Venezuela in those four years, including the new voters, changes in voting centers, migrations and political unrest.


 


Certainly very intriguing.

On Mathematical models of the recall vote and fraud, part IX: Too much correlation between the 2000 and 2004 vote?

September 15, 2004

Jose Huerta whose page you can find here (He has some interesting statistics about education and poverty in Venezuela in his page)  has been looking at a comparison of the data from 4565 centers from the results of the 2000 Presidential vote and the recent recall referendum vote, as well as between the 1999 referendum and the 2000 Presidential vote. Essentially, you note that as usual time was working against Chavez as with each vote the anti-Chavez vote went up and the pro-Chavez vote went down. (Here is the full presentation and details in Power Point format). However, this trend stopped between 2000 and 2004, despite the fact that the time span between the two was much longer.


What is interesting about his results, is that he finds that if you look at the number of anti-Chavez votes at the municipal level, there is a high correlation between the 1999 and 2000 vote with R^2=0.9784:


 



 


 


 


But what is remarkable is that the correlation actually went UP between 2000 and 2004, with R^2 of 0.9866 at the municipal level:


 



 


 


which is somewhat counterintuitive given the time frame and everything that happened in Venezuela in those four years, including the new voters, changes in voting centers, migrations and political unrest.


 


Certainly very intriguing.

Seminar at Simon Bolivar University: Two presentations

September 12, 2004

Besides the overview presentation by Isabel Llatas, there were two talks at this first seminar:


1)      Statistical study of the CNE data by Bernardo Marquez et al.


 


This is a group of engineers which looked at the statistical properties of the electronic results at the two lowest levels of detail, at the Center level and at the parish level.


 


Basically, the CNE divided the nation into 321 municipalities. Each municipality was divided itself into parishes and the parishes into centers. There were on average 2.6 parishes per municipality and on average there were 5.4 centers per parish. There were on average 10,098 votes per center and 26,486 votes per parish.


 


The study was a statistical hypothesis testing of all of the CNE data. The basic hypothesis was that the CNE data is valid and thus by looking at averages an standard deviations one should be able to establish confidence levels at both the parish and the center level, in terms of votes within a parish or center being in the correct range. By this, it means that they look at the final result of a machine and check whether that result is within what is expected from the statistics of the center or the parish.


 


The authors found that with a 95% confidence level, there are only 7% of the machines at the center level which show unexpected results with respect to the center. In contrast, they found that 62% of the parishes showed unexpected results. If the confidence level was 99% they found 51% of the machines had unexpected results.


 


They then looked at each parish to see how much centers differed within a parish by looking at the standard deviations of each center. Then, they eliminated what they called the non-homogenous centers, those in which the centers within a parish showed significant differences in the standard deviations of the distributions. Thus, they kept only the “homogeneous” parishes and found that with a 95% level of confidence 42% of the machines showed unexpected results and with 99% confidence 26% of the machines showed unexpected results.


 


2)      A study of the coincidences in the votes in the machine by Raul Jimenez (USB), Alfredo Marcano (USB) and Juan Jimenez (UCV)


 


This talk discussed the various simulations that have been done to study the coincidences. It was very critical of Rubin’s and Taylor’s form the technical point of view. I must say that what I was not able to understand the details of what they did, it was beyond my understanding and I tried. Basically, they are using fairly sophisticated mathematical theory to look at the problem and study probabilities of occurrences.


 


In their most detailed work, they looked at the probability of SI and No coincidences as well as the probability that the sum of the Si and No votes also coincides. They obtained a probability of 3.5 in 10,000 for the SI coincidences, reasonable (I think it was 0.3) for the NO and 1 in 1,000,000 for the sum of SI and No to coincide. 


 


This result is being submitted as a scientific paper to a journal next week and the author said he will send me a copy when he send it in to the Journal.

Seminar at Simon Bolivar University: Two presentations

September 12, 2004

Besides the overview presentation by Isabel Llatas, there were two talks at this first seminar:


1)      Statistical study of the CNE data by Bernardo Marquez et al.


 


This is a group of engineers which looked at the statistical properties of the electronic results at the two lowest levels of detail, at the Center level and at the parish level.


 


Basically, the CNE divided the nation into 321 municipalities. Each municipality was divided itself into parishes and the parishes into centers. There were on average 2.6 parishes per municipality and on average there were 5.4 centers per parish. There were on average 10,098 votes per center and 26,486 votes per parish.


 


The study was a statistical hypothesis testing of all of the CNE data. The basic hypothesis was that the CNE data is valid and thus by looking at averages an standard deviations one should be able to establish confidence levels at both the parish and the center level, in terms of votes within a parish or center being in the correct range. By this, it means that they look at the final result of a machine and check whether that result is within what is expected from the statistics of the center or the parish.


 


The authors found that with a 95% confidence level, there are only 7% of the machines at the center level which show unexpected results with respect to the center. In contrast, they found that 62% of the parishes showed unexpected results. If the confidence level was 99% they found 51% of the machines had unexpected results.


 


They then looked at each parish to see how much centers differed within a parish by looking at the standard deviations of each center. Then, they eliminated what they called the non-homogenous centers, those in which the centers within a parish showed significant differences in the standard deviations of the distributions. Thus, they kept only the “homogeneous” parishes and found that with a 95% level of confidence 42% of the machines showed unexpected results and with 99% confidence 26% of the machines showed unexpected results.


 


2)      A study of the coincidences in the votes in the machine by Raul Jimenez (USB), Alfredo Marcano (USB) and Juan Jimenez (UCV)


 


This talk discussed the various simulations that have been done to study the coincidences. It was very critical of Rubin’s and Taylor’s form the technical point of view. I must say that what I was not able to understand the details of what they did, it was beyond my understanding and I tried. Basically, they are using fairly sophisticated mathematical theory to look at the problem and study probabilities of occurrences.


 


In their most detailed work, they looked at the probability of SI and No coincidences as well as the probability that the sum of the Si and No votes also coincides. They obtained a probability of 3.5 in 10,000 for the SI coincidences, reasonable (I think it was 0.3) for the NO and 1 in 1,000,000 for the sum of SI and No to coincide. 


 


This result is being submitted as a scientific paper to a journal next week and the author said he will send me a copy when he send it in to the Journal.

Seminar at Simon Bolivar University on the mathematics of the recall results: An Overview

September 9, 2004

There was a seminar today at Simon Bolivar University (USB), the leading technical university in Venezuela, on mathematical studies of the recall vote. The event, which was also sponsored by Universidad Central de Venezuela (UCV) was quite interesting. I was planning to write a full report, but unfortunately (for you) maybe fortunately (for me) I forgot my notes at my office and if I want to speak with precision, I need them.


Perhaps the most interesting part is the effect this is having on the academic community. You have a bunch of mathematicians and physicists applying the tools of their academic and research trade to a real life problem. Additionally, many people are working on the same problem so there is a lively and daily exchange of ideas. This is good for Venezuelan science, independent of the final results.


 


The problem is being look at from a variety of different angles that go from very pedestrian statistical analysis to sublime techniques and I am sure, soon some may get into using divine ones that I will never be able to understand. Speakers were very careful in not using the word ¨fraud¨, concentrating on ¨probability¨,¨ likelihood¨ and other such terms.


 


The first talk was given by Isabel Llatas and it was an overview of the work that is being done or has been done so far. I counted 24 different names of scientists here or abroad looking at the problem from different angles.


 


Llatas showed partial results from the work of Sanso and Prado, which I have posted here, from that of Isbelia Martin, which I posted two nights ago as well as from that of Luis Raul Pericchi who has been using Benford´s Law to study the results of the referendum vote. Pericchi will speak in the second one of these seminars next Thursday, but I found the work very interesting and will mention it later in this post.


 


Llatas showed how people have looked at the available CNE data in many different forms, separating it into data which was counted electronically and manually, as well as geographical distributions. What came across from the talk is that there is a lot of work that has already been done in the last three weeks with the available data and scientists are still working on things, making sure they are right, before publishing it or talking about it.


 


After this, came two talks which I will dwell on in detail later. The first one was by a group of engineers that have looked at the statistical properties of the votes at the center and parish level, finding what they call “irregular” results at a significant number of machines. The second talk was by Raul Jimenez  et al. who have been looking at the problem of coincidences and has some interesting formal and practical results, which suggest the coincidences are quite unlikely. One of his most surprising statements was that there are also coincidences in the total number of votes per machine SI’s+NO’s and he has found that these coincidences have the lowest probability of occurring, with a number like a probability being one in a million.


 


Before today I had heard of Pericchi’s work, but had no idea what that was about until I saw a graph of his results and decided to look into the background. (I have no more details than what I will give at the end of this post). His work is based on Benford’s Law, a concept that now that I know about it, I have to wonder how I could have lived all these years without it!  


 


Benford’s Law


 


Imagine you have a table of populations of towns and cities for a given country. These numbers are distributed according to a probability distribution with a mean and a standard deviation. But suppose that rather than look at the full number you looked at the first digit of each number, 1 thru 9, from left to right. Intuitively most people would think that the probability of that number being, 1, 2, 3…..or 9 would be exactly the same. Well, it isn’t. If you look at wide range of statistical tables, such as the prices of stocks in the NYSE, baseball statistics or even numbers in the financial statements of a company, you find that the probability of that first digit being a 1 is 0.301, 2 is 0.176, 3 is 0.124…all the way down to 9 the probability of which is 0.04576.


 


The following is a table taken from here with the probabilities found from taking first digit statistics of the first digit in numbers found in the front page of a newspaper, the 1990 census on county populations and the prices of the stocks in the Dow Jones Industrials from 1990-1993.


 



 


The reason for this is that the populations are evenly distributed on a logarithmic scale and many of these processes are logarithmic. Think of stock prices. If you issue stock at $10 and your company grows 100% every five years, the digit 1 would be the first one of your stock price the first five years, but after that, the digit two will only be part of it less than two and a half years and the length of time will get shorter as the stock price grows. So, if you have hundreds of stocks, you will always observe more first digits with a one than any other number.


 


This turns out to have important consequences in real life testing. Supposedly (haven’t found the reference) the first time someone saw something fishy in Enron’s numbers was because some particular table of number did not fit Benford’s Law.


 


The IRS uses Benford’s law to detect fraud, auditors to detect fraud in companies and companies to detect fraud by employees. The reason is simple, if someone tampers with the data, they will likely spread the numbers uniformly and the probability of a 1 as a first single digit would be as likely as any other number. The same thing happens if people commit fraud; they spread the amounts around evenly thinking that it will not be noticed. Auditing forms apparently have many tests like this for companies’ data such as customer refund tables and account receivables.


 


You can extend the calculation for the single digit to the first two digits and you can calculate those in that case too. 


 


What I understood today is that what Pericchi et al. have done is to apply Benford’s law to the election results, looking at the total votes at each “cuaderno” level. Reportedly, and I will report the details on it when I hear their talk next Thursday, they have found that the machine results do not fit Benford’s law at all, while the manual ones fit it quite well.

Seminar at Simon Bolivar University on the mathematics of the recall results: An Overview

September 9, 2004

There was a seminar today at Simon Bolivar University (USB), the leading technical university in Venezuela, on mathematical studies of the recall vote. The event, which was also sponsored by Universidad Central de Venezuela (UCV) was quite interesting. I was planning to write a full report, but unfortunately (for you) maybe fortunately (for me) I forgot my notes at my office and if I want to speak with precision, I need them.


Perhaps the most interesting part is the effect this is having on the academic community. You have a bunch of mathematicians and physicists applying the tools of their academic and research trade to a real life problem. Additionally, many people are working on the same problem so there is a lively and daily exchange of ideas. This is good for Venezuelan science, independent of the final results.


 


The problem is being look at from a variety of different angles that go from very pedestrian statistical analysis to sublime techniques and I am sure, soon some may get into using divine ones that I will never be able to understand. Speakers were very careful in not using the word ¨fraud¨, concentrating on ¨probability¨,¨ likelihood¨ and other such terms.


 


The first talk was given by Isabel Llatas and it was an overview of the work that is being done or has been done so far. I counted 24 different names of scientists here or abroad looking at the problem from different angles.


 


Llatas showed partial results from the work of Sanso and Prado, which I have posted here, from that of Isbelia Martin, which I posted two nights ago as well as from that of Luis Raul Pericchi who has been using Benford´s Law to study the results of the referendum vote. Pericchi will speak in the second one of these seminars next Thursday, but I found the work very interesting and will mention it later in this post.


 


Llatas showed how people have looked at the available CNE data in many different forms, separating it into data which was counted electronically and manually, as well as geographical distributions. What came across from the talk is that there is a lot of work that has already been done in the last three weeks with the available data and scientists are still working on things, making sure they are right, before publishing it or talking about it.


 


After this, came two talks which I will dwell on in detail later. The first one was by a group of engineers that have looked at the statistical properties of the votes at the center and parish level, finding what they call “irregular” results at a significant number of machines. The second talk was by Raul Jimenez  et al. who have been looking at the problem of coincidences and has some interesting formal and practical results, which suggest the coincidences are quite unlikely. One of his most surprising statements was that there are also coincidences in the total number of votes per machine SI’s+NO’s and he has found that these coincidences have the lowest probability of occurring, with a number like a probability being one in a million.


 


Before today I had heard of Pericchi’s work, but had no idea what that was about until I saw a graph of his results and decided to look into the background. (I have no more details than what I will give at the end of this post). His work is based on Benford’s Law, a concept that now that I know about it, I have to wonder how I could have lived all these years without it!  


 


Benford’s Law


 


Imagine you have a table of populations of towns and cities for a given country. These numbers are distributed according to a probability distribution with a mean and a standard deviation. But suppose that rather than look at the full number you looked at the first digit of each number, 1 thru 9, from left to right. Intuitively most people would think that the probability of that number being, 1, 2, 3…..or 9 would be exactly the same. Well, it isn’t. If you look at wide range of statistical tables, such as the prices of stocks in the NYSE, baseball statistics or even numbers in the financial statements of a company, you find that the probability of that first digit being a 1 is 0.301, 2 is 0.176, 3 is 0.124…all the way down to 9 the probability of which is 0.04576.


 


The following is a table taken from here with the probabilities found from taking first digit statistics of the first digit in numbers found in the front page of a newspaper, the 1990 census on county populations and the prices of the stocks in the Dow Jones Industrials from 1990-1993.


 



 


The reason for this is that the populations are evenly distributed on a logarithmic scale and many of these processes are logarithmic. Think of stock prices. If you issue stock at $10 and your company grows 100% every five years, the digit 1 would be the first one of your stock price the first five years, but after that, the digit two will only be part of it less than two and a half years and the length of time will get shorter as the stock price grows. So, if you have hundreds of stocks, you will always observe more first digits with a one than any other number.


 


This turns out to have important consequences in real life testing. Supposedly (haven’t found the reference) the first time someone saw something fishy in Enron’s numbers was because some particular table of number did not fit Benford’s Law.


 


The IRS uses Benford’s law to detect fraud, auditors to detect fraud in companies and companies to detect fraud by employees. The reason is simple, if someone tampers with the data, they will likely spread the numbers uniformly and the probability of a 1 as a first single digit would be as likely as any other number. The same thing happens if people commit fraud; they spread the amounts around evenly thinking that it will not be noticed. Auditing forms apparently have many tests like this for companies’ data such as customer refund tables and account receivables.


 


You can extend the calculation for the single digit to the first two digits and you can calculate those in that case too. 


 


What I understood today is that what Pericchi et al. have done is to apply Benford’s law to the election results, looking at the total votes at each “cuaderno” level. Reportedly, and I will report the details on it when I hear their talk next Thursday, they have found that the machine results do not fit Benford’s law at all, while the manual ones fit it quite well.

On Mathematical models of the recall vote and fraud, part VIII: The physicists’ chopped up binomial distribution

September 7, 2004


 


This is a rewrite of last night’s post, Quico wanted it to be clear to 12 year olds, that might be stretching it, but hope it works for 21 year olds:


 


Isbelia Martin and a group of physicists at Simón Bolívar University have been looking at the statistics of the number of votes for each voting machine and by states.


 


The behaviour of the votes in an electoral process should follow what is called a binomial distribution.  A binomial distribution occurs whenever you have two outcomes of a process; the classical example is when you flip a coin. If the coin is fair, half of the time you get heads, half of the time you get tails. You get a distribution when you do an experiment many times, that is, suppose you flip a coin 1oo times and record how many heads or tails you get, but your repeat the experiment 100 times. You then record how many times you got only one head (very unlikely), two, three and so on. At the end you divide the frequency of getting each one of these cases and you get a probability distribution like this one that I stole from this site:


 



 


 


The voting process is similar in that the voters are in theory fairly independent in their decision. In the case of the voting process, the flipping of the coins becomes each voting center, which may have either a total number of voters or a total number of registered voters. So, you could construct a probability distribution, much like the one above, in which you could plot, as a first simple case, the number of people that actually voted. This is a binomial, because the voter decides between two choices whether to go and vote or not. Each voter is assumed to be independent of the other, even though there may be family pressures to go and vote. The main difference between this problem and the coin problem is that the probabilities are not 0.5 for each case. In fact, in the recall vote abstention was approximately 32%, so you could say that the probability of any given person voting was p=0.68 and the person not voting had a probability of 0.32.


 


What is different in this problem is that you have machines of different size, so what you can count is how many people voted n in each machine of size N, and then plot the frequency of occurrences for each machine of size N. What you get in the case of the voters in the recall referendum is something very similar to the distribution of the coin toss problem, which is expected, since both are binomial processes.


 


Mathematically, you can calculate the probability for a binomial process that you will get a value of n voters showing up to vote for each machine with N registered voters. Thus, if we have machines with N voters each, the probability that a voter will go vote is p and the probability that it will not go and vote is q=p-1, thus if we have M machines with N voters, the number of voters n that do go and vote, will be between “0 and N and will follow what is called a binomial distribution given by


 


P(n)=(N!/n!(N-n)!) p^n x q^(N-n)


 


This is a bell shaped curve like the one plotted above


 


Supposed we now plot instead the number of voters that did go and vote (abstention) as a function of the number of voters per machine that were registered at each machine, if the distribution is binomial the points for the abstention should form a cloud of points that open up like the tail of a comet with the greatest density along an imaginary line with a slope proportional to the average attendance of voters in that population. If half the people abstained, this cloud would be along the 22.5 degree line with respect to the horizontal, but since in the case of the recall the percentage of abstention was 32%, this cloud would be below the 22.5 degree line.


 


Below is a plot of such graph for the number of people n that did not go and vote in all of the centers in Miranda state as a function of the number of voters registered per machine N:


 



 


 


 


Plot of the number of voters that that abstained as a function of the number of registered voters N at each machine for Miranda state.


 


This is a textbook type of example of what one should get for a process that should follow a binomial distribution. Thus, the first conclusion is that the data from the recall vote in terms of the choice between going to vote or not behaves in Miranda state and nationally, much like what is expected from a binomial distribution.


 


The same logic should apply to the SI and NO votes. It should be a binomial distribution since it represents a choice between two possibilities. If the vote split were a perfect 50%/50% for the Si and the No, and one plotted the number of votes n for one or the other possibility as a function of the number of actual voters at each machine N, the cloud would spread below the 45 degree line that divides the plane, along a 22.5 degree imaginary line. In the recall vote, since the No won then, if one plots the dispersion plot for the NO votes on would get a cloud above the 22.5 degree line and a similar one below that line for the corresponding cloud of SI votes which is also plotted below.


 


However, what is observed is completely different as seen in the next graph for the number of NO votes n, in Miranda state as an example, as a function of the number of voters in each machine N:


 



 


 


 


Plot of the number of  No voters as a function of the number of voters N at each machine


 


 


Instead of obtaining a single cloud, one obtains two separate areas of high density with a valley of low density separating them. I have drawn three imaginary lines to guide the idea to the valley (area with low density between the two thicker clouds) as well as imaginary lines along the two separate clouds at each side.


 


Exactly the same type of behaviour is seen for the number of SI votes n in Miranda state as a function of the number of voters in each voting center N:


 



 


 


 


Plot of the number of Si voters as a function of the number of voters N at each machine


 


 


This shows the same low density valley, where I have drawn a line to guide the eye and two clouds at each side.


 


Thus, Miranda data, which conforms to a binomial distribution when one looks at the binomial process of abstention versus voting, does not conform to a binomial distribution. In fact, according to the authors, the data for Miranda state would NEVER conform to a binomial distribution. This is the second conclusion: The data for the Si and No votes does not conform to a binomial but is part of the same data that did conform to a binomial in the case of the abstention. In fact it would never conform to it.


 


Even more interesting, the same type of behaviour has been seen in Zulia, Carabobo, Anzoategui, Tachira and  Lara, but “textbook” type of behaviour is found in other states such as Falcon and Vargas. Other smaller states also show classical behaviour. This creates a big problem, how would one explain that some states behave exactly like a binomial, textbooks cases, no discrepancies, while certain selected states do not?


 


In order to try to understand this unusual behaviour, the authors plotted the histogram of occurrences for the Si and NO votes as shown in the next figure:


 



 


Histogram of the occurrences of the Si (red), No (blue) votes as a function of the number of votes.


 


There are two distributions plotted in this figure: The Si bars are the distribution of occurrences of the number of Si voters for each machine with N voters, in the blue the distribution for the number of NO voters as a function of the number of N voters in the machine. As you can see it is as if the Si votes had had a piece chopped up for machines in which the number of registered voters was above 250 and up to 350. This data is for Miranda state, but if one looks at a similar histogram at the national level, the same type of “chopped up” binomial distribution is observed. This is the third conclusion: The distribution is a binomial that appears to have part of it “chopped up” as if part of the Si votes were shifted to No votes.


 


It is this same chopping up which accounts for the valleys in the two unusual dispersion curves.


 


Thus, it would seem as if the process is not at all like a binomial as it should be, but follows instead a distribution which appears to have some form of artificiality and selectively introduced into it, creating two types of distribution. Curiously, the abstention had the proper behaviour expected from a binomial, but is part of a process within the same data. This result is consistent with the hypothesis of Haussman and Rigobon that only a certain number of machines may have been manipulated, in this case, the data suggests if was a selection based on the number of registered voters per machine, which determined whether the data was manipulated or not.

On Mathematical models of the recall vote and fraud, part VIII: The physicists’ chopped up binomial distribution

September 7, 2004


 


This is a rewrite of last night’s post, Quico wanted it to be clear to 12 year olds, that might be stretching it, but hope it works for 21 year olds:


 


Isbelia Martin and a group of physicists at Simón Bolívar University have been looking at the statistics of the number of votes for each voting machine and by states.


 


The behaviour of the votes in an electoral process should follow what is called a binomial distribution.  A binomial distribution occurs whenever you have two outcomes of a process; the classical example is when you flip a coin. If the coin is fair, half of the time you get heads, half of the time you get tails. You get a distribution when you do an experiment many times, that is, suppose you flip a coin 1oo times and record how many heads or tails you get, but your repeat the experiment 100 times. You then record how many times you got only one head (very unlikely), two, three and so on. At the end you divide the frequency of getting each one of these cases and you get a probability distribution like this one that I stole from this site:


 



 


 


The voting process is similar in that the voters are in theory fairly independent in their decision. In the case of the voting process, the flipping of the coins becomes each voting center, which may have either a total number of voters or a total number of registered voters. So, you could construct a probability distribution, much like the one above, in which you could plot, as a first simple case, the number of people that actually voted. This is a binomial, because the voter decides between two choices whether to go and vote or not. Each voter is assumed to be independent of the other, even though there may be family pressures to go and vote. The main difference between this problem and the coin problem is that the probabilities are not 0.5 for each case. In fact, in the recall vote abstention was approximately 32%, so you could say that the probability of any given person voting was p=0.68 and the person not voting had a probability of 0.32.


 


What is different in this problem is that you have machines of different size, so what you can count is how many people voted n in each machine of size N, and then plot the frequency of occurrences for each machine of size N. What you get in the case of the voters in the recall referendum is something very similar to the distribution of the coin toss problem, which is expected, since both are binomial processes.


 


Mathematically, you can calculate the probability for a binomial process that you will get a value of n voters showing up to vote for each machine with N registered voters. Thus, if we have machines with N voters each, the probability that a voter will go vote is p and the probability that it will not go and vote is q=p-1, thus if we have M machines with N voters, the number of voters n that do go and vote, will be between “0 and N and will follow what is called a binomial distribution given by


 


P(n)=(N!/n!(N-n)!) p^n x q^(N-n)


 


This is a bell shaped curve like the one plotted above


 


Supposed we now plot instead the number of voters that did go and vote (abstention) as a function of the number of voters per machine that were registered at each machine, if the distribution is binomial the points for the abstention should form a cloud of points that open up like the tail of a comet with the greatest density along an imaginary line with a slope proportional to the average attendance of voters in that population. If half the people abstained, this cloud would be along the 22.5 degree line with respect to the horizontal, but since in the case of the recall the percentage of abstention was 32%, this cloud would be below the 22.5 degree line.


 


Below is a plot of such graph for the number of people n that did not go and vote in all of the centers in Miranda state as a function of the number of voters registered per machine N:


 



 


 


 


Plot of the number of voters that that abstained as a function of the number of registered voters N at each machine for Miranda state.


 


This is a textbook type of example of what one should get for a process that should follow a binomial distribution. Thus, the first conclusion is that the data from the recall vote in terms of the choice between going to vote or not behaves in Miranda state and nationally, much like what is expected from a binomial distribution.


 


The same logic should apply to the SI and NO votes. It should be a binomial distribution since it represents a choice between two possibilities. If the vote split were a perfect 50%/50% for the Si and the No, and one plotted the number of votes n for one or the other possibility as a function of the number of actual voters at each machine N, the cloud would spread below the 45 degree line that divides the plane, along a 22.5 degree imaginary line. In the recall vote, since the No won then, if one plots the dispersion plot for the NO votes on would get a cloud above the 22.5 degree line and a similar one below that line for the corresponding cloud of SI votes which is also plotted below.


 


However, what is observed is completely different as seen in the next graph for the number of NO votes n, in Miranda state as an example, as a function of the number of voters in each machine N:


 



 


 


 


Plot of the number of  No voters as a function of the number of voters N at each machine


 


 


Instead of obtaining a single cloud, one obtains two separate areas of high density with a valley of low density separating them. I have drawn three imaginary lines to guide the idea to the valley (area with low density between the two thicker clouds) as well as imaginary lines along the two separate clouds at each side.


 


Exactly the same type of behaviour is seen for the number of SI votes n in Miranda state as a function of the number of voters in each voting center N:


 



 


 


 


Plot of the number of Si voters as a function of the number of voters N at each machine


 


 


This shows the same low density valley, where I have drawn a line to guide the eye and two clouds at each side.


 


Thus, Miranda data, which conforms to a binomial distribution when one looks at the binomial process of abstention versus voting, does not conform to a binomial distribution. In fact, according to the authors, the data for Miranda state would NEVER conform to a binomial distribution. This is the second conclusion: The data for the Si and No votes does not conform to a binomial but is part of the same data that did conform to a binomial in the case of the abstention. In fact it would never conform to it.


 


Even more interesting, the same type of behaviour has been seen in Zulia, Carabobo, Anzoategui, Tachira and  Lara, but “textbook” type of behaviour is found in other states such as Falcon and Vargas. Other smaller states also show classical behaviour. This creates a big problem, how would one explain that some states behave exactly like a binomial, textbooks cases, no discrepancies, while certain selected states do not?


 


In order to try to understand this unusual behaviour, the authors plotted the histogram of occurrences for the Si and NO votes as shown in the next figure:


 



 


Histogram of the occurrences of the Si (red), No (blue) votes as a function of the number of votes.


 


There are two distributions plotted in this figure: The Si bars are the distribution of occurrences of the number of Si voters for each machine with N voters, in the blue the distribution for the number of NO voters as a function of the number of N voters in the machine. As you can see it is as if the Si votes had had a piece chopped up for machines in which the number of registered voters was above 250 and up to 350. This data is for Miranda state, but if one looks at a similar histogram at the national level, the same type of “chopped up” binomial distribution is observed. This is the third conclusion: The distribution is a binomial that appears to have part of it “chopped up” as if part of the Si votes were shifted to No votes.


 


It is this same chopping up which accounts for the valleys in the two unusual dispersion curves.


 


Thus, it would seem as if the process is not at all like a binomial as it should be, but follows instead a distribution which appears to have some form of artificiality and selectively introduced into it, creating two types of distribution. Curiously, the abstention had the proper behaviour expected from a binomial, but is part of a process within the same data. This result is consistent with the hypothesis of Haussman and Rigobon that only a certain number of machines may have been manipulated, in this case, the data suggests if was a selection based on the number of registered voters per machine, which determined whether the data was manipulated or not.