Archive for the 'Uncategorized' Category

October 5, 2004


I have talked about Benford’s law and its prediction as well as quoted results in previous posts, but today I finally received the green light to talk about the details of the work by Pericchi and Torres which you can find in detail here.


Recall that Benford’s law or Necomb-Benford’s Law (NB) applies to the distribution of first and second digits in a table of numbers. That is, if you take a sample of numbers from many “natural” populations, the first or second digit are usually not evenly distributed, but follow the following equations for the frequency of their occurrence:


 


Prob(1st digit = d) = log10(1 + d-1);     d = 1, . . . , 9


For the first digit, and:


                 Prob(2nd. digit=d) = Sum (k=1 to 9)( log10(1 + (10k + d)-1));       d = 0, 1, . . . , 9


For the second digit.


What Pericchi and Torres have done is to check for the NB behavior for the first and second digit in the results of the August 15th. recall vote for both automated and manual votes. They concentrate their analysis on the distribution of second digits because it is not affected by limited ranges of numbers. For example, if one studies the first digit and no voting machine had more than 600 Si or No votes, there will be fewer first digits from 7 to 9, since the only contributing ones would be those from 70 to 99.


The first figure below shows the comparison for both the manual (Top) and automated (Bottom) results for the SI vote for the second digit of all voting notebooks in the recall vote.


 



 


Figure 1. Manual (Top) and Automated (Bottom) results for the second digit of all the voting results for the total of Si votes in each notebook. The smooth line in both cases is the theoretical value for the NB law and the broken line is the results of analyzing the recall data.


Note that in the case of the Si vote, the data from the recall vote closely follows what is expected from the NB law. In fact, as will be shown below the results are probable.


However, the results are quite different for the No vote as shown below:



 


 


 


Figure 2. Manual (Top) and Automated (Bottom) results for the second digit of all the voting results for the total of NO votes in each notebook. The smooth line in both cases is the theoretical value for the NB law and the broken line is the results of analyzing the recall data.


In the case of the NO results while the comparison is quite reasonable for the manual notebooks, the same can not be said for the automated machines where essentially a flat distribution of second digits was obtained, much different than what is expected from BN’s law and quite different from Figure 1. for the Si vote .


In fact, one can do exactly the same analysis to the total number of votes per machine SI+No and one finds the following behavior:


 



 


Figure 3. Manual (Top) and Automated (Bottom) results for the second digit of all the voting results for the total of SI+NO votes in each notebook. The smooth line in both cases is the theoretical value for the NB law and the broken line is the results of analyzing the recall data.


In the case of the total number of votes, once again there are very important discrepancies between the predictions of the BN law and the results.


What Pericchi and Torres did then, was to say that the null hypothesis Ho is that which assumes there was no tampering of the data. They then calculate both the Pvalue and the probability of the occurrences of the data observed assuming no tampering or intervention occurred.


The Pvalue is defined as the probability that a result like the one measured or more extreme is obtained given the null hypothesis, i.e. assuming there was no intervention. Pericchi and Torres then calculate also what is the approximate probability according to the Bayesian Information Criteria (BIC) which takes into account the size of the sample.


The results for all cases are shown in the table below:


 



Table I. Evidence against the null hypothesis Ho. The data follows the Newcomb –Benford law, except for the case of the automated No votes. But the manual Si and NO do follow as well as the results of the audit.


What is most remarkable about the quoted results is that the approximate probability that the measured result was obtained for the No vote is 1.34 10-36 ( a one followed by 36 zeroes!). Thus, the probability that the results were not tampered with is simply miniscule or extremely improbable, the NB law is violated and one should think more about how the intervention of the data may have occurred. In my mind this proves fraud, because there is simply no way of explaining these results.


 


Even more remarkable, which is quoted in the table above, is the fact that similar plots for the audited results on the cold audit performed on Aug. 18th. show that they do follow the BN law:


 


 


 


Figure 4. Si (Top) and No (Bottom) results for the second digit of the audited results.  The smooth line in both cases is the theoretical value for the NB law and the broken line is the results of analyzing the recall data.


Thus, the audited results for the Si and the No follow the NB law, despite the much smaller sample size in the case of the audit. Thus, once again, the results from the audit and the actual vote are quite different, indicating not only fraud, but that the sample for the audit was carefully picked! I would like the Carter Center, Taylor, Rubin and Weisbrot to explain away this result. I challenge them to do so!

October 5, 2004


I have talked about Benford’s law and its prediction as well as quoted results in previous posts, but today I finally received the green light to talk about the details of the work by Pericchi and Torres which you can find in detail here.


Recall that Benford’s law or Necomb-Benford’s Law (NB) applies to the distribution of first and second digits in a table of numbers. That is, if you take a sample of numbers from many “natural” populations, the first or second digit are usually not evenly distributed, but follow the following equations for the frequency of their occurrence:


 


Prob(1st digit = d) = log10(1 + d-1);     d = 1, . . . , 9


For the first digit, and:


                 Prob(2nd. digit=d) = Sum (k=1 to 9)( log10(1 + (10k + d)-1));       d = 0, 1, . . . , 9


For the second digit.


What Pericchi and Torres have done is to check for the NB behavior for the first and second digit in the results of the August 15th. recall vote for both automated and manual votes. They concentrate their analysis on the distribution of second digits because it is not affected by limited ranges of numbers. For example, if one studies the first digit and no voting machine had more than 600 Si or No votes, there will be fewer first digits from 7 to 9, since the only contributing ones would be those from 70 to 99.


The first figure below shows the comparison for both the manual (Top) and automated (Bottom) results for the SI vote for the second digit of all voting notebooks in the recall vote.


 



 


Figure 1. Manual (Top) and Automated (Bottom) results for the second digit of all the voting results for the total of Si votes in each notebook. The smooth line in both cases is the theoretical value for the NB law and the broken line is the results of analyzing the recall data.


Note that in the case of the Si vote, the data from the recall vote closely follows what is expected from the NB law. In fact, as will be shown below the results are probable.


However, the results are quite different for the No vote as shown below:



 


 


 


Figure 2. Manual (Top) and Automated (Bottom) results for the second digit of all the voting results for the total of NO votes in each notebook. The smooth line in both cases is the theoretical value for the NB law and the broken line is the results of analyzing the recall data.


In the case of the NO results while the comparison is quite reasonable for the manual notebooks, the same can not be said for the automated machines where essentially a flat distribution of second digits was obtained, much different than what is expected from BN’s law and quite different from Figure 1. for the Si vote .


In fact, one can do exactly the same analysis to the total number of votes per machine SI+No and one finds the following behavior:


 



 


Figure 3. Manual (Top) and Automated (Bottom) results for the second digit of all the voting results for the total of SI+NO votes in each notebook. The smooth line in both cases is the theoretical value for the NB law and the broken line is the results of analyzing the recall data.


In the case of the total number of votes, once again there are very important discrepancies between the predictions of the BN law and the results.


What Pericchi and Torres did then, was to say that the null hypothesis Ho is that which assumes there was no tampering of the data. They then calculate both the Pvalue and the probability of the occurrences of the data observed assuming no tampering or intervention occurred.


The Pvalue is defined as the probability that a result like the one measured or more extreme is obtained given the null hypothesis, i.e. assuming there was no intervention. Pericchi and Torres then calculate also what is the approximate probability according to the Bayesian Information Criteria (BIC) which takes into account the size of the sample.


The results for all cases are shown in the table below:


 



Table I. Evidence against the null hypothesis Ho. The data follows the Newcomb –Benford law, except for the case of the automated No votes. But the manual Si and NO do follow as well as the results of the audit.


What is most remarkable about the quoted results is that the approximate probability that the measured result was obtained for the No vote is 1.34 10-36 ( a one followed by 36 zeroes!). Thus, the probability that the results were not tampered with is simply miniscule or extremely improbable, the NB law is violated and one should think more about how the intervention of the data may have occurred. In my mind this proves fraud, because there is simply no way of explaining these results.


 


Even more remarkable, which is quoted in the table above, is the fact that similar plots for the audited results on the cold audit performed on Aug. 18th. show that they do follow the BN law:


 


 


 


Figure 4. Si (Top) and No (Bottom) results for the second digit of the audited results.  The smooth line in both cases is the theoretical value for the NB law and the broken line is the results of analyzing the recall data.


Thus, the audited results for the Si and the No follow the NB law, despite the much smaller sample size in the case of the audit. Thus, once again, the results from the audit and the actual vote are quite different, indicating not only fraud, but that the sample for the audit was carefully picked! I would like the Carter Center, Taylor, Rubin and Weisbrot to explain away this result. I challenge them to do so!

October 5, 2004


I have talked about Benford’s law and its prediction as well as quoted results in previous posts, but today I finally received the green light to talk about the details of the work by Pericchi and Torres which you can find in detail here.


Recall that Benford’s law or Necomb-Benford’s Law (NB) applies to the distribution of first and second digits in a table of numbers. That is, if you take a sample of numbers from many “natural” populations, the first or second digit are usually not evenly distributed, but follow the following equations for the frequency of their occurrence:


 


Prob(1st digit = d) = log10(1 + d-1);     d = 1, . . . , 9


For the first digit, and:


                 Prob(2nd. digit=d) = Sum (k=1 to 9)( log10(1 + (10k + d)-1));       d = 0, 1, . . . , 9


For the second digit.


What Pericchi and Torres have done is to check for the NB behavior for the first and second digit in the results of the August 15th. recall vote for both automated and manual votes. They concentrate their analysis on the distribution of second digits because it is not affected by limited ranges of numbers. For example, if one studies the first digit and no voting machine had more than 600 Si or No votes, there will be fewer first digits from 7 to 9, since the only contributing ones would be those from 70 to 99.


The first figure below shows the comparison for both the manual (Top) and automated (Bottom) results for the SI vote for the second digit of all voting notebooks in the recall vote.


 



 


Figure 1. Manual (Top) and Automated (Bottom) results for the second digit of all the voting results for the total of Si votes in each notebook. The smooth line in both cases is the theoretical value for the NB law and the broken line is the results of analyzing the recall data.


Note that in the case of the Si vote, the data from the recall vote closely follows what is expected from the NB law. In fact, as will be shown below the results are probable.


However, the results are quite different for the No vote as shown below:



 


 


 


Figure 2. Manual (Top) and Automated (Bottom) results for the second digit of all the voting results for the total of NO votes in each notebook. The smooth line in both cases is the theoretical value for the NB law and the broken line is the results of analyzing the recall data.


In the case of the NO results while the comparison is quite reasonable for the manual notebooks, the same can not be said for the automated machines where essentially a flat distribution of second digits was obtained, much different than what is expected from BN’s law and quite different from Figure 1. for the Si vote .


In fact, one can do exactly the same analysis to the total number of votes per machine SI+No and one finds the following behavior:


 



 


Figure 3. Manual (Top) and Automated (Bottom) results for the second digit of all the voting results for the total of SI+NO votes in each notebook. The smooth line in both cases is the theoretical value for the NB law and the broken line is the results of analyzing the recall data.


In the case of the total number of votes, once again there are very important discrepancies between the predictions of the BN law and the results.


What Pericchi and Torres did then, was to say that the null hypothesis Ho is that which assumes there was no tampering of the data. They then calculate both the Pvalue and the probability of the occurrences of the data observed assuming no tampering or intervention occurred.


The Pvalue is defined as the probability that a result like the one measured or more extreme is obtained given the null hypothesis, i.e. assuming there was no intervention. Pericchi and Torres then calculate also what is the approximate probability according to the Bayesian Information Criteria (BIC) which takes into account the size of the sample.


The results for all cases are shown in the table below:


 



Table I. Evidence against the null hypothesis Ho. The data follows the Newcomb –Benford law, except for the case of the automated No votes. But the manual Si and NO do follow as well as the results of the audit.


What is most remarkable about the quoted results is that the approximate probability that the measured result was obtained for the No vote is 1.34 10-36 ( a one followed by 36 zeroes!). Thus, the probability that the results were not tampered with is simply miniscule or extremely improbable, the NB law is violated and one should think more about how the intervention of the data may have occurred. In my mind this proves fraud, because there is simply no way of explaining these results.


 


Even more remarkable, which is quoted in the table above, is the fact that similar plots for the audited results on the cold audit performed on Aug. 18th. show that they do follow the BN law:


 


 


 


Figure 4. Si (Top) and No (Bottom) results for the second digit of the audited results.  The smooth line in both cases is the theoretical value for the NB law and the broken line is the results of analyzing the recall data.


Thus, the audited results for the Si and the No follow the NB law, despite the much smaller sample size in the case of the audit. Thus, once again, the results from the audit and the actual vote are quite different, indicating not only fraud, but that the sample for the audit was carefully picked! I would like the Carter Center, Taylor, Rubin and Weisbrot to explain away this result. I challenge them to do so!

On Mathematical Models of the recall vote and fraud part XI: 3d. Simon Bolivar seminar with a very strong result.

September 25, 2004

The 3d. Simon Bolivar Seminar on statistical analysis of the recall referendum took place  last Thursday with two talks by Jose Huerta and Luis Raul Pericchi on studies that I have discussed here before and a third talk by Carenne Ludeńa on looking at the results from a critical point of view. In some sense, I did not learn as much from the talks, since I was aware of the results, but I did learn quite a bit from the progress made by others not only in studies of the results themselves but new avenues that are trying to correlate, for example, those centers that had “anomalous” results, with those that received data during the day on August 15th., but more on that later.


Luis Raul Pericchi et al: Methods to indirectly verify the non-intervention of an election


 


I have mentioned Pericchi’s work earlier since he is the mathematician that has been applying Benford’s law to the recall vote. Basically, Benford’s law allows for the detection of manipulation of data, in this case the results of the recall vote, by looking at either the first or second digit of the sequence of numbers, the frequency of occurrence of these digits may be able to show that the data may have been tampered with.


 


What Pericchi did was to look at the first two digits. This is done because the first digit test may not be the most accurate since, for example, it may be bound in a range such as no voting machine having either Si or No votes above 900. (This is an invented example).


 


The results are similar for the first two digits, but I will convey those for the second digit. What is done, is to compare the expected frequency of the digits and do a statistical test to determine what is the probability of such an occurrence as a simple probability, or to calculate the so called P value, a number which is used to determine whether the data was or not intervened. A P value below 0.05 is considered to be indication that there was intervention of the data. In the case of the second digit I will quote both the P value and the Probability.


 


Second digit results:


 


Manual Centers Si votes P value=0.0032 and Prob~5%. This is inconclusive


 


Automatic Centers Si votes P value=0.02 and Prob~20% Suggests non-intervention


 


Manual Centers No votes P value~0.15 and Prob~ 44% Suggests non-intervention


 


Automatic Centers No votes P value~0.000… Prob~0% Indicates intervention


 


Essentially, the frequency distribution of the No votes in the automated centers was found to be flat, a uniform distribution of digits, not at all what is expected from Benford’s Law and much different from what is found in the No votes in the manual centers.


 


The same result was also found for the total number of votes at each center, that is, the sum of Si plus No votes, also showed the same pattern indicating the manual centers were ok, but the automated centers were intervened. This result is quite strong and can not be dismissed easily as voting records not only usually follow Benford’s law, but in this case the manual centers are shown to behave correctly, suggesting a very strong case for intervention of the data in automated centers.


 


Even more interesting, when Benford’s Law was tested on the No votes in the audited machines, the results were quite different with a P value of 0.24 and a probability of 48%, much different than in the overall results, suggesting the sample had something different about it.


 


For skeptics, I repeat: Similar behavior was found for both the first and second digits in which the No vote numbers and the total numbers indicate intervention since the probabilities of this happening are extremely low. It is going to be extremely difficult to “explain away” this result


 


Recall also that my pedestrian use of Benford’s law to test the Proyecto Venezuela exit poll matches very well what is expected. While I did not perform any statistical tests, the differences in both Si and No numbers from what is expected do not appear to be significant and the frequency distribution is certainly not flat.


 


To close, Pericchi also mentioned that he has obtained results similar to Jimenez on coincidences, using different techniques, in a less detailed study so far.


 


Jose Huerta and Jesus Gonzales: Comparison of the recall vote and other electoral processes


 


He presented a more detailed version of the work I posted earlier in which he compared the vote from the 1999, 2000 and 2004 votes. Huerta finds that there is more predictive correlation at the municipality level between the 200 and the 2004 vote than between the 1999 and 2000 vote. Huerta, who is a social scientist who studies poverty, concludes that this is very surprising not only from a political point of view, given what has happened in the country in those four years, but also from a social point of view, since poverty, unemployment and crime are up.


 


Huerta made a couple of comments that I found t be quite interesting and inconsistent with what is known: One, that the growth in the electoral registry is larger in the rural areas than in the urban areas by 18% to 14%, inconsistent with statistical data from the Government and from the fact that the is no evidence of a reversal of the migration trend of the last forty years. But the second comment was perhaps the most surprising: Huerta finds that the largest proportion of changes in the electoral registry were from urban areas to rural areas, which makes no sense whatsoever. His suggestion is that this was done on purpose to have manual centers match the national automated vote.


 


Carenne Ludeńa: A critical view at the models used to study the recall vote.


 


Ludeńa basically tried to point out where bias or assumptions may affect the results leading to a conclusion that may suggest fraud, but the conclusion is model based. The talk had some interesting points and considerations, but I found nothing compelling about it. She pointed out, for example, how the Hausmann and Rigobon model of errors may be flawed by proposing an alternative, but I found the alternative less compelling than the real model. Essentially she said that the Exit polls and the signatures for the recall may have had a correlation factor due to external pressures. However, in my mind these correlations did not exist as the two processes were different.


 


In the signatures, they were going to be public, which meant that those that wanted to sign did not, for fear of retaliation. In the exit polls the situation is different, whether you are pressured into lying depends in where you are voting, not how. Essentially in a Si-dominated center people may feel pressured to say they voted Si, but the opposite is true if the center is dominated by No voters. If it was true that the No won by 60-40%, then there should not be the correlation that she points put, or should not be important.


 


Other comments:


 


1) There are many people working on this problem and are now getting into the details of how the intervention may have been implemented. Perhaps the most interesting comment I heard was about communications between the voting machines and the servers. Essentially, the machines were not supposed to communicate during the day at all and the data was not supposed to be bidirectional in the sense that while handshakes are to be expected, more data should not flow from the servers to the voting machines. This did not happen. The data transmission record exists in detail for all machines and the data is quite interesting:


 


-Not all machines had communications during the day


-There were two types of ways in which calls were terminated, either by the server or by the voting machine. In one of the two (Don’t remember which) the amount of data transmitted to the machine was larger than from the machine to the server. There appears to be a correlation between this and the “anomalous” centers with funny vote distributions.


 


 This work is still in progress.


 


2) In the work of Isbelia Martin et al that I reported earlier, a peculiarity was observed that the dispersion of votes by machine size showed two “clouds” if one looks at the Si or No votes, instead of only one in some states. Some have wanted to explain away this behavior by saying it reflects two geographic or social populations. The problem is that the mathematical properties of each “cloud” have inconsistencies, such as the fact that if you do a fit to only one cloud, the intercept is not zero.


 


The above result could be explained away by artificialities in the data. But what can not be explained away is that the intercept is the same for the Si and No votes. There can be no correlation between the two! If anything should no be correlated is these two populations. There can be no justification for this coincidence state after state where the two clouds are observed!


 


If this last result is found in a few of the states where the binomial distribution is “chopped up”, in my mind there is no doubt mathematically that the data was intervened. This work is also in progress


 


3) One last conclusion to me is that the recall vote data shows quite a number of “strange” results. As someone said, the probability of a person winning the lotto is very low, however the fact that a person does win every week is not strange. In the recall vote, mathematical studies show quiet a number of strange results; this is as if the same person wins the lotto week after week. In fact, few of this statistical studies show results for which the data is reasonable or normal and that may represent the biggest abnormality or anomaly.

On Mathematical Models of the recall vote and fraud part XI: 3d. Simon Bolivar seminar with a very strong result.

September 25, 2004

The 3d. Simon Bolivar Seminar on statistical analysis of the recall referendum took place  last Thursday with two talks by Jose Huerta and Luis Raul Pericchi on studies that I have discussed here before and a third talk by Carenne Ludeńa on looking at the results from a critical point of view. In some sense, I did not learn as much from the talks, since I was aware of the results, but I did learn quite a bit from the progress made by others not only in studies of the results themselves but new avenues that are trying to correlate, for example, those centers that had “anomalous” results, with those that received data during the day on August 15th., but more on that later.


Luis Raul Pericchi et al: Methods to indirectly verify the non-intervention of an election


 


I have mentioned Pericchi’s work earlier since he is the mathematician that has been applying Benford’s law to the recall vote. Basically, Benford’s law allows for the detection of manipulation of data, in this case the results of the recall vote, by looking at either the first or second digit of the sequence of numbers, the frequency of occurrence of these digits may be able to show that the data may have been tampered with.


 


What Pericchi did was to look at the first two digits. This is done because the first digit test may not be the most accurate since, for example, it may be bound in a range such as no voting machine having either Si or No votes above 900. (This is an invented example).


 


The results are similar for the first two digits, but I will convey those for the second digit. What is done, is to compare the expected frequency of the digits and do a statistical test to determine what is the probability of such an occurrence as a simple probability, or to calculate the so called P value, a number which is used to determine whether the data was or not intervened. A P value below 0.05 is considered to be indication that there was intervention of the data. In the case of the second digit I will quote both the P value and the Probability.


 


Second digit results:


 


Manual Centers Si votes P value=0.0032 and Prob~5%. This is inconclusive


 


Automatic Centers Si votes P value=0.02 and Prob~20% Suggests non-intervention


 


Manual Centers No votes P value~0.15 and Prob~ 44% Suggests non-intervention


 


Automatic Centers No votes P value~0.000… Prob~0% Indicates intervention


 


Essentially, the frequency distribution of the No votes in the automated centers was found to be flat, a uniform distribution of digits, not at all what is expected from Benford’s Law and much different from what is found in the No votes in the manual centers.


 


The same result was also found for the total number of votes at each center, that is, the sum of Si plus No votes, also showed the same pattern indicating the manual centers were ok, but the automated centers were intervened. This result is quite strong and can not be dismissed easily as voting records not only usually follow Benford’s law, but in this case the manual centers are shown to behave correctly, suggesting a very strong case for intervention of the data in automated centers.


 


Even more interesting, when Benford’s Law was tested on the No votes in the audited machines, the results were quite different with a P value of 0.24 and a probability of 48%, much different than in the overall results, suggesting the sample had something different about it.


 


For skeptics, I repeat: Similar behavior was found for both the first and second digits in which the No vote numbers and the total numbers indicate intervention since the probabilities of this happening are extremely low. It is going to be extremely difficult to “explain away” this result


 


Recall also that my pedestrian use of Benford’s law to test the Proyecto Venezuela exit poll matches very well what is expected. While I did not perform any statistical tests, the differences in both Si and No numbers from what is expected do not appear to be significant and the frequency distribution is certainly not flat.


 


To close, Pericchi also mentioned that he has obtained results similar to Jimenez on coincidences, using different techniques, in a less detailed study so far.


 


Jose Huerta and Jesus Gonzales: Comparison of the recall vote and other electoral processes


 


He presented a more detailed version of the work I posted earlier in which he compared the vote from the 1999, 2000 and 2004 votes. Huerta finds that there is more predictive correlation at the municipality level between the 200 and the 2004 vote than between the 1999 and 2000 vote. Huerta, who is a social scientist who studies poverty, concludes that this is very surprising not only from a political point of view, given what has happened in the country in those four years, but also from a social point of view, since poverty, unemployment and crime are up.


 


Huerta made a couple of comments that I found t be quite interesting and inconsistent with what is known: One, that the growth in the electoral registry is larger in the rural areas than in the urban areas by 18% to 14%, inconsistent with statistical data from the Government and from the fact that the is no evidence of a reversal of the migration trend of the last forty years. But the second comment was perhaps the most surprising: Huerta finds that the largest proportion of changes in the electoral registry were from urban areas to rural areas, which makes no sense whatsoever. His suggestion is that this was done on purpose to have manual centers match the national automated vote.


 


Carenne Ludeńa: A critical view at the models used to study the recall vote.


 


Ludeńa basically tried to point out where bias or assumptions may affect the results leading to a conclusion that may suggest fraud, but the conclusion is model based. The talk had some interesting points and considerations, but I found nothing compelling about it. She pointed out, for example, how the Hausmann and Rigobon model of errors may be flawed by proposing an alternative, but I found the alternative less compelling than the real model. Essentially she said that the Exit polls and the signatures for the recall may have had a correlation factor due to external pressures. However, in my mind these correlations did not exist as the two processes were different.


 


In the signatures, they were going to be public, which meant that those that wanted to sign did not, for fear of retaliation. In the exit polls the situation is different, whether you are pressured into lying depends in where you are voting, not how. Essentially in a Si-dominated center people may feel pressured to say they voted Si, but the opposite is true if the center is dominated by No voters. If it was true that the No won by 60-40%, then there should not be the correlation that she points put, or should not be important.


 


Other comments:


 


1) There are many people working on this problem and are now getting into the details of how the intervention may have been implemented. Perhaps the most interesting comment I heard was about communications between the voting machines and the servers. Essentially, the machines were not supposed to communicate during the day at all and the data was not supposed to be bidirectional in the sense that while handshakes are to be expected, more data should not flow from the servers to the voting machines. This did not happen. The data transmission record exists in detail for all machines and the data is quite interesting:


 


-Not all machines had communications during the day


-There were two types of ways in which calls were terminated, either by the server or by the voting machine. In one of the two (Don’t remember which) the amount of data transmitted to the machine was larger than from the machine to the server. There appears to be a correlation between this and the “anomalous” centers with funny vote distributions.


 


 This work is still in progress.


 


2) In the work of Isbelia Martin et al that I reported earlier, a peculiarity was observed that the dispersion of votes by machine size showed two “clouds” if one looks at the Si or No votes, instead of only one in some states. Some have wanted to explain away this behavior by saying it reflects two geographic or social populations. The problem is that the mathematical properties of each “cloud” have inconsistencies, such as the fact that if you do a fit to only one cloud, the intercept is not zero.


 


The above result could be explained away by artificialities in the data. But what can not be explained away is that the intercept is the same for the Si and No votes. There can be no correlation between the two! If anything should no be correlated is these two populations. There can be no justification for this coincidence state after state where the two clouds are observed!


 


If this last result is found in a few of the states where the binomial distribution is “chopped up”, in my mind there is no doubt mathematically that the data was intervened. This work is also in progress


 


3) One last conclusion to me is that the recall vote data shows quite a number of “strange” results. As someone said, the probability of a person winning the lotto is very low, however the fact that a person does win every week is not strange. In the recall vote, mathematical studies show quiet a number of strange results; this is as if the same person wins the lotto week after week. In fact, few of this statistical studies show results for which the data is reasonable or normal and that may represent the biggest abnormality or anomaly.

On Mathematical Models of the recall vote and fraud part XI: 3d. Simon Bolivar seminar with a very strong result.

September 25, 2004

The 3d. Simon Bolivar Seminar on statistical analysis of the recall referendum took place  last Thursday with two talks by Jose Huerta and Luis Raul Pericchi on studies that I have discussed here before and a third talk by Carenne Ludeńa on looking at the results from a critical point of view. In some sense, I did not learn as much from the talks, since I was aware of the results, but I did learn quite a bit from the progress made by others not only in studies of the results themselves but new avenues that are trying to correlate, for example, those centers that had “anomalous” results, with those that received data during the day on August 15th., but more on that later.


Luis Raul Pericchi et al: Methods to indirectly verify the non-intervention of an election


 


I have mentioned Pericchi’s work earlier since he is the mathematician that has been applying Benford’s law to the recall vote. Basically, Benford’s law allows for the detection of manipulation of data, in this case the results of the recall vote, by looking at either the first or second digit of the sequence of numbers, the frequency of occurrence of these digits may be able to show that the data may have been tampered with.


 


What Pericchi did was to look at the first two digits. This is done because the first digit test may not be the most accurate since, for example, it may be bound in a range such as no voting machine having either Si or No votes above 900. (This is an invented example).


 


The results are similar for the first two digits, but I will convey those for the second digit. What is done, is to compare the expected frequency of the digits and do a statistical test to determine what is the probability of such an occurrence as a simple probability, or to calculate the so called P value, a number which is used to determine whether the data was or not intervened. A P value below 0.05 is considered to be indication that there was intervention of the data. In the case of the second digit I will quote both the P value and the Probability.


 


Second digit results:


 


Manual Centers Si votes P value=0.0032 and Prob~5%. This is inconclusive


 


Automatic Centers Si votes P value=0.02 and Prob~20% Suggests non-intervention


 


Manual Centers No votes P value~0.15 and Prob~ 44% Suggests non-intervention


 


Automatic Centers No votes P value~0.000… Prob~0% Indicates intervention


 


Essentially, the frequency distribution of the No votes in the automated centers was found to be flat, a uniform distribution of digits, not at all what is expected from Benford’s Law and much different from what is found in the No votes in the manual centers.


 


The same result was also found for the total number of votes at each center, that is, the sum of Si plus No votes, also showed the same pattern indicating the manual centers were ok, but the automated centers were intervened. This result is quite strong and can not be dismissed easily as voting records not only usually follow Benford’s law, but in this case the manual centers are shown to behave correctly, suggesting a very strong case for intervention of the data in automated centers.


 


Even more interesting, when Benford’s Law was tested on the No votes in the audited machines, the results were quite different with a P value of 0.24 and a probability of 48%, much different than in the overall results, suggesting the sample had something different about it.


 


For skeptics, I repeat: Similar behavior was found for both the first and second digits in which the No vote numbers and the total numbers indicate intervention since the probabilities of this happening are extremely low. It is going to be extremely difficult to “explain away” this result


 


Recall also that my pedestrian use of Benford’s law to test the Proyecto Venezuela exit poll matches very well what is expected. While I did not perform any statistical tests, the differences in both Si and No numbers from what is expected do not appear to be significant and the frequency distribution is certainly not flat.


 


To close, Pericchi also mentioned that he has obtained results similar to Jimenez on coincidences, using different techniques, in a less detailed study so far.


 


Jose Huerta and Jesus Gonzales: Comparison of the recall vote and other electoral processes


 


He presented a more detailed version of the work I posted earlier in which he compared the vote from the 1999, 2000 and 2004 votes. Huerta finds that there is more predictive correlation at the municipality level between the 200 and the 2004 vote than between the 1999 and 2000 vote. Huerta, who is a social scientist who studies poverty, concludes that this is very surprising not only from a political point of view, given what has happened in the country in those four years, but also from a social point of view, since poverty, unemployment and crime are up.


 


Huerta made a couple of comments that I found t be quite interesting and inconsistent with what is known: One, that the growth in the electoral registry is larger in the rural areas than in the urban areas by 18% to 14%, inconsistent with statistical data from the Government and from the fact that the is no evidence of a reversal of the migration trend of the last forty years. But the second comment was perhaps the most surprising: Huerta finds that the largest proportion of changes in the electoral registry were from urban areas to rural areas, which makes no sense whatsoever. His suggestion is that this was done on purpose to have manual centers match the national automated vote.


 


Carenne Ludeńa: A critical view at the models used to study the recall vote.


 


Ludeńa basically tried to point out where bias or assumptions may affect the results leading to a conclusion that may suggest fraud, but the conclusion is model based. The talk had some interesting points and considerations, but I found nothing compelling about it. She pointed out, for example, how the Hausmann and Rigobon model of errors may be flawed by proposing an alternative, but I found the alternative less compelling than the real model. Essentially she said that the Exit polls and the signatures for the recall may have had a correlation factor due to external pressures. However, in my mind these correlations did not exist as the two processes were different.


 


In the signatures, they were going to be public, which meant that those that wanted to sign did not, for fear of retaliation. In the exit polls the situation is different, whether you are pressured into lying depends in where you are voting, not how. Essentially in a Si-dominated center people may feel pressured to say they voted Si, but the opposite is true if the center is dominated by No voters. If it was true that the No won by 60-40%, then there should not be the correlation that she points put, or should not be important.


 


Other comments:


 


1) There are many people working on this problem and are now getting into the details of how the intervention may have been implemented. Perhaps the most interesting comment I heard was about communications between the voting machines and the servers. Essentially, the machines were not supposed to communicate during the day at all and the data was not supposed to be bidirectional in the sense that while handshakes are to be expected, more data should not flow from the servers to the voting machines. This did not happen. The data transmission record exists in detail for all machines and the data is quite interesting:


 


-Not all machines had communications during the day


-There were two types of ways in which calls were terminated, either by the server or by the voting machine. In one of the two (Don’t remember which) the amount of data transmitted to the machine was larger than from the machine to the server. There appears to be a correlation between this and the “anomalous” centers with funny vote distributions.


 


 This work is still in progress.


 


2) In the work of Isbelia Martin et al that I reported earlier, a peculiarity was observed that the dispersion of votes by machine size showed two “clouds” if one looks at the Si or No votes, instead of only one in some states. Some have wanted to explain away this behavior by saying it reflects two geographic or social populations. The problem is that the mathematical properties of each “cloud” have inconsistencies, such as the fact that if you do a fit to only one cloud, the intercept is not zero.


 


The above result could be explained away by artificialities in the data. But what can not be explained away is that the intercept is the same for the Si and No votes. There can be no correlation between the two! If anything should no be correlated is these two populations. There can be no justification for this coincidence state after state where the two clouds are observed!


 


If this last result is found in a few of the states where the binomial distribution is “chopped up”, in my mind there is no doubt mathematically that the data was intervened. This work is also in progress


 


3) One last conclusion to me is that the recall vote data shows quite a number of “strange” results. As someone said, the probability of a person winning the lotto is very low, however the fact that a person does win every week is not strange. In the recall vote, mathematical studies show quiet a number of strange results; this is as if the same person wins the lotto week after week. In fact, few of this statistical studies show results for which the data is reasonable or normal and that may represent the biggest abnormality or anomaly.

Smartmatic Seminar in Miami

September 20, 2004

When I saw this invitation to this seminar by Smartmatic at the South Florida Tech organization on September 30th., many things went through my head, but speechless may be the best way to describe me, particularly the part about the “recent success in Venezuela”. I hope all of you in Miami can attend and give us* a blow by blow account (bold lettering by the blogger):


MEETING THE CHALLENGES OF ELECTRONIC VOTING


 


Summary


 


A Presentation by Antonio Mugica, CEO, Smartmatic, Boca Raton

One of the companies hard at work to make large-scale electronic voting tamper-proof, verifiable and affordable is Boca Raton-based Smartmatic. Having designed the technological infrastructure deployed nationwide in the recent Venezuelan presidential referendum, Smartmatic has been chosen as the special guest presenter at the South Florida Technology Alliance September 30 meeting at the Davie Campus of Nova Southeastern University.


 


Details


 


Antonio Mugica, CEO, Smartmatic Corp., will examine the many challenges faced by developers seeking to improve electoral processes with digital technology and provide a close-up look at Smartmatic’s Automated Electoral Solution (SAES). He will discuss Smartmatic’s business strategies and recent success in Venezuela. Mr. Mugica holds an Electrical Engineering degree from Simon Bolivar University. He created Smartmatic’s vision and holds over nine pending patents under his name in the U.S. 


 


*Devil’s Excrement will give a free one year subscription to this blog to the first report from the seminar and will post the first written description at no charge!

Rigobon on Carter Center response: Statistically incorrect

September 19, 2004

As I said in a previous post, I did not want to give my opinion on the Carter Center response to Rigobon and Haussman until I heard from the experts, but I did use the word “silly” to refer some parts of that report, maybe I should have used amateur and Roberto Rigobon from MIT agrees. A reader points out in the comments that Rigobon’s response is in El Universal, which must not have been in the print edition which I read. .


Rigobon’s response centers on two issues:


 


1) The Carter Center said that the correlation between signatures and votes was the same for the votes and the audit.


 


2) The Carter Center said the averages in the audited sample match the averages of the vote.


 


These the are the arguments in each case:


 


1)      It was with respect to this part that I used the word silly, Rigobon seems to agree. He says :”This argument is statistically incorrect because i) The correlation between a variable with itself is one , ii) The correlation between a variable and 10% of itself is also one”


 


Basically, what Rigobon is saying is that the correlation coefficient, which measures how well two things follow each other will be very similar for the signatures compared to the votes or for the signatures compared to only part of the vote. Then, if you removed part of the SI votes when you tampered with the votes, the correlation will be the same or similar and thus the Carter Center has proven absolutely nothing about the problem at hand.


 


2)                 The Carter Center argues that the averages for the sample are similar to those in the audit. Rigobon says that this is statistically incorrect and you can construct a set of results that maintains the averages but in no way reflects the true results.


 


Rigobon gives an example using a Florida election to show how you would maintain averages the same, while tampering with the results. Basically what he says is that in order to have the same averages in both cases, you have to give the same weight to the audit as the changes you made in the vote. Basically, imagine this: Suppose the fraud involved half the machines being tampered with, then the audit would be performed half in the correct machines and half in the ones you tampered with.


 


By the way, the Carter Center says that the averages were the same, however, the average number of voters per machine in the audit was 404, in the election it was 440. I don’t know if this is statistically significant, but they are certainly not the same and did the Carter Center notice this difference?.


 


While Rigobon makes no mention of it, the Carter Center report mentions a study of the random number generator to check that it was indeed random, by making it generate samples of voting machines. To me this was also silly, the random number generator in my Excel spreadsheet would do the same, today and now, but I could have used it (or not!) the day of the selection of ballot boxes to be audited in such a way that it would have picked a certain sequence of boxes or generated an output that was internally replaced (even within Excel!) by a prearranged table.


 


Sumate has criticized that the Carter Center does not identify who did this report. I imagine that the reason is to avoid the problem they have had with people directly contacting its experts to show they are wrong. This has the ¨non-political” consequence that academics like to preserve their academic reputation and can be convinced to change their mind. With this report nobody knows the author, so there is no intellectual integrity or honesty to be compromised other than that of the Carter Center.  Thus, the Carter Center continues to act with superficiality and in this new case, with less transparency than ever.

On Mathematical models of the recall vote and fraud, part X: 2nd. Simon Bolivar Seminar

September 19, 2004

On Thursday the second Simon Bolivar University seminar on Statistical Analysis of the referendum process was held. There were supposed to be three talks, but nature conspired against Luis Raul Pericchi, who was in Puerto Rico, and was unable to come to Venezuela due to hurricane Jeanne. Then, they planned a videoconference, but unfortunately the island lost all electric power, making it impossible to set it up. It will be tentatively scheduled for next Thursday.


You can find the program for these conferences here, I though all presentations would be placed there, but only one of them has so far been posted, more on that particular one later.


 


-There was talk by Rafael Torrealba from the Math Department at Universidad Centro Occidental Lisandro Alvarado. The talk would have been useful two or three weeks ago, but by now it is too simplistic a model to be useful. Basically, Torrealba calculated the probability of coincidences assuming all machines have 500 voters and approximating the binomial distribution by a “box” with zero probability above and below a standard deviation. Using this, Torrealba got that coincidences were as likely as observed in the recall vote and cited Rubin’s work, but was unaware of Taylor, Valladares and Jimenez. Thus, it was too crude at this point to make a point.


 


Torrealba also showed some voter distributions from the Barquisimeto area where he lives to discuss the implications of applying a binomial distribution.


 


-There was a second talk by Isbelia Martin on the binomial distribution and the vote from the recall. She did a more complete presentation of the results I summarized here. In the talk she presented much more material than the one I showed and if she places her presentation online I will link it in the future here.


 


What she did was to present the data for a textbook binomial state, Vargas State, and compare it to the data I presented on Miranda State. There are more anomalies to the data that I discussed, including the fact that if one does a fit through the “clouds” of results to obtain the average for each cloud, they do not intersect zero as they should. Additionally, she and her colleagues find that in some cases the same center has machines in both clouds, which obviously makes no sense.


 


-Jimenez, Jimenez and Marcano have now placed a simplified version of their work on coincidences here, I wish everyone would make their work available like that; it would make discussions more lively and interesting.


 


What they have done is essentially to use what is called a bootstrap method, which is a basically a simulation of the vote using the actual data from the recall referendum and modeling the details of the structure of the centers, tables (mesas) and machines. They allow all variables to fluctuate so that they do not have to assume the data is random which would not be if it had been intervened with.


 


Jimenez et al. do also a more detailed calculation of the problem by looking not only at the number of coincidences in the SI or No votes, but by looking at Si, No and all votes and comparing the probability of coincidences for each type of center. That is, they not only calculate how many centers had coincidences in two machines, but calculated how many centers with two machines, had coincidences in any of the three numbers (Si, No or sum of votes), how many centers with three machines did, how many with four etc. In this fashion one has a wider number of probabilities to compare the real data to what the simulations say.


 


They then did 1238 simulations and calculated the same probabilities for centers from 2 to 11 machines. In this manner they found that in general, the proportion of coincidences is higher in the actual vote that in the simulations, which led them to do a test of ranges, calculating the probability that the observed number of coincidences in the recall vote may occur for each center with n=2,3,4…..11 machines. In this manner, it is not simply a matter of asking what the probability of two machines coinciding is, but what is the probability that centers with two machines had the level of coincidences observed.


 


You can see the results in their paper in Table 3, but I will summarize some cases with examples:


 


Centers with two machines: The probability of observing the number of Si coincidences seen was 0.0323, the number of No coincidences was 0.7746 and the number of total vote coincidences was 0.0638. Thus, while low, it was probable that there were that many coincidences.


 


Centers with four machines: The probability of observing that number of Si coincidences was ZERO, with the probability of No coincidences being 0.2883 and the probability of total votes coinciding 0.00807. Similarly low probabilities were observed for the total number of coincidences in centers with 6 and 7 machines or extremely low probabilities in Si coincidences for centers with six machines.


 


The authors conclude:


 


-The repetitions observed in the Si vote and the total number of voters per machine in one center are considerably larger than expected. It is strange, but probable


 


-The repetitions observed in the NO votes are absolutely credible and in many cases, close to what was expected.


 


-The repetitions observed in the Si votes in centers with 4 machines and the number of voters in centers with six machines are extreme cases of their analysis. In these cases the author CAN NOT accept the hypothesis that the repetitions are due to randomness.


 


This last conclusion is the strongest found in the study of the coincidences in the number of votes within one center and it says the data could not have been random.

On Mathematical models of the recall vote and fraud, part X: 2nd. Simon Bolivar Seminar

September 19, 2004

On Thursday the second Simon Bolivar University seminar on Statistical Analysis of the referendum process was held. There were supposed to be three talks, but nature conspired against Luis Raul Pericchi, who was in Puerto Rico, and was unable to come to Venezuela due to hurricane Jeanne. Then, they planned a videoconference, but unfortunately the island lost all electric power, making it impossible to set it up. It will be tentatively scheduled for next Thursday.


You can find the program for these conferences here, I though all presentations would be placed there, but only one of them has so far been posted, more on that particular one later.


 


-There was talk by Rafael Torrealba from the Math Department at Universidad Centro Occidental Lisandro Alvarado. The talk would have been useful two or three weeks ago, but by now it is too simplistic a model to be useful. Basically, Torrealba calculated the probability of coincidences assuming all machines have 500 voters and approximating the binomial distribution by a “box” with zero probability above and below a standard deviation. Using this, Torrealba got that coincidences were as likely as observed in the recall vote and cited Rubin’s work, but was unaware of Taylor, Valladares and Jimenez. Thus, it was too crude at this point to make a point.


 


Torrealba also showed some voter distributions from the Barquisimeto area where he lives to discuss the implications of applying a binomial distribution.


 


-There was a second talk by Isbelia Martin on the binomial distribution and the vote from the recall. She did a more complete presentation of the results I summarized here. In the talk she presented much more material than the one I showed and if she places her presentation online I will link it in the future here.


 


What she did was to present the data for a textbook binomial state, Vargas State, and compare it to the data I presented on Miranda State. There are more anomalies to the data that I discussed, including the fact that if one does a fit through the “clouds” of results to obtain the average for each cloud, they do not intersect zero as they should. Additionally, she and her colleagues find that in some cases the same center has machines in both clouds, which obviously makes no sense.


 


-Jimenez, Jimenez and Marcano have now placed a simplified version of their work on coincidences here, I wish everyone would make their work available like that; it would make discussions more lively and interesting.


 


What they have done is essentially to use what is called a bootstrap method, which is a basically a simulation of the vote using the actual data from the recall referendum and modeling the details of the structure of the centers, tables (mesas) and machines. They allow all variables to fluctuate so that they do not have to assume the data is random which would not be if it had been intervened with.


 


Jimenez et al. do also a more detailed calculation of the problem by looking not only at the number of coincidences in the SI or No votes, but by looking at Si, No and all votes and comparing the probability of coincidences for each type of center. That is, they not only calculate how many centers had coincidences in two machines, but calculated how many centers with two machines, had coincidences in any of the three numbers (Si, No or sum of votes), how many centers with three machines did, how many with four etc. In this fashion one has a wider number of probabilities to compare the real data to what the simulations say.


 


They then did 1238 simulations and calculated the same probabilities for centers from 2 to 11 machines. In this manner they found that in general, the proportion of coincidences is higher in the actual vote that in the simulations, which led them to do a test of ranges, calculating the probability that the observed number of coincidences in the recall vote may occur for each center with n=2,3,4…..11 machines. In this manner, it is not simply a matter of asking what the probability of two machines coinciding is, but what is the probability that centers with two machines had the level of coincidences observed.


 


You can see the results in their paper in Table 3, but I will summarize some cases with examples:


 


Centers with two machines: The probability of observing the number of Si coincidences seen was 0.0323, the number of No coincidences was 0.7746 and the number of total vote coincidences was 0.0638. Thus, while low, it was probable that there were that many coincidences.


 


Centers with four machines: The probability of observing that number of Si coincidences was ZERO, with the probability of No coincidences being 0.2883 and the probability of total votes coinciding 0.00807. Similarly low probabilities were observed for the total number of coincidences in centers with 6 and 7 machines or extremely low probabilities in Si coincidences for centers with six machines.


 


The authors conclude:


 


-The repetitions observed in the Si vote and the total number of voters per machine in one center are considerably larger than expected. It is strange, but probable


 


-The repetitions observed in the NO votes are absolutely credible and in many cases, close to what was expected.


 


-The repetitions observed in the Si votes in centers with 4 machines and the number of voters in centers with six machines are extreme cases of their analysis. In these cases the author CAN NOT accept the hypothesis that the repetitions are due to randomness.


 


This last conclusion is the strongest found in the study of the coincidences in the number of votes within one center and it says the data could not have been random.