Forensic Statistics to detect Data Fabrication

From The Embassy of Good Science

Forensic Statistics to detect Data Fabrication

What is this about?

Fabrication of numerical data is frequently described as an example of research misconduct that can occur in all areas of research. It can be detected by statistical tools, like the chi-square test for uniformity of digit distributions.

Why is this important?

Data fabrication is a form of research misconduct that affects the credibility of research and decreases public trust in science. In addition, the misrepresentation of data in biomedical research can be a serious threat to public health and safety [1]. Furthermore, because biomedical research is largely funded by the public – the National Institutes of Health (NIH) within the U.S. Department of Health and Human Services invests more than $32 billion a year to improve public health [2] – data fabrication can lead to the loss of public funds.

When there is a suspicion of research misconduct, an investigation is conducted by the Division of Investigative Oversight (DIO) within the U.S.’s Office of Research Integrity (ORI) [3]. In recent years, numerous scientific papers have been retracted and about two-thirds of them were due to scientific misconduct [4]. While the ORI and the U.S. Department of Health and Human Services support and encourage the use of methods for detecting image manipulation, some argue that statistical methods, which can detect numerical data fabrication, “get much less attention,” even though this form of fabrication occurs regularly [4].

  1. Statistical Forensics. Universal Measurement. [cited 2020 Aug 24]. Available from: http://universalmeasurement.com.au/statistical-forensics/.
  2. National Institutes of Health. [cited 2020 Aug 24]. Available from: https://www.nih.gov/grants-funding.
  3. Mosimann JE, Wiseman CV, Edelman RE. Data fabrication: Can people generate random digits? Account Res. 1995;4(1): 31-55.
  4. 4.0 4.1 Pitt JH, Hill HZ. Statistical Detection of Potentially Fabricated Numerical Data: A Case Study. 2013. [cited 2020 Aug 24]. Available from: https://www.semanticscholar.org/paper/Statistical-Detection-of-Potentially-Fabricated-A-Pitt-Hill/95e5805e45bae47e050b9f430bbd7e41b3045688.

For whom is this important?

What are the best practices?

One of the techniques for detecting the fabrication of numbers is to check the “rightmost digits” of the collected data. The “rightmost digit” is the digit that a number ends in. It is considered to be “the most random digit of a number,” which means that that the numbers that make up a data set should be uniformly distributed as in a lottery [1][2]. Since the rightmost digits in each study should be unpredictable, the appearance of any patterns is a reason to suspect data fabrication[3] [2][4][5].

Research conducted by Mosimann et al. in 1995 showed that most people cannot generate random numbers when fabricating data, which makes it possible to detect potentially fabricated data [6]. They also developed a program called the “chi-square test for uniformity of the digit distributions”, which measures the production of random digits [6]. If the distribution of numbers is not uniform, the numbers are falsified [1][7][8].

There are other methods that can be used to detect the fabrication of rightmost digits. For example, some journals have adopted a policy of statistical review for all papers containing numerical data [5] [9]. In addition, published graph data can be compared with “raw” notebook or computer data to determine whether the numbers have been reported correctly [3][7]. Authors should present the raw data that supports their findings, while journals, universities and granting agencies should promote this practice [5] [10]. Some argue that the use of statistical methods will significantly reduce fabrication of numerical data [10].

  1. 1.0 1.1 Mosimann J, Dahlberg J, Davidian N, Krueger J. Terminal Digits and the Examination of Questioned Data. Account Res. 2002.9(2):75-92.
  2. 2.0 2.1 Hartgerink CHJ, Voelkel JG, Wicherts JM, van Assen M. Detection of data fabrication using statistical tools. PsyArXiv. 2019 Aug 19. [cited 2020 Aug 24]. Available from:  https://psyarxiv.com/.
  3. 3.0 3.1 Pitt JH, Hill HZ. Statistical Detection of Potentially Fabricated Numerical Data: A Case Study. 2013. [cited 2020 Aug 24]. Available from: https://www.semanticscholar.org/paper/Statistical-Detection-of-Potentially-Fabricated-A-Pitt-Hill/95e5805e45bae47e050b9f430bbd7e41b3045688.
  4. Checking the Distribution of Rightmost Digits. General Information. Best Practices in Science. [cited 2020 Aug 24]. Available from: http://bps.stanford.edu/?page_id=10827.
  5. 5.0 5.1 5.2 Evans S. Statistical aspects of the detection of fraud. In: Lock S, Wells F, Farthing M, eds. Fraud and misconduct in medical research. 3rd ed. London: BMJ Publishing Group; 2001. p. 186-204
  6. 6.0 6.1 Mosimann JE, Wiseman CV, Edelman RE. Data fabrication: Can people generate random digits? Account Res. 1995;4(1): 31-55.
  7. 7.0 7.1 Newman A. The art of detecting data and image manipulation. Elsevier. 2013 Nov 4. [cited 2020 Aug 24]. Available from: https://www.elsevier.com/editors-update/story/publishing-ethics/the-art-of-detecting-data-and-image-manipulation.
  8. Dahlberg JE, Davidian NM. Scientific Forensic: How The Office of Research Integrity can Assist Institutional Investigations of Research Misconduct During Oversight Review. Sci Eng Ethics.2010;16:713-765.
  9. Altman DG. Statistical reviewing for medical journals. Stat Med.1998;17:2661-2674.
  10. 10.0 10.1 Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychol. Sci.2013;24(10):1875-1888.