Improper data use (a bias distorting research results)

From The Embassy of Good Science

Improper data use (a bias distorting research results)

What is this about?

Researchers may handle data in a number of ways that can influence the results to become misleading.

Why is this important?

Improper data use undermines the ethos of science and the corresponding misleading results can misguide and distort the production of knowledge.

Examples of improper data use include:

  • Massaging: … extensive transformations or other maneuvers to make inconclusive data appear … conclusive
  • Extrapolating: … predicting future trends based on unsupported assumptions …
  • Smoothing: discarding data points too far removed from expected … values
  • Slanting: … selecting certain trends in the data, … discarding others which do not fit …
  • Fudging: creating data points to augment incomplete data sets …
  • Manufacturing: creating entire data sets de novo, … [1]

Data dredging is looking for too many possible associations in a dataset to see of any of them are statistically significant. Data dredging results in false positive results.

“When a large number of associations can be looked at in a dataset where only a few real asso­ciations exist, a P value of 0.05 is compatible with the large majority of findings still being false positives.” [2]

Origin of words

There are several terms describing the act of data dredging. These include:

  • "Data Dredging"[3]
  • "Data Fishing"[4]
  • “Data Snooping,”
  • “P-hacking”
  1. Sindermann C. J. “Winning the games scientists play” (Plenum Press, NY, 1982)
  2. Smith, George Davey, and Shah Ebrahim. "Data dredging, bias, or confounding: They can all get you into the BMJ and the Friday papers."
  3. Selvin, H. C., & Stuart, A. (1966). Data-dredging procedures in survey analysis. The American Statistician, 20(3), 20-23.
  4. Grover, L. K., & Mehra, R. (2008). The lure of statistics in data mining. Journal of Statistics Education, 16(1).

For whom is this important?

What are the best practices?

Related tools

By Jensen (2000) [1]

  • New data and cross-validation
  • Sidak, Bonferroni, and other adjustments
  • Resampling and randomization techniques

By Glenn & Cormier (2015) [2]

  • Performing own reviews of the sources of data,
  • Checking for retractions and corrections,
  • Requiring full disclosure of methods,
  • Acquiring original data and reanalyzing it,
  • Avoiding secondary sources,
  • Avoiding unreplicated studies or studies that are not concordant with related studies, and
  • Checking for funding or investigator biases.

Related cases

Convenience, dichotomization, stratification, regression to the mean, impact of sample size, competing risks, immortal time and survivor bias, management of missing values . [3] [4]

  1. Jensen, David. "Data Snooping, Dredging and Fishing: The Dark Side of Data Mining, A SIGKDD99 Panel Report." SIGKDD Explorations 1.2 (2000): 52-54.
  2. Suter, Glenn W., and Susan M. Cormier. "The problem of biased data and potential solutions for health and environmental assessments." Human and Ecological Risk Assessment: An International Journal 21.7 (2015): 1736-1752.
  3. armona-Bayonas A, Jimenez-Fonseca P, Fernandez-Somoano A, et al. Top ten errors of statistical analysis in observational studies for cancer research. Clinical & translational oncology : official publication of the Federation of Spanish Oncology Societies and of the National Cancer Institute of Mexico. 2018;20(8):954-965.
  4. Reanalysis: Ebrahim S, Sohani Z, Montoya L, et al. Reanalyses of randomized clinical trial data. JAMA 2014;312:1024-32

Other information

Virtues & Values
Good Practices & Misconduct