Themes

Improper data use (a bias distorting research results)

What is this about?

Researchers may handle data in a number of ways that can influence the results to become misleading.

Why is this important?

Improper data use undermines the ethos of science and the corresponding misleading results can misguide and distort the production of knowledge.

Examples of improper data use include:

Massaging: … extensive transformations or other maneuvers to make inconclusive data appear … conclusive

Extrapolating: … predicting future trends based on unsupported assumptions …

Smoothing: discarding data points too far removed from expected … values

Slanting: … selecting certain trends in the data, … discarding others which do not fit …

Fudging: creating data points to augment incomplete data sets …

Manufacturing: creating entire data sets de novo, … ^[1]

Data dredging is looking for too many possible associations in a dataset to see of any of them are statistically significant. Data dredging results in false positive results.

“When a large number of associations can be looked at in a dataset where only a few real associations exist, a P value of 0.05 is compatible with the large majority of findings still being false positives.” ^[2]

Origin: "Data Dredging" (Selvin & Stuart, 1966); "Data Fishing" (Grover & Mehra, 2008), “Data Snooping,” “P-hacking”

↑ Sindermann C. J. “Winning the games scientists play” (Plenum Press, NY, 1982)
↑ Smith, George Davey, and Shah Ebrahim. "Data dredging, bias, or confounding: They can all get you into the BMJ and the Friday papers."

For whom is this important?

Principal investigatorsResearchersPolicy makersSupervisorsPostdocsJournal publishersJournal editors

What are the best practices?

Related tools

By Jensen (2000) ^[1]

New data and cross-validation
Sidak, Bonferroni, and other adjustments
Resampling and randomization techniques

By Glenn & Cormier (2015) ^[2]

Performing own reviews of the sources of data,
Checking for retractions and corrections,
Requiring full disclosure of methods,
Acquiring original data and reanalyzing it,
Avoiding secondary sources,
Avoiding unreplicated studies or studies that are not concordant with related studies, and
Checking for funding or investigator biases.

Related cases

Convenience, dichotomization, stratification, regression to the mean, impact of sample size, competing risks, immortal time and survivor bias, management of missing values . ^[3] ^[4]

↑ Jensen, David. "Data Snooping, Dredging and Fishing: The Dark Side of Data Mining, A SIGKDD99 Panel Report." SIGKDD Explorations 1.2 (2000): 52-54.
↑ Suter, Glenn W., and Susan M. Cormier. "The problem of biased data and potential solutions for health and environmental assessments." Human and Ecological Risk Assessment: An International Journal 21.7 (2015): 1736-1752.
↑ armona-Bayonas A, Jimenez-Fonseca P, Fernandez-Somoano A, et al. Top ten errors of statistical analysis in observational studies for cancer research. Clinical & translational oncology : official publication of the Federation of Spanish Oncology Societies and of the National Cancer Institute of Mexico. 2018;20(8):954-965.
↑ Reanalysis: Ebrahim S, Sohani Z, Montoya L, et al. Reanalyses of randomized clinical trial data. JAMA 2014;312:1024-32

The Embassy Editorial team, The Embassy editorial team, Iris Lechner, Natalie Evans, Bjørn Hofmann contributed to this theme. Latest contribution was Feb 01, 2021

Other information

[1] Sindermann C. J. “Winning the games scientists play” (Plenum Press, NY, 1982)

[2] Smith, George Davey, and Shah Ebrahim. "Data dredging, bias, or confounding: They can all get you into the BMJ and the Friday papers."

[3] Jensen, David. "Data Snooping, Dredging and Fishing: The Dark Side of Data Mining, A SIGKDD99 Panel Report." SIGKDD Explorations 1.2 (2000): 52-54.

[4] Suter, Glenn W., and Susan M. Cormier. "The problem of biased data and potential solutions for health and environmental assessments." Human and Ecological Risk Assessment: An International Journal 21.7 (2015): 1736-1752.

[5] rmona-Bayonas A, Jimenez-Fonseca P, Fernandez-Somoano A, et al. Top ten errors of statistical analysis in observational studies for cancer research. Clinical & translational oncology : official publication of the Federation of Spanish Oncology Societies and of the National Cancer Institute of Mexico. 2018;20(8):954-965.

[6] Reanalysis: Ebrahim S, Sohani Z, Montoya L, et al. Reanalyses of randomized clinical trial data. JAMA 2014;312:1024-32

[1]

[2]

[1]

[2]

[3]

[4]