4.3 Article

Outlying observations and missing values: How should they be handled?

Journal

Publisher

BLACKWELL PUBLISHING
DOI: 10.1111/j.1440-1681.2007.04860.x

Keywords

bootstrapping; box-and-whisker plot; data transformation; imputation; percentiles; permutation tests; randomness; robust methods; scatterplot

Ask authors/readers for more resources

1. The problems of, and best solutions for, outlying observations and missing values are very dependent on the sizes of the experimental groups. For original articles published in Clinical and Experimental Pharmacology and Physiology during 2006-2007, the range of group sizes ranged from three to 44 ('small groups'). In surveys, epidemiological studies and clinical trials, the group sizes range from 100s to 1000s ('large groups'). 2. How can one detect outlying (extreme) observations? The best methods are graphical, for instance: (i) a scatterplot, often with mean +/- 2 s; and (ii) a box-and-whisker plot. Even with these, it is a matter of judgement whether observations are truly outlying. 3. It is permissable to delete or replace outlying observations if an independent explanation for them can be found. This may be, for instance, failure of a piece of measuring equipment or human error in operating it. If the observation is deleted, it can then be treated as a missing value. Rarely, the appropriate portion of the study can be repeated. 4. It is decidedly not permissable to delete unexplained extreme values. Some of the acceptable strategies for handling them are: (i) transform the data and proceed with conventional statistical analyses; (ii) use the mean for location, but use permutation (randomization) tests for comparing means; and (iii) use robust methods for describing location (e.g. median, geometric mean, trimmed mean), for indicating dispersion (range, percentiles), for comparing locations and for regression analysis. 5. What can be done about missing values? Some strategies are: (i) ignore them; (ii) replace them by hand if the data set is small; and (iii) use computerized imputation techniques to replace them if the data set is large (e.g. regression or EM (conditional (E) under bar xpectation, (M) under bar aximum likelihood estimation) methods). 6. If the missing values are ignored, or even if they are replaced, it is essential to test whether the individuals with missing values are otherwise indistinguishable from the remainder of the group. If the missing values have not occurred at random, but are associated with some property of the individuals being studied, the subsequent analysis may be biased.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.3
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available