Statistics


Chapter 38

Statistics



Miriam Fine-Goulden, Victor Grech


Learning objectives




Types of data


Statistical methods can be applied to quantitative data, a set of numbers and values that have been measured. The type and/or method of recording of quantitative data is important since it influences the choice of statistical tests as well as the way in which the data is described and displayed. Qualitative data, by comparison, is descriptive and usually represents an expression of thoughts, feelings or experiences. There are resources available which detail appropriate methodologies for analysing qualitative data, but this will not be covered in this chapter.


Quantitative data (also referred to as variables, i.e. a characteristic, number or quantity that differs between individuals or items) can be numeric, in which a number is recorded, or categorical (Fig. 38.1).



Numeric data, in which a number is recorded, can be further subdivided into discrete or continuous datasets:



Categorical data can be:




Displaying data


The best method for displaying data depends on the type of data and the number of variables and datapoints. A good pictorial presentation of data can be an extremely effective and efficient means of communication. It is also crucial to plot the data:






Tables


Tables are a useful way to summarize and present data and can usually provide more precise numerical data than a graph.




Bar charts


Bar charts (Fig. 38.3) can be used to display a single variable, with the heights of the bars proportional to the frequency. They may also show the relationship between two variables by being grouped or stacked.




Dot diagrams


Dot diagrams (Fig. 38.4) can be used to display continuous numeric data for a variable, for a single group or multiple groups. Each dot represents a single value. It is a simple method of conveying as much information as possible, and it is easy to see outliers and to compare the distribution of results in different groups, but it may not be practical where there are large numbers of measurements.







Describing data




Answer 38.2


A. Mean, median and confidence intervals.


See below for discussion.



Frequency distributions


The normal distribution is symmetrical and bell-shaped (Fig. 38.8). It is a familiar concept in medicine, as much of the data collected from human subjects is normally distributed, e.g. height and weight.



Data that has a non-normal distribution may be skewed, to the left or to the right. A good example of skewed data in medicine is length of hospital stay: most patients stay for a short period of time, but a small number of patients stay for an extended period, pulling the ‘tail’ of the distribution to the right.



Tests of normality and data transformation


Whether or not a set of data is normally distributed may be important when it comes to applying statistical tests, as some tests are only valid for normally distributed data. It may be possible to tell if data is normally distributed by ‘eyeballing’ it in graphical form. There are also mathematical tests that can be applied. These cannot confirm that the data are normally distributed, but can confirm that they are compatible with a normal distribution. In some cases, non-normally distributed data can be ‘transformed’, for example by logging or squaring, to take on a normal distribution so that certain statistical tests can be applied. The method used is determined by the nature of the data.


Tests that rely on the data being normally distributed are known as parametric tests. If datasets are large but not normally distributed, parametric tests may still work well: a property known as robustness.


Tests which make no assumptions about the normality of the data distribution are called non-parametric tests. These are almost as efficient as parametric tests for normally distributed data and superior for non-normally distributed data.



Mean and median


The mean – or average – is a familiar concept. It is calculated by adding up all the values and dividing by the total number of values. For example, the mean time (in minutes) from triage to assessment by a doctor for ten children with fever 40°C in an emergency department is the total of all the values divided by ten:


Group 1 mean:


39+22+48+11+19+33+42+27+28+3110=30minutes


image

The mean is a useful measure of the centre where values are normally distributed or close to normally distributed, but it can be affected dramatically by one or two extreme values. For example, in the group above, if there was one child who waited for a long time because the doctor was unavailable, this could have a significant effect on the results:


Group 2 mean:


39+22+48+11+19+33+42+27+28+23110=50minutes


image

The median value is another measure of the centre, and it is the actual middle value (or the mean of the two middle values if there is an even number of values), so there will be the same number of values above and below it. The median is less influenced by skewed data than the mean. In the example above, the median value will be in between the 5th and 6th values (as there are ten values, an even number – if there were 11, it would be the 6th value).


Group 1 median:


image


Group 2 median:


image


The single large value that influenced the mean in group 1 did not have as much of an effect on the median.


In data that is normally distributed, the mean and median values will be the same; the greater the skew of the data, the greater the difference between the median and the mean. In non-normally distributed data, the median is therefore usually more representative of the centre than the mean. However, because the median is less sensitive to changes in the data, it may be a less useful summary measure. In a table summarizing data, it may be helpful to display both values.



Data spread


As well as giving an idea of the centre of the data, we also need to know about its spread, or variability, its dispersion. The range is the difference between the highest and lowest values. It is often given in brackets after the mean or median. For example, using our data for children with fever (above), ‘the mean time from triage to assessment was 30 minutes (11–48)’, or ‘the median time for triage to assessment was 30.5 minutes (11–231)’. One problem with the range is that it is influenced by outliers (extreme values). It can also depend on sample size, as the larger the sample size, the greater the range is likely to be.


A measure of spread that is not sensitive to outliers is the interquartile range, as described above under Box-and-whisker plot.


The standard deviation is a measure of the spread of data around the mean. In normally distributed data, measurements will be either larger or smaller than the mean. Subtracting the mean from each value gives the difference between that value and the mean. Because the numbers below the mean will be negative (which is not important, because it is the actual difference that matters), all the numbers are squared (to make them all positive), and then added together.


If there is a wide spread about the mean, the values will all be very different from the mean, giving a large number, and conversely, if they are tightly grouped around the mean, the number will be small. The variance is the sum of all the squared differences divided by the total number in that sample minus one (so, for example, if there are 100 patient measurements in the sample, you would divide by 99 to get the variance). The square root of the variance is then obtained in order to ‘unsquare’ the value, and this is called the standard deviation (SD).


Therefore:


Jun 15, 2016 | Posted by in PEDIATRICS | Comments Off on Statistics

Full access? Get Clinical Tree

Get Clinical Tree app for offline access