Given a dataset your task may be to try to describe the main features of the data. For example, you may want to know dispersed the data are about a central value. This will give you an idea of the variability of the dataset. Descriptive statistics provide summaries of the main features of a dataset.
If all the dataset information is available to you and you have sufficient software to analyse it e.g. R, SPSS or Minitab, then your task is to put the information into the program and analyse it appropriately. However, the dataset information you want may not be fully available to you. This is particularly true for large populations where it is not feasible to extract all the data , e.g. for the the UK population; voting preferences, number of cars per household etc.. In this case the usual method is to sample the population and analyse the sample.
We take the relevant information from a small, but adequate for our purpose, percentage of of the actual population to draw conclusions and test hypotheses. For example, if we wanted to test, How many hours do teachers in the state sector work in an average week in the UK? it would not be feasible to find this information from every such teacher. Therefore we take an appropriate sample of the population and make reasonable assumptions about the overall population from our sample data.
For more information on this see sampling.
\begin{equation} \bar {x} = \frac{1}{n}\sum\limits_{i=1}^{n}\ x_i. \end{equation}
For an odd number of data points the median is the value that is in the middle. For example in the ordered set $1,3,3,3,4,6,7,7,8$ the median is the fifth number in this set which is $4$.
If you have an even number of observations then there is no one middle number so you must average the middle two values i.e. $\big(\frac{n}{2}\big)^\text{th}$ and $\big( \frac{n}{2} +1\big)^\text{th}$ values. For example in the ordered set $10,13,14,14,16,17,17,17,18,19$, the median is the average of the middle two numbers; the fifth, $16$ and the sixth, $17$. Hence the median is $16.5$
\begin{equation} \text{Range} =\text{Largest value}- \text{ smallest value}. \end{equation}
\begin{equation} s^2 = \frac{1}{n-1}\sum\limits_{i=1}^n(x_i - \bar {x})^2. \end{equation}
\begin{equation} s = \sqrt{\frac{1}{n-1}\sum\limits_{i=1}^n(x_i - \bar {x})^2}. \end{equation}
\begin{equation} \sigma^2 = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar {x})^2. \end{equation}
\begin{equation} \sigma = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar {x})^2}. \end{equation}
\begin{equation} \begin{split} &&Q1 = \frac{(n+1)}{4}^\text{th} \text{ smallest observation} \\&& Q3 = \frac{3(n+1)}{4}^\text{th} \text {smallest observation}. \end{split} \end{equation}
\begin{equation} \text{Standard Error (SE)} = \frac{s}{\sqrt{n}}. \end{equation}
\begin{equation} \begin{split} &&\bar{x} = \text{Arithmetic mean} \\ &&x = \text {Individual data value} \\ &&s = \text{ Sample standard deviation} \\ &&n= \text{Sample size} \\ &&\sum \limits_{i=1}^n{x_i} = \text{Sum of of the data values } (x_1,\ldots,x_n). \end{split} \end{equation}
There are various types of data, including:
When we only have grouped data (i.e. data already split into class intervals) and so do not know the exact values of the data items (only which class they lie in), we can't calculate the exact mean, mode, median or variance etc. However, you can still make reasonable estimates of these values. This is done using the midpoint of a class interval (so we are effectively assuming that all data values in this interval are the mid point). We then use these values (midpoints) to calculate estimates of the mean and variance in the usual way.
'''For example: ''' suppose that you manage big company that owns 50 different outlets and the sales turnover totals are as follows:
|centre
If you wanted to make an approximation for the mean and variance of the sales turnover in your stores you would do the following calculations:
Firstly you calculate the sum of the data: $\sum x = (10 \times £50,000) + (14 \times £150,000) + (20 \times £250,000)+ (6 \times £350,000) = £ 97,00,000$.
Then you calculate the sum of the squares of the data: $\sum x^2 = (10 \times £50,000^2) + (14 \times £150,000^2) + (20 \times £250,000^2)+ (6 \times £350,000^2) = £ 2.325 \times 10^{12}$
Since the sample size is $50$ we have $n = 50$.
The approximate mean is thus: $\bar x = \dfrac{\sum x}{n} = £19,400$
and the approximate variance is: $ s^2 = \dfrac{\sum x^2 - \frac{(\sum x)^2}{n}}{n-1} = 4.71\times10^{10}$
As you can see we are using the usual formulas but approximating the (grouped) values using the midpoints.
Test yourself: Calculate measures of central tendency and spread
For more information on the topics covered in this section see descriptive statistics and presenting data. To develop these ideas further see hypothesis testing