You are reading this message because your browser does not support our CSS files. Please read our explanation of accessible Palgrave Macmillan websites.

Home | Search | Browse by Subject | Services | Subject areas | Companion websites

Hypothesis test 2: the shuffle test

Home > Subject areas > Studying statistics > Testing hypotheses > Hypothesis test 2: the shuffle test

The table below shows the marks obtained by ten students in an exam.

Exam marks data

Mark

Sex

37

F

46

F

56

F

49

F

78

F

50

M

55

M

81

M

55

M

53

M

The average (mean) of the marks for the female (F) students is 53.2%, whereas for the males (M) it is 58.8%. The difference is 53.2%-58.8% or –5.6%. On average the males did 5.6% better.

However this is just a small group of students. With another group of students we may get a different answer. This is where the hypothesis test comes in.

The null hypothesis is that whether a student is male or female has no relationship to the mark they are likely to get. There are no systematic differences between males and females, and the average mark for all male students (who might have taken the exam) is the same as the average for all female students. The actual difference we have observed in our small sample of ten students is a 5.6% difference in the average marks of males and females. How likely is this to have occurred if this null hypothesis is true? We can work this out by a simple computer simulation.

The actual marks obtained by the female students were 37, 46, 56, 49 and 78. However, if we assume that the null hypothesis is true and there are no systematic differences between males and females, the females are as likely to get any of the marks in the list. This means we can simulate what is likely to happen in a group of ten students by shuffling the marks at random – an example of a process known as re-sampling. The first shuffle (resample) tried was:

First Shuffle of Exam marks data

Mark

Sex

78

F

46

F

55

F

37

F

56

F

53

M

81

M

55

M

49

M

50

M

The marks here are just the same as the marks in the previous table, but they are arranged in a different order. If the null hypothesis is right, this order is just as likely as the original order, and any other shuffle.

It’s now easy to work out the average of the “female” marks (54.4%) in this shuffle, and of the “male” marks (57.6%), and the difference between them (-3.2%). The next shuffle was

Second Shuffle of Exam marks data

Mark

Sex

78

F

55

F

50

F

81

F

55

F

46

M

56

M

53

M

49

M

37

M

This has a “female” average of 63.8%, and a “male” average of 48.2%. The male average is now less than the female, so this difference counts as positive: +15.6.

If we do this lots of times, we will see how much difference there is likely to be between the average mark of five female students and five male students if the null hypothesis is true and there are no systematic differences between females and males. The graph below is based on 200 shuffles.

This graph shows how the results of all these shuffles compare with the actual data. The bars represent the differences between the average marks of females and males worked out from the shuffles. All of these differences are between –20 and +20, with the commonest values clustered around zero. The bar on the left, for example, represents 11 shuffles with differences between –19 and –15.5, and the next bar represents 17 shuffles with differences between 15.5 and 12. The solid line represents the actual difference from the data (-5.6), and the dotted line represents the situation where the females average 5.6 more than the males. This graph suggests that the data could easily have arisen from the shuffling process. In other words the data is consistent with the null hypothesis. It is entirely plausible that there really are no systematic differences between males and females.

More formally, we work out the probability of getting a difference as extreme as -5.6 if the null hypothesis is true. In this case this means the probability of the difference being –5.6 or less, or +5.6 or more. What do you think this probability is? (You should be able to see roughly what this is from the graph. Click here to see the answer.)

This probability is called a p value or significance level. In this case it is fairly large, indicating that it is entirely plausible that the data could have arisen from the null hypothesis.

The next table shows a similar set of data – except that here the females do better, and the pattern looks more consistent.

Exam marks data (2)

Mark

Sex

55

F

60

F

56

F

72

F

78

F

50

M

55

M

50

M

55

M

53

M

A similar process of 200 shuffles produced this graph:

What do you think the p value is here? What would you conclude from this? Click here to see the answer.

The shuffles, and the resulting graphs and p values are produced by an Excel workbook, which you can access at the companion website for Making Sense of Statistics. Click here to access the workbook now. If you have Excel on your computer, you should be able to use this workbook to run through this example with “live” shuffles. The ‘Read this worksheet’ explains how to use the workbook. The data in the Sample sheet includes the first set of exam marks as the first 10 rows. To follow this example using this workbook, simply delete the rest of the data. You can also use this workbook for other examples: e.g. you might want to compare the profitability of two different types of companies, or the proportions of males and females who smoke (the first column in the Data sheet should be headed smoker, with 1 representing a smoker and 0 representing a non-smoker), or a measure of the performance of two different drugs in a medical trial.

Making Sense of StatisticsThis content has been written by Michael Wood author of Making Sense of Statistics

 

 


 





Palgrave Macmillan Ltd
bar