A dummy variable is a dichotomous variable which has been coded to represent a variable with a higher level of measurement. Dummy variables are often used in multiple linear regression (MLR).
Dummy coding refers to the process of coding a categorical variable into dichotomous variables. For example, we may have data about participants' religion, with each participant coded as follows:
A categorical or nominal variable with three categories
Religion | Code |
---|---|
Christian | 1 |
Muslim | 2 |
Atheist | 3 |
This is a nominal variable (see level of measurement) which would be inappropriate as a predictor in MLR. However, this variable could be represented using a series of three dichotomous variables (coded as 0 or 1), as follows:
Full dummy coding for a categorical variable with three categories
Religion | Christian | Muslim | Atheist |
---|---|---|---|
Christian | 1 | 0 | 0 |
Muslim | 0 | 1 | 0 |
Atheist | 0 | 0 | 1 |
There is some redundancy in this dummy coding. For instance, in this simplified data set, if we know that someone is not Christian and not Muslim, then they are Atheist.
So we only need to use two of these three dummy-coded variables as predictors. More generally, the number of dummy-coded variables needed is one less than the number of categories.
Choosing which dummy variable not to use is arbitrary and depends on the researcher's logic. For example, if I'm interested in the effect of being religious, my reference (or baseline) category would be Atheist. I would then be interested to see whether the extent to which being Christian (0 (No) or 1 (Yes)) or Muslim (0 (No) or 1 (Yes)) predicts the variance in a dependent variable (such as Happiness) in a regression analysis. In this case, the dummy coding to be used would be the following subset of the previous full dummy coding table:
Dummy coding for a categorical variable with three categories, using Atheist as the reference category
Religion | Christian | Muslim |
---|---|---|
Christian | 1 | 0 |
Muslim | 0 | 1 |
Atheist | 0 | 0 |
Alternatively, I may simply be interested to recode into a single dichotomous variable to indicate, for example, whether a participant is Atheist (0) or Religious (1), where Religious is Christian or Muslim. The coding would be as follows:
A categorical or nominal variable with three categories
Religiosity | Code |
---|---|
Atheism | 0 |
Religious | 1 |
See also
- Dummy variable (statistics) (Wikipedia)
External links
- http://www.slideshare.net/jtneill/multiple-linear-regression/14
- http://www.utexas.edu/courses/schwab/sw388r6_fall_2006/SolvingProblems/IncorporatingNonmetricDataWithDummyVariables.ppt
- http://dss.princeton.edu/online_help/analysis/dummy_variables.htm
- http://www.psychstat.missouristate.edu/multibook/mlt08m.html
- http://www.cscu.cornell.edu/news/statnews/stnews72.pdf
- http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm