Monday, February 10, 2014

Descriptive Data Analysis

This week we are going to learn about conducting basic data analysis such as tabulation, cross tabulation, mean, standard deviation, median, mode, minimum, and maximum. You can use the Happiness Survey data to conduct the analysis.


Tabulation

In Stata, command “tab” will give you three things: freq. (or frequency), Percent, and Cumulative Percent. You use tab when your variables are categorical (e.g., gender, political affiliation, ethnicity etc.). Below is an example of how “tab” command looks like with e.g., Gender:


It should be noted that this data coded 1 as Male and 2 as female. So what this result shows is that we have 8 males (or 33.33%) and 16 females (or 66.67%) in your data. Forget about the Cum. column as you will not use it as much (if you want to learn more about it, here is the link). Sometimes you may forget what you coded for 1 and 2. I would recommend relabeling the variable Gender. You would type: label var Gender "gender: 1=male & 2=female". Here how it looks like after you entered that command:


Now you can see that 1 is male and 2 is female right in the results table.

You can also use tab for continuous variables as well if you want to know the number/percentage of a specific value. For example, in your Happiness Survey data, we have a question that asks number of years of education of the participants. The response to this question ranges from at least 12 years of education to higher. Below is an example of tab Edu:

tab Edu


Based on the above table, I want to know how many respondents who have 20 years of education. What it shows you here is that 17.39% (N=4) of your respondents have 20 years of education. It gives you the details of each year of education reported. You can also see that there is one person whose education is 23 years. Therefore, depending on your need of information, you can use tab command for either categorical or continuous variables, but you may find yourself using it more for categorical variables.

Cross-Tabulation 

Cross-tabulation gives you results of frequency and percentage of two groups of variables, such as gender (male vs. female) and pet owning (owning vs. not owning). In Stata, you can do this by using the command "tab" and the two variables right after it: "tab Gender Pet". Here is how it looks like:

tab Gender Pet


Based on the above table, you can see that females (coded as 2, N=12) tend to own more pet (coded as 1 for owning a pet, and 0 not owning one) than males do.

You can also request for percentage as well for easy read. You can add "column" or "col", "cell" and/or "row", right after the comma: "tab Gender Pet, col row cell". Here is how it looks like:

tab Gender Pet, col row cell






Here is how you interpret the result above:


From a row of Gender: among all of the female (coded as 2) who responded to the survey, 75% own a pet.

From a column of Pet: among all of those who responded owning a pet (coded as 1), 80% are females.

From a cell of Pet: among all of the respondents, 52.17% are females owning a pet.



Summary 

In Stata, the command "summary" or "sum" in short gives you Obs (your total number of response or your total number of participants), Mean (or your average score), Stad. Dev. (standard deviation), Min (minimum) and Max (maximum). You use "sum" command basically when you want to know the average of a variable (such as income, age, or likert scale response). Your variables should be continuous, ordinal, interval, or ratio (for more info of scale measurements, you can read via this link). Let's try to find the average or mean of GPA based on your Happiness Survey:

sum GPA


What you got here is the mean value of GPA which is 3.79. It seems that everyone is doing so well as you can see that the deviation is pretty small (SD=0.20). The values of Min and Max tell you the lowest GPA and highest GPA. We covered this as well in your previous class on data cleaning. Later, you will learn how to compare means (of GPA) with different groups or categories (such as gender or ethnicity). Techniques used for mean comparisons are t-test and ANOVA (One way Analysis of Variance). For now, let's just look into mean comparisons using a very basic technique using the command: tab Gender, sum (GPA). This command will give you result of the average GPA across gender. In other words, your question would be "Is there any differences in academic performance among male and female?" Here is how it looks like:

tab Gender, sum (GPA)


The table above tells you that female students tend to perform better than their male counterparts. The average GPA for female is 3.834 and for male 3.7.


Mode

Mode is the number that appears to be repeatedly more often than other numbers. As far as I know, Stata does not have a command for Mode. I may have missed it, but you all can help me search for it. But do not worry about it as Mode is not commonly used. Just to give you a whole picture of how to find mode in your Stata report, I use the Edu variable as an example. You use the command "tab" to give you a list of all the responses. Type tab Edu and you will get the following look:

tab Edu



Based on the table above, the 17 years of education appears to be the most common number reported by the participants--6 people reported that they have 17 years of education. So 17 is your mode value.


Median 

Media is the value in the middle. It is useful for some variables such as income, as mean may be misleading if a few people earn way too high or too low. According to 2010 U.S. Census, the income is reported using median. An except below is how it is written (full link to the report is here):

For the purpose of illustration, let's use the variable Edu from our Happiness survey, and try to find out what is the median (we do not have income variable though). In Stata, the median can be generated via the use of "sum" but with an additional request by using "detail" option after comma. Here is how it looks like: "sum Edu, detail". Detail can be "d" as a shortcut.

sum Edu, detail 


Based on the above table, the median is the 50%, which is 18 years of education. 

Practice on your own

1. What is the average of number of countries have the respondents been to? 
2. Are there any differences in the number of countries visited based on gender? 
3. Did those who vote during the election have higher GPA than those who did not vote? 
4. Did those who vote during the election like to drink wine than those who did not vote? 
5. Create two questions on your own and run the analysis. 

Good luck! 


3 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Sothy, can you please include the APA styles you referenced today in class?

    ReplyDelete
  3. Please can you email the Happiness Survey Data and other datasets used for this tutorial to my for practice. Thanks

    ReplyDelete