Welcome to CIE491: Statistical Data Analysis using STATA: April 2014

Monday, April 14, 2014

Analysis of Multiple Response Questions

Multiple response questions are commonly used in a survey questionnaire in which participants could choose more than one answers. For example, students were asked to select the things they like the most about CFC (Caring for Cambodia) schools based on 8 choices: school meal program, beautiful campus, beautiful garden, clean water, toilet, good time with friends, computers, and teachers. In Stata, the analysis of this type is pretty easy and straightforward. First of all, as always, you need to check for the response of the variables being used by using "tab var" command or "codebook var" command.

Before we get into the analysis part, it is important to know how this set of variables is coded. It appears in the questionnaire as just one question though, looking like this:

When you coded this question, it becomes 8 different questions, looking like this in your Stata:

Each question is coded with a numerical value 1 if a respondent answers Yes and 0 if No. For example, if a respondent checks 5 of choices: teacher, food, toilet, computer, garden, then here how it looks like:

If a respondent did not choose any one of the choices, then you can just leave it blank. When you perform an analysis, you can ask Stata to just count all the responses with 1.

Stata does not come with multiple response analysis command which is represented by "mrtab", so we need to install it. The installing process was mentioned during ANOVA lecture when we tried to install "effectsize" command. To install "mrtab" command, type "findit mrtab" in your command box, and a small window will pop up for you to choose a package to install. After you installed it, you can now perform the analysis:

use http://dl.dropboxusercontent.com/u/60032040/mrtab.dta

mrtab cfcteacher- cfcfriend, include response (1)

So what you see from the output above is that 62.75% of the participants chose good teachers as the thing they like the most about CFC schools. For easy reading of the results, we can ask Stata to sort descendingly for us. We can just add "sort des" after the last option:

mrtab cfcteacher- cfcfriend, include response (1) sort des

Now you can see that the largest number is on the top and the smallest on the bottom.

You can also ask Stata to break it down by school by adding "by (school) col" to the last option:

mrtab cfcteacher- cfcfriend, include response (1) sort des by (school) col

Now the results are broken down by three different schools. Based on the output above, students at Bakong High ranked "good teachers" as the thing they like most about CFC school (72.77%).

You can also request Stata to give us statistical test on chi-square on each individual question by the three schools, and also overall chi-square test of the whole model (all questions together). You can do so by adding "mtest" and "chi2":

mrtab cfcteacher- cfcfriend, include response (1) sort des by (school) col mtest chi2

Look at the last column (chi2/p*). There are two rows of values there, the top one is the chi2 value and the bottom one is the significant value (p). If p value is less than .05, then you can say that there is a significant difference of the response in different schools.

Again, it is always easier to put these values in a graph. Here how it looks like in bar graph:

Let's look at one more example with the same procedure of analysis, using this dataset that you used for your Practice Assignment earlier. Let's look at parental involvement variables (19 of them). The response to these variables range from 1-4. You want to see which involvement variables are the top choices reported by the students in terms of "often" and "most of the time." First of all, you will need to look at the code of any one of the parental involvement variables, since all of them (19) have the same response pattern. For example, I choose involvement item #1 (var: par_invol1):

codebook par_invol1

Now you know that the responses "often" and "most of the time" are 3 and 4. So that means that you want Stata to just include or count the response 3 and 4 in order to know which involvement items rank highest and lowest for "high involvement" category (e.g., 3 and 4 response).

use http://dl.dropboxusercontent.com/u/60032040/assignment1.dta

mrtab par_invol1 - par_invol19, include response (3 4) sort des

As you can see from the Stata output above, the item "My parents reminded me to study hard" ranks highest and "My parents attended meetings at my school" the lowest, for the responses "often" and "most of the time."

If you want to count those who reported just "most of the time" and see which involvement item ranks top, then you can just request Stata to include just "4". It looks like this:

When you count just "most of the time" response (as shown in the above output with number 4 circled in red), involvement item 19 ranks top. This is subjective based on what you want to look into.

Again, you can break it down by gender and you can see if there is a significant difference between male and female students responding to each of the involvement items, for the responses "often" and "most of the time."

mrtab par_invol1 - par_invol19, include response (3 4) sort des by(gender) col mtest

As you can see from the above output by gender, it seems that only involvement item #18 that shows significant difference between males' and females' report on the involvement. It shows that boys are more likely to report that their parents talked about person they admire to them than girls are, χ²(1) = 4.14, p<.05.

You can also make a table out of the above results for easy read. However, since there is no significant difference between gender, then I would not do it.

PRACTICE ON YOUR OWN

Use the question below to answer the following questions:

1. What subject that students like the most? and the least?
2. What subject that students like with the responses from 6-8?
3. Are there any gender differences of each subject (with the response from 6-8)?
4. Please describe your findings in APA format.

use http://dl.dropboxusercontent.com/u/60032040/mrtab.dta

Good luck!

Monday, April 7, 2014

Correlation

Update (May 12 2014): there is this website that attempts to to build graphs showing correlations between two completely unrelated data. The website's name is Spurious Correlations: Discover a New Correlation. By definition, spurious correlation is a relationship between two variables that are apparent depending on (or with the present of) a third factor. For example, a significant relationship between students dropping out of school and family socioeconomic status depends on students' own academic performance, meaning that family SES alone may not be impacting students' dropping out of school if the students themselves performing well. Therefore, to say that family SES is correlated with student dropout may be misleading or you can say that the relationship between the two variables are spurious. This website updates interesting correlations between two things everyday: http://www.tylervigen.com/

This week we are learning how to conduct correlation analysis. Correlation is a statistical technique that allows you to examine relationship between two variables, both of which are continuous. Correlation does not tell you causal relationship, rather a bi-directional relationship between two variables. Correlation value is expressed by the r value, or a Pearson correlation value. The highest value of r is 1; the higher the value, the stronger the relationship between two variables. The value can be positive or negative; both tell you the direction of the relationship. For example, you may want to know if education of parents is correlated with academic involvement with their children, or education of parents is correlated with age at first school enrollment for their children etc. Correlation gives you an answer in terms of a direction of relationship, and based on the above questions, parents with higher level of education are more likely to be involved in their children's education (positive direction) or parents with higher level of education are less likely to enroll their children late in school (negative). Now let's look at our data as an example. We want to know if education of mother (var name: edumom) is correlated with mother involvement (var name: involvemother). Correlation command is pwcorr. So, pwcorr edumom involvemother, sig (note that IV and DV can be placed anywhere after pwcorr. In ANOVA, DV has to come first after oneway).

use http://dl.dropboxusercontent.com/u/60032040/studentdata2013.dta

pwcorr edumom involvemother, sig

The above output shows that there is a correlation between education of mother and involvement. For your academic paper, you would say that:

"Mother education is positively correlated with mother involvement, r = .10, p<.01. Specifically, the higher levels of mother education, the higher the involvement the mothers have with their children's education."

You can also request Stata to show you the star on the pair of variables that are significant by using this command option: star (#). You can also request Stata to show you the number of observation of the pair of variables: obs. It looks like this:

pwcorr edumom involvemother, sig obs star(.05)

The above output shows you see the star on the pair of the variables, with N=804.

Now let's try correlation with more than two variables.

pwcorr edumom edudad involvemother involvefather gender ageenroll rank age, sig obs star(.05)

Now, let's try to put it into APA style on your own. Things that need to be reported include Pearson r, significant value, and number of observation (N). If there are different numbers of observation, specify the range, from lowest to highest. You don't have to report all of them. That is because by default, Stata uses pairwise deletion method (only pair of missing variables are deleted). Try it! APA style of correlation looks like this:

Correlation examines the relationship between two or more variables separately, meaning that relationship between two variables is independent of other variables (e.g., does not take into account the influence of other variables). It examines between A-B, A-C, or B-C, so A-B is independent of the other two sets. It is just like saying that Income-Education, Education-Age, or Income-Age, but it does not tell you if relationship between income and education depends on age or other variables such as gender. When you have other variables that you want to control for, you need multiple regression. Regression allows you to model your outcome variable based on two or more independent variables, all of which are continuous or dummy in nature. No categorical variables are allowed in regression or correlation.

You con also specify Stata to run for particular group by using "if" command. "if" cannot be placed after a comma. For example,

pwcorr edumom edudad involvemother involvefather ageenroll rank age if gender==1, sig obs star(.05)

For more information on correlation, you can type "help correlate" in your Stata command box. Or visit: http://www.stata.com/manuals13/rcorrelate.pdf

What if your variables are dichotomous--or more specifically binary?

Dichotomous variable is the same as categorical variable. Binary variable is a type of dichotomous variable, but with values specifically assigned to both groups as 0 or 1 (e.g., female=0 and male=1). Binary variable is the same as dummy variable where it takes the values of 0 and 1, representing absence or presence of a group.

Let's look at an example below between desk (having a desk at home ("1") or not ("0")) and academic engagement (continuous var) and age (continuous var) of the students. We use command:

pwcorr gender desk engagement age, sig star(.05)

What we can see based on the above output is that the variable desk is significantly correlated (or associated) with academic engagement (r=.09, p<.01) and age of the students (r= -.11, p<.001). The results specifically suggested that students who have a study desk at home tended to show higher level of academic engagement compared to those who do not have a study desk at home, and students who do not have a desk at home tended to be older compared to those who have a study desk at home.

So how do you read the binary variable "desk"? You know that having a study desk is coded as "1" and not having a study desk is coded as "0". So you look at the sign in front of the correlation (red circles). If it is positive, it represents the "1" which is having a study desk. If it is negative (in the case of age var), it represents the "0" which is not having a study desk.

Now it's your turn to run your own analysis with a binary variable. Use rank and gender and then try to interpret the findings.

PRACTICE ON YOUR OWN

Examine the correlation among the following variables:

rank gender edumom edudad electricity tv cell desk calculator breakfast engagement genderrole

Then, build APA styled table based on the results.

Finally, describe your findings in APA style.

use http://dl.dropboxusercontent.com/u/60032040/studentdata2013.dta

Friday, April 4, 2014

Creating Graphs

This week we are learning different ways of graph building. Reports using graphic illustration helps readers to quickly understand the results of your study. In this class, I am going to show you how to put together a graph in Excel, PowerPoint, or Word, or directly from Stata. There are many categories of graph building including scatter and line plots, range and area plots, bar graphs etc. For more information of graphs, you can type "help graph" in the Stata command box or you can click this link to visit Stata website on graphs. For the purpose of this class, I am going to show you a very simple way of building graphs using bar graphs based on your ANOVA or t-test results.

For example, we are interested in examining the relationship between IT program participation (var name: it) and IT skills (var name: pcskills) of secondary school students in Cambodian schools. IT program participation is a categorical variable with four different groups: (1) those who completed the program, (2) those who passed the enrollment into the program in that year, (3) those who failed to enroll into the program in that year, and (4) those who have never attended the program before. The IT skills variable consists of 15 questions asking them about different IT skills that they know (e.g., do you know how to use Word, save the document, creating graph, creating PPP etc.). The response is yes or no. Those who say yes receives a score of "1" and no of "0". We combined the 15 items together and consider those with higher scores have more IT skills. Cronbach's alpha of this variable is .89. Because the independent variable, IT program participation has 4 groups (more than 2 groups that t-test can handle), we will use ANOVA for this. Thus, oneway pcskills it, t:

Link to data file: http://dl.dropboxusercontent.com/u/60032040/graphpractice.dta

oneway pcskills it, t

How do you create a bar graph in Word?

First of all, open the Word document, and under Insert, please click on Charts, then choose the default one. The blank Excel sheet will appear for you to input your number. So based on the above Stata output, copy the mean values as circled in red into Excel. Then replace the 4 categories shown in Excel with the 4 categories as shown in this data (the four IT groups) and replace the mean values based on their corresponding categories. Here is how the final Figure looks like based on APA 6th Edition Style:

You can also expand this bar graph by student gender. Here is how it looks like:

oneway pcskills it if gender==0, t
oneway pcskills it if gender==1, t

(Or you can use:
sort gender
by gender: oneway pcskills it, t)

You can also change the look of the graph to like this:

PRACTICE ON YOUR OWN

Create a bar graph based on the following questions:

use http://dl.dropboxusercontent.com/u/60032040/graphpractice.dta

1. Is there any significant relationship between lab use (var name: labuselast) and science interests (var name: scienceinterest)?

2. Is there any significant relationship between lab use (var name: labuselast) and science interests (var name: scienceinterest) for boys and for girls? (hint: run two separate anovas, one for boys and one for girls).

3. Is students' IT skills (var name: pcskills) associated with education of mother (var name: edumom)?

4. Is students' IT skills (var name: pcskills) associated with education of father (var name: edudad)?