Welcome to CIE491: Statistical Data Analysis using STATA: 2015

This week we are learning about how to run multiple regression analysis. Regression is one of the most popular statistical techniques that is used to predict a single outcome variable (continuous variable) from multiple independent variables (continuous as well as binary variables). For example, you may want to predict students' academic performance (GPA of a scale from 1-4) from their gender (binary), family income, age, parental education, self-efficacy (all continuous). Having multiple independent variables in a single model allows researchers the confidence to conclude their results while taking into account other factors that may also influence the academic performance variable. As you recall, correlation, ttest, or anova provides results based on a one-on-one relationship (or bivariate relationship). However, these bivariate analyses are also useful for us as they provide a descriptive view to the data.

Now let's look into our data that we used earlier in your ANOVA analysis. Let's say we are interested in predicting academic performance from a few other independent variables such as gender, age, education of mother and father, computer lab used, and parental involvement. Command used for multiple regression is "regress" or "reg" for short. After "reg" is your outcome variable then followed by the rest of your independent variables. Here is how it looks like: reg rank gender age edumom edudad labuselast parentinvolvement , beta.

However, before we conduct any analysis, we need to check the distribution of our variables being used first. You will need to examine whether or not any of your variables are skewed. Regression analysis is susceptible to outliers. Here is what I would do for the following variables:

sum rank, detail
tab gender
sum age, detail
tab edumom (you can also do sum edumom, d)
tab edudad (you can also do sum edudad, d)
tab labuselast
sum parentinvolvement, detail

As you know tab will give you percentages to each response and sum, detail gives you skewness value. As you recall from previous class on variable manipulation, the rule of thumb of a normally distributed variable would be a skewness of 1 and lower (nearer to zero is better) and Kurtosis of lower than 3. For more information on skewness and Kurtosis, please refer to this website. All of these variables seem to look okay, except edumom and edudad. See below of the breakdown by each level of education when using tab command:

As you can see only a few fathers having education above high school (only 10 of them) and it is even more disproportional for mothers (only a few of them having high school (3.73%) and above high school education (.60%)). It looks skewed. You can also check it using sum, detail to examine skewness.

As you can see, the skewness of edudad is almost 1 and Kurtosis is exceeding the cut-off point of 3, and the same for edumom. These indicate normal distribution problem.

What can you do?

The best way is to recode these two variables, and then check the skewness again. Let's recode the edudad first, and then you can recode the edumom on your own. As you know, before you recode any variables, it is recommended that you create a new one and use that one, so you can keep your original variable just in case you need to go back later. See the whole process of doing it below:

Those are the 8 steps I would do to recode one variable and to check the skewness of that variable. As you can see, the skewness is reduced to almost zero, and Kurtosis to lower than 3.

Now, it's ready to run our regression analysis:

reg rank gender age edumom_new edudad_new labuselast parentinvolvement , beta

The regression output above shows that the overall regression model with all of these predictors is statistically significant, F(6, 556)=10.16, p<.001, Adjusted R-squared = .09. So first, look at the probability (#1), then R-squared or Adjusted R2 (#2), and if the #1 is significant, then look into individual p values of each predictor (#3). The adjusted R-squared of .09 means that 9% of the variance in the outcome variable is accounted for by the independent variables in the model. Adjusted R2 is preferred when you have many independent variables in your model as it adjusts for the number of variables used in the model. Usually, when you add more variables into the model, R2 tends to increase, so to make sure that variables you include into the model are meaningful (e.g., not just a junk that you throw in), adjusted R2 tells you so. R2 or adjusted R2 also equally indicates effect size, meaning that the larger the R2 values, the larger the effect size, hence more desired. Cohen (1988) indicates that in social science fields, typical effect size tends to be medium.

Based on the regression output above, significant predictors include gender (beta=.23, p<.001), age (beta=.10, p<.05), and parental involvement (beta= -.10, p<.05). Note that if the Prob>F is not significant, then you do not need to mention these significant predictors.

This is an example of how multiple regression table and the write up look like based on the above regression output:

Write Up:

"Table 1 shows that the overall model was significant, F(6, 556)=10.16, p<.001, Adjusted R²=.09. The model explains 9% of variance accounted for by the predictor variables. Factors that predict academic performance include gender (β=.23, p<.001), age (β=.10, p<.05), and parental involvement (β= -.10, p<.05). Specifically, the results suggest that being female, being younger, and having parents who are more involved are significantly associated with better academic performance."

Practice on Your Own

Using the same data above, please predict parental involvement from gender, age, academic performance (rank), education of mother and father, and SES-related variables (electricity, tv, cell, desk, calculator). Your response should include

1. Check to see if your variables are skewed and steps to reduce the skewness
2. Provide commands used at each step
3. Regression table (follow the APA style above)
4. Write up (follow the APA style above)

Note that it is easier and faster for me to read your output if you do the screenshots.

Today we will learn how to convert data between software: Excel-STATA or SPSS-STATA or vice versa. We will use your Happiness Survey data to practice the conversion.

First of all, let's try converting from Excel -> Stata. Here are the steps:

1. Open Stata

2. Click on "File"

3. Click on "Import"

4. Click on "Excel spreadsheet"

5. Browse the Excel file you want to import

6. Check the "Import first row as variable names"

7. OK.

Here is the screenshot at Step 6:

Then you will see your data in Stata that looks like this:

You can also Export back your Stata file to Excel file using "File -> Export -> Excel spreadsheet"

Now let's look at how we can convert this Excel file into SPSS and then to Stata. Here are the steps:

1. Open SPSS software (I still use version 16, very old)

2. Under "File", click on "Open" then "Data"

3. Under "Files of type:" choose "Excel"

4. Browse the name of your Excel data file, then select it

5. Check the box that says "Read variable names from the first row of data"

6. Continue

Here is the screenshot at Step 5:

Now you have your data in SPSS (converted from Excel), and then you want to convert this SPSS file to Stata. Here are the steps:

1. Under "File" click on "Save As"

2. Under "Save as type:" choose "Stata" (anyone of them that says Stata)

3. Under "File name:" type the name you want to save

4. Click on "Save"

Here is the screenshot at Steps 2-4:

Tuesday, May 5, 2015

Multiple Regression Analysis

Tuesday, February 10, 2015

Data Conversion between Software