Monday, February 3, 2014

Data Cleaning and Editing

This week we will learn about data cleaning and editing. Data cleaning is a procedure to examine if the values of your variables are entered correctly. As you recall from the last week's class, we do data coding in order to make data (participants' responses) more manageable, in a numeric pattern so that statistical software such as Stata, SPSS, or Excel can read, hence allowing you to analyze them. So how do you go about checking if the values of your variables were entered correctly? We can check for accuracy to some extent but not entirely. For example, in your Happiness Survey data (code file), if we were to investigate if the values of variable "political affiliation" were entered correctly, we use the command: "summary" or "sum" in short, of that variable that was coded as “Pol”. The values of this variable range from 1-3 in which 1 indicates Democrat, 2 Republican, and 3 None. When you use the command “sum” it will give you Obs (your number of your respondents), Mean (or the average), Std. Dev. (standard deviation), Min (minimum) and Max (maximum).  So here how it looks like in Stata:  




Note that the possible values of Pol are 1-3, but what it shows you here is Min 1 and Max 4. So there is something wrong with the data that we entered; specifically, the value 4 should not be in the data. So the next step is to check to find out the number 4 based on the ID associated with this number. Now, we are going to use the command "tabulation" or "tab" in short. So we type "tab Pol" in the command box. Here how it looks like:

tab Pol 




Now what we see is that there are 6 respondents that have the value 4. And you know right away that Gail may have made a mistake when she entered the data with these 6 respondents. So we need to check the IDs of these 6 people. Once we know the IDs, then we can go back to your paper questionnaire. This is why we have to always ID your questionnaire before you enter the data, so that we can always go back to the original source. To find out who are these 6 respondents, we use the command: "tab". So now we "tab ID" with enforced condition only for the variable Pol that has the value of 4. The conditional command is "if". So here how it looks like with the conditional command: "tab ID if Pol==4". Note that the equal sign is double, and variables in STATA are case sensitive. Here is how the output looks like: 

tab ID if Pol==4



Now you know which questionnaire to look into. So if you have the questionnaires in front of you, you can look them up with the above IDs which are: 1, 5, 6, 7, 9, 14. Then, the next step is to fix it using the command "edit". Other conditional command will be used as well such as "|". This "|" tells Stata to select certain number of ID you would like obtain. Here, we want to obtain IDs: 1, 5, 6, 7, 9, and 14, only for Pol equal to 4. Here is how you should type in your command box: "edit Pol if (ID==1 | ID==5 | ID==6 | ID==7 | ID==9 | ID==14)". Note that we use parenthesis right after "if" so Stata knows that you want to request 1 and 5 and 6 and so on. You can also do it one at a time. For example, "edit Pol if ID==1", "edit Pol if ID==5" and so on. Either way that you are comfortable with. So here is how it looks like after you type, "edit Pol if (ID==1 | ID==5 | ID==6 | ID==7 | ID==9 | ID==14)" in your command box: 

edit Pol if (ID==1 | ID==5 | ID==6 | ID==7 | ID==9 | ID==14)




Now Stata gives you all the IDs that you requested that has the value 4 for Pol. Now you can go ahead and replace the value that you found in your paper questionnaire for Pol variable, right in above Window. 

PRACTICE ON YOUR OWN

Now it is your turn to practice on your own. To do this, please download the Excel file HERE. I changed some values in it for practice purpose. Then do the following things: 

1. Click on File
2. Import
3. Excel spreadsheet
4. Browse to the Excel file that you just downloaded from the CourseSite then Open 
5. Check on the box that says: Import first row as variable names 
6. Ok 

Here is what you are supposed to do. I have purposefully selected three variables for you to work on: 

1-Wine (Do you drink wine?). The response is 0 for No and 1 for Yes. So it is between 0 and 1. 
2-Hincred (I am incredibly happy). The response is 1-5 in which 1 is Less True and 5 is More True. 
3-Pfocus (I have difficulty concentrating). The response is 0-3 in which 0 is Never and 3 is Always. 

Your task is to find out if these three variables contain any values that are entered by mistakes. For example, you know that the variable Wine contains just 0 and 1, so the values that are larger than 1 indicates mistake and that is your task to find that out. More specifically, you what you will need to show me is the IDs with the values that were mistakenly entered. Here is exactly what I want from you: 

1-Wine, the IDs are: ..........................
2-Hincred, the IDs are: ......................
3-Pfocus, the IDs are: ........................

I look forward for your results. 

Good Luck! 

No comments:

Post a Comment