# Spss Must have spss software Homework 7 Multi Linear Regression Chapters 10 and 11 Linear regression, in one form or another, is one of the most widely us

Spss

Must have spss software

Homework 7 Multi Linear Regression Chapters 10 and 11

Linear regression, in one form or another, is one of the most widely used statistical techniques for examining the effects of some group of explanatory (predictor) variables on an outcome variable. We use multi linear regression analysis when our dependent variable is either continuous, a scale variable, or interval level. The independent variable can be any level (continuous, a scale variable, interval level, or categorical). However, any independent variables that are categorical have to be code in a very specific way called dummy coding.

One of the most import parts of regression analysis is choosing the variables that should be in the model. Always use theory to develop your regression models. The variables in our models are also influence by our literature review (i.e. what variables have been used in the past).

Here’s the basic drop down menu path to run a regression model.

Analyze Regression Linear choose your dependent variable and choose your independent variables (if there is a reason you want to enter the independent variables in a particular order or in steps then use the next button) Statistics click on R squared change continue paste or OK.

Class example for states data (STATES07withSouth.sav). The dependent variable is going to be crime rate CRS28. I am having you enter the independent variables in 2 steps. In the first step I want you to enter the variable South and in the second step enter poverty rate (PVS493) and high school dropout rate (EDS130). Often we enter independent variables in 2 or more steps to test what we call mediated models (mediated models are beyond the scope of this class) or to see the added effect of an additional variable. I am having you use 2 steps to demonstrate what happened to the slopes (b and beta) when multicollinearity is present (When independent variables are correlated with each other; they are almost always correlated).

Here’s the path in the drop down menu

Analyze Regression Linear for your dependent variable choose CRS28 (crime rate) and for your independent variables {choose South then click next to enter variable in 2 steps, now choose both PVS493 (poverty) and EDS130 (high school dropout)} Statistics click on R squared change continue paste or OK.

Here’s what the command looks like:

REGRESSION

/MISSING LISTWISE

/STATISTICS COEFF OUTS R ANOVA CHANGE

/CRITERIA=PIN(.05) POUT(.10)

/NOORIGIN

/DEPENDENT CRS28

/METHOD=ENTER South

/METHOD=ENTER PVS493 EDS130.

 Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate Change Statistics R Square Change F Change df1 df2 Sig. F Change 1 .317a .100 .080 927.5063 .100 4.914 1 44 .032 2 .579b .335 .288 816.0265 .235 7.422 2 42 .002 a. Predictors: (Constant), South b. Predictors: (Constant), South, Public High School Drop Out Rate: 2002, Poverty Rate: 2005

There are 2 key statistics to interpret in the model summary table; the adjusted R square and the Sig. F Change. The adjusted R square is how much of the variation in the dependent variable is being explained by the independent variables. We use the adjusted R square rather than the R square because the adjusted R square is more accurate. The adjusted R square for model 1 is .080, so 8 percent of the variation in crime rate is explained by the variable South (South is the only variable in model 1). The adjusted R square for Model 2 is .288, so 28.8% of the variation in crime rate is explained by the variables in model 2 (South, Poverty Rate, and High School Dropout Rate). Now look at Sig. F Change for model 2 (p=.002), it is less than .05 which means the change in the adjusted R squared from model 1 to model 2 is significantly greater than zero (i.e. Model 2 is statistically speaking better than model 1).

 ANOVAa Model Sum of Squares df Mean Square F Sig. 1 Regression 4227339.657 1 4227339.657 4.914 .032b Residual 37851789.179 44 860267.936 Total 42079128.837 45 2 Regression 14111359.422 3 4703786.474 7.064 .001c Residual 27967769.415 42 665899.272 Total 42079128.837 45 a. Dependent Variable: Crime Rate per 100,000: 2005 b. Predictors: (Constant), South c. Predictors: (Constant), South, Public High School Drop Out Rate: 2002, Poverty Rate: 2005

In the ANOVA table we want to know if the F statistic is significant. We look at the Sig and see that for model 1 (p=.032), since the Sig is less than .05 it is significant. That tells us that model one is more accurate in predicting a state’s crime rate than if we simply used the mean of the states’ crime rate. If model 1’s F is significant than Model 2’s F will be significant. We do not use the ANAVO table to decide which model is best. Remember we use the Change in F (change in R square) in the summary table to determine which model is the best.

 Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta 1 (Constant) 3535.603 169.339 20.879 .000 South 636.490 287.128 .317 2.217 .032 2 (Constant) 1911.262 575.062 3.324 .002 South 440.686 293.502 .219 1.501 .141 Poverty Rate: 2005 53.544 48.931 .163 1.094 .280 Public High School Drop Out Rate: 2002 238.408 70.901 .434 3.363 .002 a. Dependent Variable: Crime Rate per 100,000: 2005

The Coefficient table is the key table in understanding the relationship between the dependent and independent variables. Typically we would only interpret model 2 because it is significantly better than model 1, however I want to show you the effects of multicollinearity (when two or more independent variables are correlated with each other). Typically your independent variables are going to be correlated with each other; this is ok as long as they are not extremely correlated (r > .75). Perfect-collinearity is a violation of multiple regressions’ assumptions.

Now let’s look at the Coefficient table. The key statistics are the b, beta, and sig (p). The b and beta are both slopes they tell us what happens to the dependent variable when the independent variable changes. The b tells us for one unit change in the independent variable we expect a (b) change in the dependent variable, controlling for the other variables in the model. The beta is a standardized score (like a z-score); the beta tells us for one standardized unit change in the independent variable we expect a (beta) standardized unit change in the dependent, controlling for the other variables in the model. Sig (p) tells us if the b and Beta is significantly different from zero (i.e. is the b and beta significant). Note if the b is significant the beta will be too. If we want to know which variable in a regression model has the strongest effect on the dependent variable we use the Beta (not the b) because the beta is a standardized score. The variable with the largest absolute value of Beta has the strongest effect in the model.

Because South is a dummy variable (dichotomous; Southern states are coded 1 and all other states are coded zero) its interpretation is a little different than if it was a continuous variable (i.e. there can only be a 1 unit change because the actual values are zero or 1). In model 1 the South’s b is 636.49 and its sig is .032 (b=636.49, p=.032). The South’s Crime rate on average is 636.49 higher than the rest of the country’s crime rate and it is significant. Now let’s look at the constant in model 1 (b=3535.603, p=.000). The constant in a regression model is the Y intercept when all independent variables equal zero; or another way of saying that is the constant is the predicted value of the regression equation when all the independent variables equal zero. Sometimes the constant is meaningful and other times it is not (when there is a variable in the model than cannot be zero). In the model 1 the constant is the mean crime rate of the non-southern states (b=3535.603). If we want to know the mean crime rates of the Southern states we add their b (636.49+3535.603=4172.093) to the constant.

The regression equation for model 1 is ỷ=3535.603 + 636.49South;

or ỷ=3535.603 + 636.49X1

Now let’s look at model 2. The regression equation (or line of best fit) for model 2 is:

ỷ=1911.262 + 440.686South + 53.544Poverty + 238.408HSDropout

The b for the constant is 1911.262 so it is the expected value when all the independent variables equal zero (non-southern states, with zero poverty, and zero high school dropouts). I don’t think there is such a place in the USA. Controlling for poverty and Dropout, the South’s Crime rate on average is 440.68 higher than the rest of the country’s crime rate, however this is not significant (p=.414). Controlling for the South, and Dropout, for one unit increase in poverty we expect a 53.544 unit increase in the crime rate, however this is not significant (p=.282). Controlling for the South and poverty for one unite increase in the dropout rate we expect a 238.408 unit increase in the crime rate and this is significant (p=.002). Looking at the betas high school dropout has the strongest effect on crime rate.

Why did the b and beta for the South get smaller from model 1 to model 2? It is because the South is correlated with poverty. So some of the South’s effect on crime rate is better explained by poverty; that is there is some multicollinearity between the South and poverty. I ran the correlation between the three independent variables so you could see their correlation. We may want to consider dropping either the South or Poverty from the model because both are non-significant and are correlated with each other. When we have non-significant variables in the model the model is said to be over-identified and in the combination with multicollinearity we might make the mistake of saying poverty does not matter (i.e. it is not significant) when it really does.

 Correlations Poverty Rate: 2005 Public High School Drop Out Rate: 2002 South Poverty Rate: 2005 Pearson Correlation 1 .208 .527 Sig. (2-tailed) .166 .000 N 51 46 51 Public High School Drop Out Rate: 2002 Pearson Correlation .208 1 .035 Sig. (2-tailed) .166 .818 N 46 46 46 South Pearson Correlation .527 .035 1 Sig. (2-tailed) .000 .818 N 51 46 51

For your homework run and interpret the following regression models:

1. Rerun the example without the South in the model [dependent variable Crime rate (CRS28), Independent variables PVS493 (poverty) and EDS130 (high school dropout)]..

a. Interpret the adjusted R squared for the model.

b. Write out the regression equation for the model.

c. Interpret the bs and sig (p) in the coefficient table.

d. Which variable has the strongest effect on crime rate?

2. Create regression mode like we did in the example but have violent crime rate (crs32) as your dependent variable [dependent variable Violent Crime rate (CRS32), Independent variables South, PVS493 (poverty), and EDS130 (high school dropout).

a. Interpret the adjusted R squared.

b. Write out the regression equation for the model.

c. Interpret the bs and sig (p) in the coefficient table.

d. Which variable has the strongest effect on crime rate?

3. Create regression with a dependent variable that interested you in the States data and any 2 independent variables that interest you.

a. Interpret the adjusted R squared

b. Write out the regression equation for the best fitting model.

c. Interpret the bs and sig (p) in the coefficient table.

d. Which variable has the strongest effect on crime rate?