> df=read.csv("states_m.csv") > str(df) 'data.frame': 51 obs. of 64 variables: $ AggravatedAssaultRate : num 0.00388 0.00575 0.00345 0.00414 0.00264 ... $ Area : num 52420 665384 113990 53179 163695 ... $ AverageACTCompositeScore : num 19.1 21.1 19.9 20.4 22.5 20.7 24.4 23.5 21.1 19.9 ... $ AverageACTEnglishScore : num 18.8 20.1 18.8 20 22.1 20.2 24.5 23.2 20.5 18.9 ... $ AverageACTMathematicsScore : num 18.4 21.1 20.2 20 22.7 20.4 24.1 23 21.1 19.6 ... $ AverageACTReadingScore : num 19.7 21.9 20.2 20.9 22.6 21 24.7 24.1 21.5 21 ... $ AverageACTScienceScore : num 19.1 20.9 19.7 20.3 22 20.8 23.8 23.1 20.7 19.5 ... $ AverageSalePrice : int 153099 263931 182553 139124 391609 267924 341436 247793 506654 183843 ... $ AverageSATCriticalReadingScore : int 545 509 523 568 495 582 504 462 441 486 ... $ AverageSATMathematicsScore : int 538 503 527 569 506 587 506 461 440 480 ... $ AverageSATWritingScore : int 533 482 502 551 491 567 504 445 432 468 ... $ AverageTotalSATScore : int 1616 1494 1552 1688 1492 1736 1514 1368 1313 1434 ... $ BurglaryRate : num 0.00646 0.00564 0.00536 0.00728 0.00447 ... $ CivilianLaborForce : int 2240309 351994 3518773 1362631 19421468 3143351 1907514 488776 NA 10338233 ... $ CivilianUnemploymentRate : num 3.5 6.4 4.9 3.5 4.2 3 3.7 3.2 NA 3.4 ... $ CrimeRate : num 0.0348 0.0437 0.0342 0.0363 0.0295 ... $ EarningsAverage : int 41148 50830 43927 37501 54896 49594 63127 52446 89707 41895 ... $ Employment : int 2618073 445031 3520657 1606087 21245509 3215903 2235248 548130 813734 10679883 ... $ FarmLandArea : int 9033537 881585 26117899 13872862 25364695 31604911 405616 510253 0 9231570 ... $ FederalGovernmentExpenditure : int 56047825 11922341 62943007 28025429 331030869 46028683 53019842 7982219 60719306 180887431 ... $ FederalGovernmentExpenditurePerCapita : num 11.9 17.07 9.54 9.7 8.96 ... $ FIPSCode : int 1 2 4 5 6 8 9 10 11 12 ... $ ForcibleRapeRate : num 0.000416 0.001167 0.00051 0.000683 0.000372 ... $ ForeignBornFraction : num 3.4 7.2 14.2 4.3 27.2 9.8 13.2 8.2 13 19.2 ... $ FractionOfGraduatesTakingACT : int 100 39 56 93 30 100 32 21 42 79 ... $ FractionOfGraduatesTakingSAT : num 6.1 51.9 34.3 4.1 60.4 12.3 89.3 100 100 74.3 ... $ GiniIndex : num 0.477 0.417 0.468 0.471 0.488 ... $ GovernmentEmployment : int 409942 104824 448567 231865 2718385 423713 268267 70838 248990 1208984 ... $ GrossStateProduct : num 2.21e+11 5.40e+10 3.47e+11 1.28e+11 2.97e+12 ... $ HealthInsuranceCoverageRate : int 17 6 10 7 11 23 40 35 44 4 ... $ HighestElevation : num 2405 20335 12631 2756 14505 ... $ HomeOwnershipRate : num 70.3 63.7 65.7 64.7 55.1 64.4 65.3 70.8 NA 65.5 ... $ InfantDeaths : int 571 76 651 350 2835 404 260 99 96 1717 ... $ LandArea : num 50645 570641 113594 52035 155779 ... $ LarcenyTheftRate : num 0.0205 0.024 0.0211 0.0211 0.0162 ... $ LowestElevation : num 0 0 68.9 55.8 -282.2 ... $ MedianAge : num 38.6 33.6 37.1 37.7 36 36.4 40.6 39.6 33.8 41.6 ... $ MedianHouseholdIncome : int 44758 74444 51340 42336 63783 62520 71755 61017 72935 48900 ... $ MedianSalePrice : int 128969 241750 147669 120560 330037 217558 266845 216902 404380 144501 ... $ MinimumWage : num 7.25 9.89 11 9.25 12 ... $ MotorVehicleTheftRate : num 0.00263 0.00576 0.00272 0.00241 0.00426 ... $ MurderNonnegligentManslaughterRate : num 0.000083 0.000084 0.000059 0.000086 0.000046 0.000039 0.000028 0.000056 0.000167 0.00005 ... $ Name : Factor w/ 51 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ... $ OwnerOccupiedHousingMedianValue : int 120800 235100 197400 105100 421600 236700 293100 244100 442600 188600 ... $ OwnerOccupiedHousingUnitsFraction : num 70.7 64.3 66.6 67.5 56.7 66.8 68.9 73 42.8 69 ... $ PerCapitaIncome : int 24736 34191 26686 23401 31458 33230 39906 31118 48781 27598 ... $ PerCapitaPersonalIncome : int 40805 57179 42280 41046 59796 54646 71823 49673 79989 47684 ... $ PersonsPerHousehold : num 2.62 2.86 2.75 2.61 3.01 ... $ Population : int 4849377 736732 6731484 2966369 38802500 5355866 3590886 935614 658893 19893297 ... $ PopulationDensity : num 92.51 1.11 59.05 55.78 237.04 ... $ PropertyCrimeRate : num 0.0296 0.0354 0.0291 0.0308 0.025 ... $ RobberyRate : num 0.000865 0.001285 0.00106 0.000644 0.001432 ... $ StateAbbreviation : Factor w/ 51 levels "AK","AL","AR",..: 2 1 4 3 5 6 7 9 8 10 ... $ StateGovernmentTaxCollections : num 1.10e+10 7.11e+08 1.62e+10 9.76e+09 1.75e+11 ... $ TotalVoterRegistrationRate : num 68 69.1 60.5 65.7 53.8 ... $ TotalVotingRate : num 56.4 59.4 53.3 56 48.2 ... $ ViolentCrimeRate : num 0.00524 0.00829 0.00508 0.00555 0.00449 ... $ Workforce : int 2088200 353100 2874700 1278000 17526300 2802900 1712400 472000 NA 8904000 ... $ BusinessEthnicOwnershipFractionsAmerindian : num 0.8 10 1.9 1.1 1.3 0.8 0.5 0 0.9 0.5 ... $ BusinessEthnicOwnershipFractionsAsian : num 1.8 3.1 3.3 1.4 14.9 2.6 3.3 4 5.9 3.2 ... $ BusinessEthnicOwnershipFractionsBlack : num 14.8 1.5 2 5.5 4 1.7 4.4 8.7 28.2 9 ... $ BusinessEthnicOwnershipFractionsHispanic : num 1.2 0 10.7 2.3 16.5 6.2 4.2 2.1 6.1 22.4 ... $ BusinessEthnicOwnershipFractionsPacificIslander: num 0.1 0.3 0 0.1 0.3 0.1 0 0 0 0.1 ... $ BusinessEthnicOwnershipFractionsWhite : num 81.3 85.1 82.1 89.6 63 88.6 87.6 85.2 58.9 64.8 ... > out0=lm(GiniIndex ~ .,data=df[,27:47]) > summary(out0) Call: lm(formula = GiniIndex ~ ., data = df[, 27:47]) Residuals: ALL 50 residuals are 0: no residual degrees of freedom! Coefficients: (19 not defined because of singularities) all NAs > out1=lm(GiniIndex ~ .,data=df[,27:37]) > summary(out1) Call: lm(formula = GiniIndex ~ ., data = df[, 27:37]) Residuals: Min 1Q Median 3Q Max -0.016944 -0.009164 0.000607 0.006296 0.032082 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.038e-01 5.033e-02 10.011 2.48e-12 *** GovernmentEmployment 1.362e-08 3.241e-08 0.420 0.67652 GrossStateProduct 9.044e-15 2.394e-14 0.378 0.70769 HealthInsuranceCoverageRate -6.452e-04 2.141e-04 -3.014 0.00452 ** HighestElevation -1.936e-06 5.702e-07 -3.395 0.00159 ** HomeOwnershipRate -1.479e-03 5.060e-04 -2.922 0.00576 ** InfantDeaths -7.344e-06 1.581e-05 -0.465 0.64477 LandArea -5.142e-08 2.865e-08 -1.795 0.08043 . LarcenyTheftRate 7.748e-01 5.832e-01 1.329 0.19167 LowestElevation 1.564e-06 2.689e-06 0.582 0.56399 MedianAge 1.763e-03 9.464e-04 1.863 0.07005 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.01183 on 39 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.6902, Adjusted R-squared: 0.6108 F-statistic: 8.69 on 10 and 39 DF, p-value: 2.864e-07 > dfA=df[,c(3,16,47,30,21,18,39,56,27)] > names(dfA) [1] "AverageACTCompositeScore" [2] "CrimeRate" [3] "PerCapitaPersonalIncome" [4] "HealthInsuranceCoverageRate" [5] "FederalGovernmentExpenditurePerCapita" [6] "Employment" [7] "MedianSalePrice" [8] "TotalVotingRate" [9] "GiniIndex" > dfB=df[,c(12,57,46,33,54,50,45,55,27)] > names(dfB) [1] "AverageTotalSATScore" "ViolentCrimeRate" [3] "PerCapitaIncome" "InfantDeaths" [5] "StateGovernmentTaxCollections" "PopulationDensity" [7] "OwnerOccupiedHousingUnitsFraction" "TotalVoterRegistrationRate" [9] "GiniIndex" > outA1=lm(GiniIndex ~ ., data=dfA) > summary(outA1) (Intercept) *** FederalGovernmentExpenditurePerCapita *** Employment ** Residual standard error: 0.01623 on 42 degrees of freedom Multiple R-squared: 0.5101, Adjusted R-squared: 0.4168 > AIC(outA1) [1] -265.5299 outA2=lm(GiniIndex ~ FederalGovernmentExpenditurePerCapita+Employment+MedianSalePrice, data=dfA) same sig variables Residual standard error: 0.01585 on 47 degrees of freedom Multiple R-squared: 0.4766, Adjusted R-squared: 0.4432 > AIC(outA2) [1] -272.1587 > outB1=lm(GiniIndex ~ ., data=dfB) > summary(outB1) (Intercept) 5.005e-01 5.069e-02 9.872 2.14e-12 *** PerCapitaIncome -1.742e-06 6.634e-07 -2.625 0.0121 * InfantDeaths 1.422e-05 7.297e-06 1.948 0.0583 . PopulationDensity 5.374e-05 1.223e-05 4.394 7.68e-05 *** outB2=lm(GiniIndex ~ PerCapitaIncome+InfantDeaths+PopulationDensity, data=dfB) PerCapitaIncome -4.482e-07 5.475e-07 -0.819 0.417 InfantDeaths 1.789e-05 3.646e-06 4.906 1.15e-05 *** PopulationDensity 9.966e-06 2.029e-06 4.912 1.13e-05 *** Residual standard error: 0.01511 on 47 degrees of freedom Multiple R-squared: 0.5248, Adjusted R-squared: 0.4945 > AIC(outB2) [1] -277.0839 ineresting: PerCapitaIncome dropped hard, and InfantDeaths supported quality indicators all up ---- df=read.csv("schools.csv") > str(df) 'data.frame': 160 obs. of 17 variables: $ UnitID : int 138600 168546 188641 210669 168591 164465 143084 222983 189088 189097 ... $ Name : Factor w/ 158 levels "Agnes Scott College",..: 1 2 3 4 5 6 7 8 9 10 ... $ enrollment: int 872 1265 2407 1940 1348 1922 2475 1331 2442 2572 ... $ tuition : int 37236 39313 26261 42470 35428 50562 38466 36230 49906 47631 ... $ endowment : int 251001216 173147287 93754316 162801245 102679827 1823748203 127763186 124800382 161396088 240710000 ... $ netPrice : int 20378 24906 20306 27773 21815 14687 24651 21875 25275 21791 ... $ Istaff : int 73 109 151 188 92 220 197 97 180 214 ... $ retention : int 82 84 75 83 79 98 83 86 88 95 ... $ fresh : int 252 346 535 601 376 466 627 364 489 573 ... $ SFR : int 10 11 12 10 12 8 11 12 10 10 ... $ GradRate : int 73 72 55 78 61 94 74 76 75 89 ... $ Psalary : int 78102 63972 57654 73287 49248 108315 66627 68400 98118 108387 ... $ undergrad : int 873 1268 1920 2023 1396 1792 2497 1278 2112 2573 ... $ MATH : num 26.7 21 20 21.9 21 ... $ READ : num 24.5 21 19.9 23.3 20 ... $ Name2 : Factor w/ 158 levels " Oglethorpe University 1 ",..: 3 4 5 6 7 8 9 10 11 12 ... $ USNews : int 70 122 159 77 146 2 99 105 49 27 ... out1=lm(log(USNews)~.-UnitID-Name-Name2,data=df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.918e+00 5.162e-01 15.339 < 2e-16 *** enrollment 2.849e-04 1.308e-04 2.178 0.031014 * tuition -4.344e-06 6.583e-06 -0.660 0.510298 endowment -7.591e-10 1.220e-10 -6.222 4.91e-09 *** netPrice 1.337e-05 6.516e-06 2.052 0.041934 * Istaff -3.635e-03 1.511e-03 -2.405 0.017407 * retention 3.303e-03 8.048e-03 0.410 0.682135 fresh -4.151e-05 7.079e-04 -0.059 0.953326 SFR -2.454e-02 2.738e-02 -0.896 0.371572 GradRate -1.899e-02 5.137e-03 -3.696 0.000309 *** Psalary -1.357e-05 3.892e-06 -3.487 0.000647 *** undergrad 1.151e-04 2.502e-04 0.460 0.646308 MATH 1.566e-02 2.967e-02 0.528 0.598377 READ -7.109e-02 2.068e-02 -3.438 0.000763 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.329 on 146 degrees of freedom Multiple R-squared: 0.8901, Adjusted R-squared: 0.8803 F-statistic: 90.96 on 13 and 146 DF, p-value: < 2.2e-16 > AIC(out1) [1] 113.7029 only things nearly sig are those sig... out2=lm(log(USNews)~enrollment+endowment+netPrice+Istaff+GradRate+Psalary+READ,data=df) (Intercept) 7.840e+00 2.546e-01 30.798 < 2e-16 *** enrollment 3.006e-04 8.457e-05 3.555 0.000505 *** endowment -7.545e-10 1.145e-10 -6.588 6.87e-10 *** netPrice 1.172e-05 5.654e-06 2.073 0.039816 * Istaff -2.861e-03 1.092e-03 -2.621 0.009658 ** GradRate -1.667e-02 3.780e-03 -4.410 1.95e-05 *** Psalary -1.324e-05 3.366e-06 -3.934 0.000127 *** READ -6.436e-02 1.293e-02 -4.978 1.73e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3242 on 152 degrees of freedom Multiple R-squared: 0.8889, Adjusted R-squared: 0.8838 F-statistic: 173.8 on 7 and 152 DF, p-value: < 2.2e-16 > AIC(out2) [1] 103.3773 decrease:enrollment,netPrice increase:endowment,Istaff,GradRate,Psalary,READ if you look a diagonistic plots: leverate: 136? big residuals for low-rank (good) schools non-normal distribution trends in scale-location the following span a decade: enrollment, endowment, netPrice, Istaff, fresh, undergrad suggest: I(log(endowment/enrollment))+log(netPrice)+Istaff/enrollment out2=lm(log(USNews)~enrollment+I(log(endowment/enrollment))+log(netPrice)+(Istaff/enrollment)+GradRate+Psalary+READ,data=df) (Intercept) 6.219e+00 1.281e+00 4.856 2.97e-06 *** enrollment 2.022e-04 1.191e-04 1.697 0.091762 . I(log(endowment/enrollment)) -1.391e-01 5.128e-02 -2.713 0.007438 ** log(netPrice) 3.867e-01 1.067e-01 3.625 0.000394 *** Istaff -8.256e-03 1.730e-03 -4.772 4.27e-06 *** GradRate -1.098e-02 4.432e-03 -2.478 0.014325 * Psalary -1.628e-05 3.798e-06 -4.286 3.23e-05 *** READ -7.274e-02 1.452e-02 -5.010 1.50e-06 *** enrollment:Istaff 1.198e-06 4.350e-07 2.753 0.006623 ** Residual standard error: 0.3586 on 151 degrees of freedom Multiple R-squared: 0.865, Adjusted R-squared: 0.8578 > AIC(out2) [1] 136.6208 doesn't really change suggestions; not as good a fit enrollment less significant leverage: 14? ---- > library(ISLR) > str(Carseats) 'data.frame': 400 obs. of 11 variables: $ Sales : num 9.5 11.22 10.06 7.4 4.15 ... $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ... $ Income : num 73 48 35 100 64 113 105 81 110 113 ... $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ... $ Population : num 276 260 269 466 340 501 45 425 108 131 ... $ Price : num 120 83 80 97 128 72 108 120 124 124 ... $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ... $ Age : num 42 65 59 55 38 78 71 67 76 76 ... $ Education : num 17 10 12 14 13 16 15 10 10 17 ... $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ... $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ... > out=lm(Sales~Price+Urban+US,data=Carseats) > summary(out) Call: lm(formula = Sales ~ Price + Urban + US, data = Carseats) Residuals: Min 1Q Median 3Q Max -6.9206 -1.6220 -0.0564 1.5786 7.0581 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 13.043469 0.651012 20.036 < 2e-16 *** Price -0.054459 0.005242 -10.389 < 2e-16 *** UrbanYes -0.021916 0.271650 -0.081 0.936 USYes 1.200573 0.259042 4.635 4.86e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.472 on 396 degrees of freedom Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335 F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16 > AIC(out) [1] 1865.312 US=Yes raises Sales (Intercept) by 1.2 (sig) compared to US=No (is store in US) Urban=Yes decreases Sales (Intercept) by .02 (not sig) copared to Urban=No USNo+UrbanNo: 13.043469-0.054459*Price USYes+UrbanNo: 13.043469+1.200573-0.054459*Price USNo+UrbanYes: 13.043469-0.021916-0.054459*Price USYes+UrbanYes: 13.043469+1.200573-0.021916-0.054459*Price Urban notSig > out2=lm(Sales~Price+US,data=Carseats) > summary(out2) (Intercept) 13.03079 0.63098 20.652 < 2e-16 *** Price -0.05448 0.00523 -10.416 < 2e-16 *** USYes 1.19964 0.25846 4.641 4.71e-06 *** Residual standard error: 2.469 on 397 degrees of freedom Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354 > AIC(out2) [1] 1863.319 > confint(out2) 2.5 % 97.5 % (Intercept) 11.79032020 14.27126531 Price -0.06475984 -0.04419543 USYes 0.69151957 1.70776632 diagonstics look OK