다중회귀분석에서 모형의 비교

다중회귀분석에서 모형의 비교 -이번 장은 안선생님 책과 함께 R Cookbook (Paul Teetor, 2011) 및 R In Action(Robert I.Kabacoff,2011)을 참조하여 기술하였음을 밝힌다. (1) 들어가는 말 안선생님 책의 11장을 보면 다중회귀분석의 모형 선택 방법에 대하여 기술되어 있다. 두개 또는 세개의 모형을 비교하기 위해 anova()를 사용하였다. 여러분의 기억을 돕기 위해 예를 들어보면 예 1. attitude data > out=lm(rating~.,data=attitude) > out2=lm(rating~complaints+learning+advance,data=attitude) > out3=lm(rating~complaints+learning,data=attitude) 위와 같이 세개의 모델을 만든후 각각의 anova table에서 p값이 큰 변수를 제거하고 작은 모델을 만든 후 세개의 모델을 anova로 비교하였다. > anova(out3,out2,out) Analysis of Variance Table Model 1: rating ~ complaints + learning Model 2: rating ~ complaints + learning + advance Model 3: rating ~ complaints + privileges + learning + raises + critical +      advance   Res.Df    RSS Df Sum of Sq      F Pr(>F) 1     27 1254.7                            2     26 1179.1  1    75.540 1.5121 0.2312 3     23 1149.0  3    30.109 0.2009 0.8947 > summary(out3) Call: lm(formula = rating ~ complaints + learning, data = attitude) Residuals:      Min       1Q   Median       3Q      Max  -11.5568  -5.7331   0.6701   6.5341  10.3610  Coefficients:             Estimate Std. Error t value Pr(>|t|)     (Intercept)   9.8709     7.0612   1.398    0.174     complaints    0.6435     0.1185   5.432 9.57e-06 *** learning      0.2112     0.1344   1.571    0.128     --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Residual standard error: 6.817 on 27 degrees of freedom Multiple R-squared: 0.708, Adjusted R-squared: 0.6864  F-statistic: 32.74 on 2 and 27 DF,  p-value: 6.058e-08  최종적으로는 complaints와 learning 두개의 독립변수를 갖는 모형이 선택되었다.  <ol style="list-style-type: decimal"> <li style="margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica">모형비교의 또다른 방법 - AIC 를 이용하는 방법</li> </ol>  AIC(Akaike’s An Information Criterion) 은 모형을 비교하는 또다른 방법을 제공한다. 이 index는 모형의 통계적 적합성 및 통계적합에 필요한 인수의 수를 설명해 준다. AIC 값이 적은 모형 - 즉 적은 인수를 가지고 적절한 적합성을 보이는 모형 - 이 선호된다. 이 기준을 사용하는 함수는 AIC() 함수와 step() 함수가 있는데 step()함수를 사용하여 모형을 비교해보겠다. step()함수는 단계적 회귀를 수행하는 함수로 forward, backward stepwise regression을 모두 할 수 있다. Backward regression은 가능한 많은 변수에서 시작해서 하나씩 제거하는 방법이고 Forward regression은 적은 수의 변수에서 시작해 변수를 하나씩 추가하는 방법이다. Backward Regression : 다음과 같이 시행한다. > full.model=lm(rating~.,data=attitude) > reduced.model=step(full.model,direction="backward") Forward Regression : 다음과 같이 시행한다 > min.model=lm(rating~1,data=attitude) > fwd.model=step(min.model,direction="forward",scope=(rating~complaints+privileges      +learning+raises+critical+advance)) 실제 예를 들어보면 다음과 같다. > # Backward stepwise regression > full.model=lm(rating~.,data=attitude) > summary(full.model) Call: lm(formula = rating ~ ., data = attitude) Residuals:      Min       1Q   Median       3Q      Max  -10.9418  -4.3555   0.3158   5.5425  11.5990  Coefficients:             Estimate Std. Error t value Pr(>|t|)     (Intercept) 10.78708   11.58926   0.931 0.361634     complaints   0.61319    0.16098   3.809 0.000903 *** privileges  -0.07305    0.13572  -0.538 0.595594     learning     0.32033    0.16852   1.901 0.069925 .   raises       0.08173    0.22148   0.369 0.715480     critical     0.03838    0.14700   0.261 0.796334     advance     -0.21706    0.17821  -1.218 0.235577     --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Residual standard error: 7.068 on 23 degrees of freedom Multiple R-squared: 0.7326, Adjusted R-squared: 0.6628  F-statistic:  10.5 on 6 and 23 DF,  p-value: 1.24e-05  의미없는 변수들을 제거하기 위해 step함수를 사용하였다. 그 결과를 reduced.model에 저장하였다. > reduced.model=step(full.model,direction="backward") Start:  AIC=123.36 rating ~ complaints + privileges + learning + raises + critical +      advance              Df Sum of Sq    RSS    AIC - critical    1      3.41 1152.4 121.45 - raises      1      6.80 1155.8 121.54 - privileges  1     14.47 1163.5 121.74 - advance     1     74.11 1223.1 123.24 <none>                    1149.0 123.36 - learning    1    180.50 1329.5 125.74 - complaints  1    724.80 1873.8 136.04 Step:  AIC=121.45 rating ~ complaints + privileges + learning + raises + advance              Df Sum of Sq    RSS    AIC - raises      1     10.61 1163.0 119.73 - privileges  1     14.16 1166.6 119.82 - advance     1     71.27 1223.7 121.25 <none>                    1152.4 121.45 - learning    1    177.74 1330.1 123.75 - complaints  1    724.70 1877.1 134.09 Step:  AIC=119.73 rating ~ complaints + privileges + learning + advance              Df Sum of Sq    RSS    AIC - privileges  1     16.10 1179.1 118.14 - advance     1     61.60 1224.6 119.28 <none>                    1163.0 119.73 - learning    1    197.03 1360.0 122.42 - complaints  1   1165.94 2328.9 138.56 Step:  AIC=118.14 rating ~ complaints + learning + advance              Df Sum of Sq    RSS    AIC - advance     1     75.54 1254.7 118.00 <none>                    1179.1 118.14 - learning    1    186.12 1365.2 120.54 - complaints  1   1259.91 2439.0 137.94 Step:  AIC=118 rating ~ complaints + learning              Df Sum of Sq    RSS    AIC <none>                    1254.7 118.00 - learning    1    114.73 1369.4 118.63 - complaints  1   1370.91 2625.6 138.16 step()함수의 출력을 보면 모델을 어떻게 적용해나갔는지 알수 있다. AIC 값이 123.36에서 시작해서 118까지 감소하였다. 최종 모델의 요약을 보면 두개의 변수가 남아있는 것을 알수 있으며 이는 안선생님 책의 결과와 같다. > summary(reduced.model) Call: lm(formula = rating ~ complaints + learning, data = attitude) Residuals:      Min       1Q   Median       3Q      Max  -11.5568  -5.7331   0.6701   6.5341  10.3610  Coefficients:             Estimate Std. Error t value Pr(>|t|)     (Intercept)   9.8709     7.0612   1.398    0.174     complaints    0.6435     0.1185   5.432 9.57e-06 *** learning      0.2112     0.1344   1.571    0.128     --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Residual standard error: 6.817 on 27 degrees of freedom Multiple R-squared: 0.708, Adjusted R-squared: 0.6864  F-statistic: 32.74 on 2 and 27 DF,  p-value: 6.058e-08  Backward regression은 쉬운 방법이지만 대상이 되는 변수가 아주 많은 경우 모든 변수를 포함하여 시작하는 것이 어려울 수 있다. 이러한 경우 forward stepwise regression을 시행하여 무에서 시작하여 더 이상 개선이 없을떄까지 하나씩 변수를 추가한다. Forward Regression을 같은 모델로 해보겠다. Forward regression은 아무것도 없는 곳에서부터 출발한다. > min.model=lm(rating~1,data=attitude) 이 모형은 예측인자가 없고 반응인자만 있는 모델이다. Forward selection을 하기위해서는 대상이 되는 변수들이 어떤 것인지 step() 함수에 알려주어야 한다. scope인수에 변수들을 알려주고 장황한 출력을 피하기 위해 trace=0을 추가했다. > fwd.model=step(min.model,direction="forward",scope=(rating~complaints+privileges+learning+raises+critical+advance),trace=0) > summary(fwd.model) Call: lm(formula = rating ~ complaints + learning, data = attitude) Residuals:      Min       1Q   Median       3Q      Max  -11.5568  -5.7331   0.6701   6.5341  10.3610  Coefficients:             Estimate Std. Error t value Pr(>|t|)     (Intercept)   9.8709     7.0612   1.398    0.174     complaints    0.6435     0.1185   5.432 9.57e-06 *** learning      0.2112     0.1344   1.571    0.128     --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Residual standard error: 6.817 on 27 degrees of freedom Multiple R-squared: 0.708, Adjusted R-squared: 0.6864  F-statistic: 32.74 on 2 and 27 DF,  p-value: 6.058e-08  Forward model에서도 같은 결과에 도달했다.  마지막으로 이러한 stepwise regression을 너무 과신하지 말아라. 이 방법은 만병통치약이 아니다.  이 방법으로 고철을 황금으로 바꿀수 없으며 주의깊고 현명하게 변수를 고르는 것을 대체할 수 없다. 이렇게 생각하는 사람이 있을 수 있겠다. “ 음...가능한 모든 상호작용을 내 모델에 다 집어 넣은 후 step함수가 가장 좋은 것을 고르게 하자! 뭘 고르나 한 번 보자! ” 이런 생각을 하시는 분은 다음과 같은 것을 한번 생각해보라. > full.model=lm(y~(x1+x2+x3+x4)^4)     # all-possible interaction >  reduced.model=step(full.model,direction=”backward”)  아마도 생각대로 잘 안될것 같다. 대부분의 상호작용은 의미가 없을 것이고 step()함수는 과부하가 걸릴 것이고 엄청나게 많은 무의미한 출력을 보게 될것이다.

카페정보

Biostatistics

실버 (공개)
카페지기 안재형
회원수 4,513
방문수0
카페앱수15

카페 전체 메뉴

▲

친구 카페

이전 다음

ㆍ 1차 정모

카페 게시글

R을 이용한 임상시험 데이터분석 다중회귀분석에서 모형의 비교

cardiomoon 추천 0 조회 1,183 12.01.30 17:20 댓글 2

게시글 본문내용

다음검색

저작자 표시 컨텐츠변경 비영리

안재형
12.01.30 23:26

첫댓글 그래서 저는 이런 자동방법은 쓰지 않습니다. 특히 interaction이 있을때 쓰면 a*b가 포함되면 main인 a, b는 유의하지 않아도 반드시 포함해야하는데 이런 자동화된 방법으로는 그게 가능하지 않습니다.
강성찬
12.01.31 09:27

정말 열심히 하시는군요. 대단합니다.