웹에서 하는 R통계

간단한 코드를 통해 R 회귀분석의 residual과 이를 이용한 회귀분석의 적합도 판별 방법에 대해서 알아보겠습니다. 1. 먼저 엑셀로 샘플 데이터 파일을 만듭니다.<table class="txc-table" width="206px" cellspacing="0" cellpadding="0" border="0" style="font-size: 9pt; line-height: 1.6; border: none; border-collapse: collapse; width: 206px;"><tbody><tr><td style="width: 205px; height: 24px; border: 1px solid rgb(204, 204, 204);"><img src="https://t1.daumcdn.net/cfile/cafe/236340435718973D17" class="txc-image" hspace="1" vspace="1" border="0" actualwidth="195" width="195" exif="{}" data-filename="캡처.jpg" style="font-size: 9pt; line-height: 19.2px; clear: none; float: none;" id="A_236340435718973D170BB5" id="A_236340435718973D170BB5"/> </td></tr></tbody></table>- 1행에 X, Y를 입력 하시고 아래 행들에 숫자들을 입력합니다.  <img src="https://t1.daumcdn.net/cfile/cafe/261D59425718979F19" class="txc-image" hspace="1" vspace="1" border="0" actualwidth="960" width="960" exif="{}" data-filename="캡처.jpg" style="clear:none;float:none;" id="A_261D59425718979F192D3A" id="A_261D59425718979F192D3A"/>- 저장은 엑셀 다른 이름으로 저장하기 옵션에서 CSV 파일 형식을 선택하고 RStudio에서 사용할 폴더에 저장을 합니다. <div> </div>2. RStudio에서 샘플 데이터 파일을 불러옵니다.(윈도우 기준)- 먼저 RStudio을 실행합니다. <table class="txc-table" width="316px" cellspacing="0" cellpadding="0" border="0" style="font-size: 9pt; line-height: 1.6; border: none; border-collapse: collapse; width: 316px;"><tbody><tr><td style="width: 315px; height: 24px; border: 1px solid rgb(204, 204, 204);"><img src="https://t1.daumcdn.net/cfile/cafe/2756A84557189A3533" class="txc-image" hspace="1" vspace="1" border="0" actualwidth="306" width="306" exif="{}" data-filename="캡처23.jpg" style="font-size: 9pt; line-height: 19.2px; clear: none; float: none;" id="A_2756A84557189A35332DD5" id="A_2756A84557189A35332DD5"/> </td></tr></tbody></table>- getwd() 함수로 현재 RStudio에 설정되어 있는 폴더 경로를 확인 합니다. <table class="txc-table" width="700" cellspacing="0" cellpadding="0" border="0" style="line-height: 1.6; border: none; border-collapse: collapse;"><tbody><tr><td style="width:700px;height:24px;border-bottom:1px solid #ccc;border-right:1px solid #ccc;border-top:1px solid #ccc;border-left:1px solid #ccc;;"><img src="https://t1.daumcdn.net/cfile/cafe/255AAF365718A77D22" class="txc-image" hspace="1" vspace="1" border="0" actualwidth="743" width="743" exif="{}" data-filename="캡처.jpg" style="font-size: 9pt; line-height: 19.2px; text-align: center; clear: none; float: none;" id="A_255AAF365718A77D226A3F" id="A_255AAF365718A77D226A3F"/> </td></tr></tbody></table>- 이 경로가 CSV 파일을 저장한 경로가 아니면 RStudio 옵션에서 사용할 폴더를 지정할 수 있습니다. - RStudio 터미널에서도 setwd("경로명") 명령어로 사용할 폴더를 지정할 수 있습니다. 3. 데이터 불러오기 및 그래프 그리기RStudio 터미널에서 아래 명령어로 CSV 파일로 저장한 데이터를 불러옵니다.> data <- read.csv('regression.csv', header = TRUE) data에 무엇이 저장되었는지 확인해봅니다. > data<table class="txc-table" width="94px" cellspacing="0" cellpadding="0" border="0" style="line-height: 1.6; border: none; border-collapse: collapse; width: 94px;"><tbody><tr><td style="width: 93px; height: 24px; border: 1px solid rgb(204, 204, 204);"><img src="https://t1.daumcdn.net/cfile/cafe/2269D0435738A4A81A" class="txc-image" actualwidth="86" hspace="1" vspace="1" border="0" width="86" exif="{}" data-filename="zz.jpg" style="font-size: 9pt; line-height: 19.2px; clear: none; float: none;" id="A_2269D0435738A4A81AE69B" id="A_2269D0435738A4A81AE69B"/> </td></tr></tbody></table> 행 길이가 긴 데이터인 경우에는 head 명령어로 5행의 내용만을 볼 수 있습니다.> head(data)<table class="txc-table" width="70px" cellspacing="0" cellpadding="0" border="0" style="border: none; border-collapse: collapse; font-family: gulim; font-size: 12px; width: 70px;"><tbody><tr><td style="width: 69px; height: 24px; border: 1px solid rgb(204, 204, 204);"><img src="https://t1.daumcdn.net/cfile/cafe/2316FA365738A4D30F" class="txc-image" actualwidth="56" hspace="1" vspace="1" border="0" width="56" exif="{}" data-filename="1234.jpg" style="font-size: 9pt; line-height: 19.2px; text-align: center; clear: none; float: none;" id="A_2316FA365738A4D30FD571" id="A_2316FA365738A4D30FD571"/> </td></tr></tbody></table> 가로가 X의 숫자이고 세로가 Y의 숫자인 2차원 그래프를 그려줍니다.> plot(data, pch = 1)<table class="txc-table" width="560px" cellspacing="0" cellpadding="0" border="0" style="line-height: 1.6; border: none; border-collapse: collapse; width: 560px;"><tbody><tr><td style="width: 559px; height: 24px; border: 1px solid rgb(204, 204, 204);"><img src="https://t1.daumcdn.net/cfile/cafe/212CEC3F5718C0BF31" class="txc-image" hspace="1" vspace="1" border="0" actualwidth="556" width="556" exif="{}" data-filename="Rplot.png" style="font-size: 9pt; line-height: 19.2px; clear: none; float: none;" id="A_212CEC3F5718C0BF31B5F6" id="A_212CEC3F5718C0BF31B5F6"/> </td></tr></tbody></table> 4. 회귀 모형 생성불러온 데이터를 이용하여 회귀모형을 생성 합니다.> model <- lm(Y ~ X, data) 생성한 회귀모형에 대한 정보를 확인합니다.> summary(model)<table class="txc-table" width="630px" cellspacing="0" cellpadding="0" border="0" style="line-height: 1.6; border: none; border-collapse: collapse; width: 630px;"><tbody><tr><td style="width: 629px; height: 322px; border: 1px solid rgb(204, 204, 204);"><img src="https://t1.daumcdn.net/cfile/cafe/210EF04A573DE57406" class="txc-image" hspace="1" vspace="1" border="0" actualwidth="629" width="629" exif="{}" data-filename="zzzz.jpg" style="font-size: 9pt; line-height: 1.6; clear: none; float: none;" id="A_210EF04A573DE574066EE8"/></td></tr></tbody></table>회귀모형에 대한 정보는 크게 Residuals, Coefficients, Signif. codes, Residual standard error, Multiple R-squared, Adjusted R-squared, F-statistics, p-value로 볼 수 있습니다. 먼저 residual에 관련된 항목들을 살펴보도록 하겠습니다. 5. residual<table class="txc-table" width="330px" cellspacing="0" cellpadding="0" border="0" style="font-size: 9pt; line-height: 1.6; border: none; border-collapse: collapse; width: 330px;"><tbody><tr><td style="width: 330px; height: 121px; border: 1px solid rgb(204, 204, 204);"><img src="https://t1.daumcdn.net/cfile/cafe/23285F405738ABB213" class="txc-image" actualwidth="314" hspace="1" vspace="1" border="0" width="314" exif="{}" data-filename="ddJFC.png" style="font-size: 9pt; line-height: 19.2px; clear: none; float: none;" id="A_23285F405738ABB213444B" id="A_23285F405738ABB213444B"/> </td></tr></tbody></table>residual은 잔차라고 하는데 엑셀 파일에 기입했던 '데이터의 Y값 - 생성한 회귀모형 직선의 Y값'입니다. 즉 실제 데이터와 예측 데이터의 차이를 의미합니다. 아래 명령어를 이용하여 확인 할 수 있습니다. > model$residuals <table class="txc-table" width="700" cellspacing="0" cellpadding="0" border="0" style="line-height: 1.6; border: none; border-collapse: collapse;"><tbody><tr><td style="width: 700px; height: 122px; border: 1px solid rgb(204, 204, 204);"><img src="https://t1.daumcdn.net/cfile/cafe/213289355738A50E31" class="txc-image" actualwidth="960" hspace="1" vspace="1" border="0" width="960" exif="{}" data-filename="5678.jpg" style="font-size: 9pt; line-height: 19.2px; text-align: center; clear: none; float: none;" id="A_213289355738A50E319210" id="A_213289355738A50E319210"/> </td></tr></tbody></table> residual이 '엑셀 파일에 기입했던 데이터의 Y값 - 생성한 회귀모형 직선의 Y값'이라고 하였으므로 아래 명령어로도 확인 가능합니다. 데이터의 X값을 생성한 회귀 모형 직선에 대입하여 Y을 구하는 predict 함수를 이용하여 예측값 predictedY를 구합니다.> predictedY <- predict(model, data) '엑셀 파일에 기입했던 데이터의 Y값 - 생성한 회귀모형 직선의 Y값'을 코드로 표현 해보면 아래와 같습니다. > data$Y - predictedY<table class="txc-table" width="700" cellspacing="0" cellpadding="0" border="0" style="border:none;border-collapse:collapse;;font-family:gulim;font-size:12px"><tbody><tr><td style="width:700px;height:24px;border-bottom:1px solid #ccc;border-right:1px solid #ccc;border-top:1px solid #ccc;border-left:1px solid #ccc;;"><img src="https://t1.daumcdn.net/cfile/cafe/2268304B5738A80B2F" class="txc-image" actualwidth="955" hspace="1" vspace="1" border="0" width="955" exif="{}" data-filename="aaaa.jpg" style="font-size: 9pt; line-height: 19.2px; text-align: center; clear: none; float: none;" id="A_2268304B5738A80B2F597F" id="A_2268304B5738A80B2F597F"/> </td></tr></tbody></table> model$residuals과 (data$Y - predictedY) 같은 값들임을 알 수 있습니다. 6. residual standard errorresidual standard error은 잔차의 표준 오차라고 하는데 data$Y의 값이 predictedY의 값으로부터, 즉 생성한 회귀모델로 예측한값으로부터 얼마나 흩어 있는가를 나타내는 것입니다. > sqrt(sum((predictedY - data$Y) ^ 2) / (nrow(data) - 2))<table class="txc-table" width="124px" cellspacing="0" cellpadding="0" border="0" style="line-height: 1.6; border: none; border-collapse: collapse; width: 124px;"><tbody><tr><td style="width: 123px; height: 25px; border: 1px solid rgb(204, 204, 204);"><img src="https://t1.daumcdn.net/cfile/cafe/2319B94B5738A84606" class="txc-image" actualwidth="109" hspace="1" vspace="1" border="0" width="109" exif="{}" data-filename="7777.jpg" style="font-size: 9pt; line-height: 1.6; clear: none; float: none;" id="A_2319B94B5738A84606B0B3" id="A_2319B94B5738A84606B0B3"/> </td></tr></tbody></table>보시면 summary(model)의 residual standard error의 값과 같은 것을 알 수 있습니다.  <table class="txc-table" width="298px" cellspacing="0" cellpadding="0" border="0" style="line-height: 1.6; border: none; border-collapse: collapse; width: 298px;"><tbody><tr><td style="width: 297px; height: 24px; border: 1px solid rgb(204, 204, 204);"><img src="https://t1.daumcdn.net/cfile/cafe/2216A8505738AB8032" class="txc-image" actualwidth="295" hspace="1" vspace="1" border="0" width="295" exif="{}" data-filename="olsassumptions.jpg" style="font-size: 9pt; line-height: 19.2px; clear: none; float: none;" id="A_2216A8505738AB8032E505" id="A_2216A8505738AB8032E505"/> </td></tr></tbody></table>자유도는 data의 총 길이인 20에서 2를 빼게 되는데 이것은 b0와 b1이라는 두개의 파라미터를 추정하기 때문이라고 하네요. 7. mutiple R squared, adjusted R squared회귀모형의 적합도를 일률적으로 나타내줄 수 있는 방법으로서 가장 많이 사용되는 것이 결정계수인데 결정계수는 회귀모형에 의하여 설명된 변동이 총변동에서 차지하는 상대적 크기를 나타냅니다. <table class="txc-table" width="446px" cellspacing="0" cellpadding="0" border="0" style="border: none; border-collapse: collapse; font-family: gulim; font-size: 12px; width: 446px;"><tbody><tr><td style="width: 445px; height: 24px; border: 1px solid rgb(204, 204, 204);"><img src="https://t1.daumcdn.net/cfile/cafe/23103D425738B81717" class="txc-image" actualwidth="444" hspace="1" vspace="1" border="0" width="444" exif="{}" data-filename="크기변환_reg4.jpg" style="font-size: 9pt; line-height: 19.2px; clear: none; float: none;" id="A_23103D425738B81717CEE9" id="A_23103D425738B81717CEE9"/> </td></tr></tbody></table>mutiple R squared를 계산하기 위해 먼저 SSE, SSR, SST라는 것을 계산합니다. 각각의 의미를 그래프를 보면 보면 위와 같습니다.  <table class="txc-table" width="579px" cellspacing="0" cellpadding="0" border="0" style="border: none; border-collapse: collapse; font-family: gulim; font-size: 12px; width: 579px;"><tbody><tr><td style="width: 578px; height: 25px; border: 1px solid rgb(204, 204, 204);"><img src="https://t1.daumcdn.net/cfile/cafe/2173E74F573967B011" class="txc-image" actualwidth="571" hspace="1" vspace="1" border="0" width="571" exif="{}" data-filename="zzzzzz.jpg" style="font-size: 9pt; line-height: 19.2px; clear: none; float: none;" id="A_2173E74F573967B011B27D" id="A_2173E74F573967B011B27D"/> </td></tr></tbody></table>SSE, SSR, SST에 대한 조금 더 자세한 설명입니다.  먼저 입력된 값들을 제곱하여 모두 더해주는 함수를 만듭니다.sigma <- function (term) {  sum(term ^ 2)} 다음으로 이 함수를 이용하여 SSE, SSR, SST를 각각 계산합니다.> SSR = sigma(predictedY - mean(data$Y))> SSE = sigma(data$Y - predictedY)> SST = SSE + SSR <table class="txc-table" width="700" cellspacing="0" cellpadding="0" border="0" style="border:none;border-collapse:collapse;;font-family:gulim;font-size:12px"><tbody><tr><td style="width:700px;height:24px;border-bottom:1px solid #ccc;border-right:1px solid #ccc;border-top:1px solid #ccc;border-left:1px solid #ccc;;"><img src="https://t1.daumcdn.net/cfile/cafe/26785F4D573969E428" class="txc-image" actualwidth="684" hspace="1" vspace="1" border="0" width="684" exif="{}" data-filename="zxcv.jpg" style="font-size: 9pt; line-height: 19.2px; clear: none; float: none;" id="A_26785F4D573969E428EE84" id="A_26785F4D573969E428EE84"/> </td></tr></tbody></table>mutiple R squared에 대한 공식입니다. 계산한 SSE, SSR, SST를 이용하여 mutiple R squared를 계산합니다. > Rsquared = 1 - SSE / SST summary를 이용하여 확인한 mutiple R squared의 값과 같음을 알 수 있네요.> Rsquared<table class="txc-table" width="140px" cellspacing="0" cellpadding="0" border="0" style="border: none; border-collapse: collapse; font-family: gulim; font-size: 12px; width: 140px;"><tbody><tr><td style="width: 139px; height: 24px; border: 1px solid rgb(204, 204, 204);"><img src="https://t1.daumcdn.net/cfile/cafe/2569C93B57397B8704" class="txc-image" actualwidth="128" hspace="1" vspace="1" border="0" width="128" exif="{}" data-filename="zzzzz.jpg" style="font-size: 9pt; line-height: 19.2px; clear: none; float: none;" id="A_2569C93B57397B8704B5EB" id="A_2569C93B57397B8704B5EB"/> </td></tr></tbody></table> 또한 adjusted R squared라는 것이 있는데 공식은 아래와 같습니다.<table class="txc-table" width="700" cellspacing="0" cellpadding="0" border="0" style="border:none;border-collapse:collapse;;font-family:gulim;font-size:12px"><tbody><tr><td style="width:700px;height:24px;border-bottom:1px solid #ccc;border-right:1px solid #ccc;border-top:1px solid #ccc;border-left:1px solid #ccc;;"><img src="https://t1.daumcdn.net/cfile/cafe/27468544573979910B" class="txc-image" actualwidth="814" hspace="1" vspace="1" border="0" width="814" exif="{}" data-filename="ggg.jpg" style="font-size: 9pt; line-height: 19.2px; text-align: center; clear: none; float: none;" id="A_27468544573979910BE702" id="A_27468544573979910BE702"/> </td></tr></tbody></table>※ 참고 링크 : <a href="http://www.statisticshowto.com/adjusted-r2/" target="_blank" class="tx-link" style="font-size: 9pt; line-height: 1.6;">http://www.statisticshowto.com/adjusted-r2/</a> 계산한 mutiple R squared를 이용하여 adjusted R squared를 계산합니다.> AdjustedRsquared = 1 - (1 - Rsquared) * 19 / 18 summary를 이용하여 확인한 adjusted R squared의 값과 같음을 알 수 있습니다. > AdjustedRsquared<table class="txc-table" width="142px" cellspacing="0" cellpadding="0" border="0" style="line-height: 1.6; border: none; border-collapse: collapse; width: 142px;"><tbody><tr><td style="width: 141px; height: 24px; border: 1px solid rgb(204, 204, 204);"><img src="https://t1.daumcdn.net/cfile/cafe/213DF03F57397BD802" class="txc-image" actualwidth="130" hspace="1" vspace="1" border="0" width="130" exif="{}" data-filename="12345.jpg" style="font-size: 9pt; line-height: 19.2px; clear: none; float: none;" id="A_213DF03F57397BD802987F" id="A_213DF03F57397BD802987F"/> </td></tr></tbody></table>