정보검색과 텍스트마이닝, NLP와 머신러닝 | (9/14) 실습 2. 어떤 "한글 문자열"이 입력되었을 때 그 문자열의 "한글 코드"를 자동으로 인식하는 방법 - Daum 카페

<p><span style="font-size: 12pt;"><strong>한글코드의 종류 -- 영어코드는 ASCII 외에도 EBCDIC 코드가 있음!</strong></span></p><p><br></p><p><span style="font-size: 12pt;">1) n바이트, 3바이트, 상용조합형 등</span></p><p><span style="font-size: 12pt;">2) KS 완성형 -- KS C 5601-1987, KS X 1001, <span style="color: rgb(255, 0, 0);">EUC-KR, CP949</span></span></p><p><span style="font-size: 12pt;">3) 유니코드 -- UCS2, UCS4, <span style="color: rgb(255, 0, 0);">UTF8, UTF16BE, UTF16LE</span>, UTF32</span></p><p><span style="font-size: 12pt;"><br></span></p><p><span style="font-size: 12pt;">아래 한글코드 중에서 입력된 "한글 문자열"이 어떤 코드인지 자동으로 인식!</span></p><p><br></p><p><strong><span style="color: rgb(255, 0, 0);"><font size="3"><span style="color: rgb(9, 0, 255);">1. </span></font></span><span style="color: rgb(255, 0, 0);"><font size="3"><span style="color: rgb(9, 0, 255);">KS완성형(EUC-KR 또는 CP949)</span></font></span></strong></p><p><span style="color: rgb(255, 0, 0);"><font size="3">    2바이트로 구성됨 -- 각 바이트 범위는 0xA1~0xFE</font></span></p><p><span style="color: rgb(255, 0, 0);"><font size="3">    한글 모든 음절 11,172자 중에서 2,350자를 <0xB0~C8, 0xA1~FE> 코드로 부여</font></span></p><p><span style="color: rgb(255, 0, 0);"><font size="3"><span style="color: rgb(0, 0, 0);">        '가': 0xB0A1, '힝': 0xC8FE</span></font></span></p><p><span style="color: rgb(255, 0, 0);"><font size="3"><br></font></span></p><p><span style="color: rgb(255, 0, 0);"><font size="3"><strong><span style="color: rgb(9, 0, 255);">2. 유니코드(</span><span style="color: rgb(9, 0, 255);">UTF8, UTF16BE, UTF16LE</span><span style="color: rgb(9, 0, 255);">)</span></strong></font></span></p><p><span style="color: rgb(255, 0, 0);"><font size="3">    한글 모든 음절 11,172개를 순서대로 코드 부여 -- 0xAC00~0xD7A3</font></span></p><p><span style="color: rgb(255, 0, 0);"><font size="3"><br></font></span></p><p><span style="color: rgb(255, 0, 0);"><font size="3"><strong><span style="color: rgb(9, 0, 255);"><주의> 유니코드</span><span style="color: rgb(9, 0, 255);"> 인코딩 기법</span></strong> -- UTF8, UTF16, UTF32</font></span></p><p><br></p><span style="font-size: 12pt;"><p><span style="font-size: 12pt;"><br></span></p><p><span style="font-size: 12pt;"><strong>//==== 한글코드 예제: Sample 한글 텍스트 </strong></span></p><p><span style="font-size: 12pt;"><strong>// 윈도 -- 메모장 이용, 리눅스 -- iconv 이용</strong></span></p><p><strong>가각간갇<br>힝힣</strong></p></span><p><br></p><p><u style="color: rgb(255, 0, 0); font-size: 16px;">// (주의) 아래 hexa 코드들은 모두 역워드 형식임! 윈도에서...</u></p><p><span style="font-size: 12pt;">//==== EUC-KR<span style="color: rgb(255, 0, 0);"><span style="color: rgb(0, 0, 0);"> --> </span></span></span><span style="font-size: 12pt;">실제 byte stream은 "</span><span style="color: rgb(9, 0, 255); font-size: 12pt;">b0 a1 b0 a2 b0 a3 ..."</span></p><p><span style="font-size: 12pt;">0000000 a1b0 a2b0 a3b0 a4b0 0a0d fec8 52c6 0a0d<br>0000020 0a0d 0a0d</span></p><p><span style="font-size: 12pt;"><br></span></p><p><span style="font-size: 12pt;">//==== UTF8</span><span style="font-size: 12pt;">  --> 실제 byte stream은 "e<span style="color: rgb(9, 0, 255);">f bb </span></span><span style="color: rgb(9, 0, 255); font-size: 16px;">bf ea</span><span style="color: rgb(9, 0, 255); font-size: 16px;"> </span><span style="color: rgb(9, 0, 255); font-size: 12pt;">b0 80 ea b0 81 ..."</span></p><p><span style="font-size: 12pt;">0000000 <span style="color: rgb(255, 0, 0);">bbef</span> ea<span style="color: rgb(255, 0, 0);">bf</span> 80b0 b0ea ea81 84b0 b0ea 0d87</span><br><span style="font-size: 12pt;">0000020 ed0a 9d9e 9eed 0da3 0d0a 0d0a 000a</span></p><p><br></p><p><span style="font-size: 12pt;">//==== UTF16LE  --> 실제 byte stream은 "<span style="color: rgb(9, 0, 255);">ff fe </span></span><span style="color: rgb(9, 0, 255); font-size: 16px;">00 ac</span><span style="color: rgb(9, 0, 255); font-size: 16px;"> </span><span style="color: rgb(9, 0, 255); font-size: 12pt;">01 ac 04 ac ..."</span></p><p><span style="font-size: 12pt;">0000000 <span style="color: rgb(255, 0, 0);">feff</span> ac00 ac01 ac04 ac07 000d 000a d79d</span><br><span style="font-size: 12pt;">0000020 d7a3 000d 000a 000d 000a 000d 000a</span></p><p><br></p><p><span style="font-size: 12pt;">//==== </span><span style="font-size: 12pt;">UTF16BE  --> 실제 byte stream은 "<span style="color: rgb(9, 0, 255);">fe ff </span></span><span style="color: rgb(9, 0, 255); font-size: 16px;">ac</span><span style="color: rgb(9, 0, 255); font-size: 16px;"> </span><span style="color: rgb(9, 0, 255); font-size: 16px;">00</span><span style="color: rgb(9, 0, 255); font-size: 16px;"> </span><span style="color: rgb(9, 0, 255); font-size: 12pt;">ac 01 ac 04 ..."</span></p><p><span style="font-size: 12pt;">0000000 <span style="color: rgb(255, 0, 0);">fffe</span> 00ac 01ac 04ac 07ac 0d00 0a00 9dd7</span><br><span style="font-size: 12pt;">0000020 a3d7 0d00 0a00 0d00 0a00 0d00 0a00</span></p><p><br></p>

카페정보

정보검색과 텍스트마이닝, NLP와 머신러닝

카페 전체 메뉴

▲

카페 게시글

목록 이전글 다음글

실습및과제2017 (9/14) 실습 2. 어떤 "한글 문자열"이 입력되었을 때 그 문자열의 "한글 코드"를 자동으로 인식하는 방법

nlp 추천 0 조회 133 17.08.04 16:42 댓글 0

게시글 본문내용

다음검색

저작자 표시 컨텐츠변경 비영리

댓글

검색 옵션 선택상자

댓글내용선택됨 옵션 더 보기

댓글내용

댓글 작성자

연관검색어

환율

환자

환기

최신목록