BeautifulSoup 사용 방법 및 웹 문서 스크랩핑

참조 사이트 : 영문   <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank" class="ke-link">https://www.crummy.com/software/BeautifulSoup/bs4/doc/</a><a href="http://dplex.egloos.com/category/Python" target="_blank" class="ke-link">http://dplex.egloos.com/category/Python</a>   : BeautifulSoup example<a href="http://lxml.de" target="_blank" class="ke-link">http://lxml.de </a>           : lxml 라이브러리로 대량의 파일 처리 가능 --- BeautifulSoup 실행 전 준비작업 ---1) 웹에서 자료 읽기에서 사용하는 대표적인 2가지 모듈  방법 a) requests 모듈          ~ >pip install requests            requests 모듈 활용   <a href="https://3.python-requests.org/" target="_blank" class="ke-link">https://3.python-requests.org/</a>  방법 b) urllib 모듈              <a href="https://docs.python.org/ko/3/library/urllib.request.html" target="_blank" class="ke-link">https://docs.python.org/ko/3/library/urllib.request.html</a> 2) BeautifulSoup을 설치        ~ >pip install beautifulsoup4   또는 pip install bs4  3) <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxm" target="_blank" class="ke-link">http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxm</a>l 에서 각자 설치한 python 버전에 맞는 lxml 파일을 다운받아     압축을 푼 후 ~\Lib\site-packages에 해당 폴더를 붙여넣기 해 준다. 단, Anaconda를 설치했다면 위의 작업은 하지 않아도 된다. ▣ 표에 각 해석 라이브러리의 장점과 단점 요약 ▣<div class="table-wrap"><table data-ke-type="table" data-ke-align="alignLeft" style="width: 100%;" border="1" data-ke-style="style2"><tbody><tr><td style="width: 12.6419%; text-align: center;">해석기 종류</td><td style="width: 40.8891%; text-align: center;">사용 방법</td><td style="width: 24.4954%; text-align: center;">장점</td><td style="width: 21.9736%; text-align: center;">단점</td></tr><tr><td style="width: 12.6419%; text-align: center;">html.parser</td><td style="width: 40.8891%;">BeautifulSoup(markup, "html.parser") BeautifulSoup('<a>', 'html.parser') 하면 <a></a> 형태로 강제 변경되어 처리됨</td><td style="width: 24.4954%;">각종 기능 완비 적절한 속도 관대함</td><td style="width: 21.9736%;">별로 관대하지 않음</td></tr><tr><td style="width: 12.6419%; text-align: center;">lxml</td><td style="width: 40.8891%;">BeautifulSoup(markup, "lxml") BeautifulSoup('<a>', 'lxml') 하면 <html><body><a></a></body></html> 형태로 강제 변경되어 처리됨</td><td style="width: 24.4954%;">아주 빠름 관대함</td><td style="width: 21.9736%;">외부 C 라이브러리 의존</td></tr><tr><td style="width: 12.6419%; text-align: center;">xml</td><td style="width: 40.8891%;">BeautifulSoup(markup, "xml") BeautifulSoup('<a>', 'xml') 하면 <?xml version="1.0" encoding="utf-8" ?> <a></a> 형태로 강제 변경되어 처리됨</td><td style="width: 24.4954%;">아주 빠름 유일하게 XML 해석기 지원</td><td style="width: 21.9736%;">외부 C 라이브러리 의존</td></tr></tbody></table></div>일반적으로 html 파일인 경우에는 html.parser를 사용하며, 속도를 위해 lxml을 설치해 사용할 수도 있다.   * BeautifulSoup 모듈이 제공하는 find 함수 종류  - find()  - find_next()  - find_all() 1) 모든 a 태그 검색soup.find_all("a")soup("a") 2) string 이 있는 title 태그 모두 검색soup.title.find_all(string=True)soup.title(string=True) 3) p 태그를 두개만 가져옴soup.find_all("p", limit=2) 4) string 검색soup.find_all(string="Tom")                       # string이 Tom인 것 찾기soup.find_all(string=["Tom", "Elsa", "Oscar"])  # or 검색soup.find_all(string=re.compile("\d\d"))      # 정규표현식 이용 5) p 태그와 속성 값이 title이 있는 것soup.find_all("p", "title")예)  6) a 태그와 b 태그 찾기soup.find_all(["a", "b"]) 7) 속성 값 가져오기soup.p['class']soup.p['id'] 8) string을 다른 string으로 교체tag.string.replace_with("새로운 값") 9) 보기 좋게 출력soup.b.prettify() 10) 간단한 검색soup.body.b   # body 태그 아래의 첫번째 b 태그soup.a          # 첫번째 a 태그 11) 속성 값 모두 출력tag.attrs 12) class는 파이썬에서 예약어이므로 class_ 로 쓴다.soup.find_all("a", class_="sister") 13) find 할 때 확인if soup.find("div", title=True) is not None:i = soup.find("div", title=True) 14) data-로 시작하는 속성 findsoup.find("div", attrs={"data-value": True}) 15) 태그명 얻기soup.find("div").name 16) 속성 얻기soup.find("div")['class']      # 만약 속성 값이 없다면 에러soup.find("div").get('class') # 속성 값이 없다면 None 반환 17) 속성이 있는지 확인tag.has_attr('class') tag.has_attr('id')        있으면 True, 없으면 False 18) 태그 삭제a_tag.img.unwrap() 19) 태그 추가soup.p.string.wrap(soup.new_tag("b"))soup.p.wrap(soup.new_tag("div") * BeautifulSoup의 select 함수 종류CSS의 셀렉터와 같은 형식을 사용한다. 1) select_one() : 결과를 하나만 반환2) select() : select는 결과값이 복수이며 리스트 형태로 저장된다.  태그 내의 문장을 가져오는 방법에는* string : .string 태그 하위에 문자열을 객체화. 문자열이 없으면 None 을 반환.             태그 내의 스트링. 주의! 내부에 순수하게 스트링만 존재해야함. 아니면 None. (태그가 있어도 안됨.)* text 또는 get_text() :  .text는 하위 자식태그의 텍스트까지 문자열로 반환. (유니코드 형식)            즉, 하위태그에 텍스트까지 문자열로 파싱할 경우 .text를 사용하는 것이 좋다.            string의 경우 문자열이 없으면 None을 출력하지만, get_text()의 경우 유니코드 형식으로 텍스트까지 문자열로 반환              하기 때문에 아무 정보도 출력되지 않는다.- 태그를 제외한 텍스트만 출력하는 함수는 get_text()이다.- string 은 태그가 하나밖에 없을 때만 동일한 결과를 출력한다.  ** 아래 소스 코드는 2021년 11월 현재 가능 코드임 - contents가 계속 변화함을 잊지 말자. ** 간단 예제1)import requests  from bs4 import BeautifulSoup def go():    base_url = "http://www.naver.com:80/index.html"        #storing all the information including headers in the variable source code    source_code = requests.get(base_url)        #sort source code and store only the plaintext    plain_text = source_code.text        #converting plain_text to Beautiful Soup object so the library can sort thru it    convert_data = BeautifulSoup(plain_text, 'lxml')        for link in convert_data.findAll('a'):        href = base_url + link.get('href')  #Building a clickable url        print(href)                          #displaying href go() <div class="table-wrap"><table data-ke-type="table" data-ke-align="alignLeft" border="1"><tbody><tr><td style="width: 700px; height: 24px;">실행 결과 <a style="color: var(--colab-anchor-color); background-color: #ffffff;" href="http://www.naver.com/index.html#newsstand" target="_blank" class="ke-link">http://www.naver.com/index.html#newsstand</a> <a style="color: var(--colab-anchor-color); background-color: #ffffff;" href="http://www.naver.com/index.html#themecast" target="_blank" class="ke-link">http://www.naver.com/index.html#themecast</a> <a style="color: var(--colab-anchor-color); background-color: #ffffff;" href="http://www.naver.com/index.html#timesquare" target="_blank" class="ke-link">http://www.naver.com/index.html#timesquare</a>  ...</td></tr></tbody></table></div>  간단 예제2) 정규표현식 사용from bs4 import BeautifulSoupimport rehtml = '''<!DOCTYPE html><html><head><title>story</title></head><body>BeautifulSoup Test    Once upon a time there were three little sisters; and their names were    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and    <a href="http://example.com/tillie" class="sister brother" id="link3">Tillie</a>;    and they lived at the bottom of a well.</body></html>''' soup = BeautifulSoup(html, 'html.parser')ele = soup.find('a', {'href':re.compile('.*/lacie')})print(ele)    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>ele = soup.find(href=re.compile('.*/lacie'))print(ele)    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> ele = soup.find('a', {'href':lambda val: val and 'lacie' in val})print(ele)    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> ele = soup.find(href=lambda val: val and 'lacie' in val)print(ele)    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>Advertisements  간단 예제3) BeautifulSoup 기반삼성전자(HTML 파싱)와 KOSPI 지수   http://finance.naver.com/item/sise_day.nhn?code=종목코드&page=페이지번호페이지 번호는 1부터 시작하고 현재 날짜부터 역순으로 10개씩종목코드 : 삼성전자 005930,  LG전자 066570 실습 소스)import timefrom urllib.request import urlopen   # 웹에서 텍스트를 받아오기 위해서 importfrom bs4 import BeautifulSoup       # html 파싱을 위해 importfrom pandas import Series #최근 30개의 데이터를 가지고 삼성전자와 KOSPI 지수 간의 회귀분석#삼성전자 주식의 가격을 웹에서 가져와서 DataFrame으로 만들기#천 단위 구분 기호가 있는 데이터를 숫자로 변화하기 위한 2개의 함수def f(st):    ar = st.split(',')    k = ''    for i in ar:        k = k + i    return k def func(st):    return float(st) stockitem = '005930'    # 종목코드 - 삼성전자 주식 30개 가져오기samlist = []         # 실제 데이터 저장할 리스트for i in range(1, 4):    url= 'http://finance.naver.com/item/sise_day.nhn?code=' + stockitem + '&page=' + str(i)    # 주소만들기    html = urlopen(url)     source = BeautifulSoup(html.read(), 'html.parser')    srllists = source.find_all("tr")     time.sleep(2)           # 잠시 대기 : 2초    for i in range(1, len(srllists)-1):        if(srllists[i].span != None):            samlist.append(srllists[i].find_all("td", class_="num")[0].text) #print(samlist)'''리스트의 모든 요소에게 함수를 수행시켜 결과를 다시 리스트에 저장하기for i in range(len(samlist)):     samlist[i] = f(samlist[i])     samlist[i] = func(samlist[i])'''# list를 Series 객체로 만든 후 모든 요소에게 함수 적용하기# 벡터 연산을 수행하므로 더 빠를 가능성이 높음data1 = Series(samlist)data1 = data1.map(f)data1 = data1.map(func)print(data1) *** 더 많은 설명 보기 ***<a href="https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html" target="_blank" class="ke-link">https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html </a>  참고 : 웹스크래핑 연습용 페이지http://www.pythonscraping.com/pages/warandpeace.htmlhttp://www.pythonscraping.com/pages/page3.html   참고 : 임의의 사이트 접속 시 클라이언트의 요청을 정상적으로 판단하지 못할 경우# 임의의 사이트 접속 시 클라이언트의 요청을 정상적으로 판단하지 못할 경우 headers= 속성을 주면 해결됨# 이러한 상황은 모든 경우에 해당되는 것은 아니고 웹서버에 따라 유동적이다.# 아래 예 : 접근 권한이 없는 페이지입니다. 라는 메시지와 함께 자료를 제대로 읽지 못함' import requests         # requests 모듈 사용res = requests.get("URL 주소")print('응답코드 : ', res.status_code)   #  status_code(응답 코드) 확인. 200, 404, 500 ...res.raise_for_status()    # 문서 읽기가 실패하면 처리 중지  print(res.text, ', 글자 수 :  ', len(res.text))  # 예1)def crawler(page) :    url = 'https://creativeworks.tistory.com/' + str( page )    html = requests.get(url)    html.encoding = None   # 한글 깨짐 처방    plain_text = html.text    print( plain_text )        soup = BeautifulSoup(plain_text, 'lxml')    #print(soup)        for link in soup.find_all('h3', class_='tit_post') :         title = link.string         print( title )        crawler(7) # 예 2)# headers = 속성을 적어 주면 웹서버는 정상적인 요청으로 판단함 requests.get(url, headers={ 'User-Agent': ...def crawler(page) :    url = 'https://creativeworks.tistory.com/' + str( page )    html = requests.get(url, headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko'})    html.encoding = None    plain_text = html.text    print( plain_text )        soup = BeautifulSoup(plain_text, 'lxml')    #print(soup)        for link in soup.find_all('h3', class_='tit_post') :         title = link.string         print( title )        crawler(7) * requests 모듈로 웹문서를 읽을 때 User-agent  : HTTP 요청을 보내는 디바이스와 브라우저 등 사용자 소프트웨어의 식별 정보를 담고 있는 request header의 한 종류다. 문서를 제대로 읽어 오지 못할 경우가 발생하는데 이 때 얘를 기술해 줘야 한다. <a href="https://velog.io/@ggong/User-agent-%EC%A0%95%ED%99%95%ED%95%98%EA%B2%8C-%ED%95%B4%EC%84%9D%ED%95%98%EA%B8%B0" target="_top" class="ke-link">https://velog.io/@ggong/User-agent-%EC%A0%95%ED%99%95%ED%95%98%EA%B2%8C-%ED%95%B4%EC%84%9D%ED%95%98%EA%B8%B0</a> <a href="https://www.whatismybrowser.com/detect/what-is-my-user-agent" target="_top" class="ke-link">https://www.whatismybrowser.com/detect/what-is-my-user-agent</a> * 다른 접근 방법1get(baseUrl, verify=False)* 다른 접근 방법2<a href="https://0ver-grow.tistory.com/1003" target="_blank" class="ke-link">https://0ver-grow.tistory.com/1003</a>* 다른 접근 방법3<a href="https://stackoverflow.com/questions/54175042/python-3-7-anaconda-environment-import-ssl-dll-load-fail-error" target="_blank" class="ke-link">https://stackoverflow.com/questions/54175042/python-3-7-anaconda-environment-import-ssl-dll-load-fail-error</a>  참고 :  google_images_download 를 사용해서 간단하게 이미지를 다운로드할 수도 있다.<a href="https://pypi.org/project/google_images_download/" target="_blank" class="ke-link">https://pypi.org/project/google_images_download/</a><a href="https://hhahn.tistory.com/2" target="_blank" class="ke-link">https://hhahn.tistory.com/2</a>  # 웹크롤러가 페이지를 이동하며 페이지 제목, 첫번째 문단, 편집 페이지를 가리키는 링크가 있다면 이를 수집(출력)하는 스크레이퍼# http://en.wikipedia.org  의 메인 페이지에서 a tag들을 대상으로 함 from urllib.request import urlopenfrom bs4 import BeautifulSoupimport re pages = set()def getLinks(pageUrl):    #print('pageUrl : ', pageUrl)    global pages    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))    bs = BeautifulSoup(html, 'html.parser')    try:        print(bs.h1.get_text())        print(bs.find(id ='mw-content-text').find_all('p')[0])        print(bs.find(id='ca-edit').find('span').find('a').attrs['href'])    except AttributeError:        print('This page is missing something! Continuing.')        for link in bs.find_all('a', href=re.compile('^(/wiki/)')):        if 'href' in link.attrs:            if link.attrs['href'] not in pages:                #새로운 페이지 만남                newPage = link.attrs['href']                print('-' * 20)                print(newPage)                pages.add(newPage)                getLinks(newPage)  # 재귀함수 getLinks('')    # 출력 결과    # pageUrl :      # Main Page    # /wiki/Wikipedia    # pageUrl :  /wiki/Wikipedia    # Wikipedia    #     #     # This page is missing something! Continuing.    # --------------------    # ......  python-scraping/crawling 참고https://github.com/REMitchell/python-scraping  크롤러 솔루션 scrapy<a href="https://engkimbs.tistory.com/893" target="_blank" class="ke-link">https://engkimbs.tistory.com/893</a><a href="https://scrapy.org/" target="_blank" class="ke-link">https://scrapy.org/</a>