[크롤링] BeautifulSoup 사용하면서 스크래핑해오기 어려웠던 것 정리

티스토리 뷰

프로그래밍/Python

[크롤링] BeautifulSoup 사용하면서 스크래핑해오기 어려웠던 것 정리

돔돔이부하 2021. 11. 23. 19:16

728x90

항상 HTML 태그를 파싱하기 까다로운 것들이 있는 것 같아요.

그런 것들을 모아서 정리해볼 생각입니다!

크롤링 공부하면서 계속 추가해 나갈 생각입니다.

1. HTML 코드 사이에 태그로 구성되어 있지 않은 텍스트 가져오기

<div>
	<h1>1번 글</h1>
	2021.11.23
	<p>미리보기 내용</p>
	<h1>2번 글</h1>
	2021.11.27
	<p>미리보기 내용</p>
</div>

위 HTML 코드에서 날짜 부분만 가져오기 위한 코드를 작성해보겠습니다.

import requests
from bs4 import BeautifulSoup
html = '''
<div>
<h1>1번 글</h1>
2021.11.23
<p>미리보기 내용</p>
<h1>2번 글</h1>
2021.11.27
<p>미리보기 내용</p>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
titles = soup.find_all('h1')
for title in titles:
	print(title.next_sibling.strip())

next_sibling 을 이용해서 특정 요소 다음에 있는 요소를 선택하는 방식을 사용하였습니다.

실제로 이런 요소들은 CSS Selector 로 가져오기에는 정말 까다롭기 그지 없는 것 같습니다.

그래서 위와 같은 방식을 사용한다면 보다 쉽게 가져올 수 있을 것 같습니다.

실행 결과는 아래와 같습니다.

2021.11.23
2021.11.27

2. Class 이름이 계속 유동적으로 변하는 요소에서 텍스트만 가져오기

<div>
<span class="txt_01">hello</span>
<div class="txt_04">how</div>
<em class="txt_13">are</em>
<pre class="txt_10b">you</pre>
<h3 class="txt_21">my name is</h3>
<h4 class="txt_38a">domdomi</h4>
</div>

위와 같이 태그도 다양하고 class 도 불규칙적으로 변할 때 text들의 list 를 가져오려면 어떻게 해야할까요?

상상력을 발휘해보면 사실 다양한 방법이 있을 수 있는데요.

일단 저는 아래와 같이 코드를 작성해보았습니다.

import requests
from bs4 import BeautifulSoup
html = '''
<div>
<span class="txt_01">hello</span>
<div class="txt_04">how</div>
<em class="txt_13">are</em>
<pre class="txt_10b">you</pre>
<h3 class="txt_21">my name is</h3>
<h4 class="txt_38a">domdomi</h4>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
texts = list(soup.find('div').stripped_strings)
print(texts)

div 태그의 자식요소들에 대한 문자열을 가져오기 위해서 stripped_strings 를 사용했습니다. 그냥 strings 를 사용해도 되지만 태그 사이 사이에 존재하는 불필요한 개행문자들도 가져와지기 때문에 stripped_strings를 사용하였습니다.

결과는 아래와 같습니다.

['hello', 'how', 'are', 'you', 'my name is', 'domdomi']

딱 저희가 원하는 문자열만 가져와진 것을 볼 수 있습니다.

stripped_string는 generator object로 반환하기 때문에 별도로 리스트 형태로 보기 위해서 list() 형변환을 해주었습니다.

리스트 내부에 있는 요소를 출력할 때에는 별도로 list() 형변환을 해줄 필요 없습니다.

texts = soup.find('div').stripped_strings
for text in texts:
	print(text)

그저 위와 같이 반복문으로 받아와서 출력해주면 됩니다.

hello
how
are
you
my name is
domdomi

3. Class 이름이 계속 유동적으로 변하는 요소에서 텍스트만 가져오기 (2)

이번엔 위에서 활용한 예시에서 class 의 반복되는 패턴을 이용해서 가져와 보겠습니다.

<div>
<span class="txt_01">hello</span>
<div class="txt_04">how</div>
<em class="txt_13">are</em>
<pre class="txt_10b">you</pre>
<h3 class="txt_21">my name is</h3>
<h4 class="txt_38a">domdomi</h4>
</div>

보시다시피 위의 class 에서는 txt_ 가 반복되는 것을 알 수 있습니다.

이를 정규식을 사용해서 가져와보겠습니다.

import re
import requests
from bs4 import BeautifulSoup
html = '''
<div>
<span class="txt_01">hello</span>
<div class="txt_04">how</div>
<em class="txt_13">are</em>
<pre class="txt_10b">you</pre>
<h3 class="txt_21">my name is</h3>
<h4 class="txt_38a">domdomi</h4>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
texts = soup.find_all(attrs={'class':re.compile('^txt\_')})
for text in texts:
	print(text.string)

class 가 txt_ 로 시작하는 요소만 가져오도록 정규식을 사용해서 가져와 보았습니다.

실행 결과는 역시 아래와 같습니다.

hello
how
are
you
my name is
domdomi

4. 한번에 여러 요소를 가져오기

<div>
<span class="txt_01">hello</span>
<div class="txt_04">how</div>
<em class="txt_13">are</em>
<pre class="txt_10b">you</pre>
<h3 class="txt_21">my name is</h3>
<h4 class="txt_38a">domdomi</h4>
</div>

위 HTML 코드에서 em 태그, pre 태그, h4 태그만 가져와 보도록 하겠습니다.

import re
import requests
from bs4 import BeautifulSoup
html = '''
<div>
<span class="txt_01">hello</span>
<div class="txt_04">how</div>
<em class="txt_13">are</em>
<pre class="txt_10b">you</pre>
<h3 class="txt_21">my name is</h3>
<h4 class="txt_38a">domdomi</h4>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
texts = soup.find_all(['em', 'pre', 'h4'])
for text in texts:
	print(text.string)

결과는 아래와 같습니다.

are
you
domdomi

5. 한번에 여러 요소를 가져오기 (2)

<div>
<span class="txt_01">hello</span>
<div class="txt_04">how</div>
<em class="txt_13">are</em>
<pre class="txt_10b">you</pre>
<h3 class="txt_21">my name is</h3>
<h4 class="txt_38a">domdomi</h4>
</div>

이번에는 위 모든 요소들 중에서 class를 가지고 있고, class가 txt_ 로 시작하는 요소들 모두를 가져오는 코드를 작성해보겠습니다.

이번에는 위 예시와 달리 함수를 사용해보았습니다.

함수를 사용하게 되면 장점으로는 좀 더 커스터마이징 할 수 있다는 것입니다.

import re
import requests
from bs4 import BeautifulSoup
html = '''
<div>
<span class="txt_01">hello</span>
<div class="txt_04">how</div>
<em class="txt_13">are</em>
<pre class="txt_10b">you</pre>
<h3 class="txt_21">my name is</h3>
<h4 class="txt_38a">domdomi</h4>
</div>
'''

def are_you_domdomi(class_name):
	return class_name and re.compile('^txt\_')
soup = BeautifulSoup(html, 'lxml')
texts = soup.find_all(attrs={'class':are_you_domdomi})
for text in texts:
	print(text.string)

실행결과는 아래와 같습니다.

hello
how
are
you
my name is
domdomi

좀 더 커스터마이징 해서 are, you, domdomi 라는 단어가 포함될 경우만 추출하도록 해보겠습니다.

import re
import requests
from bs4 import BeautifulSoup
html = '''
<div>
<span class="txt_01">hello</span>
<div class="txt_04">how</div>
<em class="txt_13">are</em>
<pre class="txt_10b">you</pre>
<h3 class="txt_21">my name is</h3>
<h4 class="txt_38a">domdomi</h4>
</div>
'''

def are_you_domdomi(name):
	class_name = ''
	if name.attrs.get('class'):
		class_name = name.attrs.get('class')[0]
		if class_name.startswith('txt_'):
			if name.string == 'are' or \
			name.string == 'you' or \
			name.string == 'domdomi':
				return name
                
soup = BeautifulSoup(html, 'lxml')
texts = soup.find_all(are_you_domdomi)
for text in texts:
	print(text.string)

응용하기 위해서 조금 복잡하게 작성하기는 했는데, are_you_domdomi 라는 함수가 하는 역할은 아래와 같습니다.

1. class 가 존재하는 요소인가?

2. class가 존재한다면 txt_ 로 시작하는 class를 가지고 있는가?
3. 태그 요소의 문자열이 are 또는 you 또는 domdomi 인가?

만약 위 3가지 조건에 해당된다면 가져오게 하였습니다.

출력 결과를 같이 보도록 하겠습니다.

are
you
domdomi

추후 꿀팁으로 작성할만한 것들이 더 있다면 이어서 작성하도록 하겠습니다.

혹시 위의 예제 말고도 더 궁금한 예시가 있다면 댓글로 말씀해주시면 추가하도록 하겠습니다!

- 끝 -

728x90

저작자표시 비영리 변경금지

'프로그래밍 > Python' 카테고리의 다른 글

[크롤링] Selenium 사용 시 Chromedriver 다운로드하는 방법 (0)	2021.12.01
[크롤링] BeautifulSoup 으로 가져온 데이터 CSV(엑셀)파일로 저장하기 (0)	2021.11.30
[크롤링] 파이썬으로 다음 영화 순위 실시간 예매율 가져오기 (0)	2021.11.23
[크롤링] 파이썬으로 네이버 웹툰 소개 및 회차 정보 가져오기 (0)	2021.11.17
[크롤링] 파이썬으로 네이버웹툰 인기급상승 웹툰 순위 가져오기 (0)	2021.11.16

돔돔이블로그 | DomDom's Blog

티스토리 뷰