๋ฉ‹์Ÿ์ด์‚ฌ์ž์ฒ˜๋Ÿผ เป’(โŠ™แด—โŠ™)เฅญโœŽ

[๋ฉ‹์Ÿ์ด์‚ฌ์ž์ฒ˜๋Ÿผ ๋ฐ์ดํ„ฐ๋ถ„์„ ๋ถ€ํŠธ์บ ํ”„ 5๊ธฐ] ์›น ํฌ๋กค๋ง ์—ฐ์Šตํ•ด๋ณด๊ธฐ

๊ฐ์ž์Šˆ๋‹ˆ 2025. 4. 30. 17:22

 

0. ํ•™์Šต๋ชฉํ‘œ

๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ์›น ํฌ๋กค๋ง์„ ํ˜ผ์ž์„œ(๊ฐ•์กฐ) ์—ฐ์Šตํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค.
๊ฐ•์‚ฌ๋‹˜๊ป˜์„œ ์•Œ๋ ค์ฃผ์‹  ์ฝ”๋“œ + gpt๋กœ ์—ฐ์Šต ํ•  ๊ฒƒ์ด๋‹ค.

ํ˜ผ์ž์„œ ํŽ˜์ด์ง€ 2๊ฐœ ๊ธ์–ด์™€๋ณด๊ธฐ !!


1. ํŒŒ์ด์ฌ ํ™ˆํŽ˜์ด์ง€ library reference ๋ชฉ์ฐจ ๊ธ์–ด์˜ค๊ธฐ


ํฌ๋กค๋ง ํ•  ํŽ˜์ด์ง€ :

 

The Python Standard Library

While The Python Language Reference describes the exact syntax and semantics of the Python language, this library reference manual describes the standard library that is distributed with Python. It...

docs.python.org

 

ํŒŒ๋ž€์ƒ‰ ๊ธ€์”จ๋กœ ์ ํžŒ ๋ชฉ์ฐจ์™€ ๋งํฌ๋ฅผ ํฌ๋กค๋ง ํ•ด์™€์„œ csv ํŒŒ์ผ๋กœ ์ €์žฅํ•  ๊ฒƒ์ด๋‹ค.

 

# ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
import bs4
import requests
import pandas as pd
import numpy as np
import os
from IPython.display import clear_output
# ์†Œ์Šค ์š”์ฒญ ํ•จ์ˆ˜
def getSource(site):
    header_info = {
        'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36'
    }

    response = requests.get(site, headers = header_info)

    # bs4 ๊ฐ์ฒด ์ƒ์„ฑ
    soup = bs4.BeautifulSoup(response.text, 'lxml')
    return soup

 

๐Ÿ“Œ getSource(site) ํ•จ์ˆ˜ ์š”์•ฝ

  1. ํ—ค๋” ์ •๋ณด ์„ค์ •
    • ์›น ์„œ๋ฒ„์— ์ •์ƒ์ ์ธ ๋ธŒ๋ผ์šฐ์ €์ฒ˜๋Ÿผ ๋ณด์ด๊ฒŒ ํ•˜๋ ค๊ณ  User-Agent ์ •๋ณด๋ฅผ ์„ค์ •ํ•œ๋‹ค.
  2. HTTP ์š”์ฒญ ๋ณด๋‚ด๊ธฐ
    • requests.get()์„ ์‚ฌ์šฉํ•˜์—ฌ ์ง€์ •ํ•œ site URL์— GET ์š”์ฒญ์„ ๋ณด๋‚ด๊ณ , ์‘๋‹ต(response)์„ ๋ฐ›๋Š”๋‹ค.
  3. BeautifulSoup ๊ฐ์ฒด ์ƒ์„ฑ
    • response.text๋ฅผ lxml ํŒŒ์„œ๋กœ ์ฝ์–ด๋“ค์—ฌ BeautifulSoup ๊ฐ์ฒด(soup)๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.
  4. ํŒŒ์‹ฑ ๊ฒฐ๊ณผ ๋ฐ˜ํ™˜
    • ์ƒ์„ฑํ•œ soup ๊ฐ์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

 


์ด์ œ ๋ฐ›์€ ์†Œ์Šค ์ฝ”๋“œ์—์„œ ๋‚ด๊ฐ€ ์›ํ•˜๋Š” ๋ถ€๋ถ„์„ ์ถ”์ถœํ•ด์™€์•ผํ•œ๋‹ค.
๋‚œ ๋ชฉ์ฐจ data๋งŒ ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐœ๋ฐœ์ž ์ฝ”๋“œ (fn+f12)์—์„œ ๋ชฉ์ฐจ ๋ถ€๋ถ„์„ ์ฐพ๊ณ 
copy selector ํ•ด์˜จ๋‹ค.

 

< ์ ‘๊ทผ ์ˆœ์„œ >

#the-python-standard-library > div (๋ชฉ์ฐจ ์ „์ฒด ๋ฌถ์Œ)
โฌ‡๏ธ
#the-python-standard-library > div > ul (๋ชฉ์ฐจ ํ…์ŠคํŠธ)
โฌ‡๏ธ
#the-python-standard-library > div > ul > li:nth-child(1) > a (๋งํฌ)


# data ์ˆ˜์ง‘ & ์ €์žฅ
def getData(soup, file_name) :
    # ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š” ์ „์ฒด๋ฅผ ๋ถˆ๋Ÿฌ์˜จ๋‹ค
    a1 = soup.select_one('#the-python-standard-library > div')

    # ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ด์•„ ์ €์žฅํ•˜๋Š” ์šฉ๋„๋กœ ์‚ฌ์šฉํ•  ๋”•์…”๋„ˆ๋ฆฌ
    data_dict = {
        '์ œ๋ชฉ' :[],
        '๋งํฌ' :[],
    }

    li_list = a1.select('ul > li')

    for li in li_list:
        a_tag = li.select_one('a')
        if a_tag :
            title = a_tag.text.strip()
            link = a_tag.get('href')
        data_dict['์ œ๋ชฉ'].append(title)
        data_dict['๋งํฌ'].append(link)


    df1 = pd.DataFrame(data_dict)
    # display(df1)

    # ์ €์žฅํ•œ๋‹ค.
    if os.path.exists(file_name) == False :
        df1.to_csv(file_name, encoding = 'utf-8-sig', index = False)

    # ๋งŒ์•ฝ ํŒŒ์ผ์ด ์žˆ๋‹ค๋ฉด
    else :
        df1.to_csv(file_name, encoding = 'utf-8-sig', index = False, header = None, mode = 'a')

 

๐Ÿ“Œ getData(soup, file_name) ํ•จ์ˆ˜ ์š”์•ฝ

  1. HTML ์š”์†Œ ๊ฐ€์ ธ์˜ค๊ธฐ
    • soup.select_one('#the-python-standard-library > div')๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฃผ์š” ์ฝ˜ํ…์ธ  ์˜์—ญ์„ ์„ ํƒํ•œ๋‹ค.
  2. ๋นˆ ๋”•์…”๋„ˆ๋ฆฌ ์ƒ์„ฑ
    • data_dict์— '์ œ๋ชฉ'๊ณผ '๋งํฌ'๋ฅผ ์ €์žฅํ•  ๋ฆฌ์ŠคํŠธ๋ฅผ ์ดˆ๊ธฐํ™”ํ•œ๋‹ค.
  3. ๋ฆฌ์ŠคํŠธ ํ•ญ๋ชฉ ์ถ”์ถœ
    • ์„ ํƒํ•œ div ๋‚ด์˜ ul > li ์š”์†Œ๋“ค์„ ๋ชจ๋‘ ๊ฐ€์ ธ์˜จ๋‹ค.
  4. ๊ฐ ํ•ญ๋ชฉ์—์„œ ์ œ๋ชฉ๊ณผ ๋งํฌ ์ถ”์ถœ
    • ๊ฐ li์—์„œ <a> ํƒœ๊ทธ๋ฅผ ์ฐพ์•„ ํ…์ŠคํŠธ(์ œ๋ชฉ)์™€ href(๋งํฌ)๋ฅผ ๊ฐ€์ ธ์™€ data_dict์— ์ถ”๊ฐ€ํ•œ๋‹ค.
  5. DataFrame ์ƒ์„ฑ
    • ์ˆ˜์ง‘ํ•œ ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ pandas.DataFrame์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ํ‘œ ํ˜•์‹์œผ๋กœ ๊ตฌ์„ฑํ•œ๋‹ค.
  6. CSV ํŒŒ์ผ ์ €์žฅ
    • ํŒŒ์ผ์ด ์—†๋‹ค๋ฉด: ์ƒˆ๋กœ ์ €์žฅ (header ํฌํ•จ).
    • ํŒŒ์ผ์ด ์žˆ๋‹ค๋ฉด: ๊ธฐ์กด ํŒŒ์ผ์— ๋‚ด์šฉ ์ถ”๊ฐ€ ์ €์žฅ (header ์—†์ด append).

 

์ž ์ด์ œ ๋งŒ๋“ค์–ด๋‘” ์ฝ”๋“œ๋ฅผ ์‹คํ–‰์‹œ์ผœ ๋ณด์ž.

soup = getSource('https://docs.python.org/3/library/index.html')
getData(soup, 'Pydata.csv')

 

Pydata.csv ํŒŒ์ผ๋กœ ๋งŒ๋“ค์–ด์ง„ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค !
๋ชฉํ‘œ๋Œ€๋กœ ์ œ๋ชฉ๊ณผ ๋งํฌ ์ œ๋Œ€๋กœ ์ €์žฅ๋˜์—ˆ๋‹ค. ๐Ÿ‘

 


 

2. ๋‚˜์˜ ํ‹ฐ์Šคํ† ๋ฆฌ ๋ธ”๋กœ๊ทธ์˜ ๊ธ€ ์ œ๋ชฉ, ๊ธ€ ์นดํ…Œ๊ณ ๋ฆฌ, ์ž‘์„ฑ์ผ์ž ๋ฅผ ํฌ๋กค๋งํ•ด๋ณด์ž

 

1) ๊ธฐ๋ณธ ์„ธํŒ… (1๋ฒˆ ๋ฌธ์ œ์™€ ๋™์ผํ•˜๋‹ค)

import bs4
import requests
import pandas as pd
import numpy as np
import os
import time
from IPython.display import clear_output
# ์†Œ์Šค ์š”์ฒญ ํ•จ์ˆ˜
def getSource(site):
    header_info = {
        'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36'
    }

    # ์š”์ฒญํ•œ๋‹ค.
    response = requests.get(site, headers=header_info)
    # bs4 ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.
    ## BeautifulSoup์€ HTML ๋˜๋Š” XML ๋ฌธ์„œ๋ฅผ ํŒŒ์‹ฑํ•˜๊ณ  ๊ตฌ์กฐํ™”ํ•ด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์‰ฝ๊ฒŒ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ฃผ๋Š” ๋„๊ตฌ
    soup = bs4.BeautifulSoup(response.text, 'lxml')
    return soup

 

< ์ ‘๊ทผ ์ˆœ์„œ >

#mArticle (๊ธ€์ด ์žˆ๋Š” ๊ณณ ์ „์ฒด)
โฌ‡๏ธ
#mArticle > div (๊ฐ๊ฐ์˜ ๊ธ€๋“ค)
โฌ‡๏ธ
#mArticle > div > a.link_post > strong (๊ธ€์˜ ์ œ๋ชฉ)
โฌ‡๏ธ
#mArticle > div:nth-child(3) > div > a (์นดํ…Œ๊ณ ๋ฆฌ)
โฌ‡๏ธ
#mArticle > div:nth-child(3) > div > span.txt_date (์ž‘์„ฑ ๋‚ ์งœ)


2) ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ํ•จ์ˆ˜

# ํ•œ ํŽ˜์ด์ง€์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•ด ์ €์žฅํ•˜๋Š” ํ•จ์ˆ˜
def getData(soup, file_name):

    # ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š” ์ „์ฒด๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค
    a1 = soup.select_one('#mArticle')

    # ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ด์„ ๋ฆฌ์ŠคํŠธ
    data_list = []

    
    # ๋‚ด๋ถ€์˜ ํƒœ๊ทธ๋“ค ๊ฐ€์ ธ์˜ค๊ธฐ
    a2 = a1.select('div')
    for div in a2 :
        # ์ œ๋ชฉ ๊ฐ€์ ธ์˜ค๊ธฐ
        # #mArticle > div:nth-child(3) > a.link_post > strong
        a3 = div.select_one('a.link_post > strong')
        title = a3.text.strip() if a3 else None


        # ์นดํ…Œ๊ณ ๋ฆฌ ๊ฐ€์ ธ์˜ค๊ธฐ
        # #mArticle > div:nth-child(3) > div > a
        a4 = div.select_one('a')
        category = a4.text.strip() if a4 else None


        # ์ž‘์„ฑ๋‚ ์งœ ๊ฐ€์ ธ์˜ค๊ธฐ
        # #mArticle > div:nth-child(3) > div > span.txt_date
        a5 = div.select_one('span.txt_date')
        date = a5.text.strip() if a5 else None

        
        # ๋ฆฌ์ŠคํŠธ์— ๋ฐ์ดํ„ฐ๋“ค์„ ๋‹ด๋Š”๋‹ค
        data_list.append({
            '์ œ๋ชฉ': title,
            '์นดํ…Œ๊ณ ๋ฆฌ': category,
            '์ž‘์„ฑ๋‚ ์งœ': date
        })
        

    # ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ๊ฐ€์ง€๊ณ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•œ๋‹ค.
    df1 = pd.DataFrame(data_list)
    #display(df1)


    # ํŒŒ์ผ๋กœ ์ €์žฅํ•œ๋‹ค. (Index๋Š” ์ €์žฅํ•˜์ง€ ์•Š๋Š”๋‹ค)
    if os.path.exists(file_name) == False:
        df1.to_csv(file_name, encoding = 'utf-8-sig', index = False)

    else:
        df1.to_csv(file_name, encoding = 'utf-8-sig', index = False, header = None, mode = 'a')

 

์ด๋ฒˆ์—๋Š” ์—ฌ๋Ÿฌ ํŽ˜์ด์ง€๊ฐ€ ์žˆ๋Š” ์‚ฌ์ดํŠธ ์ด๊ธฐ ๋•Œ๋ฌธ์—
๋‹ค์Œ ํŽ˜์ด์ง€ ์ฃผ์†Œ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ํ•จ์ˆ˜๋„ ํ•„์š”ํ•˜๋‹ค.

# ๋‹ค์Œ ํŽ˜์ด์ง€ ์ฃผ์†Œ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ํ•จ์ˆ˜
def getNextPage(soup) :

    # Next ๋ฒ„ํŠผ์˜ ํƒœ๊ทธ๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.
    next_tag = soup.select_one('#mArticle > div.area_paging > span > a.btn_next')
    print(next_tag)

    if next_tag != None :
        # href ์†์„ฑ์˜ ๊ฐ’์„ ๊ฐ€์ ธ์˜จ๋‹ค.
        # ๋‹ค์Œ ํŽ˜์ด์ง€์˜ ๋งํฌ
        href = next_tag.get('href')
        if href :
            return href
        else :
            return None
    
    else:
        return None
# ์ˆ˜์ง‘ํ•  ์‚ฌ์ดํŠธ ์ฃผ์†Œ
site = 'https://tobepotato.tistory.com/'

# ์ˆ˜์ง‘ํ•˜๊ณ ์ž ํ•˜๋Š” ํŽ˜์ด์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’
# ์ฒ˜์Œ์—๋Š” ๋นˆ๊ณต๊ฐ„์œผ๋กœ 
page = ''

while True :
    time.sleep(1)

    # ๊ธฐ์กด์— ์ถœ๋ ฅ๋œ ๊ฒƒ์„ ์ฒญ์†Œํ•œ๋‹ค.
    clear_output(wait=True)

    print(f'{site}{page} ์ˆ˜์ง‘์ค‘...')

    # ํŽ˜์ด์ง€ ์š”์ฒญ
    soup = getSource(site + page)

    # ํ˜„์ œ ํŽ˜์ด์ง€์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€ ์ €์žฅํ•œ๋‹ค.
    getData(soup, 'tistoryData.csv')

    # ๋‹ค์Œ ํŽ˜์ด์ง€ ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.
    page = getNextPage(soup)

    if page == None :
        print('์ˆ˜์ง‘์™„๋ฃŒ')
        break

 

๊ทธ๋Ÿฐ๋ฐ..
์œ„ ์ฝ”๋“œ๋ฅผ ๋Œ๋ฆฌ๋ฉด? csv ํŒŒ์ผ์ด ์š”์ง€๊ฒฝ์ด๋‹ค...๐Ÿ’ข๐Ÿ’ข๐Ÿ’ข

 

< ๋ฌธ์ œ ์ƒํ™ฉ๐Ÿšจ >

1. ํ•œ ํ–‰์— ์ œ๋ชฉ-์นดํ…Œ๊ณ ๋ฆฌ-๋‚ ์งœ ๋กœ ์ถœ๋ ฅ๋˜์ง€ ์•Š๊ณ  ์ œ๋ชฉ๊ณผ ์นดํ…Œ๊ณ ๋ฆฌ-๋‚ ์งœ ์ด๋ ‡๊ฒŒ ํ•œ ๊ธ€ ๋‹น ๋‘ ํ–‰์”ฉ ์ถœ๋ ฅ๋˜๊ณ  ์žˆ๋‹ค.
2. ์ด์ „, prev ๋‹จ์ถ”๋„ ๊ฐ™์ด ๊ธ์–ด์–ด๊ณ  ์žˆ๋‹ค.
3. None ๊ฐ’์ด ๋งŽ์•„์„œ ๋นˆ์นธ์ด ๋งŽ๋‹ค.


< ํ•ด๊ฒฐํ•˜๊ธฐ โœ… >

์ผ๋‹จ None ๊ฐ’์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ๋งŒ ๊ฐ€์ ธ์˜ค๊ธฐ ์œ„ํ•ด์„œ

if title and category and date :
            data_list.append({
                '์ œ๋ชฉ': title,
                '์นดํ…Œ๊ณ ๋ฆฌ': category,
                '์ž‘์„ฑ๋‚ ์งœ': date
            })        


๋ฆฌ์ŠคํŠธ์— ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ด์„ ๋•Œ ์กฐ๊ฑด๋ฌธ์œผ๋กœ ๋„ฃ์–ด์„œ
title, category, date๊ฐ€ None์ด ์•„๋‹ ๋•Œ๋ฅผ ๊ฐ€์ •ํ•˜์˜€๋‹ค.

< ๋ฌธ์ œ ์ƒํ™ฉ๐Ÿšจ >
1. ํ•œ ํ–‰์— ์ œ๋ชฉ-์นดํ…Œ๊ณ ๋ฆฌ-๋‚ ์งœ ๋กœ ์ถœ๋ ฅ๋˜์ง€ ์•Š๊ณ  ์ œ๋ชฉ๊ณผ ์นดํ…Œ๊ณ ๋ฆฌ-๋‚ ์งœ ์ด๋ ‡๊ฒŒ ํ•œ ๊ธ€ ๋‹น ๋‘ ํ–‰์”ฉ ์ถœ๋ ฅ๋˜๊ณ  ์žˆ๋‹ค.
2. ์ด์ „, prev ๋‹จ์ถ”๋„ ๊ฐ™์ด ๊ธ์–ด์–ด๊ณ  ์žˆ๋‹ค.
3. None ๊ฐ’์ด ๋งŽ์•„์„œ ๋นˆ์นธ์ด ๋งŽ๋‹ค.  (ํ•ด๊ฒฐโ—๏ธ)

 

๊ทธ๋ฆฌ๊ณ  ๋‚ด๋ถ€์˜ ํƒœ๊ทธ๋“ค์„ ๊ฐ€์ ธ์˜ฌ ๋•Œ, ์–‘์ชฝ ๊ณต๋ฐฑ์„ ์ œ๊ฑฐํ•˜๋Š”

title = a3.text.strip() if a3 else None
category = a4.text.strip() if a4 else None
date = a5.text.strip() if a5 else None

์œ„ ์ฝ”๋“œ๋“ค์„ ๋ชจ๋‘ ์‚ญ์ œํ–ˆ๋‹ค.
์ด์œ ๋Š” ์ผ๋‹จ ๋‚  ๊ฒƒ ๊ทธ๋Œ€๋กœ์˜ ์ถœ๋ ฅ์„ ๋ณด๊ณ  ๋ฌด์—‡์„ ๊ณ ์ณ์•ผํ• ์ง€ ๋ณด๊ธฐ ์œ„ํ•ด์„œ์ด๋‹ค.

๊ทธ๋Ÿผ ์•„๋ž˜์™€ ๊ฐ™์ด ํŒŒ์ผ์ด ๋งŒ๋“ค์–ด ์กŒ๋‹ค !!
prev, next ๋‹จ์ถ”๋„ ์—†์–ด์ง€๊ณ , ๊ธ€ ํ•˜๋‚˜ ๋‹น ํ•˜๋‚˜์˜ ํ–‰์œผ๋กœ ์ž˜ ์ถœ๋ ฅ๋œ๋‹ค.

๊ทผ๋ฐ ์ด์ œ html ์ฝ”๋“œ๊ฐ€ ์•ž ๋’ค๋กœ ๋”ธ๋ ค ๋‚˜์˜จ๋‹ค๋Š”๊ฒŒ ๋ฌธ์ œ..

< ๋ฌธ์ œ ์ƒํ™ฉ๐Ÿšจ >
1. ํ•œ ํ–‰์— ์ œ๋ชฉ-์นดํ…Œ๊ณ ๋ฆฌ-๋‚ ์งœ ๋กœ ์ถœ๋ ฅ๋˜์ง€ ์•Š๊ณ  ์ œ๋ชฉ๊ณผ ์นดํ…Œ๊ณ ๋ฆฌ-๋‚ ์งœ ์ด๋ ‡๊ฒŒ ํ•œ ๊ธ€ ๋‹น ๋‘ ํ–‰์”ฉ ์ถœ๋ ฅ๋˜๊ณ  ์žˆ๋‹ค. (ํ•ด๊ฒฐโ—๏ธ)
2. ์ด์ „, prev ๋‹จ์ถ”๋„ ๊ฐ™์ด ๊ธ์–ด์–ด๊ณ  ์žˆ๋‹ค. (ํ•ด๊ฒฐโ—๏ธ)
3. None ๊ฐ’์ด ๋งŽ์•„์„œ ๋นˆ์นธ์ด ๋งŽ๋‹ค.  (ํ•ด๊ฒฐโ—๏ธ)
4. html ํƒœ๊ทธ ๋“ค์ด ์•ž ๋’ค๋กœ ์ถœ๋ ฅ๋˜๋Š” ํ˜„์ƒ (์ƒˆ๋กœ์šด ๋ฌธ์ œ ๐Ÿ’ข)

 

gpt์—๊ฒŒ ๋ฌผ์–ด๋ณธ ๊ฒฐ๊ณผ     ...    

a4 = div.select_one('a')

a ํƒœ๊ทธ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ์—์„œ ๋ถ€ํ„ฐ ๋ฌธ์ œ๊ฐ€ ๋น„๋กฏ๋˜์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

 

โœ… ๋น ๋ฅด๊ฒŒ ์ •๋ฆฌํ•˜๋ฉด:

  1. ํ˜„์žฌ ํฌ๋กค๋ง ๋Œ€์ƒ ํŽ˜์ด์ง€์—๋Š” ๋ช…ํ™•ํ•œ ์นดํ…Œ๊ณ ๋ฆฌ ํ…์ŠคํŠธ๊ฐ€ ์—†๊ณ ,
  2. a ํƒœ๊ทธ๋Š” ์ด๋ฏธ์ง€๋‚˜ ํฌ์ŠคํŠธ ๋งํฌ ์—ญํ• ๋งŒ ํ•˜๊ณ  ์žˆ์–ด์š”.
  3. ๊ทธ๋ž˜์„œ .text๋ฅผ ํ•ด๋„ ๋น„์–ด์žˆ๊ฑฐ๋‚˜ ์˜๋ฏธ ์—†๋Š” ํ…์ŠคํŠธ๊ฐ€ ๋‚˜์˜ค๋Š” ๊ฒ๋‹ˆ๋‹ค.

 

๋ผ๊ณ  ํ•œ๋‹ค.. ๊ทธ๋ž˜์„œ 'a'์— ์ ‘๊ทผํ•˜์ง€ ์•Š๊ณ  'a.link_cate' ๋กœ ์ ‘๊ทผ ํ–ˆ๋”๋‹ˆ
๋“œ๋””์–ด ๋‚ด๊ฐ€ ์›ํ•˜๋Š” ๊ฒฐ๊ณผ๋ฌผ์ด ๋‚˜์™”๋‹ค !

link_cate๊ฐ€ ๋ฌด์—‡์ด๋ƒ๋ฉด... ๋ฐ”๋กœ ๋นจ๊ฐ„์ƒ‰ ๋ถ€๋ถ„

<a href="/category/%EB%A9%8B%EC%9F%81%EC%9D%B4%EC%82%AC%EC%9E%90%EC%B2%98%EB%9F%BC%20%E0%BB%92%28%E2%8A%99%E1%B4%97%E2%8A%99%29%E0%A5%AD%E2%9C%8E" class="link_cate">๋ฉ‹์Ÿ์ด์‚ฌ์ž์ฒ˜๋Ÿผ เป’(โŠ™แด—โŠ™)เฅญโœŽ</a>

< ๋ฌธ์ œ ์ƒํ™ฉ๐Ÿšจ >
1. ํ•œ ํ–‰์— ์ œ๋ชฉ-์นดํ…Œ๊ณ ๋ฆฌ-๋‚ ์งœ ๋กœ ์ถœ๋ ฅ๋˜์ง€ ์•Š๊ณ  ์ œ๋ชฉ๊ณผ ์นดํ…Œ๊ณ ๋ฆฌ-๋‚ ์งœ ์ด๋ ‡๊ฒŒ ํ•œ ๊ธ€ ๋‹น ๋‘ ํ–‰์”ฉ ์ถœ๋ ฅ๋˜๊ณ  ์žˆ๋‹ค. (ํ•ด๊ฒฐโ—๏ธ)
2. ์ด์ „, prev ๋‹จ์ถ”๋„ ๊ฐ™์ด ๊ธ์–ด์–ด๊ณ  ์žˆ๋‹ค. (ํ•ด๊ฒฐโ—๏ธ)
3. None ๊ฐ’์ด ๋งŽ์•„์„œ ๋นˆ์นธ์ด ๋งŽ๋‹ค.  (ํ•ด๊ฒฐโ—๏ธ)
4. html ํƒœ๊ทธ ๋“ค์ด ์•ž ๋’ค๋กœ ์ถœ๋ ฅ๋˜๋Š” ํ˜„์ƒ (์ƒˆ๋กœ์šด ๋ฌธ์ œ ๐Ÿ’ข) (ํ•ด๊ฒฐโ—๏ธ)

 

์ฝ”๋“œ ๊นŒ์ง€ ๋ถ™์—ฌ๋„ฃ๊ธฐ ํ•˜๋ฉด ๊ธ€์ด ๋„ˆ๋ฌด ๊ธธ์–ด์ง€๋‹ˆ๊นŒ Py ํŒŒ์ผ์„ ์˜ฌ๋ ค๋‘ฌ์•ผ๊ฒ ๋‹ค (๊ธฐ๋ก์šฉ)

tistoryData.csv
0.00MB
แ„แ…ตแ„‰แ…ณแ„แ…ฉแ„…แ…ต แ„ƒแ…ฆแ„‹แ…ตแ„แ…ฅ แ„‰แ…ฎแ„Œแ…ตแ†ธ.ipynb
0.00MB

๊ฒฐ๊ตญ ํ•ด๋‚ธ ๋‚˜ ์ •๋ง ์นญ์ฐฌํ•ด !!

 


๐Ÿ’ญ ๋А๋‚€์ 

์ฒซ ๋ฒˆ์งธ ํ”„๋กœ์ ํŠธ๋Š” ๊ฐ•์‚ฌ๋‹˜๊ป˜์„œ ์‰ฌ์šด ์‚ฌ์ดํŠธ๋กœ ์ฃผ์…”์„œ ๊ธฐ์กด ์ฝ”๋“œ๋งŒ ๋”ฐ๋ผ๊ฐ€๋ฉด ํฌ๊ฒŒ ์–ด๋ ต์ง€ ์•Š์•˜๋Š”๋ฐ,
๋‘ ๋ฒˆ์งธ ํ”„๋กœ์ ํŠธ๋Š” ๊ฐœ์ธ ํ˜ผ์ž์„œ ํ•˜๋Š”๊ฑฐ๋‹ค ๋ณด๋‹ˆ๊นŒ ๋ง‰ํžˆ๋ฉด ๋„ˆ๋ฌด ์–ด๋ ค์› ๋‹ค.
ํŠนํžˆ ์นดํ…Œ๊ณ ๋ฆฌ ํƒœ๊ทธ๊ฐ€ ๋‹น์—ฐํžˆ a์ธ์ค„ ์•Œ์•˜๋Š”๋ฐ class๋ผ๋Š” ๊ฐ์ฒด๋ฅผ ์ƒ๊ฐ๋ชปํ–ˆ๋‹ค.
์•”ํŠผ gpt ์—†์ด๋Š” ๊น”๋”ํ•œ ํฌ๋กค๋ง์ด ์–ด๋ ค์› ์„๊ฑฐ๋‹ค ใ…œ

๊ทธ๋ž˜๋„ ์–ด๋ ต๊ฒŒ ํฌ๋กค๋ง ํ•ด์™”์œผ๋‹ˆ๊นŒ ๋‹ค์Œ ํ”„๋กœ์ ํŠธ๋Š” ์ข€ ๋” ์ˆ˜์›”ํ•˜๊ฒŒ ํ• ๊ฑฐ ๊ฐ™๋‹ค ^^ 
์—ญ์‹œ ํž˜๋“ค๊ฒŒ ๋ฐฐ์›Œ์•ผ ๋จธ๋ฆฌ์— ์˜ค๋ž˜ ๋‚จ๋Š”๋ฒ• ๐Ÿ˜Œ

 

์ถœ์ฒ˜ : ๋ฉ‹์Ÿ์ด์‚ฌ์ž์ฒ˜๋Ÿผ, ๊ฐ์ž์Šˆ๋‹ˆ