Notice
Recent Posts
Recent Comments
Link
Today
Total
ยซ   2025/06   ยป
์ผ ์›” ํ™” ์ˆ˜ ๋ชฉ ๊ธˆ ํ† 
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30
Tags more
Archives
๊ด€๋ฆฌ ๋ฉ”๋‰ด

๊ฐ์ž์˜ Data Lab ๐Ÿ“Š

[๋ฉ‹์Ÿ์ด์‚ฌ์ž์ฒ˜๋Ÿผ ๋ฐ์ดํ„ฐ๋ถ„์„ ๋ถ€ํŠธ์บ ํ”„ 5๊ธฐ] ๋จธ์‹ ๋Ÿฌ๋‹ ๊ฐœ๊ด„ ๋ฐ ๋ณต์Šต ๋ณธ๋ฌธ

๋ฉ‹์Ÿ์ด์‚ฌ์ž์ฒ˜๋Ÿผ เป’(โŠ™แด—โŠ™)เฅญโœŽ

[๋ฉ‹์Ÿ์ด์‚ฌ์ž์ฒ˜๋Ÿผ ๋ฐ์ดํ„ฐ๋ถ„์„ ๋ถ€ํŠธ์บ ํ”„ 5๊ธฐ] ๋จธ์‹ ๋Ÿฌ๋‹ ๊ฐœ๊ด„ ๋ฐ ๋ณต์Šต

๊ฐ์ž์Šˆ๋‹ˆ 2025. 6. 10. 17:47

0. ํ•™์Šต๋ชฉํ‘œ 

๋จธ์‹ ๋Ÿฌ๋‹์˜ ํ•ต์‹ฌ์„ ํ›‘์–ด๋ณด์ž


1. ๋จธ์‹ ๋Ÿฌ๋‹ ์ด๋ž€?

  • ์ •์˜:
    ๋จธ์‹ ๋Ÿฌ๋‹์€ ๋ช…์‹œ์ ์ธ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์—†์ด ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ์Šค์Šค๋กœ ํ•™์Šตํ•˜๊ณ  ์˜ˆ์ธกํ•˜๋Š” ์ธ๊ณต์ง€๋Šฅ ๊ธฐ์ˆ ์ด๋‹ค.
  • ๋ฐฐ๊ฒฝ/๋“ฑ์žฅ ์ด์œ :
    ์ „ํ†ต์ ์ธ ๊ทœ์น™ ๊ธฐ๋ฐ˜ ํ”„๋กœ๊ทธ๋žจ์ด ํ•ด๊ฒฐํ•  ์ˆ˜ ์—†๋˜ **๋ณต์žกํ•œ ๋ฌธ์ œ(์˜ˆ: ์–ผ๊ตด ์ธ์‹, ์ถ”์ฒœ ์‹œ์Šคํ…œ)**๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅ.
  • ์‚ฌ์šฉ ์‚ฌ๋ก€:
    • ์œ ํŠœ๋ธŒ ์ถ”์ฒœ
    • ์ž์œจ์ฃผํ–‰ ์ž๋™์ฐจ
    • ์งˆ๋ณ‘ ์˜ˆ์ธก
    • ์ฑ„์šฉ ์ž๋™ํ™” ๋“ฑ

2. ํŒŒ์ด์ฌ์œผ๋กœ ๋จธ์‹ ๋Ÿฌ๋‹ ์‹ค์Šต

1) ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

# ๊ธฐ๋ณธ
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# ๋ชจ๋ธ ํ‰๊ฐ€
from sklearn.metrics import accuracy_score

# ์‚ฌ์šฉํ•  ์•Œ๊ณ ๋ฆฌ์ฆ˜
from sklearn.linear_model import LogisticRegression

 

 

2) ํ•™์Šต ๋ฐ ์˜ˆ์ธก ํ•  ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค๊ธฐ

iris_df = pd.read_csv('data/iris.csv')
iris_df

์˜ค๋Š˜์€ ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด๋ณด์ž ~

 

3) EDA (ํƒ์ƒ‰์  ๋ถ„์„)

EDA๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ์˜ ํƒ€์ž…, ์ข…๋ฅ˜, ํ˜•ํƒœ ์ •๋„๋ฅผ ํŒŒ์•…ํ•ด์•ผ ํ•œ๋‹ค.

# ๊ฐ„๋‹จํ•˜๊ฒŒ info๋งŒ ์ฐ์–ด๋ด„
iris_df.info()

 


๋‚ด๊ฐ€ ์˜ˆ์ƒํ•ด์•ผํ•˜๋Š” ์ปฌ๋Ÿผ์ธ target์ด object๋กœ ๋ฌธ์žํ˜• ๋ฐ์ดํ„ฐ์ด๋‹ค.
๊ทผ๋ฐ ๋Œ€๋ถ€๋ถ„์˜ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์€ ์ˆซ์žํ˜• ์ปฌ๋Ÿผ์œผ๋กœ ํ•™์Šต์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— 
์šฐ๋ฆฌ๋Š” ์ˆซ์žํ˜• ๋ฐ์ดํ„ฐ๋กœ ๋ณ€ํ™˜ํ•ด์•ผ ํ•œ๋‹ค.

 

4) ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

์ธ์ฝ”๋”ฉ๊ณผ ํ‘œ์ค€ํ™” ...


1๏ธโƒฃ Label Encoding (๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ)

๊ฐ ๋ฒ”์ฃผ์— ์ˆซ์ž๋ฅผ 1๊ฐœ์”ฉ ๋ถ€์—ฌํ•˜๋Š” ๋ฐฉ์‹


ex)

Red 0
Blue 1
Green 2

2๏ธโƒฃ One-Hot Encoding (์›-ํ•ซ ์ธ์ฝ”๋”ฉ)

๊ฐ ๋ฒ”์ฃผ๋ฅผ ์ปฌ๋Ÿผ์œผ๋กœ ๋งŒ๋“ค๊ณ , ํ•ด๋‹นํ•˜๋Š” ๊ฐ’์—๋งŒ 1, ๋‚˜๋จธ์ง€๋Š” 0

ex)

์›๋ž˜ ๊ฐ’ Red Blue Green
Red 1 0 0
Blue 0 1 0
Green 0 0 1

iris_df์—์„œ target์€ LabelEncoder๋กœ ์ธ์ฝ”๋”ฉํ•ด๋ณด์ž !

# label Encoder
encoder1 = LabelEncoder()
# ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•œ๋‹ค.
encoder1.fit(iris_df['target'])
# ํ•™์Šตํ•œ ๊ฒƒ์„ ํ† ๋Œ€๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.
df_enc1 = encoder1.transform(iris_df['target'])
df_enc1

โžก๏ธ ๊ธฐ์กด์˜ target์— ์žˆ๋˜ ๊ฐ’๋“ค์„ 0, 1, 2๋กœ ๋ผ๋ฒจ๋ง ํ•œ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

๋งŒ์•ฝ ์ธ์ฝ”๋”ฉ ํ•œ ๊ฒƒ์„ ๋‹ค์‹œ ๋ณต์›ํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ?

# ๋ณต์›
df_enc2 = encoder1.inverse_transform(df_enc1)
df_enc2

3๏ธโƒฃ ํ‘œ์ค€ํ™”

ํ‘œ์ค€ํ™”๋Š” ํŠน์„ฑ ๊ฐ„์˜ ๋‹จ์œ„ ์ฐจ์ด๋ฅผ ์—†์• ๊ธฐ ์œ„ํ•จ์ด๋‹ค.
์–ด๋–ค ํŠน์„ฑ์€ 0~1, ์–ด๋–ค ๊ฑด 0~10,000 ๋ฒ”์œ„์ผ ์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋“  ํŠน์„ฑ์„ ๋น„์Šทํ•œ ์ˆ˜์ค€์—์„œ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š”
ํ‘œ์ค€ํ™”๊ฐ€ ํ•„์š”ํ•˜๋‹ค !

target์„ ๋บ€ ๋‚˜๋จธ์ง€ 4๊ฐœ์˜ ์นผ๋Ÿผ์„ ํ‘œ์ค€ํ™” ํ•˜์ž

# ํ‘œ์ค€ํ™” ์ž‘์—…์„ ์œ„ํ•ด ๊ฒฐ๊ณผ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ์™ธํ•œ๋‹ค.
X = iris_df.drop('target', axis = 1)
X
# ํ‘œ์ค€ํ™”
scaler1 = StandardScaler()
# ํ•™์Šตํ•œ๋‹ค
scaler1.fit(X)
# ๋ณ€ํ™˜ํ•œ๋‹ค
scaler_df = scaler1.transform(X)
scaler_df

 

๋งŒ์•ฝ ํ‘œ์ค€ํ™” ํ•œ ๊ฒƒ์„ ๋ณต๊ตฌํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ?

# ๋ณต์›
scaler_df2 = scaler1.inverse_transform(scaler_df)
scaler_df2

 

 

5) ๋ชจ๋ธ๋ง - ๋ชจ๋ธ ํ•™์Šต / ์˜ˆ์ธก

# ํ•™์Šต์šฉ/ํ‰๊ฐ€์šฉ ๋ฐ์ดํ„ฐ ๋‚˜๋ˆ„๊ธฐ
x = scaler_df
y = df_enc1

train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.2, random_state = 0)

# ๋ถ„๋ฅ˜ ๋ชจ๋ธ ํ•™์Šต
md = LogisticRegression()
md.fit(train_x, train_y)

# ์˜ˆ์ธก
pred = md.predict(test_x)
pred 

# ํ‰๊ฐ€
accuracy_point = accuracy_score(test_y, pred)
accuracy_point

 

๋ชจ๋ธ๋ง ๊ณผ์ •์„ ํ‚ค์›Œ๋“œ๋งŒ ์จ๋ณด์ž๋ฉด

ํ•™์Šต์šฉ/ํ‰๊ฐ€์šฉ ๋ฐ์ดํ„ฐ ๋‚˜๋ˆ„๊ธฐ - ๋ชจ๋ธ ์ƒ์„ฑํ•˜๊ธฐ - ๋ชจ๋ธ์— ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ ํ•™์Šต์‹œํ‚ค๊ธฐ - ์˜ˆ์ธก์šฉ ๋ฐ์ดํ„ฐ๋กœ ์˜ˆ์ธกํ•˜๊ธฐ - ํ‰๊ฐ€ํ•˜๊ธฐ
์ˆœ์œผ๋กœ ๊ฐ€๋ฉด ๋˜๊ฒ ๋‹ค.

์ง€๊ธˆ๊นŒ์ง€์˜ ๋‚ด์šฉ์€ ์•„์ฃผ ์•„์ฃผ !! ๊ธฐ์ดˆ์ ์ธ ๋ผˆ๋Œ€๋งŒ ๊ตฌ์„ฑํ•ด๋†“์€ ๊ฒƒ์ด๊ณ , 
์ œ๋Œ€๋กœ๋œ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•˜๋ฉด ๋”์šฑ ๋ณต์žกํ•ด์งˆ ๊ฒƒ์ด๋‹ค.

๋‹ค์Œ๊ธ€์—๋Š” ํ‰๊ฐ€ ์ง€ํ‘œ๋“ค์„ ๊ณต๋ถ€ํ•ด์„œ ์ •๋ฆฌํ•ด๋ณด์•„์•ผ๊ฒ ๋‹ค.


๐Ÿชฝ ๋А๋‚€์ 

๋จธ์‹ ๋Ÿฌ๋‹์€ ์ž๊ฒฉ์ฆ ์ค€๋น„, ๊ฐœ์ธ ๋ฐ์ด์ฝ˜ ํ•˜๋ฉด์„œ ์—ฌ๋Ÿฌ๋ฒˆ ํ•ด๋ดค๋Š”๋ฐ, ์•„์ง๋„ ๊ฐˆ ๊ธธ์ด ๋จผ๊ฑฐ ๊ฐ™๋‹ค.
์šฐ์„  ๋ฐฐ์›Œ์•ผํ• ๊ฒŒ ๋„ˆ๋ฌด ๋งŽ๊ณ , ๊ฐœ๋… ์ž์ฒด๋„ ์ž˜ ์ดํ•ด๋˜์ง€ ์•Š๋Š” ๋ถ€๋ถ„๋„ ๋งŽ๋‹ค.
๊ทธ๋ฆฌ๊ณ  ๋ณธ์งˆ์ ์œผ๋กœ ์ˆ˜ํ•™๊ณผ ๊ด€๋ จ๋œ ๊ฐœ๋… ์ดํ•ด๊ฐ€ ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋จธ๋ฆฌ ์•„ํ”„๋‹ค ใ… ใ…‹ใ…‹

๊ทธ์น˜๋งŒ ๋” ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ๊ฒฝํ—˜ํ•˜๋ฉด์„œ ๋‚˜๋งŒ์˜ ํ”„๋กœ์ ํŠธ๋„ ๊ผญ ํ•ด๋ณด๊ณ  ์‹ถ๋‹ค.
๋‚ด๊ฐ€ ์ข‹์•„ํ•˜๋Š” ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๋„ ์–ผ๋ฅธ ๋ฐฐ์šฐ๊ณ  ์‹ถ๋‹ค !

 

์ถœ์ฒ˜: ๋ฉ‹์Ÿ์ด์‚ฌ์ž์ฒ˜๋Ÿผ, ์†Œํ”„ํŠธ์บ ํผ์Šค