๐Ÿ ๋จธ์‹ ๋Ÿฌ๋‹ ๋‘ ๋ฒˆ์งธ ์—ฌ์ •: ์˜ˆ์ธก๊ณผ ๋ถ„๋ฅ˜์˜ ์„ธ๊ณ„๋กœ

๋จธ์‹ ๋Ÿฌ๋‹์„ ์œ„ํ•œ ํŒŒ์ด์ฌ part 2
AIPythonMachineLearningLinearRegressionLogisticRegressionClassification๋ถ€์ŠคํŠธ์ฝ”์Šค๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค
avatar
2025.05.17
ยท
9 min read

๋“ค์–ด๊ฐ€๋ฉฐ
๋ถ€์ŠคํŠธ์ฝ”์Šค ๋จธ์‹ ๋Ÿฌ๋‹ ํŒŒ์ด์ฌ ๊ณผ์ •์˜ ๋‘ ๋ฒˆ์งธ ํŒŒํŠธ๋ฅผ ๋งˆ์ณค๋‹ค. ์„ ํ˜• ํšŒ๊ท€, ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€, ๊ทธ๋ฆฌ๊ณ  ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊นŒ์ง€ ๋ฐฐ์šฐ๋ฉด์„œ ์ ์  ๋จธ์‹ ๋Ÿฌ๋‹์˜ ์‹ค์ œ ์‘์šฉ์— ๊ฐ€๊นŒ์›Œ์ง€๊ณ  ์žˆ๋‹ค. ์ด๋ฒˆ์—๋„ "ํ•œ ์ค„์”ฉ ์ดํ•ดํ•˜๋ฉฐ ์ฝ”๋“œ๋กœ ๊ตฌํ˜„ํ•˜์ž"๋Š” ๋งˆ์Œ๊ฐ€์ง์œผ๋กœ ์ž„ํ–ˆ๋‹ค.

4. ์„ ํ˜• ํšŒ๊ท€(Linear Regression): ๋ฐ์ดํ„ฐ ์† ์„ ํ˜• ๊ด€๊ณ„ ์ฐพ๊ธฐ

์„ ํ˜• ํšŒ๊ท€๋Š” ๋…๋ฆฝ ๋ณ€์ˆ˜์™€ ์ข…์† ๋ณ€์ˆ˜ ๊ฐ„์˜ ์„ ํ˜• ๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” ๊ธฐ๋ฒ•์ด๋‹ค. ์ง‘ ๊ฐ€๊ฒฉ ์˜ˆ์ธก, ํŒ๋งค๋Ÿ‰ ์˜ˆ์ธก ๋“ฑ ์—ฐ์†์ ์ธ ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ๋„๋ฆฌ ์‚ฌ์šฉ๋œ๋‹ค.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ (์ง‘ ํฌ๊ธฐ์™€ ๊ฐ€๊ฒฉ)
np.random.seed(42)
house_size = np.random.normal(150, 40, 100)  # ํ‰๊ท  150mยฒ, ํ‘œ์ค€ํŽธ์ฐจ 40mยฒ์˜ ์ง‘ ํฌ๊ธฐ 100๊ฐœ
noise = np.random.normal(0, 50, 100)
house_price = 1500 * house_size + 10000 + noise  # ๊ฐ€๊ฒฉ = 1500 ร— ํฌ๊ธฐ + 10000 + ๋…ธ์ด์ฆˆ

# ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
X = house_size.reshape(-1, 1)
y = house_price
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ ํ•™์Šต
model = LinearRegression()
model.fit(X_train, y_train)

# ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"๊ณ„์ˆ˜(๊ธฐ์šธ๊ธฐ): {model.coef_[0]:.2f}")
print(f"์ ˆํŽธ: {model.intercept_:.2f}")
print(f"ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ(MSE): {mse:.2f}")
print(f"๊ฒฐ์ • ๊ณ„์ˆ˜(Rยฒ): {r2:.2f}")

# ๊ทธ๋ž˜ํ”„๋กœ ์‹œ๊ฐํ™”
plt.scatter(X_test, y_test, color='black', label='์‹ค์ œ ๋ฐ์ดํ„ฐ')
plt.plot(X_test, y_pred, color='blue', linewidth=3, label='์„ ํ˜• ํšŒ๊ท€์„ ')
plt.xlabel('์ง‘ ํฌ๊ธฐ (mยฒ)')
plt.ylabel('์ง‘ ๊ฐ€๊ฒฉ')
plt.title('์„ ํ˜• ํšŒ๊ท€: ์ง‘ ํฌ๊ธฐ์— ๋”ฐ๋ฅธ ๊ฐ€๊ฒฉ ์˜ˆ์ธก')
plt.legend()
plt.show()

์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์€ ๋‹จ์ˆœํ•˜์ง€๋งŒ ๊ฐ•๋ ฅํ•˜๋‹ค. ํŠนํžˆ ๊ฒฐ์ • ๊ณ„์ˆ˜(Rยฒ)๊ฐ€ ๋†’๊ฒŒ ๋‚˜์˜ค๋ฉด ๋ชจ๋ธ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ์„ค๋ช…ํ•œ๋‹ค๋Š” ์˜๋ฏธ๋‹ค. ํ•˜์ง€๋งŒ ๋ฐ์ดํ„ฐ์— ๋น„์„ ํ˜• ๊ด€๊ณ„๊ฐ€ ์žˆ๊ฑฐ๋‚˜ ํŠน์ง• ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ์ด ์žˆ๋‹ค๋ฉด ๋‹คํ•ญ ํšŒ๊ท€๋‚˜ ๋‹ค๋ฅธ ๋น„์„ ํ˜• ๋ชจ๋ธ์„ ๊ณ ๋ คํ•ด์•ผ ํ•œ๋‹ค.

5. ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€(Logistic Regression): ํ™•๋ฅ ๋กœ ๋ถ„๋ฅ˜ํ•˜๊ธฐ

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋Š” ์ด๋ฆ„์— 'ํšŒ๊ท€'๊ฐ€ ๋“ค์–ด๊ฐ€์ง€๋งŒ ์‚ฌ์‹ค์€ ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. ํŠนํžˆ ์ด์ง„ ๋ถ„๋ฅ˜(๋‘ ํด๋ž˜์Šค๋กœ ๋‚˜๋ˆ„๊ธฐ)์— ํƒ์›”ํ•˜๋ฉฐ, ์ถœ๋ ฅ๊ฐ’์„ 0๊ณผ 1 ์‚ฌ์ด์˜ ํ™•๋ฅ ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification

# ์ด์ง„ ๋ถ„๋ฅ˜์šฉ ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ (์˜ˆ: ์‹œํ—˜ ํ•ฉ๊ฒฉ/๋ถˆํ•ฉ๊ฒฉ)
X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, 
                           n_clusters_per_class=1, random_state=42)

# ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ๋ง (๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์—์„œ ์ค‘์š”)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๋ชจ๋ธ ํ•™์Šต
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)

# ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]  # ํด๋ž˜์Šค 1์˜ ํ™•๋ฅ 

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"์ •ํ™•๋„: {accuracy:.4f}")
print("ํ˜ผ๋™ ํ–‰๋ ฌ:")
print(conf_matrix)
print("\n๋ถ„๋ฅ˜ ๋ณด๊ณ ์„œ:")
print(classification_report(y_test, y_pred))

# ๊ฒฐ์ • ๊ฒฝ๊ณ„ ์‹œ๊ฐํ™”
def plot_decision_boundary(X, y, model, scaler):
    h = 0.02  # ๊ฒฉ์ž ๊ฐ„๊ฒฉ
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    # ๊ฒฉ์ž์ ์— ๋Œ€ํ•œ ์˜ˆ์ธก
    Z = model.predict(scaler.transform(np.c_[xx.ravel(), yy.ravel()]))
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.3)
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', marker='o')
    plt.xlabel('ํŠน์„ฑ 1')
    plt.ylabel('ํŠน์„ฑ 2')
    plt.title('๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์˜ ๊ฒฐ์ • ๊ฒฝ๊ณ„')
    plt.show()

plot_decision_boundary(X_test, y_test, model, scaler)

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋Š” ํ•ด์„์ด ์‰ฝ๊ณ  ํ™•๋ฅ ๊ฐ’์„ ์–ป์„ ์ˆ˜ ์žˆ์–ด ์œ„ํ—˜ ํ‰๊ฐ€๋‚˜ ๊ณ ๊ฐ ์ดํƒˆ ์˜ˆ์ธก ๋“ฑ์— ์œ ์šฉํ•˜๋‹ค. ๋˜ํ•œ L1, L2 ์ •๊ทœํ™”๋ฅผ ํ†ตํ•ด ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค.

6. ๋ถ„๋ฅ˜(Classification): ๋‹ค์–‘ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํƒ์ƒ‰

๋ถ„๋ฅ˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋ฏธ๋ฆฌ ์ •์˜๋œ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ณผ์ œ๋‹ค. ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ๋„˜์–ด ๋‹ค์ค‘ ๋ถ„๋ฅ˜ ๋ฌธ์ œ๋„ ์ž์ฃผ ๋งˆ์ฃผํ•˜๊ฒŒ ๋œ๋‹ค. ์—ฌ๋Ÿฌ ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋น„๊ตํ•ด๋ณด์ž.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
import seaborn as sns

# ์™€์ธ ๋ฐ์ดํ„ฐ์…‹ ๋กœ๋“œ (๋‹ค์ค‘ ๋ถ„๋ฅ˜)
wine = load_wine()
X = wine.data
y = wine.target

# ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ์Šค์ผ€์ผ๋ง
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ๋‹ค์–‘ํ•œ ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜
classifiers = {
    '๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€': LogisticRegression(max_iter=1000, random_state=42),
    '๊ฒฐ์ • ํŠธ๋ฆฌ': DecisionTreeClassifier(random_state=42),
    '๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(random_state=42),
    'K-์ตœ๊ทผ์ ‘ ์ด์›ƒ': KNeighborsClassifier(n_neighbors=5)
}

# ๊ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ›ˆ๋ จ ๋ฐ ํ‰๊ฐ€
results = {}
for name, clf in classifiers.items():
    clf.fit(X_train_scaled, y_train)
    y_pred = clf.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    cv_scores = cross_val_score(clf, X_train_scaled, y_train, cv=5)
    results[name] = {
        'accuracy': accuracy,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std()
    }
    print(f"\n{name}:")
    print(f"ํ…Œ์ŠคํŠธ ์ •ํ™•๋„: {accuracy:.4f}")
    print(f"๊ต์ฐจ ๊ฒ€์ฆ ์ •ํ™•๋„: {cv_scores.mean():.4f} ยฑ {cv_scores.std():.4f}")
    print(classification_report(y_test, y_pred, target_names=wine.target_names))

# ๊ฒฐ๊ณผ ๋น„๊ต ์‹œ๊ฐํ™”
results_df = pd.DataFrame({
    'Algorithm': list(results.keys()),
    'Test Accuracy': [results[name]['accuracy'] for name in results],
    'CV Accuracy': [results[name]['cv_mean'] for name in results]
})

plt.figure(figsize=(12, 6))
sns.barplot(x='Algorithm', y='value', hue='variable', 
            data=pd.melt(results_df, id_vars='Algorithm', 
                          value_vars=['Test Accuracy', 'CV Accuracy']))
plt.title('๋‹ค์–‘ํ•œ ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์„ฑ๋Šฅ ๋น„๊ต')
plt.ylim(0.7, 1.0)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# ํŠน์„ฑ ์ค‘์š”๋„ (๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ๊ธฐ์ค€)
rf = classifiers['๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ']
feature_importance = pd.DataFrame({
    'Feature': wine.feature_names,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance)
plt.title('ํŠน์„ฑ ์ค‘์š”๋„ (๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ)')
plt.tight_layout()
plt.show()

๋‹ค์–‘ํ•œ ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋น„๊ตํ•ด๋ณด๋‹ˆ, ๋ฐ์ดํ„ฐ์…‹์— ๋”ฐ๋ผ ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ ์ƒ๊ธด๋‹ค. ๋ชจ๋ธ ์„ ํƒ์€ ์ •ํ™•๋„๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ, ํ•™์Šต/์˜ˆ์ธก ์†๋„, ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ ๋Šฅ๋ ฅ ๋“ฑ์„ ์ข…ํ•ฉ์ ์œผ๋กœ ๊ณ ๋ คํ•ด์•ผ ํ•œ๋‹ค.

ํ•™์Šตํ•˜๋ฉด์„œ ๋А๋‚€ ์ 

1. ์ ์ ˆํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ ํƒ์˜ ์ค‘์š”์„ฑ

  • ๋ชจ๋“  ๋ฐ์ดํ„ฐ์— ๋งŒ๋Šฅ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์—†๋‹ค

  • ๋ฐ์ดํ„ฐ ํŠน์„ฑ๊ณผ ๋ชฉ์ ์— ๋งž๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ ํƒ์ด ์ค‘์š”

  • ์—ฌ๋Ÿฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋น„๊ตํ•˜๋Š” ์Šต๊ด€์„ ๋“ค์ด์ž

2. ์ „์ฒ˜๋ฆฌ์™€ ํ”ผ์ฒ˜ ์—”์ง€๋‹ˆ์–ด๋ง์˜ ํž˜

  • ์Šค์ผ€์ผ๋ง์ด ๋ชจ๋ธ ์„ฑ๋Šฅ์— ํฌ๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค

  • ํ”ผ์ฒ˜ ์„ ํƒ๊ณผ ๊ฐ€๊ณต์— ๋” ๋งŽ์€ ์‹œ๊ฐ„์„ ํˆฌ์žํ•˜์ž

  • ๋„๋ฉ”์ธ ์ง€์‹์„ ์ ๊ทน ํ™œ์šฉํ•˜์ž

3. ๋ชจ๋ธ ํ‰๊ฐ€๋Š” ๋‹ค๊ฐ๋„๋กœ

  • ์ •ํ™•๋„๋งŒ์œผ๋กœ๋Š” ๋ถ€์กฑํ•˜๋‹ค

  • ํ˜ผ๋™ ํ–‰๋ ฌ, ์ •๋ฐ€๋„, ์žฌํ˜„์œจ, F1 ์Šค์ฝ”์–ด ๋“ฑ ๋‹ค์–‘ํ•œ ์ง€ํ‘œ๋ฅผ ์‚ดํŽด๋ณด์ž

  • ๊ต์ฐจ ๊ฒ€์ฆ์„ ํ†ตํ•ด ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ™•์ธํ•˜์ž

๋งˆ์น˜๋ฉฐ

๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฐ์ž์˜ ํŠน์„ฑ๊ณผ ์žฅ๋‹จ์ ์ด ์žˆ๋‹ค. ์„ ํ˜• ํšŒ๊ท€๋ถ€ํ„ฐ ๋ณต์žกํ•œ ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊นŒ์ง€, ๊ฐ ๋ชจ๋ธ์˜ ์›๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๊ณ  ์ ์žฌ์ ์†Œ์— ํ™œ์šฉํ•˜๋Š” ๋Šฅ๋ ฅ์ด ์ค‘์š”ํ•˜๋‹ค. ์ด๋ก ๊ณผ ์‹ค์Šต์„ ๋ณ‘ํ–‰ํ•˜๋ฉด์„œ ์ง๊ด€์„ ํ‚ค์›Œ๋‚˜๊ฐ€๋Š” ์ค‘์ด๋‹ค.

"๋ชจ๋ธ์€ ๋„๊ตฌ์ผ ๋ฟ, ๊ฒฐ๊ตญ ์ค‘์š”ํ•œ ๊ฒƒ์€ ๋ฌธ์ œ ์ •์˜์™€ ๋ฐ์ดํ„ฐ๋‹ค" - ๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธํ‹ฐ์ŠคํŠธ์˜ ํ†ต์ฐฐ