xgboost的原理:
参考链接:
1.准备数据集
这利用的是经典的乳腺癌数据集,56数据,30种特征。
乳腺癌数据集:
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from xgboost import plot_importance
from xgboost import plot_tree
from sklearn import metrics
df = pd.read_csv('../data/data/breast_cancer.csv')
print(df.loc[:,"label"])
label = df.loc[:,"label"]
print(df.iloc[:,:30])
features = df.iloc[:,:30]
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=3)
2.训练模型
sklearn提供了各种机器学习算法的接口,实现算法更为简易,当然建议在应用sklearn还有其他之前还是得好好推公式,写原生代码。
model = XGBClassifier(learning_rate=0.01,
n_estimators=10,
max_depth=4,
min_child_weight = 1,
gamma=0.,
subsample=1,
colsample_btree=1,
scale_pos_weight=1,
random_state=27,
slient = 0
)
model.fit(X_train,y_train)
3.模型可视化:
这里直接用导师matplot,你也可以用graphviz库
plot_tree(model)
plt.show()
4.预测模型
y_test, y_pred = y_test, model.predict(X_test)
print("Accuracy : %.4g" % metrics.accuracy_score(y_test, y_pred))
y_train_proba = model.predict_proba(X_train)[:,1]
print("AUC Score (Train): %f" % metrics.roc_auc_score(y_train, y_train_proba))
y_proba = model.predict_proba(X_test)[:,1]
print("AUC Score (Test): %f" % metrics.roc_auc_score(y_test, y_proba))
结果:
这里用的是auc模型评价指标
Accuracy : 0.9123
AUC Score (Train): 0.995224
AUC Score (Test): 0.923986
可以看出准确率在91%左右