Introduction to XGBoost

The Tech Platform
Feb 6, 2021
3 min read

Extreme Gradient Boosting with XGBoost!

XGBoost is an optimized Gradient Boosting Machine Learning library. It is originally written in C++, but has API in several other languages. The core XGBoost algorithm is parallelizable i.e. it does parallelization within a single tree. There are some of the cons of using XGBoost:

It is one of the most powerful algorithms with high speed and performance.
It can harness all the processing power of modern multicore computers.
It is feasible to train on large datasets.
Consistently outperform all single algorithm methods.

Here is a simple source code to understand the basics of XGBoost.

Code:

Step 1: Import all the necessary libraries, for XGBoost we need to import “xgboost” library, and then read the file using pandas.

import numpy as np
import pandas as pd
import xgboost as xgb 
from sklearn.preprocessing import train_test_split

Step 2: Split the entire dataset into a matrix of samples by features called X and vector of targets called y, as shown below:

#Load dataset
class_data = pd.read_csv("classification.csv")

#Split dataset into features and vector
X, y = class_data.iloc[:,:-1], class_data.iloc[:,-1]

Step 3: Split the dataset for training and testing, here I split it like 80% training and 20% testing, and then instantiate the XGBClassifier (since the output needs to be on the form of classification, either 1 or 0). Some of its hyperparameters are ‘objective’, which specifies the type of algorithm used, here I have used “binary: logistic” which means, logistic regression for binary classification, the output probability and ‘n_estimators’, which tunes the number of Decision Trees.

#Splitting
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=123)

#Instantiate
xg_cl=xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

Step 4: Fit and predict the model and then calculate the accuracy.

#Fit the model
xg_cl.fit(X_train,y_train)

#Predict the model
preds=xg_cl.predict(X_test)

#Accuracy
accuracy=float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f"%(accurcay))

accuracy: 0.78333

Let's dive more deeply into this super powerful algorithm!

XGBoost is usually used with a tree as the base learner, that decision tree is composed of the series of binary questions and the final predictions happens at the leaf. XGBoost is itself an ensemble method. The trees are constructed iteratively until a stopping criterion is met.

XGBoost uses CART(Classification and Regression Trees) Decision trees. CART is the trees that contain real-valued score in each leaf, regardless of whether they are used for classification or regression. Real-valued scores can then be converted to categories for classification, if necessary.

Model Evaluation in XGBoost

Here, we will see the model evaluation process with the process of Cross-validation. So, what is cross-validation?

Cross-validation

Cross-validation is a robust method for estimating the performance of model on unseen data by generating many non-overlapping train/test splits into training data and reporting the average test set performance across all data splits.

Here an example below,

#Necessary imports

import xgboost as xgb
import pandas as pd

#Loading example dataset
churn_data=pd.read_csv("classification_data.csv")

#Dataset Conversion to DMatrix
churn_dmatrix=xgb.DMatrix(data=churn_data.iloc[:,:-1], label=churn_data.month_5_still_here)

#Specifying parameters
params= {"objective" : "binary:logistic", "max_depth" : 4}

#Performing Cross-validation
cv_results=xgb.cv(dtrain=churn_dmatrix, params=params, nfold=4, num_boost_round=10, metrics="error", as_pandas=True)

#Accuracy
print("Accuracy: %f"%((1-cv_results["test-error-mean"]).iloc[-1]))

Below are the steps involved in the above code:

Line 2 & 3 includes the necessary imports. Line 6 includes loading the dataset. Line 9 includes conversion of the dataset into an optimized data structure that the creators of XGBoost made that gives the package its performance and efficiency gains called a DMatrix. In order to use XGBoost cv object, which is part of XGBoost’s learning API we have to first explicitly convert our data into a DMatrix.

Line 12 includes creating a parameter dictionary to pass into cross-validation, this is necessary because the cv method has no idea about what kind of XGBoost model is used. Line 15 includes calling cv method and pass in our DMatrix objects storing all data (the parameter dictionary, number of cross-validation folds and how many trees needed to be built, metric to be computed, whether output to be stored as pandas data frame. Line 18 includes converting metric to accuracy which came out to be 0.88315.

XGBoost vs Gradient Boosting

XGBoost is a more regularized form of Gradient Boosting. XGBoost uses advanced regularization (L1 & L2), which improves model generalization capabilities.

XGBoost delivers high performance as compared to Gradient Boosting. Its training is very fast and can be parallelized across clusters.

When to use XGBoost?

When there is a larger number of training samples. Ideally, greater than 1000 training samples and less 100 features or we can say when the number of features < number of training samples.
When there is a mixture of categorical and numeric features or just numeric features.