Predicting Investment Risk Levels with XGBoost in R | by Alfian Adi Pratama

XGBoost Algorithm | Supply: https://www.nvidia.com/en-us/glossary/xgboost/

Introduction

On this undertaking, I used XGBoost to foretell funding threat ranges primarily based on varied financial indicators. A key spotlight is that I didn’t take away outliers, counting on XGBoost’s robustness to deal with them. This publish will cowl the dataset, mannequin constructing, and analysis.

Dataset overview

The dataset contains 14 key monetary and financial options which are vital in assessing funding threat. These options are:

Capital adequacy ratio (%) — common from final 5 years
GDP per capita (USD)
Gross Exterior Debt (% of GDP) — common from final 5 years
Progress of shopper value (%) — common from final 5 years
Progress of inhabitants (%) — common from final 5 years
Progress of Actual GDP (%) — common from final 5 years
Progress of Actual GDP per capita (%) — common from final 5 years
Mortgage-deposit ratio (%) — common from final 5 years
Web Exterior Debt (% of GDP) — common from final 5 years
Nominal GDP (USD bn)
Non-performing loans (% of gross loans) — common from final 5 years
Proportion of gross home funding to GDP (%) — common from final 5 years
Proportion of gross home saving to GDP (%) — common from final 5 years
Unemployment charge (% labour drive) — common from final 5 years

The goal variable (Threat Degree) represents whether or not a rustic’s funding threat is low or excessive:

These options and the goal are used to coach the XGBoost mannequin, predicting whether or not the funding threat is low or excessive. XGBoost’s skill to deal with outliers made preprocessing less complicated, as I selected to not take away any outliers.

Lacking worth

I addressed lacking values and potential outliers within the dataset by utilizing the MICE (A number of Imputation by Chained Equations) methodology. MICE allowed me to carry out imputations in a manner that maintains the relationships between variables, guaranteeing that the lacking or excessive values had been dealt with with out distorting the underlying information patterns. This method offered a extra sturdy dataset for coaching the XGBoost mannequin, with out having to fully take away any outliers. Right here is the code:

## Knowledge cleansing: lacking worth imputation with A number of Imputation by Chained Equations (MICE) methodology
library(mice)
imputed_data <- mice(information, m = 5, methodology = 'pmm', seed = 123 )
complete_data <- full(imputed_data, 1)
head(complete_data)

Outliers

I discovered a major quantity of outliers in my information, however I made a decision to not take away them. As a substitute, I relied on XGBoost’s robustness to deal with these outliers successfully. Since XGBoost relies on resolution bushes, it’s much less delicate to excessive values in comparison with algorithms like linear regression, which could be closely influenced by outliers. By holding the outliers within the information, I aimed to check how properly the mannequin might nonetheless carry out with out intensive information cleansing.

Knowledge Scaling

To make sure that the mannequin might deal with outliers extra successfully, I utilized Strong Scaling to the dataset. This scaling approach transforms the info by specializing in the median and the interquartile vary (IQR), making it much less delicate to excessive values in comparison with normal scaling strategies like min-max or z-score normalization. By utilizing Strong Scaling, I maintained the integrity of the outliers whereas guaranteeing that the options had been on a comparable scale, which helped enhance the efficiency of the XGBoost mannequin with out eliminating any useful info from the outliers.

## Knowledge scaling utilizing Strong Scaling
library(caret)
predictors <- complete_data[, c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10", "X11", "X12", "X13", "X14")]
process_parameters <- preProcess(predictors, methodology=c("heart","scale"))
predictors_scaled <- predict(process_parameters, predictors)
complete_data[, c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10", "X11", "X12", "X13", "X14")] <- predictors_scaled
head(complete_data)

Modeling

Within the modeling section, I used XGBoost to construct a predictive mannequin. First, I skilled the mannequin utilizing the coaching dataset, permitting XGBoost to be taught the patterns and relationships between the options and the goal variable (funding threat ranges). As soon as the mannequin was skilled, I evaluated its efficiency on the testing dataset to evaluate how properly it might generalize to unseen information. This method ensured that the mannequin wasn’t overfitting and will make correct predictions on new, real-world information. That is the code for modeling on coaching and take a look at information

## Modeling: utilizing XGBoost (retaining outlier information)
library(xgboost)
library(caret)# Memisahkan variabel prediktor (fitur) dan variabel goal (Risk_level)
complete_data_numeric <- complete_data
for(i in colnames(complete_data_numeric)) {
if(is.character(complete_data_numeric[[i]])) {
complete_data_numeric[[i]] <- as.numeric(as.issue(complete_data_numeric[[i]]))
}
}
predictors <- as.matrix(complete_data_numeric[, c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10", "X11", "X12", "X13", "X14")])
goal <- complete_data_numeric$Risk_level
train_data_matrix <- as.matrix(complete_data_numeric[, -which(names(complete_data_numeric) == "Risk_level")])
train_label <- as.numeric(goal) 
# Constructing the XgBoost mannequin
dtrain <- xgb.DMatrix(information = train_data_matrix, label = train_label)
# XGBoost parameters (for binary classification) utilizing binary logistic
parameters_xgb <- checklist(
goal = "binary:logistic",
eval_metric = "error",
max_depth = 6,
eta = 0.3
)
distinctive(train_label)
# practice the mannequin with 100 rounds
mannequin <- xgb.practice(params = parameters_xgb, information = dtrain, nrounds = 100, watchlist = checklist(practice = dtrain), verbose = 1)

## Testing the XGBoost mannequin for testing information
library(xgboost)
complete_data_testing <- data_testing# Preprocessing fortesting information
for(i in colnames(complete_data_testing)) {
if(is.character(complete_data_testing[[i]])) {
complete_data_testing[[i]] <- as.numeric(as.issue(complete_data_testing[[i]]))
}
}
# Convert testing information right into a matrix
test_data_matrix <- as.matrix(complete_data_testing)
# Create DMatrix for testing information
dtest <- xgb.DMatrix(information = test_data_matrix)
# Make predictions with the skilled mannequin
predictions <- predict(mannequin, dtest)
# Convert prediction to class (if utilizing binary classification)
predicted_classes <- ifelse(predictions > 0.5, 1, 0)
# Save the prediction outcomes right into a dataframe for additional evaluation
outcomes <- information.body(Predicted_Risk_Level = predicted_classes)
# Prediction end result
print(outcomes)

Mannequin Analysis

In mannequin analysis, i discovered that the XGBoost mannequin classifier confirmed excellent efficiency with an accuracy of 100%. The confusion matrix demonstrated no misclassifications, with all samples from each courses predicted accurately. Sensitivity, specificity, precision, and different associated metrics additionally achieved 100%, indicating that the mannequin efficiently distinguished between courses.

Nonetheless, it’s vital to notice that class imbalance is current within the dataset, as the vast majority of samples belonged to class 0, with just one pattern in school 1. This might have an effect on the mannequin’s generalizability when utilized to bigger datasets. Nonetheless, regardless of the presence of outliers, XGBoost carried out exceptionally properly on this evaluation, making it a sturdy alternative for binary. That is the mannequin abstract.

Extra particulars could be accessed on the following hyperlink: https://rpubs.com/Alfian19/levellriskpredictionxgboost

Dataset: https://drive.google.com/drive/folders/1jmRINSevb5BkH_6qvXmPK763Ppp9-owS?usp=sharing

Source link