Notesclub

created by hec & contributors

terms privacy

Cross-validation with gradient boosting trees

cv_gradient_boosting_tree.livemd

Numerical Elixir (Nx)

@elixir-nx

scholar

Share to X

Share to Bluesky

More notebooks

Cross-validation with gradient boosting trees

Mix.install([
  {:scholar, "~> 0.2.0"},
  {:exgboost, "~> 0.5"},
  {:req, "~> 0.3.9"},
  {:kino_vega_lite, "~> 0.1.9"},
  {:kino, "~> 0.10.0"},
  {:kino_explorer, "~> 0.1.7"},
  {:nx, "~> 0.7.2"},
  {:explorer, "~> 0.8.2"}
])

Setup

We will use Explorer in this notebook, so let’s define aliases for its main modules:

require Explorer.DataFrame, as: DF
require Explorer.Series, as: S

And let’s configure EXLA as our default backend (where our tensors are stored) and compiler (which compiles Scholar code) across the notebook and all branched sections:

Nx.global_default_backend(EXLA.Backend)
Nx.Defn.global_default_options(compiler: EXLA)

We are going to work with Medical Cost Personal Datasets to predict medical charges that were applied to each person from the dataset. Let’s download it:

data =
  Req.get!(
    "https://gist.githubusercontent.com/meperezcuello/82a9f1c1c473d6585e750ad2e3c05a41/raw/d42d226d0dd64e7f5395a0eec1b9190a10edbc03/Medical_Cost.csv"
  ).body

df = DF.load_csv!(data)

The dataset consists of 7 columns: age, sex, BMI (body mass index), children (number), smoker (yes-no), region (NE, NW, SE, SW), and charges that we want to predict. Since gradient boosting trees that we are using in our analysis accept numerical data, we need to process further three columns: sex, smoker and region and encode them from categorical to numerical.

y = DF.select(df, "charges") |> Nx.concatenate()

x =
  df
  |> DF.discard(["charges"])
  |> DF.mutate(
    sex: cast(sex, :category),
    smoker: cast(smoker, :category),
    region: cast(region, :category)
  )
  |> Nx.stack(axis: 1)

{x, y}

Before training our model, we separate the data between train and test sets.

{x_train, x_test} = Nx.split(x, 0.8)
{y_train, y_test} = Nx.split(y, 0.8)

Training a gradient boosting tree

Gradient boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. Let’s go through a simple regression example, using decision trees as the base predictors; this is called gradient tree boosting, or gradient boosted regression trees (GBRT).

EXGBoost provides an implementation of gradient boosting trees that accepts a wide range of hyperparameter configurations. For the full list of hyperparameters refer to the EXGBoost docs.

y_pred =
  EXGBoost.train(
    x_train,
    y_train,
    booster: :gbtree,
    tree_method: :auto,
    objective: :reg_squarederror,
    num_boost_rounds: 100,
    evals: [{x_train, y_train, "training"}],
    verbose_eval: true
  )
  |> EXGBoost.predict(x_test)

Having our predictions, we can measure performance by calculating the root mean squared error of predictions with respect to target values.

alias Scholar.Metrics.Regression, as: Metrics

Metrics.mean_square_error(y_test, y_pred)
|> Nx.sqrt()
|> Nx.to_number()

With very little preprocessing we get similar results to the linear regression model. However, we can improve our model evaluation process by using cross-validation.

Evaluating with cross-validation

k-fold cross-validation works by creating splits on the training set into k smaller sets, so that the model is trained using $k - 1$ splits (folds) as training data and is validated on the remaining part of the data. When using this technique, the performance measure is the average of the values computed in each iteration.

Scholar provides tools for performing k-fold CV.

alias Scholar.ModelSelection

First, we need to define a folding function that will perform the k-folds, and also a scoring function that will train the model and evaluate performance with each split.

folding_fn = fn x -> ModelSelection.k_fold_split(x, 5) end

scoring_fn = fn x, y ->
  {x_train, x_test} = x
  {y_train, y_test} = y

  y_pred =
    EXGBoost.train(
      x_train,
      y_train,
      booster: :gbtree,
      tree_method: :auto,
      objective: :reg_squarederror,
      num_boost_rounds: 100,
      evals: [{x_train, y_train, "training"}],
      verbose_eval: true
    )
    |> EXGBoost.predict(x_test)

  Metrics.mean_square_error(y_test, y_pred)
  |> Nx.sqrt()
end

Now let’s run the cross-validation function and put the scores tensor in a series.

cv_score =
  ModelSelection.cross_validate(
    x_train,
    y_train,
    folding_fn,
    scoring_fn
  )
  |> Nx.squeeze()
  |> S.from_tensor()

Taking the average results in the performance reported by cross-validation.

S.mean(cv_score)

Fine-tuning our model with grid search

Finding the right configuration of hyperparameters is an important part of the process when selecting a model. One could try different combinations of hyperparameter values manually, but this can get tedious and time consuming. Instead, we can use the grid search method, an iterative process for finding an optimal configuration of hyperparameter values for a given model.

First, we need to provide a “grid” of hyperparameter values, so that the algorithm can train and evaluate our model with all possible combinations.

grid = [
  booster: [:gbtree],
  objective: [:reg_squarederror],
  evals: [[{x_train, y_train, "training"}]],
  verbose_eval: [true],
  tree_method: [:approx, :exact],
  max_depth: [2, 3, 4, 5, 6],
  num_boost_rounds: [20, 50, 90],
  subsample: [0.25, 0.5, 0.75, 1.0]
]

We also need to adapt our scoring function in order to use the hyperparameter values for each grid search iteration.

gs_scoring_fn = fn x, y, hyperparams ->
  {x_train, x_test} = x
  {y_train, y_test} = y

  y_pred =
    x_train
    |> EXGBoost.train(y_train, hyperparams)
    |> EXGBoost.predict(x_test)

  Metrics.mean_square_error(y_test, y_pred)
  |> Nx.sqrt()
end

Let’s run the grid search and see the results. Remember that the more hyperparameter values you add to the grid, the more it will take the algorithm to end.

gs_scores =
  ModelSelection.grid_search(
    x_train,
    y_train,
    folding_fn,
    gs_scoring_fn,
    grid
  )

The output is a list of maps, each corresponding to an iteration of the grid search algorithm. Every iteration yields a score calculated by our scoring function. Let’s find the set of hyperparameters that optimizes the score.

best_config =
  Enum.min_by(gs_scores, fn %{score: score} ->
    score
    |> Nx.squeeze()
    |> Nx.to_number()
  end)

Finally we train and evaluate a model using the best hyperparameter configuration found by grid search.

%{hyperparameters: opts} = best_config

model = EXGBoost.train(x_train, y_train, opts)
y_pred = EXGBoost.predict(model, x_test)

rmse =
  Metrics.mean_square_error(y_test, y_pred)
  |> Nx.sqrt()

"RMSE: #{Nx.to_number(rmse)}"

Other notebooks:

Michal Slaski
@michalslaski

livebook_examples

Salary predictions

salary_prediction.livemd

advanced data-science exla axon nx

2022-8-18
Dr. Christian Geuer-Pollmann
@chgeuer

livebook_on_azure

Christian's first LiveBook test

notebook1.livemd

tutorial advanced data-science axon exla nx

2022-8-18
@andyl

elix_util

MNIST

mnist.livemd

tutorial advanced data-science req axon exla nx

2022-8-18
Yejun Su
@goofansu

ogp

ogp

ogp.livemd

tutorial intermediate ogp kino

2022-8-18
@DockYard-Academy

curriculum

MetaMath

meta_math.livemd

tutorial intermediate jason kino youtube hidden_cell

2023-3-21
@DeSchoel

Elixir_Curriculum

Anagram

anagram.livemd

tutorial intermediate jason kino youtube hidden_cell

2026-1-10
Zack Siri
@zacksiri

notebooks

Compute Resource Allocation Model

cram.livemd

advanced data-science nx exla axon table_rex safetensors

2025-2-15

Back