Powered by AppSignal & Oban Pro
Would you like to see your link here? Contact us

Evolution of Time Dimension in OLTP Systems 🚀

3945f397-0ab7-4996-81e8-2a811cd8a811.livemd

Evolution of Time Dimension in OLTP Systems 🚀

Mix.install([
  {:scholar, "~> 0.3.1"},
  {:exla, "~> 0.7.3"},
  {:nx, "~> 0.7.3", override: true},
  {:explorer, "~> 0.8.3"},
  {:stb_image, "~> 0.6.9"},
  {:scidata, "~> 0.1.8"},
  {:req, "~> 0.5.2"},
  {:kino, "~> 0.13.2"},
  {:kino_vega_lite, "~> 0.1.13"},
  {:tucan, "~> 0.3.1"}
])

Setup

This notebook introduces the KMeans clustering algorithm. We will explore KMeans in three different use cases. Let’s setup some aliases:

alias Scholar.Cluster.KMeans
require Explorer.DataFrame, as: DF

And let’s configure EXLA as our default backend (where our tensors are stored) and compiler (which compiles Scholar code) across the notebook and all branched sections:

Nx.global_default_backend(EXLA.Backend)
Nx.Defn.global_default_options(compiler: EXLA)
key = Nx.Random.key(42)

Iris Dataset

In the first example, we will focus on is the Iris Dataset. It is one of the most renowned datasets. It consists of 150 records describing three iris species: Iris Setosa, Iris Virginica, and Iris Versicolor. Our task will be to predict the species of given flowers.

Firstly, we load the data, then we split it into Training Data (x) and Target (y) and cast those into Nx tensors.

df = Explorer.Datasets.iris()
x = df |> DF.discard(["species"]) |> Nx.stack(axis: 1)

y =
  df[["species"]]
  |> DF.dummies(["species"])
  |> Nx.stack(axis: 1)
  |> Nx.argmax(axis: 1)

{x, y}

Exploratory Data Analysis

An important part of Data Science workflow is so-called Exploratory Data Analysis. EDA helps us understand the data in a better way and suggests some efficient strategies to solve problems. There is no one specific course of action which defines good EDA. It should contain tabular summaries and plots showing relations between features.

We start our EDA by finding the mean values of each feature by species.

grouped_data = DF.group_by(df, "species")

DF.summarise(
  grouped_data,
  petal_length: mean(petal_length),
  petal_width: mean(petal_width),
  sepal_width: mean(sepal_width),
  sepal_length: mean(sepal_length)
)

We see that petal_length and petal_width are the most distinguishing features. Let’s explore them a little bit more.

Tucan.histogram(df, "petal_length", color_by: "species")
|> Tucan.facet_by(:column, "species")
|> Tucan.Scale.set_y_domain(0, 55)
|> Tucan.set_size(200, 200)
|> Tucan.set_title("Histograms of petal_length column by species", offset: 25, anchor: :middle)
Tucan.scatter(df, "petal_length", "petal_width", filled: true, color_by: "species")
|> Tucan.set_size(300, 300)
|> Tucan.set_title("Scatterplot of data samples projected on plane petal_width x petal_length",
  offset: 25
)
Tucan.scatter(df, "petal_length", "petal_width")
|> Tucan.facet_by(:column, "species")
|> Tucan.set_title(
  "Scatterplot of data samples projected on plane petal_width x petal_length by species",
  offset: 25
)

Now we have a better understanding of the data. Iris species have different petal widths and petal lengths. Iris Setosa has the smallest petal, Versicolor is medium size, and Virginica has the largest petal. We can ascertain that our analysis is correct and plot the so-called Elbow plot. The Elbow plot is a plot which presents Inertia vs the number of clusters. If there is a characteristic elbow, then we have a strong suggestion that the number of clusters is correct. Let’s train KMeans models for a different number of clusters from range 1 to 11.

clusterings = 1..11

models =
  for num_clusters <- clusterings do
    KMeans.fit(x, num_clusters: num_clusters, key: key)
  end

inertias = for model <- models, do: Nx.to_number(model.inertia)
Tucan.lineplot([num_clusters: clusterings, inertia: inertias], "num_clusters", "inertia",
  x: [type: :nominal, axis: [label_angle: 0]],
  title: "Elbow Plot"
)
|> Tucan.Axes.set_xy_titles("Number of Clusters", "Inertia")
|> Tucan.set_size(600, 300)

As you can see, we have the elbow when the number of clusters equals three. So this value of the parameter seems to be the best.

In order to compare our clustering with the target labels, we need to ensure our clusters are in a matching order.

defmodule Iris.Clusters do
  import Nx.Defn

  defn sort_clusters(model) do
    # We sort clusters by the first coordinate
    order = Nx.argsort(model.clusters[[.., 0]])
    labels_maping = Nx.argsort(order)

    %{
      model
      | labels: Nx.take(labels_maping, model.labels),
        clusters: Nx.take(model.clusters, order)
    }
  end
end
best_model = Enum.at(models, 2)
best_model = Iris.Clusters.sort_clusters(best_model)
accuracy = Scholar.Metrics.Classification.accuracy(best_model.labels, y)

Accuracy is nearly 90% - that’s pretty decent! Let’s look at our results plotted on one of the previous plots.

coords = [
  cluster_petal_length: best_model.clusters[[.., 2]] |> Nx.to_flat_list(),
  cluster_petal_width: best_model.clusters[[.., 3]] |> Nx.to_flat_list()
]

Tucan.layers([
  Tucan.scatter(df, "petal_length", "petal_width", color_by: "species", filled: true),
  Tucan.scatter(coords, "cluster_petal_length", "cluster_petal_width",
    filled: true,
    point_size: 100,
    point_color: "green"
  )
])
|> Tucan.set_size(300, 300)
|> Tucan.set_title(
  "Scatterplot of data samples projected on plane petal_width x petal_length with calculated centroids",
  offset: 25
)

As we expect 😎

Clustering of pixel colors

The other interesting use case of KMeans clustering is pixel clustering. This technique replaces all pixels with similar colors (similar in terms of euclidean distance between RGB) with a centroid related to them.

Let us start with loading the referral image.

url =
  "https://pix4free.org/assets/library/2021-01-12/originals/san_francisco_california_golden_gate_bridge_water.jpg"

%{body: raw_image} = Req.get!(url)
image = StbImage.read_binary!(raw_image)

{height, width, _num_channels} = image.shape
image = StbImage.resize(image, div(height, 3), div(width, 3))
shape = image.shape

image_kino = image |> StbImage.to_binary(:jpg) |> Kino.Image.new(:jpeg)

Now we will try to use only ten colors to represent the same picture.

x = image |> StbImage.to_nx() |> Nx.reshape({:auto, 3})

model =
  KMeans.fit(x,
    num_clusters: 10,
    num_runs: 10,
    max_iterations: 200,
    key: key
  )

repainted_x = Nx.take(model.clusters, model.labels)

tensor_to_image = fn x ->
  x
  |> Nx.reshape(shape)
  |> Nx.round()
  |> Nx.as_type({:u, 8})
  |> StbImage.from_nx()
  |> StbImage.to_binary(:jpg)
  |> Kino.Image.new(:jpeg)
end

repainted_x = tensor_to_image.(repainted_x)

Look that even though we use only ten colors, we can say without any doubt that this is the same image. Let’s experiment more deeply. Now we will try 5, 10, 15, 20 and 40 colors and then compare the processed images with the original one.

clusterings = [5, 10, 15, 20, 40]

models =
  for num_clusters <- clusterings do
    KMeans.fit(x, num_clusters: num_clusters, key: key)
  end
image_boxes =
  for {model, num_clusters} <- Enum.zip(models, clusterings) do
    repainted_x = Nx.take(model.clusters, model.labels)

    image_kino = tensor_to_image.(repainted_x)

    Kino.Layout.grid(
      [Kino.Markdown.new("### Number of colors: #{num_clusters}"), image_kino],
      boxed: true
    )
  end

image_box =
  Kino.Layout.grid(
    [Kino.Markdown.new("### Original image"), image_kino],
    boxed: true
  )

Kino.Layout.grid(image_boxes ++ [image_box], columns: 2)

Look that even with only five colors can recognize the Golden Gate Bridge in the image. On the other hand, with only 40 colors we keep almost all details except the sky and water surface. Sky and water do not map well because there is a small gradient in changing colors. Pixel clustering is a great way to compress images drastically with small integration in their appearance.

Clustering images from Fashion-MNIST

The last example is the clustering problem on the Fashion-MNIST Dataset. The dataset consists of 60000 images 28 by 28 pixels of ten different parts of clothing. Let’s dive into this clustering problem.

Before we start, we define the StratifiedSplit module. The module trims input data and splits it, so the number of samples per class is the same for each.

defmodule StratifiedSplit do
  import Nx.Defn

  defn trim_samples(x, labels, opts \\ []) do
    opts = keyword!(opts, [:num_classes, :samples_per_class])

    num_classes = opts[:num_classes]
    samples_per_class = opts[:samples_per_class]

    membership_mask = Nx.iota({1, num_classes}) == Nx.reshape(labels, {:auto, 1})

    indices =
      membership_mask
      |> Nx.argsort(axis: 0, direction: :desc)
      |> Nx.slice_along_axis(0, samples_per_class, axis: 0)
      |> Nx.flatten()

    {Nx.take(x, indices), Nx.take(labels, indices)}
  end
end

Firstly, load the data and cast it into Nx tensors.

{image_data, labels_data} = Scidata.FashionMNIST.download()

{images_binary, images_type, images_shape} = image_data
{num_samples, _num_channels = 1, image_height, image_width} = images_shape

images =
  images_binary
  |> Nx.from_binary(images_type)
  |> Nx.reshape({num_samples, :auto})
  |> Nx.divide(255)

{labels_binary, labels_type, _shape} = labels_data
target = Nx.from_binary(labels_binary, labels_type)

num_classes = 10
samples_per_class = 20

{images, target} =
  StratifiedSplit.trim_samples(images, target,
    num_classes: num_classes,
    samples_per_class: samples_per_class
  )

num_images = num_classes * samples_per_class

Let’s also define a function that will visualize an image in the tensor format for us.

tensor_to_kino = fn x ->
  x
  |> Nx.reshape({image_height, image_width, 1})
  # Replicate the value into 3 channels for PNG
  |> Nx.broadcast({image_height, image_width, 3})
  |> Nx.multiply(255)
  |> Nx.as_type({:u, 8})
  |> StbImage.from_nx()
  |> StbImage.resize(112, 112)
  |> StbImage.to_binary(:png)
  |> Kino.Image.new(:png)
end

Here is one of the images.

tensor_to_kino.(images[0])

We will try some different numbers of clusters and then measure the quality of clustering.

nums_clusters = 2..20

models =
  for num_clusters <- 2..20 do
    KMeans.fit(images, num_clusters: num_clusters, key: key)
  end
data = [
  num_clusters: nums_clusters,
  inertia: for(model <- models, do: Nx.to_number(model.inertia))
]

Tucan.lineplot(data, "num_clusters", "inertia",
  x: [type: :ordinal, axis: [label_angle: 0]],
  width: 600,
  height: 300
)
|> Tucan.Axes.set_xy_titles("Number of Clusters", "Inertia")
|> Tucan.Scale.set_y_domain(4800, 11500)
|> Tucan.set_title("Elbow Plot")

Look that this time there is no elbow on a plot. We need to use a different method to predict the number of classes. Now we will use Silhouette Score. It is a metric that indicates the quality of clustering. The higher score we achieve, the better clustering we get. However, we should be aware that Silhouette Score is just a heuristic and not always works.

silhouette_scores =
  for {model, num_clusters} <- Enum.zip(models, nums_clusters) do
    Scholar.Metrics.Clustering.silhouette_score(images, model.labels, num_clusters: num_clusters)
    |> Nx.to_number()
  end
data = [num_clusters: nums_clusters, silhouette_scores: silhouette_scores]

Tucan.lineplot(data, "num_clusters", "silhouette_scores",
  points: true,
  point_color: "darkBlue",
  x: [type: :ordinal, axis: [label_angle: 0]]
)
|> Tucan.Axes.set_xy_titles("Number of Clusters", "Silhouette score")
|> Tucan.Scale.set_y_domain(0.088, 0.205)
|> Tucan.set_size(600, 300)
|> Tucan.set_title("Silhouette score vs Number of Clusters")

As we can see, the model with num_clusters equal to 3 has the highest Silhouette Score. Now we will visualize this clusterization.

best_num_clusters = 3
best_model = Enum.at(models, 1)
predicted_cluster_with_indices =
  best_model.labels
  |> Nx.to_flat_list()
  |> Enum.with_index()
  |> Enum.group_by(&amp;elem(&amp;1, 0), &amp;elem(&amp;1, 1))

for cluster <- 0..(best_num_clusters - 1) do
  indices = predicted_cluster_with_indices[cluster]

  boxes =
    for index <- indices do
      original_cluster = Nx.to_number(target[index])

      Kino.Layout.grid([
        Kino.Markdown.new("Original cluster: #{original_cluster}"),
        tensor_to_kino.(images[index])
      ])
    end

  Kino.Layout.grid(
    [
      Kino.Markdown.new("## Cluster #{cluster}"),
      Kino.Layout.grid(boxes, columns: 5)
    ],
    boxed: true
  )
end
|> Kino.Layout.grid()

Oops, it doesn’t look right! That’s because our algorithm for three clusters gathers images by colors rather than shapes. To spot this, let’s plot the average image of each cluster.

for cluster <- 0..(best_num_clusters - 1) do
  indices = predicted_cluster_with_indices[cluster]

  mean_image =
    indices
    |> Enum.map(&amp;images[&amp;1])
    |> Nx.stack()
    |> Nx.mean(axes: [0])

  tensor_to_kino.(mean_image)
end
|> Kino.Layout.grid(columns: 3)

One of the images has a vertical line (something like trousers), the next image is almost all white (similar to a jumper), and the last one is mostly black. This time Silhouette Score turns out to be not the best indicator. To get better clustering, try to rerun the code with a higher number of clusters.

Blog, News