Would you like to see your link here? Contact us

Notesclub

created by hec & contributors

terms privacy

Explorer PCA

livebooks/explorer/pca.livemd

Ryo Wakabayashi

@RyoWakabayashi

elixir-learning

Share to X

Share to Bluesky

More notebooks

Explorer PCA

Mix.install([
  {:explorer, "~> 0.9"},
  {:nx, "~> 0.9"},
  {:kino, "~> 0.15"},
  {:kino_vega_lite, "~> 0.1"}
])

Alias

alias Explorer.DataFrame
alias Explorer.Series
require Explorer.DataFrame

Load Datasets

Explorer に用意されているワインのデータセットを使用する

元データはこちら

> These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

このデータは、イタリアの同じ地域で栽培された、3種類の異なる品種に由来するワインの化学分析の結果です。この分析では、3種類のワインに含まれる13種類の成分の量を測定しています。

> The attributes are (dontated by Riccardo Leardi, riclea ‘@’ anchem.unige.it ) > > 1. Alcohol > 2. Malic acid > 3. Ash > 4. Alcalinity of ash > 5. Magnesium > 6. Total phenols > 7. Flavanoids > 8. Nonflavanoid phenols > 9. Proanthocyanins > 10. Color intensity > 11. Hue > 12. OD280/OD315 of diluted wines > 13. Proline

属性は以下の通りです（Riccardo Leardi, riclea@anchem.unige.it による寄贈）。

アルコール
リンゴ酸
灰分
灰のアルカリ度
マグネシウム
総フェノール量
フラバノイド
非フラバノイドフェノール類
プロアントシアニン
色の濃さ
色相
希釈ワインのOD280/OD315
プロリン

wine_df = Explorer.Datasets.wine()

データフレームをテーブル表示する

Kino.DataTable.new(wine_df)

クラスが3種類であることを確認する

wine_df
|> DataFrame.distinct(["class"])
|> DataFrame.pull("class")

分析に利用するため、 class 以外の列名を取得しておく

cols =
  wine_df
  |> DataFrame.discard("class")
  |> DataFrame.names()

Describe

Python における pandas の DataFrame.describe を再現する

各項目毎に集計値を取得し、テーブル表示する

count: データ数
mean: 平均
std: 標準偏差
min: 最小値
25%: 1/4 分位数
50%: 中央値
75%: 3/4 分位数
max: 最大値

データのおおよその傾向を知ることができる

wine_df
|> DataFrame.describe()
|> Kino.DataTable.new()

Histogram

ヒストグラム = 度数分布図を作成する

各値がどのように分布しているか（バラついているか）が可視化される

get_values = fn df, col ->
  df
  |> DataFrame.pull(col)
  |> Series.to_list()
end

histgram = fn df, col ->
  x = get_values.(df, col)

  VegaLite.new(width: 200, height: 200, title: col)
  |> VegaLite.data_from_values(x: x)
  |> VegaLite.mark(:bar)
  |> VegaLite.encode_field(
    :x,
    "x",
    type: :quantitative,
    bin: %{maxbins: 20},
    title: col
  )
  |> VegaLite.encode_field(
    :y,
    "x",
    type: :quantitative,
    aggregate: :count
  )
end

histgram.(wine_df, "alcohol")

すべての列に対してヒストグラムを作成する

cols
|> Enum.map(fn col ->
  histgram.(wine_df, col)
end)
|> Kino.Layout.grid(columns: 3)

クラス（ワインの品種）毎の分布を見る

all_class_histgram = fn df, col ->
  x = get_values.(df, col)
  color = get_values.(df, "class")

  VegaLite.new(width: 200, height: 200)
  |> VegaLite.data_from_values(x: x, color: color)
  |> VegaLite.mark(:bar, opacity: 0.5)
  |> VegaLite.encode_field(
    :x,
    "x",
    type: :quantitative,
    bin: %{maxbins: 20},
    title: co
  )
  |> VegaLite.encode_field(
    :y,
    "x",
    type: :quantitative,
    aggregate: :count
  )
  |> VegaLite.encode_field(
    :color,
    "color",
    type: :nominal
  )
end

all_class_histgram.(wine_df, "alcohol")

cols
|> Enum.map(fn col ->
  all_class_histgram.(wine_df, col)
end)
|> Kino.Layout.grid(columns: 3)

flavanoids 、 color_intensity あたりが、クラスによって分布の頂点が違います

この辺りの値で大体クラス分類ができそうなことが分かります

Scatter

散布図を作成する

項目間の関連具合を見ることができる

scatter = fn df, x_col, y_col ->
  x = get_values.(df, x_col)
  y = get_values.(df, y_col)
  class = get_values.(wine_df, "class")

  VegaLite.new(width: 200, height: 200)
  |> VegaLite.data_from_values(x: x, y: y, class: class)
  |> VegaLite.mark(:point)
  |> VegaLite.encode_field(:x, "x",
    type: :quantitative,
    scale: [domain: [Enum.min(x), Enum.max(x)]],
    title: x_col
  )
  |> VegaLite.encode_field(:y, "y",
    type: :quantitative,
    scale: [domain: [Enum.min(y), Enum.max(y)]],
    title: y_col
  )
  |> VegaLite.encode_field(:color, "class", type: :nominal)
end

alcohol と malic_acid の散布図は完全にバラバラになっている

クラス毎にもバラツキが多い

scatter.(wine_df, "alcohol", "malic_acid")

flavanoids と color_intensity の散布図は特定の箇所に集中している

クラス毎にも明確に分布が別れている

scatter.(wine_df, "flavanoids", "color_intensity")

Scatter Matrix

散布図行列を作成する

各項目の組み合わせで、同項目の場合はヒストグラム、別項目の場合は散布図を出す

graphs =
  cols
  |> Enum.map(fn col_1 ->
    h_graphs =
      cols
      |> Enum.map(fn col_2 ->
        cond do
          col_1 == col_2 ->
            all_class_histgram.(wine_df, col_1)

          true ->
            scatter.(wine_df, col_1, col_2)
        end
      end)

    VegaLite.new(width: 200 * Enum.count(cols), height: 200)
    |> VegaLite.concat(h_graphs, :horizontal)
  end)

VegaLite.new(width: 200 * Enum.count(cols), height: 200 * Enum.count(cols))
|> VegaLite.concat(graphs, :vertical)

Standardization

データを標準化する

データの大きさ、バラツキを統一するため、各値から平均を引いて標準偏差で割る

相関行列を計算しやすくするために標準化しておく

standardize = fn df, column ->
  mean =
    wine_df
    |> DataFrame.pull(column)
    |> Series.mean()

  std =
    wine_df
    |> DataFrame.pull(column)
    |> Series.standard_deviation()

  df
  |> DataFrame.mutate_with(fn in_df ->
    %{column => Series.subtract(in_df[column], mean)}
  end)
  |> DataFrame.mutate_with(fn in_df ->
    %{column => Series.divide(in_df[column], std)}
  end)
end

standardized_wine_df =
  cols
  |> Enum.reduce(wine_df, fn col, standardized_df ->
    standardize.(standardized_df, col)
  end)

standardized_wine_df
|> Kino.DataTable.new()

Correlation matrix

相関係数 = 項目間の関係性の強さ

相関行列 = 各項目の組み合わせに対して相関係数を算出したもの

標準化している場合、相関行列 = 分散共分散行列

行列計算を行うため、データフレームをテンソルに変換する

df_to_tensor = fn df ->
  df
  |> DataFrame.names()
  |> Enum.map(fn col ->
    standardized_wine_df
    |> DataFrame.pull(col)
    |> Series.to_tensor()
  end)
  |> Nx.concatenate()
  |> Nx.reshape({DataFrame.n_columns(df), DataFrame.n_rows(df)})
end

standardized_wine_tensor =
  standardized_wine_df
  |> DataFrame.select(cols)
  |> df_to_tensor.()
  |> Nx.transpose()

分散共分散行列は (標準化した行列T・標準化した行列) / データ数で求めることができる

covariance_tensor =
  standardized_wine_tensor
  |> Nx.transpose()
  |> Nx.dot(standardized_wine_tensor)
  |> Nx.divide(DataFrame.n_rows(standardized_wine_df))

テンソルをデータフレームに変換してテーブル表示する

add_cols_label = fn list, cols_ ->
  [{"x", cols_} | list]
end

covariance_df =
  cols
  |> Stream.with_index()
  |> Enum.map(fn {col, index} ->
    {col, Nx.to_flat_list(covariance_tensor[index])}
  end)
  |> add_cols_label.(cols)
  |> DataFrame.new()

covariance_df
|> Kino.DataTable.new(keys: ["x" | cols])

直感的に分かりづらいため、ヒートマップにして表示する

heatmap =
  cols
  |> Stream.with_index()
  |> Enum.map(fn {col_1, index_1} ->
    cols
    |> Stream.with_index()
    |> Enum.map(fn {col_2, index_2} ->
      covariance =
        covariance_tensor[index_1][index_2]
        |> Nx.to_number()

      %{
        x: col_1,
        y: col_2,
        covariance: covariance
      }
    end)
  end)
  |> List.flatten()

VegaLite.new(width: 300, height: 300)
|> VegaLite.data_from_values(heatmap)
|> VegaLite.mark(:rect)
|> VegaLite.encode_field(:x, "x", type: :nominal)
|> VegaLite.encode_field(:y, "y", type: :nominal)
|> VegaLite.encode_field(
  :fill,
  "covariance",
  type: :quantitative,
  scale: [
    domain: [-1, 1],
    scheme: :blueorange
  ]
)

Eigenvalues and Eigenvectors

PCA のために固有値、固有ベクトルを求める

{eigenvals, eigenvecs} = Nx.LinAlg.eigh(covariance_tensor)

Contribution rate

寄与率: 各主成分が、全体のうちどれくらいの情報を占めているか

寄与率 = 各固有値 / 総分散

総分散 = 項目数

get_contribution_rate = fn index ->
  Nx.to_number(eigenvals[index - 1]) / Nx.size(eigenvals)
end

contribution_rate_list =
  1..Nx.size(eigenvals)
  |> Enum.to_list()
  |> Enum.map(fn index ->
    get_contribution_rate.(index)
  end)

累積寄与率 = 当該成分までの合計寄与率

cumulative_contribution_rate_list =
  1..Nx.size(eigenvals)
  |> Enum.map(fn index ->
    contribution_rate_list
    |> Enum.slice(0, index)
    |> Enum.sum()
  end)

寄与率と累積寄与率をテーブル表示する

contribution_rate_df =
  %{
    "PC" => 1..Nx.size(eigenvals),
    "contribution_rate" => contribution_rate_list,
    "cumulative_contribution_rate" => cumulative_contribution_rate_list
  }
  |> DataFrame.new()

contribution_rate_df
|> Kino.DataTable.new()

累積寄与率を折れ線グラフで表示する

第１主成分、第２主成分で寄与率の半数以上を占めている

VegaLite.new(width: 300, height: 300)
|> VegaLite.data_from_values(
  x: 0..Nx.size(eigenvals),
  y: [0 | cumulative_contribution_rate_list]
)
|> VegaLite.mark(:line)
|> VegaLite.encode_field(
  :x,
  "x",
  type: :quantitative,
  title: "PC"
)
|> VegaLite.encode_field(
  :y,
  "y",
  type: :quantitative,
  title: "Cumulative Contribution Rate"
)

Dimensional compression

次元圧縮 = 各項目の値を主成分にまとめる

固有ベクトルの1番目 = 第１主成分得点

固有ベクトルの2番目 = 第２主成分得点

eigenvecs = Nx.transpose(eigenvecs)

w1 = eigenvecs[0]
w2 = eigenvecs[1]

IO.inspect(w1)
IO.inspect(w2)

w = Nx.stack([w1, w2], axis: 1)

標準化したデータと射影行列の内積を取ることで、第１主成分、第２主成分が取得できる

pca_wine_tensor =
  standardized_wine_tensor
  |> Nx.dot(w)
  |> Nx.transpose()

クラスと第１主成分、第２主成分でデータフレームを作る

classes = get_values.(wine_df, "class")

pca_wine_df =
  %{
    class: classes,
    pc1: Nx.to_flat_list(pca_wine_tensor[0]),
    pc2: Nx.to_flat_list(pca_wine_tensor[1])
  }
  |> DataFrame.new()

pca_wine_df
|> Kino.DataTable.new()

散布図にすると、クラス毎にまとまっていることが分かる

scatter.(pca_wine_df, "pc1", "pc2")

Other notebooks:

Michal Slaski
@michalslaski

livebook_examples

Salary predictions

salary_prediction.livemd

exla axon nx

2022-8-18
Dr. Christian Geuer-Pollmann
@chgeuer

livebook_on_azure

Christian's first LiveBook test

notebook1.livemd

axon exla nx

2022-8-18
@andyl

elix_util

MNIST

mnist.livemd

req axon exla nx

2022-8-18
@TomBers

livebookNotes

Attractors

attractors.livemd

decimal vega_lite kino

2022-8-18
Nick C
@flowerett

aoc

Day9

day9.livemd

kino_aoc benchee nimble_parsec libgraph math

2023-12-11
Ryo Wakabayashi
@RyoWakabayashi

elixir-learning

Evision DetectionModel

detection_model.livemd

nx evision req kino

2023-2-3
@andyl

livebooks

Untitled notebook

evision.livemd

kino kino_vega_lite nx evision explorer download

2023-12-4

Back