Monitor errors

livebooks/bayes.livemd

Metrist

@Metrist-Software

backend

Share to X

Share to Bluesky

More notebooks

Monitor errors

Work in progress!

This Livebook is a work in progress. It has been added to the Backend livebooks directory mostly to illustrate the use of S3-hosted data files in concert with the s3fs mount that happens before make livebook.

Parsing error message export

We have a dump of the PostgreSQL monitor_errors table, with tabs as separators so splitting the CSV is simple.

csv = File.read!("livebooks/data/bayes/errors.csv")
lines = String.split(csv, "\n")
tabbed = Enum.map(lines, &amp;String.split(&amp;1, "\t"))

For moving this data into the histograms we need for Naive Bayesian classification, we want to make sure we strip down all special characters from the error messages and convert the remaining words into a set. Note that we keep numbers as well - error messages, like 500, may provide important clues.

We also provide a simple struct to keep the error entries. The hash field is there to make sure that we treat exact same messages exactly the same (maybe “YAGNI”, but there’s a good chance we’ll classify them together).

defmodule S do
  def t(n) do
    n
    |> String.downcase()
    |> String.split(~r/[^[:alnum:]]/)
    |> Enum.filter(fn s -> String.length(s) > 0 end)
  end
end

S.t("This is (an error)\n")

defmodule Entry do
  defstruct [:id, :monitor, :check, :error, :words, :hash]
end

Finally, we map all the parsed error messages into Entry structs. We use sha1 for the hash because it is available and should be reasonable fast to calculate.

Note that we stuff the code in a module so it gets compiled, that’s way faster.

defmodule Mapper do
  def map(tabbed) do
    Enum.map(tabbed, fn items ->
      err = Enum.at(items, 4) || ""
      hash = :crypto.hash(:sha, err)
      words = err |> S.t() |> Enum.join(" ")

      %Entry{
        id: Enum.at(items, 0),
        monitor: Enum.at(items, 1),
        check: Enum.at(items, 2),
        error: err,
        words: words,
        hash: hash
      }
    end)
  end
end

mapped = Mapper.map(tabbed)

Naive Bayes classification

Now the hard part: classifying. We start with some fake test data and canned answers. The idea is that we will at one point present new messages to a user with a suggested classification (based on what we know so far) and then ask for confirmation or a new classification.

Because we’re lazy, we install the Simple Bayes Hex package which seems to provide what we need. We also install Kino so we can get interactive.

Mix.install([{:simple_bayes, "1.0.0"}, {:kino, "~> 0.5.1"}])
Application.ensure_started(:simple_bayes)

A quick test drive to see whether everything is legit:

bayes =
  SimpleBayes.init()
  |> SimpleBayes.train(:apple, "red sweet")
  |> SimpleBayes.train(:apple, "green", weight: 0.5)
  |> SimpleBayes.train(:apple, "round", weight: 2)
  |> SimpleBayes.train(:banana, "sweet")
  |> SimpleBayes.train(:banana, "green", weight: 0.5)
  |> SimpleBayes.train(:banana, "yellow long", weight: 2)
  |> SimpleBayes.train(:orange, "red")
  |> SimpleBayes.train(:orange, "yellow sweet", weight: 0.5)
  |> SimpleBayes.train(:orange, "round", weight: 2)

bayes |> SimpleBayes.classify_one("Maybe green maybe red but definitely round and sweet.")

bayes |> SimpleBayes.classify("Maybe green maybe red but definitely round and sweet.", top: 2)

An important question is what categories we want. For now, I’m going with these four:

down means that the service is actually down.
timeout means that the service timed out, it may be slow or “almost down”.
bug means that our code has a bug, in other words “it’s our fault”.
quota means that we ran out of quota. This can mean that we’re not cleaning up correctly or that we’re running into vendor quota, rate limits, etc.
false_positive means that it’s not actually an issue.

Just for fun, let’s define them.

categories = [:down, :timeout, :bug, :quota, :false_positive]

Using Kino, we ask for each unique error message what the classification is. Answers get written to an answers file so that when we need to repeat this (which we likely will) we can fast-forward by replaying the answers.

Other notebooks:

@TomBers

livebookNotes

Attractors

attractors.livemd

decimal vega_lite kino

2022-8-18
Kevin Pan
@feng19

spider_man

ElixirJobs

elixirjobs.livemd

spider_man floki nimble_csv kino

2022-8-18
@TomBers

livebookNotes

Fun with Graphs

graphs.livemd

vega_lite kino math

2022-8-18
@TomBers

livebookNotes

Epicycloid - draw Curves with Straight Lines

Epicycloid.livemd

vega_lite kino math

2022-8-18
Ryan Young
@ryoung786

AdventOfCode

Template

template.livemd

req vega_lite kino_vega_lite

2022-12-5
Daniel Kukula
@dkuku

livebooks

Untitled notebook

thinking_elixir.livemd

bumblebee exla nx

2024-4-23
@alde103

Build_Large_Language_Mode...

Introduction to PyTorch (Elixir mimic)

AA.livemd

axon nx kino scidata exla table_rex

2024-11-26

Back