Should I use GPT to autogenerate schema validations?
Mix.install([
{:jason, "> 0.0.0"},
{:vega_lite, "~> 0.1.7"},
{:kino_vega_lite, "~> 0.1.8"},
{:benchee, "~> 1.1.0"},
{:exonerate, "~> 0.3.0"}
])
~w(test.ex schema.ex)
|> Enum.each(fn file ->
__DIR__
|> Path.join("benchmark/#{file}")
|> Code.compile_file()
end)
alias Benchmark.Schema
alias Benchmark.Test
Benchmark.Test
Motivation
This entire month (March, 2023), I had been spending a ton of effort completing a major refactor of my json-schema library for Elixir. As I was toiling away handcrafting macros to generate optimized, bespoke, yet generalizable code, GPT-4 rolled onto the scene and awed all of us in the industry with its almost magical ability to craft code out of whole cloth. I felt a little bit like John Henry battling against the iron tracklayer, only to win but expire from his exertion. https://upload.wikimedia.org/wikipedia/commons/0/00/John_Henry-27527.jpg” width=”50%” />
With the advent of LLM-based code generation, we are seeing programmers leveraging the power of LLMs, such as GPT, to generate difficult or fussy code and rapidly create code. Is this a good idea? I wanted to test this out.
Note that compared to a schema compiler, LLM-generated code may be able to see some nice optimizations for simple schemas. This is roughly equivalent to a human claiming to be able to write better assembly language than a low-level language compiler. In some cases, the human may access extra knowledge about the structure of the data being handled, and thus the claim may be justified.
On the other hand, JSONSchema validations are typically used at the edge of a system, especially when interfacing with a 3rd party system (or human) with QC that is not under the control of the publisher of the JSONSchema. In these situations, strict adherence to JSONSchema is desirable. An early 422 rejection with a reason explaining where the data are misshapen is generally more desirable than a typically more opaque 500 rejection because the data do not match the expectations of the internal system.
With these considerations, I decided to test just how good GPT is at writing JSONSchemas, and answer the question “Should I use GPT to autogenerate schema validations?”
Methodology
To test this question, the following prompt was generated against ~> 250 JSONSchemas provided as a part of the JSONSchema engine validation suite (website). Each of these was injected into the following templated query and GPT3.5 and GPT4 were asked to provide a response.
Hi, ChatGPT! I would love your help writing an Elixir public function `validate/1`, which takes
one parameter, which is a decoded JSON value. The function should return :ok if the following
jsonschema validates, and an error if it does not:
```
#{schema}
```
The function should NOT store or parse the schema, it should translate the instructions in the schema directly as
elixir code. For example:
```
{"type": "object"}
```
should emit the following code:
```
def validate(object) when is_map(object), do: :ok
def validate(_), do: :error
```
DO NOT STORE THE SCHEMA or EXAMINE THE SCHEMA anywhere in the code. There should not be any
`schema` variables anywhere in the code. please name the module with the atom `:"#{group}-#{title}"
Thank you!
From the response, the code inside of the elixir fenced block was extracted and saved into a .exs file for processing as below in this live notebook. GPT-3.5 was not capable of correctly wrapping the elixir module, so it required an automated result curation step; GPT-4 code was able to be used as-is. Some further manual curation was performed (see Systematic code generation issues.)
Limitations
The biggest limitation of this approach is the nature of the examples provided in the JSONSchema validation suite. These validations exist to help JSONSchema implementers understand “gotchas” in the JSONSchema standard. As such, they don’t feature “real-world” payloads and their complexity is mostly limited to testing a single JSONSchema filter, in some cases, a handful of JSONSchema filters, where the filters have a long-distance interaction as part of the specification.
As a result, the optimizations that GPT performs may not really be scalable to real-world cases, and it’s not clear if GPT will have sufficient attention to handle the more complex cases.
Future studies, possibly involving schema generation and a property testing approach, can yield a more comprehensive understanding of GPT code generation
Note that the source data for GPT is more heavily biased towards imperative programming languages, so despite the claim that AI-assisted code-generation is likely to be more fruitful for languages (like Elixir) with term-immutability, any deficiencies in the code may also be a result of a deficiency in the LLM’s understanding of Elixir.
Benchmarking Accuracy
We’re going to marshal our results into the following struct, which carries information for visualization:
defmodule Benchmark.Result do
@enforce_keys [:schema, :type]
defstruct @enforce_keys ++ [fail: [], pass: [], pct: 0.0, exception: nil]
end
{:module, Benchmark.Result, <<70, 79, 82, 49, 0, 0, 11, ...>>,
%Benchmark.Result{schema: nil, type: nil, fail: [], pass: [], pct: 0.0, exception: nil}}
The following code is used to profile our GPT-generated code. The directory structure is expected to be that of the https://github.com/E-xyza/exonerate repository, and this notebook is expected to be in the ./bench/, otherwise the relative directory paths won’t work.
Note that the Schema and Test modules should be in ./bench/benchmark/schema.ex
and
./bench/benchmark/test.ex
, respectively, these are loaded in the dependencies section.
defmodule Benchmark do
alias Benchmark.Result
@omit ~w(anchor.json refRemote.json dynamicRef.json)
@test_directory Path.join(__DIR__, "../test/_draft2020-12")
def get_test_content do
Schema.stream_from_directory(@test_directory, omit: @omit)
end
def run(gpt, test_content) do
code_directory = Path.join(__DIR__, gpt)
test_content
|> Stream.map(&compile_schema(&1, code_directory))
|> Stream.map(&evaluate_test/1)
|> Enum.to_list()
end
defp escape(string), do: String.replace(string, "/", "-")
defp compile_schema(schema, code_directory) do
filename = "#{schema.group}-#{escape(schema.description)}.exs"
code_path = Path.join(code_directory, filename)
module =
try do
{{:module, module, _, _}, _} = Code.eval_file(code_path)
module
rescue
error -> error
end
{schema, module}
end
defp evaluate_test({schema, exception}) when is_exception(exception) do
%Result{schema: schema, type: :compile, exception: exception}
end
defp evaluate_test({schema, module}) do
# check to make sure module exports the validate function.
if function_exported?(module, :validate, 1) do
increment = 100.0 / length(schema.tests)
schema.tests
|> Enum.reduce(%Result{schema: schema, type: :ok}, fn test, result ->
expected = if test.valid, do: :ok, else: :error
try do
if module.validate(test.data) === expected do
%{result | pct: result.pct + increment, pass: [test.description | result.pass]}
else
%{result | type: :partial, fail: [{test.description, :incorrect} | result.fail]}
end
rescue
e ->
%{result | type: :partial, fail: [{test.description, e} | result.fail]}
end
end)
|> set_total_failure
else
%Result{schema: schema, type: :compile, exception: :not_generated}
end
end
# if absolutely none of the answers is correct, then set the type to :failure
defp set_total_failure(result = %Result{pct: 0.0}), do: %{result | type: :failure}
defp set_total_failure(result), do: result
end
tests = Benchmark.get_test_content()
gpt_3_results = Benchmark.run("gpt-3.5", tests)
gpt_4_results = Benchmark.run("gpt-4", tests)
:ok
warning: variable "map" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/if-then-else-if appears at the end when serialized (keyword processing sequence).exs:30: :"if-then-else-if appears at the end when serialized (keyword processing sequence)-gpt-3.5".validate_map/1
warning: function validate_array/1 is unused
/home/ityonemo/code/exonerate/bench/gpt-3.5/if-then-else-if appears at the end when serialized (keyword processing sequence).exs:38
warning: function validate_bool/1 is unused
/home/ityonemo/code/exonerate/bench/gpt-3.5/if-then-else-if appears at the end when serialized (keyword processing sequence).exs:58
warning: function validate_null/1 is unused
/home/ityonemo/code/exonerate/bench/gpt-3.5/if-then-else-if appears at the end when serialized (keyword processing sequence).exs:62
warning: function validate_number/1 is unused
/home/ityonemo/code/exonerate/bench/gpt-3.5/if-then-else-if appears at the end when serialized (keyword processing sequence).exs:54
warning: function validate_string/1 is unused
/home/ityonemo/code/exonerate/bench/gpt-3.5/if-then-else-if appears at the end when serialized (keyword processing sequence).exs:46
warning: this clause cannot match because a previous clause at line 22 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/if-then-else-if appears at the end when serialized (keyword processing sequence).exs:25
warning: Map.get!/2 is undefined or private. Did you mean:
* get/2
* get/3
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-root pointer ref.exs:4: :"ref-root pointer ref-gpt-3.5".validate/1
warning: variable "errors" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-relative pointer ref to object.exs:13: :"ref-relative pointer ref to object-gpt-3.5".validate_map/2
warning: variable "map" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-relative pointer ref to object.exs:10: :"ref-relative pointer ref to object-gpt-3.5".validate_map/2
warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-escaped pointer ref.exs:6: :"ref-escaped pointer ref-gpt-3.5".validate/1
warning: this clause cannot match because a previous clause at line 19 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-escaped pointer ref.exs:27
warning: variable "k" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-nested refs.exs:32: :"ref-nested refs-gpt-3.5".validate_object/1
warning: undefined module attribute @schemas, please remove access to @schemas or explicitly set it before access
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-nested refs.exs:50: :"ref-nested refs-gpt-3.5" (module)
warning: this clause cannot match because a previous clause at line 30 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-nested refs.exs:31
warning: module attribute @schemas was set but never used
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-nested refs.exs:57
warning: variable "schema" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref applies alongside sibling keywords.exs:10: :"ref-ref applies alongside sibling keywords-gpt-3.5".validate/1
warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref applies alongside sibling keywords.exs:10
warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref applies alongside sibling keywords.exs:14
warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-$ref to boolean schema true.exs:10
warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-$ref to boolean schema true.exs:14
warning: undefined module attribute @schema, please remove access to @schema or explicitly set it before access
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-refs with quote.exs:50: :"ref-refs with quote-gpt-3.5" (module)
warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref creates new scope when adjacent to keywords.exs:21: :"ref-ref creates new scope when adjacent to keywords-gpt-3.5".validate/1
warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref creates new scope when adjacent to keywords.exs:25: :"ref-ref creates new scope when adjacent to keywords-gpt-3.5".validate/1
warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref creates new scope when adjacent to keywords.exs:10
warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref creates new scope when adjacent to keywords.exs:17
warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref creates new scope when adjacent to keywords.exs:21
warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref creates new scope when adjacent to keywords.exs:25
warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-unevaluatedProperties with adjacent additionalProperties.exs:9: :"unevaluatedProperties-unevaluatedProperties with adjacent additionalProperties-gpt-3.5".validate/1
warning: variable "result" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-unevaluatedProperties with adjacent additionalProperties.exs:63: :"unevaluatedProperties-unevaluatedProperties with adjacent additionalProperties-gpt-3.5".validate_properties_with_properties/2
warning: variable "default" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-unevaluatedProperties with nested properties.exs:48: :"unevaluatedProperties-unevaluatedProperties with nested properties-gpt-3.5".validate_object_properties/4
warning: Map.equal/2 is undefined or private. Did you mean:
* equal?/2
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-unevaluatedProperties with nested properties.exs:35: :"unevaluatedProperties-unevaluatedProperties with nested properties-gpt-3.5".validate_object_properties/4
warning: MapSchema.validate/2 is undefined (module MapSchema is not available or is yet to be defined)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-cousin unevaluatedProperties, true and false, true with properties.exs:4: :"unevaluatedProperties-cousin unevaluatedProperties, true and false, true with properties-gpt-3.5".validate/1
warning: variable "schema" does not exist and is being expanded to "schema()", please use parentheses to remove the ambiguity or change the variable name
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-cousin unevaluatedProperties, true and false, false with properties.exs:31: :"unevaluatedProperties-cousin unevaluatedProperties, true and false, false with properties-gpt-3.5".validate_properties_schema/2
warning: variable "schema" does not exist and is being expanded to "schema()", please use parentheses to remove the ambiguity or change the variable name
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-cousin unevaluatedProperties, true and false, false with properties.exs:50: :"unevaluatedProperties-cousin unevaluatedProperties, true and false, false with properties-gpt-3.5".validate_unevaluated_properties/2
warning: undefined function schema/0 (expected :"unevaluatedProperties-cousin unevaluatedProperties, true and false, false with properties-gpt-3.5" to define such a function or for it to be imported, but none are available)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-cousin unevaluatedProperties, true and false, false with properties.exs:50
warning: this clause for validate_schema1/1 cannot match because a previous clause at line 13 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/allOf-allOf with base schema.exs:20
warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/allOf-allOf with two empty schemas.exs:2: :"allOf-allOf with two empty schemas-gpt-3.5".validate/1
warning: variable "object" is unused (there is a variable with the same name in the context, use the pin operator (^) to match on it or prefix this variable with underscore if it is not meant to be used)
/home/ityonemo/code/exonerate/bench/gpt-3.5/allOf-nested allOf, to check validation semantics.exs:92: :"allOf-nested allOf, to check validation semantics-gpt-3.5".validate_subschema/2
warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/allOf-nested allOf, to check validation semantics.exs:10
warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/allOf-nested allOf, to check validation semantics.exs:18
warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/allOf-nested allOf, to check validation semantics.exs:27
warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/required-required default validation.exs:2: :"required-required default validation-gpt-3.5".validate/1
warning: clauses with the same name should be grouped together, "def validate/1" was previously defined (/home/ityonemo/code/exonerate/bench/gpt-3.5/maximum-maximum validation.exs:2)
/home/ityonemo/code/exonerate/bench/gpt-3.5/maximum-maximum validation.exs:14
warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/maximum-maximum validation.exs:14
warning: Map.fetch/3 is undefined or private. Did you mean:
* fetch/2
/home/ityonemo/code/exonerate/bench/gpt-3.5/uniqueItems-uniqueItems with an array of items and additionalItems=false.exs:31: :"uniqueItems-uniqueItems with an array of items and additionalItems=false-gpt-3.5".is_unique_items/1
warning: incompatible types:
map() !~ [dynamic()]
in expression:
# /home/ityonemo/code/exonerate/bench/gpt-3.5/uniqueItems-uniqueItems=false validation.exs:5
is_list(object)
where "object" was given the type map() in:
# /home/ityonemo/code/exonerate/bench/gpt-3.5/uniqueItems-uniqueItems=false validation.exs:5
%{uniqueItems: false} = object
where "object" was given the type [dynamic()] in:
# /home/ityonemo/code/exonerate/bench/gpt-3.5/uniqueItems-uniqueItems=false validation.exs:5
is_list(object)
Conflict found at
/home/ityonemo/code/exonerate/bench/gpt-3.5/uniqueItems-uniqueItems=false validation.exs:5: :"uniqueItems-uniqueItems=false validation-gpt-3.5".validate/1
warning: function is_integer/1 is unused
/home/ityonemo/code/exonerate/bench/gpt-3.5/properties-object properties validation.exs:31
warning: variable "subprop" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/properties-properties, patternProperties, additionalProperties interaction.exs:64: :"properties-properties, patternProperties, additionalProperties interaction-gpt-3.5".validate_pattern_properties/2
warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/additionalProperties-additionalProperties can exist by itself.exs:4: :"additionalProperties-additionalProperties can exist by itself-gpt-3.5".validate/1
warning: undefined function validate/2 (expected :"items-items should not look in applicators, valid case-gpt-3.5" to define such a function or for it to be imported, but none are available)
/home/ityonemo/code/exonerate/bench/gpt-3.5/items-items should not look in applicators, valid case.exs:86
warning: variable "null" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/anyOf-nested anyOf, to check validation semantics.exs:2: :"anyOf-nested anyOf, to check validation semantics-gpt-3.5".validate/1
warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/anyOf-nested anyOf, to check validation semantics.exs:6
warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/anyOf-nested anyOf, to check validation semantics.exs:10
warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/anyOf-nested anyOf, to check validation semantics.exs:17
warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/anyOf-nested anyOf, to check validation semantics.exs:21
warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/anyOf-nested anyOf, to check validation semantics.exs:25
warning: found quoted keyword "pattern" but the quotes are not required. Note that keywords are always atoms, even when quoted. Similar to atoms, keywords made exclusively of ASCII letters, numbers, and underscores and not beginning with a number do not require quotes
/home/ityonemo/code/exonerate/bench/gpt-3.5/pattern-pattern is not anchored.exs:5:6
warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/minimum-minimum validation with signed integer.exs:6
warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with items.exs:6: :"unevaluatedItems-unevaluatedItems with items-gpt-3.5".validate/1
warning: variable "prefix" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with items.exs:31: :"unevaluatedItems-unevaluatedItems with items-gpt-3.5".validate_prefix_items/2
warning: function validate_items/2 is unused
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with items.exs:14
warning: function validate_prefix_items/2 is unused
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with items.exs:25
warning: function validate_schema/2 is unused
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with items.exs:53
warning: function validate_with_prefix_schema/2 is unused
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with items.exs:46
warning: variable "type" is unused (there is a variable with the same name in the context, use the pin operator (^) to match on it or prefix this variable with underscore if it is not meant to be used)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with nested tuple.exs:67: :"unevaluatedItems-unevaluatedItems with nested tuple-gpt-3.5".validate_prefix_items/2
warning: variable "type" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with nested tuple.exs:63: :"unevaluatedItems-unevaluatedItems with nested tuple-gpt-3.5".validate_prefix_items/2
warning: variable "data" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:9: :"unevaluatedItems-unevaluatedItems with not-gpt-3.5".validate/1
warning: variable "prefix_items" does not exist and is being expanded to "prefix_items()", please use parentheses to remove the ambiguity or change the variable name
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:56: :"unevaluatedItems-unevaluatedItems with not-gpt-3.5".validate_item/2
warning: variable "const" does not exist and is being expanded to "const()", please use parentheses to remove the ambiguity or change the variable name
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:57: :"unevaluatedItems-unevaluatedItems with not-gpt-3.5".validate_item/2
warning: function validate_item/2 is unused
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:24
warning: function validate_items/2 is unused
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:13
warning: undefined function validate_item/3 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:62
warning: undefined function validate_item/3 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:61
warning: undefined function validate_item/3 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:60
warning: undefined function validate_item/3 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:59
warning: undefined function validate_item/3 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:58
warning: undefined function validate_item/3 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:57
warning: undefined function const/0 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:57
warning: undefined function validate_item/3 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:56
warning: undefined function prefix_items/0 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:56
warning: this clause cannot match because a previous clause at line 23 always matches
/home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with boolean schemas.exs:27
warning: variable "properties" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-3.5/content-validation of binary-encoded media type documents with schema.exs:30: :"content-validation of binary-encoded media type documents with schema-gpt-3.5".validate_content_schema/1
warning: expected Kernel.rem/2 to have signature:
integer() | float(), float() -> dynamic()
but it has signature:
integer(), integer() -> integer()
in expression:
# /home/ityonemo/code/exonerate/bench/gpt-3.5/multipleOf-by number.exs:2
rem(val, 1.5)
Conflict found at
/home/ityonemo/code/exonerate/bench/gpt-3.5/multipleOf-by number.exs:2: :"multipleOf-by number-gpt-3.5".validate/1
warning: expected Kernel.rem/2 to have signature:
integer() | float(), float() -> dynamic()
but it has signature:
integer(), integer() -> integer()
in expression:
# /home/ityonemo/code/exonerate/bench/gpt-3.5/multipleOf-by small number.exs:2
rem(value, 0.0001)
Conflict found at
/home/ityonemo/code/exonerate/bench/gpt-3.5/multipleOf-by small number.exs:2: :"multipleOf-by small number-gpt-3.5".validate/1
warning: expected Kernel.rem/2 to have signature:
integer(), float() -> dynamic()
but it has signature:
integer(), integer() -> integer()
in expression:
# /home/ityonemo/code/exonerate/bench/gpt-3.5/multipleOf-invalid instance should not raise error when float division = inf.exs:4
rem(object, 0.123456789)
Conflict found at
/home/ityonemo/code/exonerate/bench/gpt-3.5/multipleOf-invalid instance should not raise error when float division = inf.exs:4: :"multipleOf-invalid instance should not raise error when float division = inf-gpt-3.5".validate/1
warning: variable "object" does not exist and is being expanded to "object()", please use parentheses to remove the ambiguity or change the variable name
/home/ityonemo/code/exonerate/bench/gpt-3.5/patternProperties-regexes are not anchored by default and are case sensitive.exs:15: :"patternProperties-regexes are not anchored by default and are case sensitive-gpt-3.5".is_valid_key/1
warning: variable "rest" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-4/unevaluatedProperties-unevaluatedProperties with adjacent properties.exs:4: :"unevaluatedProperties-unevaluatedProperties with adjacent properties".validate/1
warning: unused import Regex
/home/ityonemo/code/exonerate/bench/gpt-4/unevaluatedProperties-unevaluatedProperties with adjacent patternProperties.exs:2
warning: unused import Regex
/home/ityonemo/code/exonerate/bench/gpt-4/additionalProperties-non-ASCII pattern with additionalProperties.exs:2
warning: function is_boolean/1 is unused
/home/ityonemo/code/exonerate/bench/gpt-4/additionalProperties-additionalProperties allows a schema which should validate.exs:19
warning: function is_boolean/1 is unused
/home/ityonemo/code/exonerate/bench/gpt-4/additionalProperties-additionalProperties can exist by itself.exs:12
warning: variable "object" does not exist and is being expanded to "object()", please use parentheses to remove the ambiguity or change the variable name
/home/ityonemo/code/exonerate/bench/gpt-4/additionalProperties-additionalProperties should not look in applicators.exs:29: :"additionalProperties-additionalProperties should not look in applicators".is_additional_property_valid?/1
warning: :inet.parse_address/2 is undefined or private. Did you mean:
* parse_address/1
/home/ityonemo/code/exonerate/bench/gpt-4/format-validation of IPv6 addresses.exs:3: :"format-validation of IPv6 addresses".validate/1
warning: :idna.to_ascii/1 is undefined (module :idna is not available or is yet to be defined)
/home/ityonemo/code/exonerate/bench/gpt-4/format-validation of IDN hostnames.exs:15: :"format-validation of IDN hostnames".valid_idn_hostname?/1
warning: undefined function return/1 (expected :"content-validation of binary-encoded media type documents with schema" to define such a function or for it to be imported, but none are available)
/home/ityonemo/code/exonerate/bench/gpt-4/content-validation of binary-encoded media type documents with schema.exs:39
warning: undefined function return/1 (expected :"content-validation of binary-encoded media type documents with schema" to define such a function or for it to be imported, but none are available)
/home/ityonemo/code/exonerate/bench/gpt-4/content-validation of binary-encoded media type documents with schema.exs:36
warning: undefined function return/1 (expected :"content-validation of binary-encoded media type documents with schema" to define such a function or for it to be imported, but none are available)
/home/ityonemo/code/exonerate/bench/gpt-4/content-validation of binary-encoded media type documents with schema.exs:32
warning: function is_map_key/2 is unused
/home/ityonemo/code/exonerate/bench/gpt-4/not-forbidden property.exs:5
warning: expected Kernel.rem/2 to have signature:
float(), float() -> dynamic()
but it has signature:
integer(), integer() -> integer()
in expression:
# /home/ityonemo/code/exonerate/bench/gpt-4/multipleOf-by number.exs:2
rem(object, 1.5)
Conflict found at
/home/ityonemo/code/exonerate/bench/gpt-4/multipleOf-by number.exs:2: :"multipleOf-by number".validate/1
warning: expected Kernel.rem/2 to have signature:
float(), float() -> dynamic()
but it has signature:
integer(), integer() -> integer()
in expression:
# /home/ityonemo/code/exonerate/bench/gpt-4/multipleOf-by small number.exs:2
rem(number, 0.0001)
Conflict found at
/home/ityonemo/code/exonerate/bench/gpt-4/multipleOf-by small number.exs:2: :"multipleOf-by small number".validate/1
warning: variable "null" is unused (if the variable is not meant to be used, prefix it with an underscore)
/home/ityonemo/code/exonerate/bench/gpt-4/oneOf-nested oneOf, to check validation semantics.exs:2: :"oneOf-nested oneOf, to check validation semantics".validate/1
warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
/home/ityonemo/code/exonerate/bench/gpt-4/oneOf-nested oneOf, to check validation semantics.exs:6
:ok
Systematic Issues
Atoms vs. Strings
Both GPT-3.5 and GPT-4 sometimes use atoms in their code instead of strings. This is understandable, since various Elixir JSON implementations may use atoms instead of strings in the internal representation of JSON, especially for object keys. However, validation of JSON is most likely going to operate on string keys, since atom keys for input is discouraged due to security concerns. Here is some example code that GPT-4 generated:
defmodule :"oneOf-oneOf complex types" do
def validate(object) when is_map(object) do
case Enum.filter([:bar, :foo], &Map.has_key?(object, &1)) do
[:bar] ->
case Map.fetch(object, :bar) do
{:ok, value} when is_integer(value) -> :ok
_ -> :error
end
[:foo] ->
case Map.fetch(object, :foo) do
{:ok, value} when is_binary(value) -> :ok
_ -> :error
end
_ -> :error
end
end
def validate(_), do: :error
end
Code featuring atom keys in maps was manually converted prior to benchmarking accuracy, for example, the above code is converted to:
defmodule :"oneOf-oneOf complex types" do
def validate(object) when is_map(object) do
case Enum.filter(["bar", "foo"], &Map.has_key?(object, &1)) do
["bar"] ->
case Map.fetch(object, "bar") do
{:ok, value} when is_integer(value) -> :ok
_ -> :error
end
["foo"] ->
case Map.fetch(object, "foo") do
{:ok, value} when is_binary(value) -> :ok
_ -> :error
end
_ -> :error
end
end
def validate(_), do: :error
end
String length is UTF-8 grapheme count
Neither GPT understood that the JSONSchema string length count counts UTF-8 graphemes. As an example, GPT-4 produced the following code:
defmodule :"maxLength-maxLength validation" do
def validate(string) when is_binary(string) do
if byte_size(string) <= 2, do: :ok, else: :error
end
def validate(_), do: :error
end
Instead, the if statement should have been:
if String.length(string) <= 2, do: :ok, else: :error
Integers need to match Floats
The JSONSchema standard requires that constant integers, enumerated integers, and floating
point numbers must match as integers. In elixir, while the ==
operator will resolve as true
when comparing an integral floating point, other operations, such as matching, will not. Both
GPT-3.5 and GPT-4 struggled with this. GPT-4 missed several validations due to this.
Example:
defmodule :"enum-enum with 0 does not match false" do
def validate(0), do: :ok
def validate(_), do: :error
end
Filters only apply to their own types
This common error, which is common to both GPT-3.5 and GPT-4, stems because GPT does not
understand that a filter will not reject a type it is not designed to operate on. A good
example of such code is the following (derived from the schema {"maxItems": 2}
):
defmodule :"maxItems-maxItems validation" do
def validate(list) when is_list(list) and length(list) <= 2, do: :ok
def validate(_), do: :error
end
GPT-4 will Note that validate/1
will return :error
when confronted with a string, even
though the JSONSchema spec says that the maxItems
filter should not apply, defaulting to
successful validation.
When given the schema {"maxItems": 2, "maxLength": 4}
(not in the test suite), GPT-4 does
something even stranger, applying the maxLength
criterion to the inner elements of the list,
even while accepting the that the outer element can be either a list or a string.
defmodule :"maxItems-maxLength" do
def validate(value) when is_list(value) and length(value) <= 2 do
Enum.reduce(value, :ok, fn item, acc ->
if is_binary(item) and byte_size(item) <= 4, do: acc, else: :error
end)
end
def validate(value) when is_binary(value) and byte_size(value) <= 4 do
:ok
end
def validate(_), do: :error
end
When given {"maxLength": 4, "maximum": 3}
, GPT-4 gets the code correct.
defmodule :"maxLength-maximum" do
def validate(value) when is_binary(value) and byte_size(value) <= 4, do: :ok
def validate(value) when is_number(value) and value <= 3, do: :ok
def validate(_), do: :error
end
In the GPT-4 accuracy benchmark, ~15 of the test schemas were prevented from passing all of their tests solely because of missing this criterion.
Format/Content
Neither GPT knew that format
and content-*
filters are default off and the test suite does
not test against this validation. However, both GPTs reached for Elixir standard library
tools, even though these do not necessarily fit the explicit requirements set forth by the
JSONSchema standard, or 3rd party tools (despite being told explictly not to) to perform
validations.
Examples (all GPT-4):
defmodule :"format-validation of date-time strings" do
def validate(datetime_string) when is_binary(datetime_string) do
if valid_datetime?(datetime_string) do
:ok
else
:error
end
end
def validate(_), do: :error
defp valid_datetime?(datetime_string) do
case DateTime.from_iso8601(datetime_string) do
{:ok, _} -> true
:error -> false
end
end
end
For date-time
validation, the correct Elixir Standard Library to use is NaiveDateTime
, not
DateTime
. DateTime
will fail without being given time-zone information.
defmodule :"format-validation of IDN hostnames" do
alias :idna, as: Idna
def validate(hostname) when is_binary(hostname) do
if valid_idn_hostname?(hostname) do
:ok
else
:error
end
end
def validate(_), do: :error
defp valid_idn_hostname?(hostname) do
case Idna.to_ascii(hostname) do
{:ok, ascii_hostname} -> valid_ascii_hostname?(ascii_hostname)
_ -> false
end
end
defp valid_ascii_hostname?(hostname) do
:inet.parse_strict_address(hostname) == :error and
Enum.all?(String.split(hostname, ".", trim: true), &valid_label?/1)
end
defp valid_label?(label) do
byte_size(label) in 1..63 and
String.match?(label, ~r/^[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?$/)
end
end
GPT-4 (impressively) reaches for the :idna erlang library, but, oddly decides to alias it with an Elixir-style module name.
Accuracy Evaluation
Next, let’s look at how accurately GPT-3.5 and GPT-4 perform across all of the JSONSchema tests:
defmodule Benchmark.Plotter do
def format_passes(result) do
result.pass
|> Enum.map(&"✅ #{&1}")
|> Enum.join("\n")
end
def format_fails(result) do
result.fail
|> Enum.map(&"❌ #{elem(&1, 0)}")
|> Enum.join("\n")
end
@granularity 2
def tabularize(result) do
color =
case result.type do
:ok -> :green
:partial -> :yellow
:failure -> :orange
:compile -> :red
end
%{
group: result.schema.group,
test: result.schema.description,
schema: Jason.encode!(result.schema.schema),
pct: round(result.pct / @granularity) * @granularity,
color: color,
pass: format_passes(result),
fail: format_fails(result)
}
end
def nudge_data(results) do
# data points might overlap, so to make the visualization more effective,
# we should nudge the points apart from each other.
results
|> Enum.sort_by(&{&1.group, &1.pct})
|> Enum.map_reduce(MapSet.new(), &nudge/2)
|> elem(0)
end
@nudge 2
# points might overlap, so move them up or down accordingly for better
# visualization. Colors help us understand the qualitative results.
defp nudge(result = %{pct: pct}, seen) when pct == 100, do: nudge(result, seen, -@nudge)
defp nudge(result, seen), do: nudge(result, seen, @nudge)
defp nudge(result, seen, amount) do
if {result.group, result.pct} in seen do
nudge(%{result | pct: result.pct + amount}, seen, amount)
else
{result, MapSet.put(seen, {result.group, result.pct})}
end
end
def plot_one({title, results}) do
tabularized =
results
|> Enum.map(&tabularize/1)
|> nudge_data
VegaLite.new(title: title)
|> VegaLite.data_from_values(tabularized)
|> VegaLite.mark(:circle)
|> VegaLite.encode_field(:x, "group", type: :nominal, title: false)
|> VegaLite.encode_field(:y, "pct", type: :quantitative, title: "percent correct")
|> VegaLite.encode_field(:color, "color", legend: false)
|> VegaLite.encode(:tooltip, [
[field: "group"],
[field: "test"],
[field: "schema"],
[field: "pass"],
[field: "fail"]
])
end
def plot(list_of_results) do
VegaLite.new()
|> VegaLite.concat(Enum.map(list_of_results, &plot_one/1), :vertical)
end
end
Benchmark.Plotter.plot("gpt-3.5": gpt_3_results, "gpt-4": gpt_4_results)
In the above chart, blue dots are 100% correct, green dots are partially correct, orange dots are completely incorrect, and red dots are compilation errors.
Selected Observations of interest
Incorrect Elixir
GPT-3.5 and GPT-4 are not aware that only certain functions can be called in function guards, causing a compilation error:
defmodule :"anyOf-anyOf with base schema" do
def validate(value) when is_binary(value) and (String.length(value) <= 2 or String.length(value) >= 4), do: :ok
def validate(_), do: :error
end
Misunderstanding Elixir
GPT-4 attempts to directly match the result of Map.keys/1
. This likely works, but in general there is no guarantee that the result of this function will have any order.
defmodule :"oneOf-oneOf with missing optional property" do
def validate(%{"foo" => _} = object) do
case Map.keys(object) do
["foo"] -> :ok
_ -> :error
end
end
def validate(%{"bar" => _} = object) do
case Map.keys(object) do
["bar", "baz"] -> :ok
_ -> :error
end
end
def validate(_), do: :error
end
GPT-3 also often attempts to match from Map.keys/1
:
defmodule :"unevaluatedProperties-nested unevaluatedProperties, outer false, inner true, properties inside-gpt-3.5" do
def validate(object) when is_map(object) do
case Map.keys(object) do
["foo" | _] -> :ok
_ -> :error
end
end
def validate(_) do
:error
end
end
in this case simply
if Map.has_key?(object, "foo"), do: :ok, else: :error
would have done the trick.
Hallucinations
GPT-3.5 was particularly prone to hallucinations. In one case, it hallucinated a json_schema
library (and also flubbed the parameter it passed):
defmodule :"items-prefixItems with no additional items allowed-gpt-3.5" do
def validate(object) when is_map(object), do: validate_object(object)
def validate(_), do: :error
defp validate_object(object) do
case Map.has_key?(object, :items) and Map.has_key?(object, :prefixItems) and not Map.has_key?(object, :additionalItems) do
true -> Map.get(object, :prefixItems)
|> Enum.all?(fn _ -> %{} end)
|> :json_schema.validate(:#{false})
|> handle_validation_result()
false -> :error
end
end
defp handle_validation_result(result) do
case result do
{:ok, _} -> :ok
{:error, _, _} -> :error
end
end
end
Semantic misunderstanding
{"contains":{"maximum": 5}}
GPT-4 misinterprets OpenAPI and generates the following code:
defmodule :"contains-contains keyword validation" do
def validate(object) when is_list(object) do
if Enum.count(object) >= 5 do
:ok
else
:error
end
end
def validate(_), do: :error
end
This would be the correct code for:
{"contains": {}, "maxContains": 5}
But the semantic error that GPT-4 makes is that it thinks that “maximum” is a qualifier on
“contains”, when in fact the schema calls for a new “context”; each object in the list should
validate as {"maximum": 5}
but this doesn’t apply to the list itself.
Completely misunderstanding
Several times, GPT-3.5 gave up on doing the task properly and instead wandered off into matching the schema, despite being told explictly not to. Here is the simplest example:
defmodule :"uniqueItems-uniqueItems=false validation-gpt-3.5" do
@moduledoc "Validates a JSON object against the 'uniqueItems=false' schema.\n"
@doc "Validates the given JSON object against the schema.\n"
@spec validate(Map.t()) :: :ok | :error
def validate(%{uniqueItems: false} = object) when is_list(object) do
if Enum.uniq(object) == object do
:ok
else
:error
end
end
def validate(_) do
:error
end
end
Selected Performance Comparisons
Using the Benchee library, here I set up a framework by which we can test the speed of a few
representative samples of generated code. The “John Henry” contender will be Exonerate, the
Elixir library that this notebook lives in. Here we set up a compare/2
function that runs
Benchee and reports the winner (ips = invocations per second, bigger is better). The
module will also host the code generated by Exonerate.
defmodule ExonerateBenchmarks do
require Exonerate
def compare(scenario, value, raw \\ false) do
[exonerate_ips, gpt_ips] =
%{
gpt4: fn -> apply(scenario, :validate, [value]) end,
exonerate: fn -> apply(__MODULE__, scenario, [value]) end
}
|> Benchee.run()
|> Map.get(:scenarios)
|> Enum.sort_by(& &1.name)
|> Enum.map(& &1.run_time_data.statistics.ips)
cond do
raw ->
exonerate_ips / gpt_ips
gpt_ips > exonerate_ips ->
"gpt-4 faster than exonerate by #{gpt_ips / exonerate_ips}x"
true ->
"exonerate faster than gpt-4 by #{exonerate_ips / gpt_ips}x"
end
end
Exonerate.function_from_string(
:def,
:"allOf-allOf simple types",
~S({"allOf": [{"maximum": 30}, {"minimum": 20}]})
)
Exonerate.function_from_string(
:def,
:"uniqueItems-uniqueItems validation",
~S({"uniqueItems": true})
)
Exonerate.function_from_string(
:def,
:"oneOf-oneOf with required",
~S({
"type": "object",
"oneOf": [
{ "required": ["foo", "bar"] },
{ "required": ["foo", "baz"] }
]
})
)
end
{:module, ExonerateBenchmarks, <<70, 79, 82, 49, 0, 0, 58, ...>>, [[]]}
GPT-4 wins!
{"allOf": [{"maximum": 30}, {"minimum": 20}]}
Let’s take a look at a clear case where GPT-4 is the winning contender. In this code, we apply two filters to a number using the allOf construct so that the number is subjected to both schemata. This would not be the best way to do this (probably doing this without allOf is better) but it will be very illustrative of how GPT-4 can do better.
This is GPT-4’s code:
def validate(number) when is_number(number) do
if number >= 20 and number <= 30 do
:ok
else
:error
end
end
def validate(_), do: :error
Holy moly. GPT-4 was able to deduce the intent of the allOf and see clearly that the filters collapse into a single set of conditions that can be checked without indirection.
By contrast, this is what Exonerate creates:
def validate(data) do
unquote(:"exonerate://validate/#/")(data, "/")
end
defp unquote(:"exonerate://validate/#/")(array, path) when is_list(array) do
with :ok <- unquote(:"exonerate://validate/#/allOf")(array, path) do
:ok
end
end
defp unquote(:"exonerate://validate/#/")(boolean, path) when is_boolean(boolean) do
with :ok <- unquote(:"exonerate://validate/#/allOf")(boolean, path) do
:ok
end
end
defp unquote(:"exonerate://validate/#/")(integer, path) when is_integer(integer) do
with :ok <- unquote(:"exonerate://validate/#/allOf")(integer, path) do
:ok
end
end
defp unquote(:"exonerate://validate/#/")(null, path) when is_nil(null) do
with :ok <- unquote(:"exonerate://validate/#/allOf")(null, path) do
:ok
end
end
defp unquote(:"exonerate://validate/#/")(float, path) when is_float(float) do
with :ok <- unquote(:"exonerate://validate/#/allOf")(float, path) do
:ok
end
end
defp unquote(:"exonerate://validate/#/")(object, path) when is_map(object) do
with :ok <- unquote(:"exonerate://validate/#/allOf")(object, path) do
:ok
end
end
defp unquote(:"exonerate://validate/#/")(string, path) when is_binary(string) do
if String.valid?(string) do
with :ok <- unquote(:"exonerate://validate/#/allOf")(string, path) do
:ok
end
else
require Exonerate.Tools
Exonerate.Tools.mismatch(string, "exonerate://validate/", ["type"], path)
end
end
defp unquote(:"exonerate://validate/#/")(content, path) do
require Exonerate.Tools
Exonerate.Tools.mismatch(content, "exonerate://validate/", ["type"], path)
end
defp unquote(:"exonerate://validate/#/allOf")(data, path) do
require Exonerate.Tools
Enum.reduce_while([
&unquote(:"exonerate://validate/#/allOf/0")/2,
&unquote(:"exonerate://validate/#/allOf/1")/2
],
:ok,
fn fun, :ok ->
case fun.(data, path) do
:ok -> {:cont, :ok}
Exonerate.Tools.error_match(error) -> {:halt, error}
end
end)
end
defp unquote(:"exonerate://validate/#/allOf/0")(integer, path) when is_integer(integer) do
with :ok <- unquote(:"exonerate://validate/#/allOf/0/maximum")(integer, path) do
:ok
end
end
# ... SNIP ...
defp unquote(:"exonerate://validate/#/allOf/1/minimum")(number, path) do
case number do
number when number >= 20 ->
:ok
_ ->
require Exonerate.Tools
Exonerate.Tools.mismatch(number, "exonerate://validate/", ["allOf", "1", "minimum"], path)
end
end
It was so long I had to trim it down to keep from boring you. But you should be able to get the point. The exonerate code painstakingly goes through every single branch of the schema giving it its own, legible function and when there’s an error it also goes ahead and annotates the location in the schema where the error occurred, and what filter the input violated. So it’s legitimately doing more than what the GPT-4 code does, which gleefully destroyed this information that could be useful to whoever is trying to send data.
Then again, I didn’t ask it to do that. Let’s see how much of a difference in performance all this makes
ExonerateBenchmarks.compare(:"allOf-allOf simple types", 25)
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 7.94 M 125.94 ns ±10521.80% 113 ns 159 ns
exonerate 3.47 M 288.45 ns ±13170.20% 191 ns 463 ns
Comparison:
gpt4 7.94 M
exonerate 3.47 M - 2.29x slower +162.51 ns
"gpt-4 faster than exonerate by 2.2903531909721173x"
So above, we see that gpt-4 is ~>2x faster than exonerate. John Henry is defeated, in this round.
Hidden Regressions
{"uniqueItems": true}
Next, let’s take a look at a place where a quick glance at the GPT-4 code creates a tough-to-spot regression, in a very simple filter. Here, GPT-4 does an obvious thing:
def validate(list) when is_list(list) do
unique_list = Enum.uniq(list)
if length(list) == length(unique_list) do
:ok
else
:error
end
end
If you’re not familiar with how the BEAM works, the regression occurs because Enum.uniq()
is
O(N) in the length of the list; length(...)
is O(N) as well, so in the worst case this
algorithm runs through the length of the list three times.
I won’t show you the code Exonerate generated, but suffice it to say, the validator only loops through the list once. And it even quits early if it encounters a uniqueness violation.
When we give it a short list, GPT-4 wins still.
ExonerateBenchmarks.compare(:"uniqueItems-uniqueItems validation", [1, 2, 3])
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 3.52 M 0.28 μs ±11454.15% 0.21 μs 0.50 μs
exonerate 0.67 M 1.50 μs ±2172.91% 1.22 μs 2.35 μs
Comparison:
gpt4 3.52 M
exonerate 0.67 M - 5.28x slower +1.22 μs
"gpt-4 faster than exonerate by 5.284881178051852x"
but, given a longer list, we see that exonerate will win out.
input = List.duplicate(1, 1000)
ExonerateBenchmarks.compare(:"uniqueItems-uniqueItems validation", input)
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
exonerate 964.49 K 1.04 μs ±2805.64% 0.83 μs 1.74 μs
gpt4 107.92 K 9.27 μs ±148.84% 8.80 μs 16.01 μs
Comparison:
exonerate 964.49 K
gpt4 107.92 K - 8.94x slower +8.23 μs
"exonerate faster than gpt-4 by 8.93731926728509x"
we can run different length sizes in both the best-case and worst-case scenarios and see where the performance crosses over.
list_lengths = [1, 3, 10, 30, 100, 300, 1000]
worst_case =
Enum.map(
list_lengths,
&ExonerateBenchmarks.compare(:"uniqueItems-uniqueItems validation", Enum.to_list(1..&1), true)
)
best_case =
Enum.map(
list_lengths,
&ExonerateBenchmarks.compare(
:"uniqueItems-uniqueItems validation",
List.duplicate(1, &1),
true
)
)
tabularized =
worst_case
|> Enum.zip(best_case)
|> Enum.zip(list_lengths)
|> Enum.flat_map(fn {{worst, best}, list_length} ->
[
%{
relative: :math.log10(worst),
length: :math.log10(list_length),
label: list_length,
group: :worst
},
%{
relative: :math.log10(best),
length: :math.log10(list_length),
label: list_length,
group: :best
}
]
end)
VegaLite.new(width: 500)
|> VegaLite.data_from_values(tabularized)
|> VegaLite.mark(:circle)
|> VegaLite.encode_field(:x, "length", type: :quantitative, title: "log_10(list_length)")
|> VegaLite.encode_field(:y, "relative",
type: :quantitative,
title: "log_10(exonerate_ips/gpt_ips)"
)
|> VegaLite.encode_field(:color, "group")
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 4.34 M 230.44 ns ±13813.36% 147 ns 280 ns
exonerate 1.42 M 703.86 ns ±4488.49% 524 ns 1202 ns
Comparison:
gpt4 4.34 M
exonerate 1.42 M - 3.05x slower +473.42 ns
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 3.40 M 0.29 μs ±11722.80% 0.21 μs 0.53 μs
exonerate 0.65 M 1.53 μs ±1543.42% 1.25 μs 2.46 μs
Comparison:
gpt4 3.40 M
exonerate 0.65 M - 5.21x slower +1.24 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 1.37 M 0.73 μs ±5156.71% 0.52 μs 1.20 μs
exonerate 0.23 M 4.32 μs ±423.54% 3.79 μs 7.61 μs
Comparison:
gpt4 1.37 M
exonerate 0.23 M - 5.92x slower +3.59 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 335.74 K 2.98 μs ±650.16% 2.54 μs 5.31 μs
exonerate 76.87 K 13.01 μs ±115.33% 12.23 μs 22.70 μs
Comparison:
gpt4 335.74 K
exonerate 76.87 K - 4.37x slower +10.03 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 84.83 K 11.79 μs ±26.89% 10.86 μs 20.85 μs
exonerate 21.57 K 46.35 μs ±20.26% 44.33 μs 77.77 μs
Comparison:
gpt4 84.83 K
exonerate 21.57 K - 3.93x slower +34.57 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 27.27 K 36.67 μs ±17.70% 33.74 μs 63.48 μs
exonerate 6.66 K 150.09 μs ±15.95% 142.29 μs 245.64 μs
Comparison:
gpt4 27.27 K
exonerate 6.66 K - 4.09x slower +113.42 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 4.90 K 204.00 μs ±23.59% 193.91 μs 375.72 μs
exonerate 1.60 K 623.75 μs ±14.33% 595.98 μs 972.79 μs
Comparison:
gpt4 4.90 K
exonerate 1.60 K - 3.06x slower +419.75 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 4.29 M 233.22 ns ±13763.63% 149 ns 290 ns
exonerate 1.45 M 691.06 ns ±4714.55% 523 ns 1117 ns
Comparison:
gpt4 4.29 M
exonerate 1.45 M - 2.96x slower +457.85 ns
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 3.82 M 0.26 μs ±14476.78% 0.166 μs 0.37 μs
exonerate 0.95 M 1.05 μs ±2504.27% 0.84 μs 1.76 μs
Comparison:
gpt4 3.82 M
exonerate 0.95 M - 4.02x slower +0.79 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 3.21 M 0.31 μs ±10437.52% 0.22 μs 0.55 μs
exonerate 0.94 M 1.06 μs ±2758.44% 0.84 μs 1.75 μs
Comparison:
gpt4 3.21 M
exonerate 0.94 M - 3.42x slower +0.75 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 2.03 M 0.49 μs ±4837.45% 0.40 μs 0.78 μs
exonerate 0.96 M 1.04 μs ±2452.95% 0.84 μs 1.73 μs
Comparison:
gpt4 2.03 M
exonerate 0.96 M - 2.11x slower +0.55 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
exonerate 947.66 K 1.06 μs ±2855.38% 0.84 μs 1.77 μs
gpt4 873.47 K 1.14 μs ±2024.78% 1.00 μs 1.85 μs
Comparison:
exonerate 947.66 K
gpt4 873.47 K - 1.08x slower +0.0896 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
exonerate 972.48 K 1.03 μs ±2688.61% 0.84 μs 1.72 μs
gpt4 346.59 K 2.89 μs ±440.43% 2.67 μs 4.98 μs
Comparison:
exonerate 972.48 K
gpt4 346.59 K - 2.81x slower +1.86 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
exonerate 966.65 K 1.03 μs ±2855.32% 0.84 μs 1.73 μs
gpt4 106.10 K 9.43 μs ±91.50% 8.84 μs 16.48 μs
Comparison:
exonerate 966.65 K
gpt4 106.10 K - 9.11x slower +8.39 μs
In the worst case scenario for Exonerate, we see that the relative speeds stay about the same: This makes sense, as both processes are O(N) in the size of the list, and the Exonerate overhead is the same per function instance, even if GPT-4 actually traverses the list more times.
In the best case scenario, the crossover occurs at around 80 items in the list. This isn’t terribly good, but a 3x slower for a 100ns function call isn’t the end of the world. Let’s take a look at another example.
{
"type": "object",
"oneOf": [
{ "required": ["foo", "bar"] },
{ "required": ["foo", "baz"] }
]
}
Here is the function that GPT-4 generates:
def validate(object) when is_map(object) do
case Enum.count(["foo", "bar"] -- Map.keys(object)) do
0 -> :ok
_ -> case Enum.count(["foo", "baz"] -- Map.keys(object)) do
0 -> :ok
_ -> :error
end
end
end
This too is O(N) in the size of the object, whereas the code generated by exonerate is O(1). Checking to see if a constant set of items are keys in the object should be a fixed-time process.
In the next cell, we’ll test several different inputs, maps with “foo” and “bar” keys, as well as maps with “bar” and “baz” keys, and maps that only have “foo” keys. To expand the size of the map, we’ll add string number keys. All keys will have to the string “foo” as values. Note that the GPT-4 code doesn’t address a map with “foo”, “bar”, and “baz” keys, which should be rejected. We expect to see performance regressions that are worse for the case without “baz” because these cases will run through the size of the map twice.
with_bar =
Enum.map(
list_lengths,
fn list_length ->
input = Map.new(["foo", "bar"] ++ Enum.map(1..list_length, &"#{&1}"), &{&1, "foo"})
ExonerateBenchmarks.compare(
:"oneOf-oneOf with required",
input,
true
)
end
)
with_baz =
Enum.map(
list_lengths,
fn list_length ->
input = Map.new(["foo", "baz"] ++ Enum.map(1..list_length, &"#{&1}"), &{&1, "foo"})
ExonerateBenchmarks.compare(
:"oneOf-oneOf with required",
input,
true
)
end
)
with_none =
Enum.map(
list_lengths,
fn list_length ->
input = Map.new(["foo", "baz"] ++ Enum.map(1..list_length, &"#{&1}"), &{&1, "foo"})
ExonerateBenchmarks.compare(
:"oneOf-oneOf with required",
input,
true
)
end
)
tabularized =
with_bar
|> Enum.zip(with_baz)
|> Enum.zip(with_none)
|> Enum.zip(list_lengths)
|> Enum.flat_map(fn {{{bar, baz}, none}, list_length} ->
[
%{
relative: :math.log10(bar),
length: :math.log10(list_length),
label: list_length,
group: :bar
},
%{
relative: :math.log10(baz),
length: :math.log10(list_length),
label: list_length,
group: :baz
},
%{
relative: :math.log10(none),
length: :math.log10(list_length),
label: list_length,
group: :none
}
]
end)
VegaLite.new(width: 500)
|> VegaLite.data_from_values(tabularized)
|> VegaLite.mark(:circle)
|> VegaLite.encode_field(:x, "length", type: :quantitative, title: "log_10(list_length)")
|> VegaLite.encode_field(:y, "relative",
type: :quantitative,
title: "log_10(exonerate_ips/gpt_ips)"
)
|> VegaLite.encode_field(:color, "group")
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 4.02 M 0.25 μs ±453.97% 0.23 μs 0.43 μs
exonerate 0.63 M 1.60 μs ±1719.89% 1.19 μs 2.45 μs
Comparison:
gpt4 4.02 M
exonerate 0.63 M - 6.42x slower +1.35 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 3.30 M 0.30 μs ±9092.90% 0.26 μs 0.51 μs
exonerate 0.60 M 1.66 μs ±1651.95% 1.25 μs 2.46 μs
Comparison:
gpt4 3.30 M
exonerate 0.60 M - 5.47x slower +1.35 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 2.33 M 0.43 μs ±2160.37% 0.38 μs 0.74 μs
exonerate 0.54 M 1.86 μs ±1299.83% 1.49 μs 2.81 μs
Comparison:
gpt4 2.33 M
exonerate 0.54 M - 4.34x slower +1.43 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 1.20 M 0.84 μs ±3875.60% 0.70 μs 1.37 μs
exonerate 0.37 M 2.73 μs ±925.57% 2.23 μs 4.12 μs
Comparison:
gpt4 1.20 M
exonerate 0.37 M - 3.27x slower +1.90 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
exonerate 606.10 K 1.65 μs ±1681.92% 1.23 μs 2.52 μs
gpt4 381.09 K 2.62 μs ±634.41% 2.38 μs 4.51 μs
Comparison:
exonerate 606.10 K
gpt4 381.09 K - 1.59x slower +0.97 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
exonerate 613.11 K 1.63 μs ±1672.92% 1.23 μs 2.48 μs
gpt4 108.29 K 9.23 μs ±124.20% 8.68 μs 16.16 μs
Comparison:
exonerate 613.11 K
gpt4 108.29 K - 5.66x slower +7.60 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
exonerate 628.85 K 1.59 μs ±1761.49% 1.23 μs 2.44 μs
gpt4 31.44 K 31.80 μs ±17.56% 30.57 μs 52.44 μs
Comparison:
exonerate 628.85 K
gpt4 31.44 K - 20.00x slower +30.21 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 2.62 M 0.38 μs ±1623.39% 0.35 μs 0.66 μs
exonerate 0.57 M 1.75 μs ±1708.48% 1.25 μs 2.66 μs
Comparison:
gpt4 2.62 M
exonerate 0.57 M - 4.58x slower +1.37 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 2.19 M 0.46 μs ±3810.88% 0.41 μs 0.79 μs
exonerate 0.56 M 1.77 μs ±1688.38% 1.29 μs 2.61 μs
Comparison:
gpt4 2.19 M
exonerate 0.56 M - 3.89x slower +1.32 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 1.33 M 0.75 μs ±2545.42% 0.66 μs 1.26 μs
exonerate 0.48 M 2.07 μs ±1351.64% 1.53 μs 3.10 μs
Comparison:
gpt4 1.33 M
exonerate 0.48 M - 2.75x slower +1.32 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 695.17 K 1.44 μs ±1680.97% 1.28 μs 2.40 μs
exonerate 363.43 K 2.75 μs ±871.41% 2.24 μs 4.27 μs
Comparison:
gpt4 695.17 K
exonerate 363.43 K - 1.91x slower +1.31 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
exonerate 565.35 K 1.77 μs ±1681.01% 1.27 μs 2.67 μs
gpt4 186.24 K 5.37 μs ±155.12% 4.91 μs 9.17 μs
Comparison:
exonerate 565.35 K
gpt4 186.24 K - 3.04x slower +3.60 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
exonerate 567.24 K 1.76 μs ±1671.49% 1.28 μs 2.61 μs
gpt4 50.96 K 19.62 μs ±19.43% 18.96 μs 34.09 μs
Comparison:
exonerate 567.24 K
gpt4 50.96 K - 11.13x slower +17.86 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
exonerate 552.85 K 1.81 μs ±1699.22% 1.28 μs 2.60 μs
gpt4 13.87 K 72.10 μs ±14.63% 69.24 μs 118.93 μs
Comparison:
exonerate 552.85 K
gpt4 13.87 K - 39.86x slower +70.29 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 2.55 M 0.39 μs ±1614.09% 0.35 μs 0.69 μs
exonerate 0.56 M 1.78 μs ±1733.75% 1.24 μs 2.82 μs
Comparison:
gpt4 2.55 M
exonerate 0.56 M - 4.54x slower +1.38 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 2.15 M 0.47 μs ±3708.10% 0.41 μs 0.82 μs
exonerate 0.56 M 1.79 μs ±1697.45% 1.30 μs 2.70 μs
Comparison:
gpt4 2.15 M
exonerate 0.56 M - 3.85x slower +1.33 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 1.32 M 0.76 μs ±2584.80% 0.65 μs 1.28 μs
exonerate 0.48 M 2.08 μs ±1353.92% 1.53 μs 3.12 μs
Comparison:
gpt4 1.32 M
exonerate 0.48 M - 2.75x slower +1.33 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
gpt4 693.10 K 1.44 μs ±1228.39% 1.28 μs 2.46 μs
exonerate 358.38 K 2.79 μs ±862.70% 2.27 μs 4.40 μs
Comparison:
gpt4 693.10 K
exonerate 358.38 K - 1.93x slower +1.35 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
exonerate 565.23 K 1.77 μs ±1679.28% 1.27 μs 2.67 μs
gpt4 188.25 K 5.31 μs ±234.88% 4.96 μs 9.31 μs
Comparison:
exonerate 565.23 K
gpt4 188.25 K - 3.00x slower +3.54 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
exonerate 571.25 K 1.75 μs ±1674.82% 1.28 μs 2.55 μs
gpt4 50.26 K 19.89 μs ±20.74% 19.00 μs 35.15 μs
Comparison:
exonerate 571.25 K
gpt4 50.26 K - 11.36x slower +18.14 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking exonerate ...
Benchmarking gpt4 ...
Name ips average deviation median 99th %
exonerate 543.67 K 1.84 μs ±1503.21% 1.30 μs 2.71 μs
gpt4 13.84 K 72.26 μs ±16.23% 68.85 μs 119.16 μs
Comparison:
exonerate 543.67 K
gpt4 13.84 K - 39.29x slower +70.42 μs
Indeed, we see exactly the relationship we expect: maps with “foo/baz” and “foo/none” have a more dramatic performance improvement over maps with “foo/bar”. Moreover, we see a “kink” in the performance regression around N=30. This is likely because in the BEAM virtual machine, under the hood maps switch from a linked list implementation (with O(N) worst case search) to a hashmap implementation at N=32.
Conclusions
So, should you use GPT to generate your OpenAPI validations? Probably not… yet
GPT-3.5 (and even better, GPT-4) are very impressive at generating correct validations for OpenAPI schemas in Elixir. The most common systematic errors (e.g. not being sure whether to use atoms or strings) are easily addressable using prompt engineering. However, the GPT-generated is not quite right in many cases and sometimes it dangerously misunderstands (see Semantic Misunderstanding).
Although indeed GPT appears to be able to perform compiler optimizations that generate highly efficient code, this code is not composable, and the attention of the current state of the art LLM models may not scale to more complex schemas. In the small, GPT makes performance errors that are likely due to its lack of understanding of the VM architecture; without repeating this experiment in other languages, it’s not entirely clear, though, that this wouldn’t be better.
The use case for autogenerating code in GPT, especially for something like this, is likely to be a developer with low experience in OpenAPI and/or low experience in Elixir. For these practicioners, using GPT in lieu of a built compiler is still generally not a good idea, though I’m looking forward to repeating this experiment with GPT-6.