In this example, a DSL for programming board games is embedded into typescript’s tag functions. When we’re done, it will look something like this:
Note: This guide will assume you have written a regular syntax-highlighting extension. VS Code has a guide on how to do this.
Note: The > Developer: Inspect Editor Tokens and Scopes
command in VS Code is useful for debugging which scopes are assigned.
First, create a file like the following to find points within your host language’s source. In this example, begin
is a regex which will find the start of a string template literal utilizing a tag function called ludi
.
syntaxes/ts-injection.tmLanguage.json
{
"scopeName": "source.ts.embedded.ludi",
// source.ts is the scope for our host language
// `L` means to inject this grammar before `source.ts`
// `-string -comment` means to exclude string and comment scopes
"injectionSelector": "L:source.ts -string -comment",
"patterns": [
{
"begin": "(?:ludi)\\s*`", // Look for strings like ludi`
"end": "`", // Look for the ending `
"contentName": "meta.embedded.block.ludi.typescript source.ludi",
"patterns": [
{
// Process the ludi scope inside
"include": "source.ludi"
}
]
}
]
}
Next, the grammar is added to package.json
. In this example, support for entire ludi language already exists, so only the last section is added.
package.json
{
"name": "ludi",
...
"contributes": {
// Language configuration
"languages": [{
"id": "ludi",
"aliases": ["Ludi", "ludi"],
"extensions": [".ludi"],
"configuration": "./language-configuration.json"
}],
"grammars": [
// Grammar for Ludi
{
"language": "ludi",
"scopeName": "source.ludi",
"path": "./syntaxes/ludi.tmLanguage.json"
},
// Injecting into typescript's grammar. This part is new.
{
"path": "./syntaxes/ts-injection.tmLanguage.json",
"scopeName": "source.ts.embedded.ludi", // Scope matches `scopeName` in the grammar
"injectTo": ["source.ts"], // Host language's scope
"embeddedLanguages": {
"meta.embedded.inline.ludi.typescript": "ludi" // The language id defined above
}
}
]
}
}
That’s it, you’re done!
]]>If a payload is “malformed” that means it is not syntactically valid. This means that there is some syntax issue keeping it from being well-formed, so trying to parse it will result in an error. If the payload happens to be in a binary format, it could be said that it cannot be deserialized.
For example, the following JSON is malformed for several reasons:
{
"key": "value,
"key2:" value",
"list": [
}
If a payload is “invalid” that means it has failed validation. The payload could be parsed and therefore is well-formed, but is does not meet certain validation constraints. Since a program must be able to parse a payload in order to validate it, all invalid payloads are also well-formed1.
The fundamental difference is that these constraints enforce whether a program is willing to accept a payload rather than whether a program can understand it.
For example, age
is not allowed to be negative in the following payload:
{
"age": -12
}
The JSON communicates that the age
key maps to the number -12
, but knowing that ages must be non-negative a program may choose to reject it.
What about this payload, where count
is expected to be an integer?
{
"count": "123"
}
This case may seem ambiguous, but even though an integer is expected, the payload is syntactically valid2. The program reading this payload can choose whether it is invalid or not. The syntax of JSON allows any value to be of any type so there are no problems with well-formedness.
The previous payload is distinct from this one below where the value of count
is a number.
{
"count": 123
}
It’s worth noting that many JSON parsers will automatically convert strings to integers when possible but this just changes what a program is willing to accept and not the definition of JSON syntax.3
Not all formats have the same approach to mismatched types. For example, in a protobuf document the types are specified making the distinction a matter of well-formedness rather than validity.
message Payload {
int32 count = 1;
}
What about this payload with duplicated keys?
{
"key": "value1",
"key": "value2"
}
Again, it is syntactically valid but this time it is also ambiguous. I would argue that the most reasonable thing to do is to reject the payload as invalid due to the ambiguity. There are many cases where the payload must be interpreted anyways and some reasonable choice must be made.
It’s possible (and performant) to write a parser which validates fields in a streaming fashion as they are parsed rather than all at once after the parsing is complete. Therefore, a document could be rejected for being invalid even though it is later malformed. For example:
{
"age": -12,
"list": [
}
In this case, either message is useful. Both issues must be solved eventually. ↩
https://www.json.org/json-en.html
https://stackoverflow.com/questions/15368231/can-json-numbers-be-quoted ↩
There are situations where a number must be encoded as a string, but again this is beyond the scope of syntax ↩
When using floating point, it is easy to write programs which may seem to compute the right answer but are actually hiding subtle bugs. In serious applications numerical computing quickly gets complicated, requiring the consideration of many factors, like the accumulation of error, numerical stability, and the how the numbers flow throughout the program. Knowing some floating point quirks provides a good foundation for when your math starts to look off.
While this list does not go into the details of correctly using floating point numbers, it does enumerate a number of assumptions often made by programmers.
All of these assumptions are wrong
2^n
bits
This post represents a snapshot in the year 2021. The guidelines here are not official guidance from the .NET team and are not endorsed by Microsoft, but represent a combination of what my team at Microsoft uses as well as my own personal preference. The project in this post will target .NET 5, C#9.0 and use the .NET 5 SDK.
The repository is available on GitHub here.
dotnet build
and dotnet test
. This makes integration with CI easy, and allows developers and CI to share the same pipeline.With this setup, dependencies are so limited that Visual Studio is not required to be productive.
The following dependencies should be installed:
If it’s likely that team members have old .NET versions installed, you can enforce a minimum through a global.json
file in the root. There’s also some versioning information here which will become relevant later.
global.json
{
"sdk": {
"version": "5.0.103",
"rollForward": "latestMajor"
},
"msbuild-sdks": {
"Microsoft.Build.Traversal": "3.0.3"
}
}
Below is a directory listing of the project. Each item will be explained in its section. Items marked with an asterisk are considered optional or project-dependent.
StarterProject
│ .gitignore
│ Directory.Packages.props
│ Directory.Build.props
│ dirs.proj
│ global.json
│ README.md
├───deployment*
├───docs*
├───shell
│ Init.ps1
│ MyTool.psm1
│ VisualStudio.psm1
├───src
│ │ dirs.proj
│ ├───MyComponent
│ │ │ StarterProject.MyComponent.csproj
│ │ │ Source.cs
│ │ └───Folder
│ │ MoreSource.cs
│ └───AnotherComponent
│ StarterProject.AnotherComponent.csproj
│ AnotherSource.cs
├───test
│ │ dirs.proj
│ └───MyComponent
│ StarterProject.Test.MyComponent.csproj
│ ExampleTest.cs
└───tools*
There are some properties not set by default which should be used on new .NET projects. These can be configured in Directory.Build.props
, which is applied to all projects within the directory. Other global configurations can be made here as well. I have included some packaging-related ones for sake of example.
Directory.Build.props
<Project>
<!-- General -->
<PropertyGroup>
<TargetFramework>net5.0</TargetFramework>
<LangVersion>9.0</LangVersion>
<Nullable>enable</Nullable>
<Features>strict</Features>
</PropertyGroup>
<!-- Build -->
<PropertyGroup>
<TreatWarningsAsErrors>true</TreatWarningsAsErrors>
<ManagePackageVersionsCentrally>true</ManagePackageVersionsCentrally>
<EnforceCodeStyleInBuild>true</EnforceCodeStyleInBuild> <!-- Enable linter -->
</PropertyGroup>
<!-- Packaging -->
<PropertyGroup>
<!-- Enable packaging on a per-project basis. -->
<IsPackable>false</IsPackable>
<IsPublishable>false</IsPublishable>
<IncludeSymbols>true</IncludeSymbols>
<SymbolPackageFormat>snupkg</SymbolPackageFormat>
<EmbedUntrackedSources>true</EmbedUntrackedSources>
<Authors>Author One; Author Two</Authors>
<PackageLicenseExpression>GPL-3.0-only</PackageLicenseExpression>
<Description>Example project description.</Description>
<PackageTags>dotnet</PackageTags>
</PropertyGroup>
</Project>
Here’s the .gitignore
being used. Note that .sln
files are being ignored because they will be generated as needed and not checked in. More on that later.
.gitignore
**/bin
**/obj
**/TestResults/
*.sln
.vs/
The key to the source organization is the use of the Microsoft.Build.Traversal
SDK. It allows projects to be hierarchically structured within the repository. Each folder has a file called dirs.proj
or a .csproj
for the project. The dirs.proj
references where the child project files are located. The version of this package is specified in global.json
.
dirs.proj
<Project Sdk="Microsoft.Build.Traversal">
<ItemGroup>
<ProjectReference Include="src/dirs.proj" />
<ProjectReference Include="test/dirs.proj" />
</ItemGroup>
</Project>
src/dirs.proj
<Project Sdk="Microsoft.Build.Traversal">
<ItemGroup>
<ProjectReference Include="MyComponent/StarterProject.MyComponent.csproj" />
<ProjectReference Include="AnotherComponent/StarterProject.AnotherComponent.csproj" />
</ItemGroup>
</Project>
It’s also possible to define one dirs.proj
which automatically references any projects under src
and test
.
dirs.proj
<Project Sdk="Microsoft.Build.Traversal">
<ItemGroup>
<ProjectReference Include="src\**\*.*proj" />
<ProjectReference Include="test\**\*.*proj" />
</ItemGroup>
</Project>
Source files are split into two folders, src
and test
. Within each folder are a tree of projects.
StarterProject
│ dirs.proj
├───src
│ │ dirs.proj
│ ├───MyComponent
│ │ └───Folder
│ │ StarterProject.MyComponent.csproj
│ └───AnotherComponent
│ StarterProject.AnotherComponent.csproj
└───test
│ dirs.proj
└───MyComponent
StarterProject.Test.MyComponent.csproj
Dependencies are made between projects using project references. This implies that project boundaries are drawn around self-contained components. .NET will prohibit circular dependencies. Within a project, folders can be used to group files if more than one namespace is needed.
src/MyComponent/StarterProject.MyComponent.proj
<Project Sdk="Microsoft.NET.Sdk">
<ItemGroup>
<ProjectReference Include="../AnotherComponent/StarterProject.AnotherComponent.csproj" />
</ItemGroup>
<ItemGroup>
<PackageReference Include="Newtonsoft.Json" />
<PackageReference Include="Serilog" />
</ItemGroup>
</Project>
NuGet is the package manager of choice for .NET applications. It can be configured in two parts, Directory.Packages.props
which gives the version number for each package, and in each project file are references to those packages.
Note: The functionality described is currently in preview, but represents the direction of the .NET SDK. A stable alternative is the CentralPackageVersions SDK, which does the same thing with slightly more boilerplate.
Here’s what Directory.Packages.props
may look like. Dependencies are sorted by usage, then alphabetically by package name.
Directory.Packages.props
<Project>
<!-- Runtime -->
<ItemGroup>
<PackageVersion Include="Newtonsoft.Json" Version="12.0.3" />
<PackageVersion Include="Serilog" Version="2.10.0" />
</ItemGroup>
<!-- Test -->
<ItemGroup>
<PackageVersion Include="Microsoft.NET.Test.Sdk" Version="16.8.0" />
<PackageVersion Include="Moq" Version="4.13.1" />
<PackageVersion Include="xunit" Version="2.4.1" />
<PackageVersion Include="xunit.runner.visualstudio" Version="2.4.1" />
</ItemGroup>
</Project>
A project can then reference one of these packages.
src/AnotherComponent/StarterProject.AnotherComponent.csproj
<Project Sdk="Microsoft.NET.Sdk">
<ItemGroup>
<PackageReference Include="Newtonsoft.Json" />
<PackageReference Include="Serilog" />
</ItemGroup>
</Project>
Since this functionality is currently in preview, each project much have ManagePackageVersionsCentrally
set to true
. This can be done globally in Directory.Build.props
. The default value of this property will be true
in future versions of the .NET SDK.
Directory.Build.props
<PropertyGroup>
<ManagePackageVersionsCentrally>true</ManagePackageVersionsCentrally>
</PropertyGroup>
It can be useful to have a collection of scripts related to the project checked in. PowerShell is my automation language of choice, not only for it’s integration with .NET, but also because scripts tend to be easier to write and more maintainable than other scripting alternatives. PowerShell can be used on both Linux and Windows.
An entrypoint is defined as follows, which imports all other PowerShell scripts where commands are defined. In this case there are only two.
shell/Init.ps1
Import-Module $PSScriptRoot/VisualStudio.psm1
Import-Module $PSScriptRoot/MyTool.psm1 # Optional
Write-Host -ForegroundColor Cyan "Welcome to StarterProject shell"
This can be invoked directly when starting the shell. Running this script will load any commands that the .psm1
files export.
PS StarterProject> .\shell\Init.ps1
Welcome to StarterProject shell
While developers can use any editor, many will want to work from Visual Studio. Visual Studio requires a solution file in order to be run. Within Microsoft, it is quite common not to check in .sln
files and instead generate them using one of many tools. Here is a short PowerShell script which can be used to do the same thing.
shell/VisualStudio.psm1
function Start-VisualStudio() {
$solutionName = (Get-Item .).Name
dotnet new sln --force --name $solutionName
Get-ChildItem -Recurse *.csproj | ForEach { dotnet sln add $_.FullName }
start "$solutionName.sln" # This part only works on windows
}
Export-ModuleMember *-*
Running it will generate the solution file and launch Visual Studio if installed.
PS StarterProject> .\shell\Init.ps1
Welcome to StarterProject shell
PS StarterProject> Start-VisualStudio
The template "Solution File" was created successfully.
Project `src\AnotherComponent\StarterProject.AnotherComponent.csproj` added to the solution.
Project `src\MyComponent\StarterProject.MyComponent.csproj` added to the solution.
Project `test\MyComponent\StarterProject.Test.MyComponent.csproj` added to the solution.
Project `tools\MyTool\StarterProject.MyTool.csproj` added to the solution.
The command can also be run from a different location within the repo to generate a solution with a smaller scope.
PS StarterProject\src\MyComponent> Start-VisualStudio
The template "Solution File" was created successfully.
Project `StarterProject.MyComponent.csproj` added to the solution.
Note: The slngen tool is a more robust alternative to this script with better MSBuild integration. However, because it has dependencies on Visual Studio and MSBuild which require extra configuration, it is not included in this guide.
There are several popular options for testing in .NET.
This guide will choose xUnit out of personal preference.
Tests are organized with a hierarchy that parallels the code being tested. This gives something like the following structure.
├───src
│ │ dirs.proj
│ └───MyComponent
│ StarterProject.MyComponent.csproj
│ Source.cs
└───test
│ dirs.proj
└───MyComponent
StarterProject.Test.MyComponent.csproj
SourceTest.cs
Tests use relative project references to refer to the code they are testing.
test/MyComponent/StarterProject.Test.MyComponent.csproj
<Project Sdk="Microsoft.NET.Sdk">
<ItemGroup>
<ProjectReference Include="../../src/MyComponent/StarterProject.MyComponent.csproj" />
</ItemGroup>
<ItemGroup>
<PackageReference Include="Microsoft.NET.Test.Sdk" />
<PackageReference Include="Moq" />
<PackageReference Include="xunit" />
<PackageReference Include="xunit.runner.visualstudio" />
</ItemGroup>
</Project>
Code style analyzers have been added to .NET 5. In order to enable this, a .editorconfig
file must be created and the EnforceCodeStyleInBuild
property should be enabled. Using this property will cause IDExxxx
rules to be emitted.
Directory.Build.props
<PropertyGroup>
<EnforceCodeStyleInBuild>true</EnforceCodeStyleInBuild> <!-- Enable linter -->
</PropertyGroup>
The .editorconfig
file is too large to reproduce here, but you can see an example in the SampleProject repo.
Code quality analyzers (CAxxxx
) are enabled by default.
/docs
The /docs
folder is a great place to store documentation alongside the code. A simple wiki can be created here as a collection of markdown files. By checking documentation into the repo through pull requests, it undergoes the same quality gates as the rest of the code.
/deployment
If the project will be run as a service, /deployment
is a good place to put any configuration or automation related to making deployments.
/tools
Any ad-hoc tools can be placed here. If they are written in .NET, a simple wrapper in the shell
folder can be written to invoke dotnet run
. This will compile and run the program.
shell/MyTool.psm1
function Invoke-MyTool() {
dotnet run -p tools/MyTool/StarterProject.MyTool.csproj -- @args
}
Export-ModuleMember *-*
Running it:
PS StarterProject> .\shell\Init.ps1
Welcome to StarterProject shell
PS StarterProject> Invoke-MyTool arg1 arg2
Hello from MyTool! Arguments: [arg1,arg2]
Serialization converts in-memory data to an external representation, typically a list or stream of bytes, which is then ready to be stored on disk or sent over the network.
This post describes Cheerios, a verified library for serialization in Coq. Cheerios uses typeclasses to make it easy to create new serializers by composing existing serializers, such that the correctness proofs also compose. We first give an overview of the core definitions of Cheerios and show how to build simple serializers for booleans, natural numbers, and pairs. Then, we describe two generic strategies for serializing recursive “container-like” types, such as lists and trees, and discuss the tradeoffs in proof effort between the strategies. A recurring theme is the challenge of expressing decoders via structural recursion.
This post is generated from a literate Coq file, which we encourage you to step through.
In order to define serialization, three things are needed, types for the serialization and deserialization functions, and a correctness specification. The correctness spec should roughly show that serialization and deserialization are inverses.This enables the proof that any object can be serialized then deserialized into the same object. We’ll start with serialization because it conceptually comes first in the process.
In order to serialize something, all of it’s information must be mapped into bits. It makes sense then to define a serializer for some type A
as A -> list bool
. Take the following type for example, representing olympic medals:
Inductive medal := Gold | Silver | Bronze.
A serialization function should map each case to a symbol of bits. There are many ways this could be done, each with different trade offs that will be explored later. For now, we just pick one.
Definition medal_serialize (m: medal) : list bool :=
match m with
| Gold => [true; true]
| Silver => [true; false]
| Bronze => [false]
end.
As it turns out, this first attempt at a type will be exactly what is needed.
Now a type for the deserializer can be determined. We want something that acts as an
inverse to the serialization function we picked. At first thought, list bool -> A
seems
like a good option. This would allow the correctness spec to be deserialize (serialize a) = a
.
However, this runs into problems pretty quickly.
Fail Definition medal_deserialize (bools: list bool) : medal :=
match bools with
| [true; true] => Gold
| [true; false] => Silver
| [false] => Bronze
end.
Coq catches the mistake and points out that the bools
is not exhaustively matched on. What if
it’s empty? Fundamentally, this problem is encountered because not every sequence of booleans
decodes into a medal
. Even non-empty sequences such as [false; true]
pose issues. Since
these sequences are not produced by the serializer, they can be considered erroneous.
In cheerios we handle this case by returning the option
constructor None
to indicate an
error.
This makes the spec become deserialize (serialize a) = Some a
. In English: deserialization
on any serialized stream always succeeds and returns the correct value.
Definition medal_deserialize1 (bools: list bool):option medal :=
match bools with
| [true; true] => Some Gold
| [true; false] => Some Silver
| [false] => Some Bronze
| _ => None
end.
This works for a single medal being encoded in the bitstream, but there are problems when the work from above is reused to a type which requires composition, like a pair of medals. Serialization works just fine, but deserialization is problematic.
Definition medal_serialize_pair (medals: medal * medal) :=
medal_serialize (fst medals) ++ medal_serialize (snd medals).
Fail Definition medal_deserialize_pair (bools: list bool)
: option (medal * medal) :=
(medal_deserialize1 bools, medal_deserialize1 hmmm).
When deserializing the first medal, the entire list is consumed. There is nothing to pass into the second call to medal_deserialize1
because it is not known how much of the list has been deserialized. The definition of deserialize needs a way to communicate how much of the stream is remaining back to the caller. In Cheerios, this is represented with the type medal * list bool
where
the deserialized medal and remaining portion of the stream are returned. This is wrapped in an option to allow the entire
deserialization operation to fail. Failure happens at this level because once an error is encountered, it is
impossible in general to resume serialization of the remaining content.
Definition medal_deserialize (bools: list bool)
: option (medal * list bool) :=
match bools with
| true :: true :: bools => Some (Gold, bools)
| true :: false :: bools => Some (Silver, bools)
| false :: bools => Some (Bronze, bools)
| _ => None
end.
As we will see shortly, this type is sufficient to support both composition and malformed inputs. Let’s take a moment to generalize these definitions before continuing so we can arrive at a definition for the spec.
Definition serializer (A: Type) := A -> list bool.
Definition deserializer (A: Type) :=
list bool -> option (A * list bool).
How does this alter the correctness specification? We can start by taking what we had last time and making it typecheck:
deser (ser a) = Some (a, [])
However this still doesn’t address the problem with the remaining bools. How do we reason about deserialization with any other input following? Another attempt leads us to something like this:
deser (ser a ++ ser b) = Some (a, ser b)
This works, but now exactly two objects must be encoded in the stream. We can’t easily reason about deserializing multiple objects or a single object this way. Generalizing again for what comes after gives:
deser (ser a ++ bools) = Some (a, bools)
Now the dependence on a second object is removed and as a side effect the spec becomes stronger, allowing any data to follow rather than just data produced by some serializer.
Note that the spec only needs to worry about encodings which the serializer produces. This eliminates our need to reason about the error cases that were necessary in the deserializer. However, in doing this, nothing is said about how malformed bitstrings are parsed, or that every deserialized value can be generated by exactly one bit string. These may be useful properties to prove, but cheerios does not handle deserialization from unknown and unverified sources since this minimum spec is enough.
Definition ser_deser_spec A
(ser : serializer A)
(deser : deserializer A) :=
forall (a : A) (bools: list bool),
(deser (ser a ++ bools)) = Some (a, bools).
Wrapping this up in a class gives us the following definition which includes the following three things: a serializer, a deserializer, and a proof of correctness.
Class Serializer (A : Type) : Type := {
serialize : A -> list bool;
deserialize : list bool -> option (A * list bool);
ser_deser_identity : ser_deser_spec A serialize deserialize
}.
In general, the correctness proofs tend to be straightforward and repetitive, but this first one is included here to show the structure. Concretely this becomes:
Theorem medal_ser_deser_identity :
ser_deser_spec medal medal_serialize medal_deserialize.
Proof.
unfold ser_deser_spec.
unfold medal_deserialize.
unfold medal_serialize.
intros m.
destruct m; reflexivity.
Qed.
Instance MedalSerializer : Serializer medal.
Proof.
exact {| serialize := medal_serialize;
deserialize := medal_deserialize;
ser_deser_identity := medal_ser_deser_identity;
|}.
Defined.
Generalizing this pair deserailizer for arbitrary types A
and B
comes
naturally now that there are better type signatures for serialization
and deserialization. Wrapping all three components in a section avoids some
boilerplate. Note that the type system requires a serializer for A
and B
in order
for the A * B
serializer to function.
Section PairSerializer.
Variable A : Type.
Variable B : Type.
Variable serA : Serializer A.
Variable serB : Serializer B.
Definition pair_serialize (p : A * B) : list bool :=
serialize (fst p) ++ serialize (snd p).
Definition pair_deserialize bools
: option ((A * B) * list bool) :=
match deserialize bools with
| Some (a, bools) =>
match deserialize bools with
| Some (b, bools) => Some ((a, b), bools)
| None => None
end
| None => None
end.
Theorem pair_ser_deser_identity :
ser_deser_spec (A * B) pair_serialize pair_deserialize.
Proof.
unfold ser_deser_spec.
intros.
unfold pair_serialize.
rewrite app_ass.
unfold pair_deserialize.
rewrite ser_deser_identity, ser_deser_identity.
rewrite <- surjective_pairing.
reflexivity.
Qed.
Instance PairSerializer : Serializer (A * B).
Proof.
exact {| serialize := pair_serialize;
deserialize := pair_deserialize;
ser_deser_identity := pair_ser_deser_identity;
|}.
Defined.
End PairSerializer.
Note that the variable bools
is shadowed several times in this definition. Normally this can complicate
code, but in this case it improves clarity because bools
always refers to “what’s left to parse”.
Now, we will build a simple (inefficient1) serializer/deserializer for a more useful datatype, nat
s.
The encoding is essentially the unary representation of the natural number.
Fixpoint nat_serialize (n : nat) : list bool :=
match n with
| O => [false]
| S n => [true] ++ (nat_serialize n)
end.
Fixpoint nat_deserialize bools : option (nat * list bool) :=
match bools with
| true :: bools =>
match nat_deserialize bools with
| None => None
| Some (n, bools) => Some (S n, bools)
end
| false :: bools => Some (O, bools)
| [] => None (* Deserializing an empty stream *)
end.
Theorem nat_ser_deser_identity :
ser_deser_spec nat nat_serialize nat_deserialize.
Proof.
unfold ser_deser_spec.
intros n; induction n; intros.
- simpl. reflexivity.
- simpl.
rewrite IHn.
reflexivity.
Qed.
Instance NatSerializer : Serializer nat.
Proof.
exact {| serialize := nat_serialize;
deserialize := nat_deserialize;
ser_deser_identity := nat_ser_deser_identity;
|}.
Defined.
Notice that the information about when to stop deserialization of each element must be encoded
into the stream itself. For example with the following definition of nat_serialize
, deserialization
of nat * nat
would become problematic.
Fixpoint nat_serialize_broken (n : nat) : list bool :=
match n with
| O => []
| S n => [true] ++ (nat_serialize n)
end.
Under this definition, it’s unclear what deserializing [true, true true]
as a pair of nat
s should
return. It could be (0,3)
, (1,2)
, (2,1)
or (3,0)
. To remove this ambiguity, the information about when to stop must be
encoded in the stream itself in one form or another rather than implicitly by using the end of the stream as a token.
Consider the serialized pair of nat
s [true, false, true, true, false]
, serialized using the not-broken serializer.
It is unambiguously (1, 2)
. When deserializing it is known precisely when each nat
finishes (when false
is reached), and when the pair finishes (when the second nat
finishes).
This information about the structure of the encoded
data plays a crucial part in showing ser_deser_identity
.
When serializing lists (or any variable sized collection) there must be some information about the structure in the serialized stream. Imagine this is not done, and a pair of lists is serialized into the byte stream. This would produce an encoding which looks like the figure below. It’s impossible to tell where one list stops and the next begins just by looking at the stream.
This serializer is broken for the same reason as the broken nat
serializer, the information in a serialized
object must be entirely contained within the bitstream. Note that we don’t run into this problem with any
collection of fixed size, like a pair or vector. It is clear when to stop deserializing a Vec 5
because 5
elements have been deserialized. In this case, the information about the shape of the data in this case is encoded in the
type. Since the type is known to the serializer and the deserializer, it does not need to be encoded
in the bitstream.
Let’s start with solving this problem by including a “continue” bit before every element. If it is true an element follows, and if it is false, the end of the list has been reached. This appears as follows:
Let’s see what this looks like in code.
Fixpoint list_serialize_inter (l : list A) : list bool :=
match l with
| [] => [false]
| h :: t => [true] ++ serialize h ++ list_serialize_inter t
end.
With this scheme, deserialization again proves to be difficult. In the definition below, because bools_after_elem
is not a syntactic subterm of bools
, the termination checker refuses to accept this definition. The fact that bools_after_elem
is returned from a function hides the subterm property from the typechecker. When executed, the
definition does terminate, since bools_after_elem
is a strict suffix of bools
,
but the type system does not see this. An attempted definition is given below:
Fail Fixpoint list_deserialize_inter
(bools: list bool) : option (list A * list bool) :=
match bools with
| [] => None
| false :: bools => Some ([], bools)
| true :: bools =>
match deserialize bools with
| None => None
| Some (a, bools_after_elem) =>
match list_deserialize_em bools_after_elem with
| None => None
| Some (tail, bools_after_list) =>
Some (a :: tail, bools_after_list)
end
end
end.
It is intuitively impossible to define this deserialization function without using general recursion. To solve this recursion problem, the same information encoded in the continuation bits can be moved to the front of the list’s encoding in the form of a size. Then the rest of the deserializer can recurse on the number of elements remaining.
Programmatically,
Fixpoint list_serialize_elts (l : list A) : list bool :=
match l with
| [] => []
| h :: t => serialize h ++ list_serialize_elts t
end.
Definition list_serialize (l : list A) : list bool :=
nat_serialize (length l) ++ list_serialize_elts l.
Fixpoint list_deserialize_elts (size : nat) (bools : list bool)
: option (list A * list bool) :=
match size with
| O => Some ([], bools)
| S size =>
match deserialize bools with
| None => None
| Some (n, bools) =>
match list_deserialize_elts size bools with
| None => None
| Some (tail, bools) => Some (n :: tail, bools)
end
end
end.
Definition list_deserialize bools :=
match deserialize bools with
| None => None
| Some (size, bools) => list_deserialize_elts size bools
end.
This gives a definition which can be defined using only structural recursion, just
by moving the information around. It’s worth noting that because the size information
is grouped together instead of spread apart, it would be much easier to make the encoding
format more efficient by swapping in a more efficient nat
serializer. The only property
lost with this encoding is that it is now impossible to reason about any tail of the
list in isolation, the concept of a size must also be considered.
To continue exploring this idea of serializing shape, we need to look at a more complicated data structure such as a binary tree. Our definition of a binary tree is straightforward:
Inductive tree: Type :=
| leaf : tree
| node : A -> tree -> tree -> tree.
Just as with lists, there are two general approaches to serializing trees: interleaved and up-front.
For the interleaved shape tree serializer, the concept of a “path” is needed. A path is simply the list of
directions taken from the root to reach some node. We’ll use true
to represent left and false
to represent right. These directions are stored with the head at the top of the tree.
Below is the path [true, false]
.
Using the concept of a path, the position and data of any node can be serialized. When this is done for all nodes in the tree, all information captured by the original data structure has been encoded.2
Even though an interleaved structure is impossible to deserialize without general recursion, using an interleaved structure is still possible if there is just enough information up front to recurse on. The number of nodes in the tree provides a nice metric. Our serializer will not be truely interleaved since we require this header, but information about the shape will still be interleaved in the encoding.
The encoding using an interleaved structure looks like this:
Serialization is performed as follows:
Fixpoint tree_size (t : tree A) : nat :=
match t with
| leaf => 0
| node _ l r => 1 + tree_size l + tree_size r
end.
Fixpoint tree_serialize_subtree_inter
(t: tree A) (path: list bool) :=
match t with
| leaf => []
| node a l r => serialize path ++ serialize a
++ tree_serialize_subtree_inter l (path ++ [true])
++ tree_serialize_subtree_inter r (path ++ [false])
end.
Definition tree_serialize_inter (t: tree A) : list bool :=
nat_serialize (tree_size t) ++
tree_serialize_subtree_inter t [].
Deserialization is more complicated. As elements are parsed, they are inserted into the tree structure parsed already. The insertion function used is not particularly robust, however during deserialization as long as any given node is preceded by all of its parents no issues arise. This is the case with a preorder traversal, and other traversals like BFS, so it meets our needs.
Fixpoint tree_insert (into t: tree A)(path: list bool): tree A :=
match into with
| leaf => t
| node a l r =>
match path with
| [] => t (* not supported *)
| true :: path => node a (tree_insert l t path) r
| false :: path => node a l (tree_insert r t path)
end
end.
Fixpoint tree_deserialize_inter_impl
(remaining : nat) (root : tree A) (bools : list bool)
: option (tree A * list bool) :=
match remaining with
| S n =>
match deserialize bools with
| None => None
| Some (path, bools) =>
match deserialize bools with
| None => None
| Some (a, bools) =>
tree_deserialize_inter_impl
n
(tree_insert root (node a leaf leaf) path)
bools
end
end
| O => Some (root, bools)
end.
Definition tree_deserialize_inter bools :=
match nat_deserialize bools with
| Some (size, bools) =>
tree_deserialize_inter_impl size leaf bools
| None => None
end.
Because of this concept of a path, which is a global address of any particular node, reasoning about a tree becomes much more difficult. In particular, we must now prove that every insertion is made on a leaf of the tree so it does not overwrite data or fall off the end.
Fixpoint leaf_insertable (into: tree A)(path: list bool): Prop :=
match into with
| leaf =>
(* Only if the path and tree run out at the same time
should we be able to insert *)
match path with
| [] => True
| _ => False
end
| node a l r =>
match path with
| [] => False
| true :: path => (leaf_insertable l path)
| false :: path => (leaf_insertable r path)
end
end.
The proof for this serializer is quite large (about 150 lines) and uninteresting, so it has been omitted. It can be found here.
Alternatively, the structure may be recorded at the beginning and then filled in as the tree is parsed. To do this, a tree’s shape can be reasoned about as the type tree unit
, and it’s elements as the type list A
.
This technique requires serialization and deserialization to be a two step process, which has the advantage of better mapping to the information stored in the tree (shape and element data), but the disadvantage of being more complicated.
The shape is encoded similarly to HTML with three symbols:
[true; true]
: The beginning of a node
[true; false]
: The end of a node
[false]
: A leaf nodeEach node
requires exactly two subtrees between its start and end marker. Storing the shape as tree unit
works because unit
contains no information, so tree unit
only contains the information that
the tree
portion of tree A
describes, which is the shape. Since the shape is recorded in a preorder
traversal, the elements are also encoded in the same order, which makes it easy to marry the two together.
A visual representation of this encoding:
And in code:
Fixpoint tree_serialize_shape (t : tree A) : list bool :=
match t with
| leaf => [false]
| node _ l r => [true; true] ++ tree_serialize_shape l ++
tree_serialize_shape r ++ [true; false]
end.
Fixpoint tree_serialize_data_preorder (t : tree A) : list bool :=
match t with
| leaf => [] (* No data contained within leaf nodes *)
| node a l r => serialize a ++
tree_serialize_data_preorder l ++
tree_serialize_data_preorder r
end.
Definition tree_serialize_front (t: tree A) : list bool :=
tree_serialize_shape t ++ tree_serialize_data_preorder t.
Fixpoint tree_deserialize_shape
(bools: list bool) (progress: list (list (tree unit)))
: option (tree unit * list bool) :=
match bools with
| false :: bools =>
match progress with
| [] => Some (leaf, bools)
| level :: progress =>
tree_deserialize_shape
bools
((leaf :: level) :: progress)
end
| true :: true :: bools =>
tree_deserialize_shape bools ([] :: progress)
| true :: false :: bools =>
match progress with
| [] => None (* end without a beginning *)
| level :: [] =>
match level with
| [r; l] => Some (node tt l r, bools)
| _ => None
end
| level :: parent :: progress =>
match level with
| [r; l] =>
tree_deserialize_shape
bools
((node tt l r :: parent) :: progress)
| _ => None
end
end
| _ => None
end.
Fixpoint tree_deserialize_front_elts
(shape : tree unit) (bools : list bool)
: option (tree A * list bool) :=
match shape with
| leaf => Some (leaf, bools)
| node _ l r =>
match deserialize bools with
| None => None
| Some (a, bools) =>
match tree_deserialize_front_elts l bools with
| None => None
| Some (l, bools) =>
match tree_deserialize_front_elts r bools with
| None => None
| Some (r, bools) => Some (node a l r, bools)
end
end
end
end.
Definition tree_deserialize_front (bools : list bool)
: option (tree A * list bool) :=
match tree_deserialize_shape bools [] with
| None => None
| Some (shape, bools) =>
tree_deserialize_front_elts shape bools
end.
Because of the more recursive nature of the encoding, reasoning is significantly easier. We can consider any portion of the shape in isolation from all others because there are no ties to any global state.
Again, the proof for this serializer is large (about 70 lines) and uninteresting, so it has been omitted. It can be found here.
It’s worth noting that possible encodings for a given type are restricted by information dependencies within that type. Imagine a list is encoded as follows:
Since the size of the list is at the end, rather than at the beginning, information about how to deserialize the structure isn’t known until its too late. Similarly, the size can’t be put anywhere in the middle (say after the first element), because of the possibility of an empty list. Before deserializing each element, it must be known that it actually is an element of the list, and not some other data coming after the list.
This is why the interleaved list serializer is able to work. Right before each element is deserialized, we mark that the list continues with the continue bit.
This is also why the tree serializers are able to encode the shape at the front or the end. In both cases, the size is known so deserializing additional elements is justified. The question of how to arrange these elements can be reasoned about independently of the elements themselves, therefore the shape of the tree can be encoded without regard to where the element data is located.
One might expect to be able to speculatively parse elements of the bitstream and stop when an invalid element is reached. But this requires that we don’t accidentally interpret whatever came after in the bit stream as an element. If the encoding of different types are guaranteed to not overlap, then this would be possible. But in our model, serializers can choose arbitrary encodings, so this is not possible.
Beyond practical necessity, serialization can be used as a forcing function to understand the information contained within data structures. By requiring a well defined format, the information contained in that structure may be deduced and formalized. For example, a list needs to have a length, and a tree needs to have a shape. From there, the encoding of this information is flexible, although some encodings are easier to work with than others.
A linked list of booleans is not computationally efficient, and could be replaced with another more sensible structure such as a stream of bytes. ↩
It’s worth noting that this representation could be made more efficient by recording locations relative to the previous node instead of absolute ones. However, this fact does not significantly change how hard it is to reason about the tree. Recording relative locations would allow us to reason about subtrees instead of parts of some tree, but we still must reason about insertions. ↩