Insane in the Membrain

Date: July 23, 2019
Author: Veronika Romashkina

Intro🔗

This blog post is going to walk you through the membrain Haskell library, its purpose and implementation details. Here is the link to the source code:

Since the package has such an extraordinary name I feel like we owe you an explanation on what the package is doing. Membrain is not about creating memes (though we have good ones 😏), it’s about a type-safe memory data type with units being on the type level, so all the computations are guaranteed to be trustworthy and some of them even performed during compile time. The Haskell type system has a lot of advanced features. However, they are not used often, and common patterns around them are not yet established. Membrain combines clever type-level tricks and the best-practices of package development and maintenance to provide smooth integration with the rest of the ecosystem.

In order to be more useful for everybody, this blog post also offers a small tutorial on type-level operations along with the library description. If you are interested in any of this, we are going to start with a quick excursion to memory units in general so we are all on the same page and understand the problem we are trying to solve here.

A ‘bit’ of history🔗

We use memory units in our day-to-day lives to measure information/data and needless to say, there are officially standardised units of measurements which are used ubiquitously. At the base of the memory measurement system that we are covering in the library, we have bit — a binary digit. We are using the binary model of memory representation accepted by most computers in the world.

Bit is a binary base for a logarithmic tower of memory units. Note, that we are not going to take into consideration systems with other bases as they are rarely used nowadays for data measurement.

Usually we work with more than several bits. In order to ease this interaction, we can use measure prefixes to define huge amounts of bits concisely. But the story behind unit names is not that simple. There are two main naming conventions for prefixes: decimal and binary. The difference between them lies in the base of the multiplier. The standard range of prefixes are the multipliers of the powers of 10: kilo (103), mega (106), giga (109), etc. However, as we are working with memory units, it is more conventional to work with the powers of 2 in the computer science world. For this exact reason, there exist the binary prefixes: kibi (210), mebi (220), gibi (230), etc. Here is the table of supported units in the membrain library:

Table of units

Unit Symbol Multiplier Degree Number of Bits
Bit b 1 1
Nibble n 4 4
Byte B 8 8
Kilobyte kB 1000 * B 103 8000
Megabyte MB 1000 * kB 106 8000000
Gigabyte GB 1000 * MB 109 8000000000
Terabyte TB 1000 * GB 1012 8000000000000
Petabyte PB 1000 * TB 1015 8000000000000000
Exabyte EB 1000 * PB 1018 8000000000000000000
Zettabyte ZB 1000 * EB 1021 8000000000000000000000
Yottabyte YB 1000 * ZB 1024 8000000000000000000000000
Kibibyte KiB 1024 * B 210 8192
Mebibyte MiB 1024 * KiB 220 8388608
Gibibyte GiB 1024 * MiB 230 8589934592
Tebibyte TiB 1024 * GiB 240 8796093022208
Pebibyte PiB 1024 * TiB 250 9007199254740992
Exbibyte EiB 1024 * PiB 260 9223372036854775808
Zebibyte ZiB 1024 * EiB 270 9444732965739290427392
Yobibyte YiB 1024 * ZiB 280 9671406556917033397649408

As you can see, memory units represent a complicated system, and this is why it is better to be more confident that you are doing safe computations with them.

Memory Units confusion

Implementation🔗

Let’s move to the more technically exciting part — the implementation itself. The goal of the library is to ease the lives of people working with memory units. The intentions are:

  • to make it easy to define different units without the fear of forgetting to multiply/divide them during the process, combine different units and safely convert between them.
  • if some requirements for the code should be changed in the future users need to have guarantees that nothing is going to break with the code they wrote, or at least they want to be notified via compile time errors in such cases.

In order to convince you and not to be unsubstantiated that using type-safe libraries would pay off for you, here is a motivating example of how easily you can mess up with different units.

Example

Let’s say we have the following data

file1 :: Int
file1 = 2  -- in Gigabytes

file2 :: Int
file2 = 80  -- in Bytes

memoryToAlloc :: Int
memoryToAlloc = file1 + file2 -- FATALITY!

And it’s over. You allocated 82 gigabytes and probably ran out of memory.

But how can you avoid such situations? Obviously, using just type Memory = Int for that is not an option as it fails in the described situation as well. And generally using type aliases is not the right thing to do in most of these cases. Usually wrapping the value in a newtype is a common pattern in Haskell. It is a safer approach but as there are a lot of memory units, introducing an individual newtype for every one and a function to compose them with each other is a lot of boilerplate. So, it also doesn’t satisfy the goals we are aiming for. That’s why in membrain we want to push this idea even further and represent all memory units with a single newtype in a safe and convenient way.

Now, I hope you see the motivation behind the membrain library too. In order to bypass many crucial mistakes that could cost lots of money or even lives in especially important cases, we could use the Haskell type system to avoid this whole domain of errors.

The definition🔗

Memory definition galaxy brain meme

If we want to avoid runtime computations and manual supervision of units, we can delegate this job to the compiler. To do so we are using the following data type:

newtype Memory (mem :: Nat) = Memory
    { unMemory :: Natural
    } deriving stock   (Show, Read, Generic)
      deriving newtype (Eq, Ord)

Where mem is the phantom parameter that is going to be used to control the unit multiplier in the expression. As you can see mem has a kind Nat about which I’m going to talk more in the following section. We are storing Natural here for a reason. Internally, memory of any unit is represented as a number of bits. This not only helps to avoid dealing with division during compile time, but also gives us free-of-charge runtime computations! You can think of this approach in the following manner: as a user, you can explicitly specify the memory unit multiplier (Kilobyte, Gigabyte, etc.) and the amount you want, but we are actually storing everything in bits under the hood. The Memory data type constructor is not supposed to be used explicitly as it could cause the wrong situations because of this internal representation. Instead, we defined useful functions for you to construct memory values which are doing the conversion for you:

λ> bit 1
Memory {unMemory = 1}

λ> bit 32
Memory {unMemory = 32}

λ> byte 1
Memory {unMemory = 8}

We denote the type of the unit we are using on the type level by specifying the multiplier (Bit, Byte, etc.), which is just a type-level natural number, and then simply specify the amount of needed memory.

Type-level Nats🔗

For a better understanding of how we are storing different units we need to talk about Nats a bit. GHC has built-in support for some type-level literals: strings and natural numbers. To access natural numbers predefined in the compiler one can utilise the Nat data type. Nat is used as a kind of the type-level natural numbers (kind is the type of a type). So, in plain Haskell you can have values 0, 1, 2, 3 of the Int type. But now you can have types 0, 1, 2, 3 of the Nat kind. However, you can’t have values of the type 3 :: Nat. Type-level natural numbers exist only for compile time computations.

To play with Nat, the corresponding data type can be found in the GHC.TypeNats module along with other useful natural number related stuff. The data type itself is declared simply like this:

data Nat

As you can see, it doesn’t have any constructors or anything and looks not very useful at first glance, but the ability to be used as a parameter at the type-level gives us a lot of opportunities as this natural number can be retrieved at runtime.

Let’s explore the Nat kind in ghci for a better understanding:

λ> :set -XDataKinds
λ> import GHC.TypeNats
λ> :kind 42
42 :: Nat

The module even exports some type-level operators (type families), so you can write code like this:

λ> :set -XTypeOperators
λ> :kind! 2 + 2
2 + 2 :: Nat
= 4

In addition to performing numeric operations with type-level natural numbers, GHC gives the ability to extract a runtime value from the compile time type-level natural. For this, you can use the natVal function which returns the Natural:

-- to show the type after each command
λ> :set +t

λ> :set -XTypeApplications -XKindSignatures
λ> import Data.Proxy
λ> natVal $ Proxy @5
5
it :: GHC.Natural.Natural

λ>  natVal $ Proxy @(1 + 5)
6
it :: GHC.Natural.Natural

Note: GHC uses Proxy to pass type-level witnesses in runtime. However, in membrain we can leverage the AllowAmbiguousTypes extension to pass types explicitly as arguments. We even have a helper function for that:

nat :: forall (mem :: Nat) . KnownNat mem => Natural
nat = natVal (Proxy @mem)

λ>  nat @(1 + 5)
6
it :: GHC.Natural.Natural

If you would like to read more about type-level literals, you can read this section of the GHC user guide.

We are heavily using Nat as the core building block of membrain. Check out the following section to learn how and why.

Units and Type-level conversion🔗

Finally, now that we learned what a Nat is, I can say that memory units in our architecture are just type-level natural numbers. Here is the exact code we have:

type Bit       = 1
type Kilobyte  = 1000 * Byte
type Megabyte  = 1000 * Kilobyte
...
type Kibibyte  = 1024 * Byte
type Mebibyte  = 1024 * Kibibyte
...

Instead of writing computed Nats we are using * type family for type-level multiplication.

These units are type-level multipliers. To construct term-level values of the Memory type, you can use the memory smart constructor. As I mentioned earlier, we store all Memory values as bits, so this function performs some additional computation to translate the number of units given by the user to bits. It is an easy operation due to the fact that we can get the multiplier from the type, as it has this information in type-level natural numbers (which is exactly the required unit multiplier). Here the natVal function comes to our assistance. It returns Natural by the given Proxy of KnownNat. As all of our units are KnownNats, we come up with the following function to transform any unit to the number of bits:

memory :: forall (mem :: Nat) . KnownNat mem => Natural -> Memory mem
memory = Memory . (* nat @mem)

Using this general-purpose memory function we can implement smart constructors:

byte :: Natural -> Memory Byte
byte = memory

kilobyte :: Natural -> Memory Kilobyte
kilobyte = memory
...

With these smart constructors in place the conversion function becomes as simple as possible. The toMemory function is for view-changing only, so it has zero cost at runtime. The implementation is:

toMemory :: forall (to :: Nat) (from :: Nat) . Memory from -> Memory to
toMemory = coerce

Note: type variables are written explicitly under forall and in this particular order because it’s crucial for library usability. It makes usages of the toMemory function more convenient. This is an interesting insight into how the order of type variables can improve the UX of modern Haskell libraries.

Because the internal representation is not changed during this transformation.

And it works:

λ> byte 42
Memory {unMemory = 336}

λ> toMemory @Kilobyte $ byte 42
Memory {unMemory = 336}

λ> showMemory $ toMemory @Kilobyte $ byte 42
"0.042kB"

λ> showMemory $ byte 42
"42B"

Memory units exhi’bit’ion🔗

In this section, I am going to explain how the pretty output is implemented in the library. The Memory data type has derived instances of the Show and Read typeclasses. Besides that, we also implemented the special functions showMemory and readMemory which have the same types but different implementations and goals. We don’t put the implementation of the custom printing/parsing functions into the corresponding standard instances themselves instead of using the deriving mechanism on purpose. There is an (mostly) unspoken rule around these typeclasses: the output of the show function should be valid Haskell code that can replace the shown value. We decided not to break this principle and have our pretty-printing functions with the human-readable output. Though we strictly preserve the roundtrip law of the Show/Read typeclasses — read . show ≡ id — with the custom implementation.

Let’s look closer at the showMemory and readMemory functions.

The showMemory function shows a Memory value as a decimal fraction, along with the unit suffix. It shows Memory losslessly when used with standardized units of measurement. The following mathematical law is used to display Memory:

A decimal representation written with a repeating final 0 is supposed to terminate before these zeros. Instead of 1.585000… one simply writes 1.585. The decimal is also called a terminating decimal. Terminating decimals represent rational numbers of the form (k/2n 5m). If you use different forms of units then the show function for Memory hangs.

λ> showMemory (Memory 22 :: Memory Byte)
"2.75B"

readMemory in turn is the inverse of showMemory. For safety, it returns a Maybe memory value instead of throwing runtime exceptions. The readMemory function uses the following information from type for parsing:

  • unit text representation (UnitSymbol typeclass function) — to compare with the parsed unit name string,
  • the multiplier — because we are storing bits internally so we would need to convert it before returning the resulting Memory.
λ> readMemory @Byte "2.75B"
Just (Memory {unMemory = 22})

λ> readMemory @Bit "2.75B"
Nothing

These two functions implement a reliable communication channel between compile time and runtime values.

’Bit’tersweet Testing🔗

Even if we are heavily using the high-level capabilities of GHC to perform type-level computations, in order to have a strong foundation in this library that prevents a lot of ambiguous situations, we still think that testing is a crucial part of library or application development and maintenance. Existing tests already showed up to be really helpful and helped us find some inaccuracy during the development of this library.

We have different types of testing:

  1. Type-level testing (yes! They exist!) — to control some of the type-level magic.
  2. Doctests — to have documentation always up-to-date.
  3. Property testing — no need to say how important this one is.
  4. Unit testing — a sanity check.

In the following subsections, let’s talk about the most interesting ones.

Type-level testing🔗

Type-level testing power

As most of the library logic is taken upon types, it would be extremely useful to have tests on this level of computation. But how? The answer is the type-spec library. This package is created to write type-level unit tests. It provides a convenient EDSL for this purpose. In our case we are using it to add tests on our UnitSymbol instances I mention above. The cool thing is that you write your test cases at the type level as well.

Just look at our use case. To be able to show the Memory value in the standardised format like 5kB, we are using the UnitSymbol type family and a bunch of type instances for each unit. By design, every unit is the type alias of a corresponding type-level natural number. For instance, a kilobyte is defined as follows

type Kilobyte = 1000 * Byte

But you can see that in the instances of UnitSymbol we are using Nats instead.

-- 8000 = 8 * 1000 Bit = Kilobyte
type instance UnitSymbol 8000 = "kB"

This is because of the fact that GHC doesn’t allow having type family application as an argument of the instance declaration. If we try to define an instance for Kilobyte written above in the following form:

type instance UnitSymbol Kilobyte = "kB"

it would fail at compile time with the following error:

src/Membrain/Units.hs:110:15: error:
Illegal type synonym family application in instance: Kilobyte
In the type instance declaration for ‘UnitSymbol
    |
110 | type instance UnitSymbol Kilobyte = "kB"

In order to keep an eye on this inconvenience and to have these great natural numbers under control, we can get the help of type-level tests. So our tests should check that the unit specification symbols correspond to the correct units.

The test is written in the following form:

unitSymbolTests ::
    "UnitSymbol"
    ###

        "Type-level UnitSymbol Tests"
        ~~~
            It "Name Bit    = b" (UnitSymbol Bit    `Is` "b")
        -*- It "Name Nibble = n" (UnitSymbol Nibble `Is` "n")
unitSymbolTests = Valid

It uses some fancy type-level operators from the type-spec library to make it work. The most interesting part for us, the place where the type-level unit test is specified is here: UnitSymbol Bit 'Is' "b". It’s easy to understand that we are trying to check exactly what I was describing.

Property testing🔗

We’ve implemented a number of helpful instances and functions for the Memory data type that should preserve some properties associated with them. For example, Memory has manual implementations of the Semigroup and Monoid classes. These typeclasses have associativity and neutrality laws associated with them. Also, we provide custom show and read functions for pretty output which also should satisfy the roundtrip property. We define property-based tests using the hedgehog library.

The idea of property-based testing is that you specify the property that should be preserved and provide the generators that would randomly create data to test against. One of the distinctive characteristics of property-based testing is that it covers the scope of all possible inputs. In our case, as we are testing the Memory data type, that would mean that the input data would have different types due to the phantom parameter. That is puzzling. We need a generator that can create different Memory values that can also be of a different type in the context of any function. But the type of the randomly generated Memory is defined at runtime by the generator. Currently, it is not possible to make the type depend on the runtime value. To achieve this we would need dependent types, however, this feature is not yet implemented in Haskell at the moment of writing this post. What we can do as a workaround is to create the following existential data type:

data AnyMemory
    = forall (mem :: Nat) . (KnownNat mem, KnownUnitSymbol mem)
    => MkAnyMemory (Memory mem)

It is called existential because as you can see we have this forall in the right-most part of the function definition. This allows us to store a type-level natural number at runtime. You can think of the existential types as a trick to move type parameters from the definition site to the call site. Or, alternatively, you can think of it as moving type information from compile time to runtime. Such existential types are useful if you want to have collections of values of different types, or if you want the type to depend on some runtime value. However, this feature has its cost: you no longer have access to the type-level information at compile time, meaning you can inspect types only at runtime. Fortunately, this is a drawback we can accept to solve our problem.

To summarize, we need to implement the following two things for generating AnyMemory:

  1. Generate random Natural number.
  2. Randomly choose the unit and wrap it into the MkAnyMemory constructor.

Here is how the generator looks like:

genAnyMemory :: MonadGen m => m AnyMemory
genAnyMemory = genNatural >>= unitChooser

-- | Returns random 'AnyMemory'.
unitChooser :: (MonadGen m) => Natural -> m AnyMemory
unitChooser n = Gen.element
    [ MkAnyMemory (Memory @Mem.Bit       n)
    , MkAnyMemory (Memory @Mem.Kilobyte  n)
    ...
    , MkAnyMemory (Memory @Mem.Kilobyte  n)
    ...
    ]

And when we have the generator of the AnyMemory data type our property tests can be written in a simple readable way:

showReadLaw :: Property
showReadLaw = property $ do
    MkAnyMemory mem <- forAll genAnyMemory
    readMemory (showMemory mem) === Just mem

By the way, the AnyMemory data type is exported in case any user would also need to solve such problems of having different memory units in their collections.

Inventing your own mem🔗

Despite the fact that the library provides quite a large collection of supported memory units, the situations when you need to have another one are still possible (for example, a custom filesystem page size or very-very-very large data pieces of a brontobyte or a gegobyte). Don’t panic! The way of adding your own unit is quite straightforward and I am going to walk you through it.

Note: adding incorrect multipliers that could spoil some staff is the responsibility of the users. See memory units exhibition section for clarifications.

Note: the following code assumes that you enabled DataKinds, TypeOperators and TypeFamilies to work with type families.

Let’s say, that we need to use a brontobyte a lot. So it makes sense to create a type. By the definition, a brontobyte contains 103 yottabytes. Let’s write it down

type Brontobyte = 8000000000000000000000000000

Alternatively, with the usage of the * type family from GHC.TypeNats you can write the following clean definition:

type Brontobyte = 1000 * Yottabyte

Don’t forget to turn on the NoStarIsType extension.

And the instance of UnitSymbol in order to have custom show and read functions. It is quite easy to write:

type instance UnitSymbol 8000000000000000000000000000 = "bB"

Also, the smart constructor is helpful indeed:

brontobyte :: Natural -> Memory Brontobybte
brontobyte = memory
{-# INLINE brontobyte #-}

Aaand that’s all that is needed! We can check it in ghci now:

λ> showMemory $ brontobyte 2
"2bB"

λ> showMemory $ toMemory @Bit $ brontobyte 1
"8000000000000000000000000000b"

λ> showMemory $ brontobyte 2 `memoryPlus` bit 5
"16000000000000000000000000005b"

As you can see, the newly created type is fully composable with the existing interface and types.

Multiple public libraries vs Orphan instances🔗

The membrain package depends only on the base library which makes it extremely lightweight. However, it would be more useful to users if the library could provide ways to work with JSON, TOML, CSV, SQL, binary serialisation and other formats. Unfortunately, implementation of the required functionality requires membrain to depend on many other packages. This is an awkward tradeoff between making a small package with the minimal interface or a heavy package with a lot of dependencies. The more dependencies you add, the longer the build times of the package and the higher the maintenance cost. And this is a common problem in Haskell when you want to implement a minimal data type or some basic typeclass.

Moreover, some people may only need JSON serialisation. Why should they build CSV and SQL libraries for that? And what is worse, such encoding/decoding libraries are usually based on typeclasses which means that you need to use orphan instances if you don’t put those instances in the package with a data type or a typeclass. So what we want is a modular and scalable way to extend the capabilities of the package. By now in Haskell we have the following options to choose from to resolve this dilemma:

  1. Don’t provide those instances. Let the users implement them if they need.
  2. Depend on all packages and provide all the instances.
  3. Use Cabal flags and CPP to disable/enable specific extension points. This solution is inconvenient to use because you can’t specify these flags in the build-depends section.
  4. Put integration with each library into a separate package. Maintaining multiple packages around a single data type or typeclass is an extra headache.

Fortunately, since Cabal-3.0 there is a better way to approach the problem — multiple public libraries. This feature allows creating and exposing multiple independent libraries from a single package. These libraries may later be used individually by any other projects.

The main library stanza in membrain looks like this:

library
  import:              common-options
  exposed-modules:     Membrain
                         Membrain.Base
                         Membrain.Constructors
                         Membrain.Memory
                         Membrain.Units

Now, if we want to implement JSON instances from the aeson library, we can create a Membrain.Aeson module and put it into a separate public library:

library json
  import:              common-options
  exposed-modules:     Membrain.Aeson
  build-depends:       aeson ^>= 1.4

Note: membrain also uses the common stanzas feature to remove metadata duplication between different stanzas in the .cabal file.

The implementation of FromJSON and ToJSON instances from the aeson library goes into the Membrain.Aeson module. And if you want to use these instances in your package in addition to the main Memory type, you just need to add the json public library to your build-depends using the following syntax:

  build-depends:       membrain ^>= 0.0.0.0
                     , membrain:json

and then import the Membrain.Aeson module when you need it (you can even put this import in your custom local Prelude to automatically make this instance visible project-wide).

The approach of using multiple public libraries has the following benefits in comparison with other existing solutions:

  1. No need to maintain extra packages to support additional features.
  2. No CPP and compile time boolean flags involved.
  3. Additional features of the library are specified explicitly in the .cabal file, not in some other configuration file.
  4. Composition and modularity: you can depend only on those libraries that you need without having extra dependencies.
  5. All integrations with other libraries are in a single place which means that you can easily see what needs to be updated when you introduce breaking changes to the main library.

This looks like a lovely solution to a long-pending problem.

Summary🔗

This is membrain in a nutshell. To check out the full code and more tricks used in the package you can look into the sources:

And here is a short recap of membrain’s features:

  • The library provides a single data type Memory parameterized by the type-level memory units.
  • Memory unit multipliers are stored as type-level natural numbers.
  • Internally everything is stored as bits.
  • All computations/conversions are done at compile time, so minimum boilerplate is required from users.
  • The library provides its own version of base functions that work with memory units.
  • The implemented interface is extensible externally.
  • A multiple public libraries solution provides useful instances that help avoid redundant dependencies and orphan instances.