This blog post is going to walk you through the
membrain Haskell library, its purpose and implementation details. Here is the link to the source code:
Since the package has such an extraordinary name I feel like we owe you an explanation on what the package is doing.
Membrain is not about creating memes (though we have good ones 😏), it’s about a type-safe memory data type with units being on the type level, so all the computations are guaranteed to be trustworthy and some of them even performed during compile time. The Haskell type system has a lot of advanced features. However, they are not used often, and common patterns around them are not yet established.
Membrain combines clever type-level tricks and the best-practices of package development and maintenance to provide smooth integration with the rest of the ecosystem.
In order to be more useful for everybody, this blog post also offers a small tutorial on type-level operations along with the library description. If you are interested in any of this, we are going to start with a quick excursion to memory units in general so we are all on the same page and understand the problem we are trying to solve here.
A ‘bit’ of history🔗
We use memory units in our day-to-day lives to measure information/data and needless to say, there are officially standardised units of measurements which are used ubiquitously. At the base of the memory measurement system that we are covering in the library, we have bit — a binary digit. We are using the binary model of memory representation accepted by most computers in the world.
Bit is a binary base for a logarithmic tower of memory units. Note, that we are not going to take into consideration systems with other bases as they are rarely used nowadays for data measurement.
Usually we work with more than several bits. In order to ease this interaction, we can use measure prefixes to define huge amounts of bits concisely. But the story behind unit names is not that simple. There are two main naming conventions for prefixes: decimal and binary. The difference between them lies in the base of the multiplier. The standard range of prefixes are the multipliers of the powers of 10: kilo (103), mega (106), giga (109), etc. However, as we are working with memory units, it is more conventional to work with the powers of 2 in the computer science world. For this exact reason, there exist the binary prefixes: kibi (210), mebi (220), gibi (230), etc. Here is the table of supported units in the
Table of units
|Unit||Symbol||Multiplier||Degree||Number of Bits|
|Kilobyte||kB||1000 * B||103||8000|
|Megabyte||MB||1000 * kB||106||8000000|
|Gigabyte||GB||1000 * MB||109||8000000000|
|Terabyte||TB||1000 * GB||1012||8000000000000|
|Petabyte||PB||1000 * TB||1015||8000000000000000|
|Exabyte||EB||1000 * PB||1018||8000000000000000000|
|Zettabyte||ZB||1000 * EB||1021||8000000000000000000000|
|Yottabyte||YB||1000 * ZB||1024||8000000000000000000000000|
|Kibibyte||KiB||1024 * B||210||8192|
|Mebibyte||MiB||1024 * KiB||220||8388608|
|Gibibyte||GiB||1024 * MiB||230||8589934592|
|Tebibyte||TiB||1024 * GiB||240||8796093022208|
|Pebibyte||PiB||1024 * TiB||250||9007199254740992|
|Exbibyte||EiB||1024 * PiB||260||9223372036854775808|
|Zebibyte||ZiB||1024 * EiB||270||9444732965739290427392|
|Yobibyte||YiB||1024 * ZiB||280||9671406556917033397649408|
As you can see, memory units represent a complicated system, and this is why it is better to be more confident that you are doing safe computations with them.
Let’s move to the more technically exciting part — the implementation itself. The goal of the library is to ease the lives of people working with memory units. The intentions are:
- to make it easy to define different units without the fear of forgetting to multiply/divide them during the process, combine different units and safely convert between them.
- if some requirements for the code should be changed in the future users need to have guarantees that nothing is going to break with the code they wrote, or at least they want to be notified via compile time errors in such cases.
In order to convince you and not to be unsubstantiated that using type-safe libraries would pay off for you, here is a motivating example of how easily you can mess up with different units.
Let’s say we have the following data
And it’s over. You allocated 82 gigabytes and probably ran out of memory.
But how can you avoid such situations? Obviously, using just
type Memory = Int for that is not an option as it fails in the described situation as well. And generally using type aliases is not the right thing to do in most of these cases. Usually wrapping the value in a
newtype is a common pattern in Haskell. It is a safer approach but as there are a lot of memory units, introducing an individual newtype for every one and a function to compose them with each other is a lot of boilerplate. So, it also doesn’t satisfy the goals we are aiming for. That’s why in
membrain we want to push this idea even further and represent all memory units with a single newtype in a safe and convenient way.
Now, I hope you see the motivation behind the
membrain library too. In order to bypass many crucial mistakes that could cost lots of money or even lives in especially important cases, we could use the Haskell type system to avoid this whole domain of errors.
If we want to avoid runtime computations and manual supervision of units, we can delegate this job to the compiler. To do so we are using the following data type:
mem is the phantom parameter that is going to be used to control the unit multiplier in the expression. As you can see
mem has a kind
Nat about which I’m going to talk more in the following section. We are storing
Natural here for a reason. Internally, memory of any unit is represented as a number of bits. This not only helps to avoid dealing with division during compile time, but also gives us free-of-charge runtime computations! You can think of this approach in the following manner: as a user, you can explicitly specify the memory unit multiplier (Kilobyte, Gigabyte, etc.) and the amount you want, but we are actually storing everything in bits under the hood. The
Memory data type constructor is not supposed to be used explicitly as it could cause the wrong situations because of this internal representation. Instead, we defined useful functions for you to construct memory values which are doing the conversion for you:
We denote the type of the unit we are using on the type level by specifying the multiplier (
Byte, etc.), which is just a type-level natural number, and then simply specify the amount of needed memory.
For a better understanding of how we are storing different units we need to talk about
Nats a bit. GHC has built-in support for some type-level literals: strings and natural numbers. To access natural numbers predefined in the compiler one can utilise the
Nat data type.
Nat is used as a kind of the type-level natural numbers (kind is the type of a type). So, in plain Haskell you can have values
0, 1, 2, 3 of the
Int type. But now you can have types
0, 1, 2, 3 of the
Nat kind. However, you can’t have values of the type
3 :: Nat. Type-level natural numbers exist only for compile time computations.
To play with
Nat, the corresponding data type can be found in the
GHC.TypeNats module along with other useful natural number related stuff. The data type itself is declared simply like this:
As you can see, it doesn’t have any constructors or anything and looks not very useful at first glance, but the ability to be used as a parameter at the type-level gives us a lot of opportunities as this natural number can be retrieved at runtime.
Let’s explore the
Nat kind in
ghci for a better understanding:
The module even exports some type-level operators (type families), so you can write code like this:
In addition to performing numeric operations with type-level natural numbers, GHC gives the ability to extract a runtime value from the compile time type-level natural. For this, you can use the
natVal function which returns the
Note: GHC uses Proxy to pass type-level witnesses in runtime. However, in
membrainwe can leverage the
AllowAmbiguousTypesextension to pass types explicitly as arguments. We even have a helper function for that:
If you would like to read more about type-level literals, you can read this section of the GHC user guide.
We are heavily using
Nat as the core building block of
membrain. Check out the following section to learn how and why.
Units and Type-level conversion🔗
Finally, now that we learned what a Nat is, I can say that memory units in our architecture are just type-level natural numbers. Here is the exact code we have:
Instead of writing computed
Nats we are using
*type family for type-level multiplication.
These units are type-level multipliers. To construct term-level values of the
Memory type, you can use the
memory smart constructor. As I mentioned earlier, we store all
Memory values as bits, so this function performs some additional computation to translate the number of units given by the user to bits. It is an easy operation due to the fact that we can get the multiplier from the type, as it has this information in type-level natural numbers (which is exactly the required unit multiplier). Here the
natVal function comes to our assistance. It returns
Natural by the given
KnownNat. As all of our units are
KnownNats, we come up with the following function to transform any unit to the number of bits:
Using this general-purpose
memory function we can implement smart constructors:
With these smart constructors in place the conversion function becomes as simple as possible. The
toMemory function is for view-changing only, so it has zero cost at runtime. The implementation is:
Note: type variables are written explicitly under
foralland in this particular order because it’s crucial for library usability. It makes usages of the
toMemoryfunction more convenient. This is an interesting insight into how the order of type variables can improve the UX of modern Haskell libraries.
Because the internal representation is not changed during this transformation.
And it works:
Memory units exhi’bit’ion🔗
In this section, I am going to explain how the pretty output is implemented in the library. The
Memory data type has derived instances of the
Read typeclasses. Besides that, we also implemented the special functions
readMemory which have the same types but different implementations and goals. We don’t put the implementation of the custom printing/parsing functions into the corresponding standard instances themselves instead of using the deriving mechanism on purpose. There is an (mostly) unspoken rule around these typeclasses: the output of the
show function should be valid Haskell code that can replace the shown value. We decided not to break this principle and have our pretty-printing functions with the human-readable output. Though we strictly preserve the roundtrip law of the Show/Read typeclasses —
read . show ≡ id — with the custom implementation.
Let’s look closer at the
showMemory function shows a
Memory value as a decimal fraction, along with the unit suffix. It shows
Memory losslessly when used with standardized units of measurement. The following mathematical law is used to display
A decimal representation written with a repeating final
0 is supposed to terminate before these zeros. Instead of
1.585000… one simply writes
1.585. The decimal is also called a terminating decimal. Terminating decimals represent rational numbers of the form (k/2n 5m). If you use different forms of units then the
show function for
readMemory in turn is the inverse of
showMemory. For safety, it returns a
Maybe memory value instead of throwing runtime exceptions. The
readMemory function uses the following information from type for parsing:
- unit text representation (
UnitSymboltypeclass function) — to compare with the parsed unit name string,
- the multiplier — because we are storing bits internally so we would need to convert it before returning the resulting
These two functions implement a reliable communication channel between compile time and runtime values.
Even if we are heavily using the high-level capabilities of GHC to perform type-level computations, in order to have a strong foundation in this library that prevents a lot of ambiguous situations, we still think that testing is a crucial part of library or application development and maintenance. Existing tests already showed up to be really helpful and helped us find some inaccuracy during the development of this library.
We have different types of testing:
- Type-level testing (yes! They exist!) — to control some of the type-level magic.
- Doctests — to have documentation always up-to-date.
- Property testing — no need to say how important this one is.
- Unit testing — a sanity check.
In the following subsections, let’s talk about the most interesting ones.
As most of the library logic is taken upon types, it would be extremely useful to have tests on this level of computation. But how? The answer is the type-spec library. This package is created to write type-level unit tests. It provides a convenient EDSL for this purpose. In our case we are using it to add tests on our
UnitSymbol instances I mention above. The cool thing is that you write your test cases at the type level as well.
Just look at our use case. To be able to show the
Memory value in the standardised format like
5kB, we are using the
UnitSymbol type family and a bunch of type instances for each unit. By design, every unit is the type alias of a corresponding type-level natural number. For instance, a kilobyte is defined as follows
But you can see that in the instances of
UnitSymbol we are using
This is because of the fact that GHC doesn’t allow having type family application as an argument of the instance declaration. If we try to define an instance for
Kilobyte written above in the following form:
it would fail at compile time with the following error:
In order to keep an eye on this inconvenience and to have these great natural numbers under control, we can get the help of type-level tests. So our tests should check that the unit specification symbols correspond to the correct units.
The test is written in the following form:
It uses some fancy type-level operators from the
type-spec library to make it work. The most interesting part for us, the place where the type-level unit test is specified is here:
UnitSymbol Bit 'Is' "b". It’s easy to understand that we are trying to check exactly what I was describing.
We’ve implemented a number of helpful instances and functions for the
Memory data type that should preserve some properties associated with them. For example,
Memory has manual implementations of the
Monoid classes. These typeclasses have associativity and neutrality laws associated with them. Also, we provide custom show and read functions for pretty output which also should satisfy the roundtrip property. We define property-based tests using the hedgehog library.
The idea of property-based testing is that you specify the property that should be preserved and provide the generators that would randomly create data to test against. One of the distinctive characteristics of property-based testing is that it covers the scope of all possible inputs. In our case, as we are testing the
Memory data type, that would mean that the input data would have different types due to the phantom parameter. That is puzzling. We need a generator that can create different
Memory values that can also be of a different type in the context of any function. But the type of the randomly generated
Memory is defined at runtime by the generator. Currently, it is not possible to make the type depend on the runtime value. To achieve this we would need dependent types, however, this feature is not yet implemented in Haskell at the moment of writing this post. What we can do as a workaround is to create the following existential data type:
It is called existential because as you can see we have this
forall in the right-most part of the function definition. This allows us to store a type-level natural number at runtime. You can think of the existential types as a trick to move type parameters from the definition site to the call site. Or, alternatively, you can think of it as moving type information from compile time to runtime. Such existential types are useful if you want to have collections of values of different types, or if you want the type to depend on some runtime value. However, this feature has its cost: you no longer have access to the type-level information at compile time, meaning you can inspect types only at runtime. Fortunately, this is a drawback we can accept to solve our problem.
To summarize, we need to implement the following two things for generating
- Generate random Natural number.
- Randomly choose the unit and wrap it into the
Here is how the generator looks like:
genAnyMemory :: MonadGen m => m AnyMemory genAnyMemory = genNatural >>= unitChooser -- | Returns random 'AnyMemory'. unitChooser :: (MonadGen m) => Natural -> m AnyMemory unitChooser n = Gen.element [ MkAnyMemory (Memory @Mem.Bit n) , MkAnyMemory (Memory @Mem.Kilobyte n) ... , MkAnyMemory (Memory @Mem.Kilobyte n) ... ]
And when we have the generator of the
AnyMemory data type our property tests can be written in a simple readable way:
By the way, the
AnyMemory data type is exported in case any user would also need to solve such problems of having different memory units in their collections.
Inventing your own mem🔗
Despite the fact that the library provides quite a large collection of supported memory units, the situations when you need to have another one are still possible (for example, a custom filesystem page size or very-very-very large data pieces of a brontobyte or a gegobyte). Don’t panic! The way of adding your own unit is quite straightforward and I am going to walk you through it.
Note: adding incorrect multipliers that could spoil some staff is the responsibility of the users. See memory units exhibition section for clarifications.
Note: the following code assumes that you enabled
TypeFamiliesto work with type families.
Let’s say, that we need to use a brontobyte a lot. So it makes sense to create a type. By the definition, a brontobyte contains 103 yottabytes. Let’s write it down
Alternatively, with the usage of the
* type family from
GHC.TypeNats you can write the following clean definition:
Don’t forget to turn on the
And the instance of
UnitSymbol in order to have custom show and read functions. It is quite easy to write:
Also, the smart constructor is helpful indeed:
Aaand that’s all that is needed! We can check it in ghci now:
As you can see, the newly created type is fully composable with the existing interface and types.
Multiple public libraries vs Orphan instances🔗
membrain package depends only on the
base library which makes it extremely lightweight. However, it would be more useful to users if the library could provide ways to work with JSON, TOML, CSV, SQL, binary serialisation and other formats. Unfortunately, implementation of the required functionality requires
membrain to depend on many other packages. This is an awkward tradeoff between making a small package with the minimal interface or a heavy package with a lot of dependencies. The more dependencies you add, the longer the build times of the package and the higher the maintenance cost. And this is a common problem in Haskell when you want to implement a minimal data type or some basic typeclass.
Moreover, some people may only need JSON serialisation. Why should they build CSV and SQL libraries for that? And what is worse, such encoding/decoding libraries are usually based on typeclasses which means that you need to use orphan instances if you don’t put those instances in the package with a data type or a typeclass. So what we want is a modular and scalable way to extend the capabilities of the package. By now in Haskell we have the following options to choose from to resolve this dilemma:
- Don’t provide those instances. Let the users implement them if they need.
- Depend on all packages and provide all the instances.
- Use Cabal flags and CPP to disable/enable specific extension points. This solution is inconvenient to use because you can’t specify these flags in the
- Put integration with each library into a separate package. Maintaining multiple packages around a single data type or typeclass is an extra headache.
Fortunately, since Cabal-3.0 there is a better way to approach the problem — multiple public libraries. This feature allows creating and exposing multiple independent libraries from a single package. These libraries may later be used individually by any other projects.
library stanza in
membrain looks like this:
library import: common-options exposed-modules: Membrain Membrain.Base Membrain.Constructors Membrain.Memory Membrain.Units
Now, if we want to implement JSON instances from the
aeson library, we can create a
Membrain.Aeson module and put it into a separate public library:
library json import: common-options exposed-modules: Membrain.Aeson build-depends: aeson ^>= 1.4
membrainalso uses the common stanzas feature to remove metadata duplication between different stanzas in the
The implementation of
ToJSON instances from the
aeson library goes into the
Membrain.Aeson module. And if you want to use these instances in your package in addition to the main
Memory type, you just need to add the
json public library to your
build-depends using the following syntax:
build-depends: membrain ^>= 0.0.0.0 , membrain:json
and then import the
Membrain.Aeson module when you need it (you can even put this import in your custom local
Prelude to automatically make this instance visible project-wide).
The approach of using multiple public libraries has the following benefits in comparison with other existing solutions:
- No need to maintain extra packages to support additional features.
- No CPP and compile time boolean flags involved.
- Additional features of the library are specified explicitly in the
.cabalfile, not in some other configuration file.
- Composition and modularity: you can depend only on those libraries that you need without having extra dependencies.
- All integrations with other libraries are in a single place which means that you can easily see what needs to be updated when you introduce breaking changes to the main library.
This looks like a lovely solution to a long-pending problem.
membrain in a nutshell. To check out the full code and more tricks used in the package you can look into the sources:
And here is a short recap of
- The library provides a single data type
Memoryparameterized by the type-level memory units.
- Memory unit multipliers are stored as type-level natural numbers.
- Internally everything is stored as bits.
- All computations/conversions are done at compile time, so minimum boilerplate is required from users.
- The library provides its own version of
basefunctions that work with memory units.
- The implemented interface is extensible externally.
- A multiple public libraries solution provides useful instances that help avoid redundant dependencies and orphan instances.