DEV Community: Michał Knasiecki

Scheduled stepdown for smoother MongoDB primary election

Michał Knasiecki — Sat, 15 Jun 2024 12:02:59 +0000

Stepdown is a great tool that allows us to keep clusters operating smoothly. We use it for example when we want to perform some maintenance work on the host where the primary is currently running, to perform a rolling upgrade, and in many other cases we need to switch the primary to another node.

While usually electing a new primary is fast enough, for clusters with very high write traffic, it sometimes unfortunately leads to write errors on the application side. The reason is that drivers need to disconnect from the previous primary and connect to the new one, and this takes time. All write attempts at this time end with an error. Of course those operations can be retried after a short time, but it seems to me that this situation could be avoided fairly easily.

Here's my idea:

We add an optional parameter to the stepDown method, let’s name it: electionEffectiveTimeSeconds. If we do not set it, stepDown works normally - as before,
The new primary is elected, but the cluster and drivers do not use it as a primary yet, instead they wait for electionEffectiveTimeSeconds seconds after the election is done,
Until then, the previous primary handles write operations,
In the meantime, all cluster nodes and drivers receive information about the new primary and when the change will take effect - based on electionEffectiveTimeSeconds,
This will allow all parties to properly prepare for the switchover, which should minimize the time of unavailability of the primary,
When the time expires, clusters and drivers change primaries to the new node.

The key element will be to determine the optimal range for this new parameter. On the one hand, it must be big enough to give enough of time to spread the information through the cluster and drivers. On the other hand, it must not be too long, because there is a risk that - during that time - the new elected primary will catch a lag and will not be able to become a primary. In this case, however, it could try one more time, and if that too fails, then perform a normal stepdown a third time.

I think this method would be useful during all stepdowns, which are planned in advance and it is not necessary to happen immediately.

I am curious about your opinions, if you see value in this, I will be grateful for the upvote here.

How to reduce MongoDB database storage size by 95% with no effort?

Michał Knasiecki — Tue, 12 Dec 2023 15:51:17 +0000

One of the key things that I dedicate a lot of effort and attention to in my daily work is the size of my databases. Not everyone understands this, and some people may say I'm wasting time and money because, after all, disks are cheap. However, the truth is that I don't do it because of the disks, or at least not only. After all, keeping the size of the database under control has many other advantages:

the smaller the size of the data, the more of the so-called workingset will fit into the RAM memory and the more efficiently the database will work; in short: smaller databases work faster;
small databases backup faster, which is especially important if you manage hundreds of databases and for each of them you must guarantee a backup, let’s say, once every 24 hours or even more often;
in case of failure, a small database can be restored from the backup faster than a large one, so our system will return to operation more quickly;
and finally, of course, infrastructure costs depend on the size of the database, so smaller databases are cheaper. This applies not only to on-prem environments, but also to the cloud; for many providers, the cost of renting servers depends largely on the size of the disks.

So it's safe to say that the size of the database is one of the most important parameters we should keep an eye on. The developers of MongoDB also came out of a similar assumption by giving us a binary BSON, many different types of data, built-in compression, or dedicated time-series collections that can pack data even more densely.

However, it turns out that in special cases we can go one step further and use non-standard solutions, such as probabilistic data structures. They will allow even more efficient use of disk space, although there is a cost to be expected here, which we will discuss below.

Memory-efficient presence test

One such data structure is the Bloom Filter. It is an implementation of a memory-saving structure that is used in the so-called presence test.

To explain how it works, let's refer for a moment to one of the most popular data structures: a set. As we know, you can add elements to a set, you can remove elements from it, and you can also check whether an element is in the set.

Sets are very useful, but not perfect; one disadvantage of this structure is the linear dependence of the data size on the number of its elements: the more elements we put to the set, the more memory it will need to store it. This can be a big problem if we want to operate on a very large number of elements. Especially, if we are not really interested in the contents of the set, but only want to be able to check if some element is in it. The operation of checking whether an element is part of a set is called a presence test.

And this is where Bloom Filters prove to be helpful: we can use it to test whether elements have been placed in it, albeit without being able to extract those elements. Thus, we can't iterate through the elements of the filter, as we can with a set, we can only - knowing the element - check whether it has been placed in the Bloom Filter.

An interesting feature of Bloom Filters is their fixed size (up to a certain maximum number of elements the filter can handle, of course): there’s no linear dependence of the data size on the number of its elements.

Unfortunately, you have to pay a price for this fixed size: Bloom Filter can sometimes return false-positive results, i.e. state that a certain element is in it, while it was actually not added to it. On the other hand, a false-negative result won't happen, so if the filter claims that an element is absent, it definitely wasn’t added to it.

There are, of course, many fields in which a false positive result is unacceptable; one example would be medicine. However, there is quite a wide field of applications where we can accept a small number of false positive results in exchange for an efficient way of storing information.

Article recommendation system

Imagine building a social media type system in which some users create content for others. An obvious feature of such a system would probably be some sort of system for promoting posts. We would like to recommend to users new posts that match their interests. We can expect that our system will hopefully have so many posts to promote that we would like to present each suggestion to a given user only once. This means the need to save for each user the URL of posts whose suggestion has already been presented to them once. It will prevent us from suggesting content to users that has already been recommended before.

There are two ways to model the data described above.
It could be done in the analogous way in which we would approach this problem in the case of a relational database. So, we would keep our users in one collection and the information about the articles they visited in another. An additional collection mapping identifiers would allow this to model an M:N relationship.

This way, although possible, is not likely to be recommended for non-relational databases.
When modeling this data in MongoDB, we would probably use the ability to model complex documents. This would allow us to keep all the data in one collection, where each document would describe the user, and one of its fields would be a set containing the URLs of recommended articles.

One of the advantages of this method is simplicity: instead of two (or even three) collections, we only need one; we can also avoid using $lookup aggregator which might be expensive to use on a big database. On the other hand, however, the single-collection solution can also cause performance problems: the longer the list of articles, the more data we will have to load into memory when loading a user document. The relationship will be linear in this case.

If, in addition, we do not need to know the actual URL of the articles, but only want to be able to check whether an article has been recommended to the user, it is worth considering an alternative solution based on Bloom Filters.

This is because, instead of storing a set of recommended articles, you can store a serialized Bloom Filter in the document, which should be more compact than using the set. This means that the collection of users will be much smaller than if you store the URLs list directly.

Sounds promising, so let's see how it works in practice. And the best way to do it is to perform a simple experiment.

Time to do some experiments

There are many implementations of Bloom Filters, since I am a Java/Kotlin developer, I will use the guava library for my tests. This library has a nice additional feature: its Bloom Filter implementation includes a writeTo method that stores the filter in a very compact form, and this will be an additional benefit for us.

Note that in the examples below I will use annotations from the spring-data library, but its use is not necessary and the same program can be implemented without problems using a pure driver for MongoDB.

I have prepared two classes that map a user model to a MongoDB collection. The first one uses the classic approach: with a set containing article URLs:

@Document(collection = "users-plain")
data class PlainUser(
   @Id val id: ObjectId,
   val visitedUrls: Set<String>
)

While in the second, instead of a set, I use a Bloom Filter, which is serialized to a binary form, mapped to an array of bytes:

@Document(collection = "users-bf")
data class BloomFilterUser(@Id val id: ObjectId, 
    val visitedUrlsFilter: ByteArray)

I decided to start the tests with the assumption that the average length of URL is 50 characters (according to SEO recommendations, links with length of 50-60 characters are optimal), and each document has an average of 10 of them. I will perform the first test for 100,000 users.

The following code shows how data are being generated:

for (i in 1..usersCount) {

   //Generate URLs:
   val urls = (1..urlsCount).map { 
       RandomStringUtils.randomAlphanumeric(urlLength) 
   }.toSet()

   //Store a user with URLs in a set:
   val nextPlainUser = PlainUser(ObjectId.get(), urls)
   plainUserRepository.save(nextPlainUser)

   //Store a user with URLs as Bloom Filter:
   val bloomFilter = BloomFilter.create(
       Funnels.stringFunnel(
           Charset.defaultCharset()), urlsCount, fpp)

   urls.forEach{
       bloomFilter.put(it)
   }

   val out = ByteArrayOutputStream()
   bloomFilter.writeTo(out)

   val nextBFUser = BloomFilterUser(ObjectId.get(),
      out.toByteArray())

   bloomFilterUserRepository.save(nextBFUser)
}

Of course, the code can be written in a more clean (and efficient by using bulk inserts) way, but I wanted to keep it simple, so I decided to use a single loop with everything in it.

Variables usersCount, urlsCount and urlLength are obvious and define the parameters of the test which I mentioned above, however, the variable fpp needs clarification - this is the "false positive probability" which tells what is the probability that the filter will return true for an object that has not actually been put in to it. We can control this factor, but of course, the smaller the fpp, the less will be the profit from using the Bloom Filter because its storage size will get bigger.

Results

To compare the two methods, we will use the storageSize property that can be found in db.collection.stats() output, so it will reflect the compressed size.

Let's start with some low numbers: urlsCount = 10 so will assume that we can save only last 10 viewed articles.

The following table shows the size of the collections modeled by both methods (plain and using Bloom filters), the last column shows the size reduction we achieved:

Users count	Array [B]	Bloom Filter [B]	Diff [%]
100K	45633536	2420736	-94,7
500K	218083328	11304960	-94,8
1M	442531840	22593536	-94,8

You can see that by using filters we can achieve a size reduction of 95%. Great result, now let's see what happens when we set urlsCount = 50

Users count	Array [B]	Bloom Filter [B]	Diff [%]
100K	217296896	7933952	-96,4
500K	1068204032	38817792	-96,4
1M	2135564288	77930496	-96,4

The result has further improved, although only slightly.

Now let's see what effect the value of the fpp parameter has, we will perform a test for:urlsCount = 50 and users = 100K

fpp	Bloom Filter [B]	Diff [%]
0.001	11337728	+42,9
0.01	7933952	-
0.05	5005312	-36,9

According to the documentation, increasing the uncertainty of the result has the effect of further reducing the size and we confirmed it with the above test.

However, it should be noted here that 5% of false positives is already quite a lot, so it is worth thinking carefully before further increasing the value of this parameter.

Summary

MongoDB use very advanced tools to store data in the most compact way. It should be more than enough for the vast majority of our daily projects. In some very special cases, however, you can consider using even more advanced data structures that will allow us to reduce the size of the database to just a few percent of the initial size. We have demonstrated that using Bloom filters it is possible to achieve excellent results with a reasonable level of uncertainty about the stored data.

Comparing MongoDB composite indexes

Michał Knasiecki — Mon, 27 Nov 2023 19:06:38 +0000

One of the key elements ensuring efficient operation of the services we work on every day at Allegro is fast responses from the database. We spend a lot of time to properly model the data so that storing and querying take as little time as possible. You can read more about why good schema design is important in one of my earlier posts. It’s also equally important to make sure that all queries are covered with indexes of the correct type whenever possible. Indexes are used to quickly search the database and under certain conditions even allow results to be returned directly from the index, without the need to access the data itself. However, indexes are not all the same and it’s important to learn more about their different types in order to make the right choices later on.

What’s the difference?

I’ve had a conversation with a colleague of mine the other day, about the point of using composite keys in a MongoDB database. I’ve always been a firm believer that it’s a good idea to use a composite key wherever possible because searching this way is very fast. My colleague, on the other hand, advocates using artificial keys and creating separate composite indexes on fields for which I would use a composite key. After a brief disagreement, I realized that other than my intuition, I had no arguments to defend my beliefs. I decided to see how indexes on composite keys differ from composite indexes created on regular fields in practice.

As an example for our considerations, we will use an entity describing a person by first and last name, let’s also assume that this pair is unique.

Such data can be stored in a collection (let’s call it composite) with a composite key that contains both fields.

{
    "_id" : {
        "name" : "John",
        "surname" : "Doe"
    }
}

A unique index will be automatically created on pair of both fields along with the collection.

Many developers would most likely prefer to use an artificial key, and index the name and surname fields separately, as shown in collection artificial:

{
    "_id" : ObjectId("615eb18b76e172647f7462e2"),
    "name" : "John",
    "surname" : "Doe"
}

When it comes to the second model, only the artificial key will be automatically covered by the index. To be able to efficiently search the collection by first and last name, we need to manually create a composite index on both fields. To maintain consistency with the first collection, of course, uniqueness also needs to be enforced:

db.artificial.createIndex({name: 1, surname: 1}, {unique: true})

At first glance, we can clearly see that the first way of storing data is more compact, since we only store two fields of interest, and only one index is required in addition to the data. For the second collection, in addition to the data itself, we need to store an artificial key, moreover two indexes are required here: on the artificial key and on the name and surname fields.

We can now move on to comparing the execution plans of queries to two collections, so let’s take a look at the result of the explain commands:

db.composite.find({
    "_id" : {
        "name" : "John",
        "surname" : "Doe"
    }
}).explain("executionStats")

and:

db.artificial.find({"name" : "John", "surname" : "Doe"}).explain("executionStats")

Let’s start with the second result first. We can see that the optimiser chose to use the index we manually created.

{
  "winningPlan": {
    "stage": "FETCH",
    "inputStage": {
      "stage": "IXSCAN",
      "keyPattern": {
        "name": 1.0,
        "surname": 1.0
      },
      "indexName": "name_1_surname_1"
    }
  }
}

[...]

{
  "executionStats" : {
    "executionSuccess": true,
    "nReturned": 1
  }
}

In the case of a collection with a composite key, however, the plan is different; it contains the word IDHACK:

{
  "winningPlan": {
    "stage": "IDHACK"
  }
}

[...]

{
  "executionStats" : {
    "executionSuccess": true,
    "nReturned": 1
  }
}

It means that the optimiser skipped the index selection phase (although in our case there were no other indexes, but it doesn’t matter) and decided to use the key index. This operation is considered to be the fastest one possible. Whenever there is a key in the query, its index will be used (while ignoring conditions on other fields).

Let’s also take a look at the notation "nReturned" : 1, which means that both queries returned a single document.

We already know that queries to both collections will be handled with an index. However, I’ve been wondering if there are any differences between these indexes?

The first should be search time: since whenever there is a key in the condition list, its index will be used, theoretically, key’s index should be the fastest. We’ll get to that topic in a moment. For now, let’s see what happens if we only want to search one field at a time:

db.getCollection('composite').find({
    "_id" : {
        "name" : "John"
    }
}).explain("executionStats")

db.getCollection('artificial').find({"name" : "John"}).explain("executionStats")

In the case of the first query, the index was admittedly used, but we got no results, as evidenced by the notation:

{
    "executionStages": {
        "stage": "IDHACK",
        "nReturned": 0
    }
}

This happens because a composite key requires all its components to be used by the query, so it is impossible to search by a single field.

The situation is different when querying the second collection, here the index was also used, however this time the document was found:

{
  "winningPlan": {
    "stage": "FETCH",
    "inputStage": {
      "stage": "IXSCAN",
      "keyPattern": {
        "name": 1.0,
        "surname": 1.0
      },
      "indexName": "name_1_surname_1"
    }
  }
}

[...]

{
  "executionStats" : {
    "executionSuccess": true,
    "nReturned": 1
  }
}

What if we decide to search by surname only?

db.getCollection('composite').find({
    "_id" : {
        "surname" : "Doe"
    }
}).explain("executionStats")

db.getCollection('artificial').find({"surname" : "Doe"}).explain("executionStats")

In a collection with a composite key, we have a situation similar to the previous one - the index was used, but we didn’t receive any document. The reason, of course, is the same: we didn’t use all the key fields.

By querying the collection with a separate composite index, we got the document we were looking for, but it turns out that this time the index was not used, and instead the database had to search through the entire collection:

{
    "winningPlan" : {
        "stage" : "COLLSCAN",
        "filter" : {
            "surname" : {
                "$eq" : "Doe"
            }
        },
        "direction" : "forward"
    }
}

This is because with composite indexes, while it is possible not to use all indexed fields, it is only permissible to skip values in the reverse order, that is, reading indexed fields from the right.
Our index was created on the fields in the following order:

db.artificial.createIndex({name: 1, surname: 1}, {unique: true})

It is therefore possible to omit the condition on the surname field and search only by name, but it’s not possible the other way around.

We managed to find out the first difference between two types of indexes: composite key indexes are less flexible, require all values to be specified, while in regular composite indexes we can omit values from the right.

Let’s also check if the order of conditions matter?

db.getCollection('composite').find({
    "_id" : {
        "surname" : "Doe",
        "name" : "John"
    }
}).explain("executionStats")

db.getCollection('artificial').find({"surname" : "Doe", "name" : "John"}).explain("executionStats")

And here’s another surprise: even though all the components of the key were provided, but in reverse order, the first query did not find the document.

{
    "executionStages" : {
      "stage": "IDHACK",
      "nReturned": 0
    }
}

With a regular composite index, the optimiser was able to reverse the field order itself and use the appropriate index to find the document: "nReturned" : 1

{
    "winningPlan" : {
        "stage" : "FETCH",
        "inputStage" : {
          "stage": "IXSCAN",
          "keyPattern": {
            "name": 1.0,
            "surname": 1.0
          },
          "indexName": "name_1_surname_1"
        }
    }
}

For the time being, indexes on composite keys are losing against regular ones. This is a good time to get back to the question about the search time of the two indexes. Now that we’ve established that indexes on composite keys are less flexible, it’s a good idea to figure out what we gain in return for such limitations. We already know that IDHACK skips all indexes and always uses a key, so one might think that this is the fastest available way to get to the document. I decided to check this on my own.

It’s time to experiment

I filled both previously used collections with 10 million documents. I used the following scripts for this purpose:

var bulk = db.composite.initializeUnorderedBulkOp();
for (let i = 0; i < 10000000; i++) {
    bulk.insert({_id: {name: 'name_' + NumberInt(i), surname: 'surname_' + NumberInt(i)}})
}
bulk.execute();

var bulk = db.artificial.initializeUnorderedBulkOp();
for (let i = 0; i < 10000000; i++) {
    bulk.insert({_id: new ObjectId(), name: 'name_' + NumberInt(i), surname: 'surname_' + NumberInt(i)})
}
bulk.execute();

It is worth noting here that I’m adding documents in batches. This is definitely faster than a list of single inserts and is useful when we need to generate a large amount of data quickly. Also note that I am using existing collections so my index on name and surname fields already exists. I measured the execution time of both scripts using following commands (to avoid network latency I performed the measurements on my laptop, with a local instance of MongoDB version 4.4.0 running):

time mongo test fill1.js

time mongo test fill2.js

Filling collections with documents took 58,95s and 76,48s respectively. So when it comes to insert operation time, the composite key collection clearly wins. The reason for this, of course, is a simpler document structure and only one index, instead of two.

I was more interested in read times, because in a typical case, data is usually read more often than written. For each collection, I prepared a script containing a list of find commands for 1 million documents in random order. Of course, the order of queries in both scripts is the same:

#!/bin/bash
RANDOM=42
for i in {1..1000000}
do
  x=$(( $RANDOM % 10000000))
    echo "db.composite.find({_id: {name: 'name_$x', surname: 'surname_$x'}})"
done >> find1.js

#!/bin/bash
RANDOM=42
for i in {1..1000000}
do
  x=$(( $RANDOM % 10000000))
    echo "db.artificial.find({name: 'name_$x', surname: 'surname_$x'})"
done >> find2.js

Finally, I measured the execution time of both scripts using the commands:

time mongo test find1.js

time mongo test find2.js

The results surprised me. The script running on the first collection took 11.19s and the second took 10.15s. Although the differences are small, I was sure that using a composite key would be much faster than searching through regular fields and would make up for all the inconvenience of less flexible keys. Meanwhile, it turns out that a collection built with an artificial key and a separate index on two fields wins in terms of search speed.

I had one more idea. Maybe searching a document by a key that doesn’t exist would be faster?
To test this, I generated another two scripts, this time searching for non-existent documents:

#!/bin/bash
RANDOM=42
for i in {1..1000000}
do
  x=$(( $RANDOM % 10000000))
    echo "db.composite.find({_id: {name: 'missing_name_$x', surname: 'missing_surname_$x'}})"
done >> find-missing1.js

#!/bin/bash
RANDOM=42
for i in {1..1000000}
do
  x=$(( $RANDOM % 10000000))
    echo "db.artificial.find({name: 'missing_name_$x', surname: 'missing_surname_$x'})"
done >> find-missing2.js

Unfortunately, the composite key collection also lost in this case: 12.44s vs. 10.26s.

Finally, I decided to run one more test. Since when using composite key we have to pass the entire key to the query and cannot search by its fragment, I decided to create a third collection, this time its key being the concatenation of first and last name:

var bulk = db.concatenation.initializeUnorderedBulkOp();
for (let i = 0; i < 10000000; i++) {
    bulk.insert({_id: 'name_' + NumberInt(i) + '-' + 'surname_' + NumberInt(i)})
}
bulk.execute();

I then prepared another script containing 1 million search operations (in the same order as the previous attempts, of course):

#!/bin/bash
RANDOM=42
for i in {1..1000000}
do
  x=$(( $RANDOM % 10000000))
    echo "db.concatenation.find({_id: 'name_$x-surname_$x'})"
done >> find3.js

This time I was finally able to get a result better than the one achieved with the artificial key approach: 9.71s.
All results are summarized in the chart below:

Conclusion

It appears that the IDHACK operation is indeed the fastest possible way to access a document, but only for a simple key. As soon as we add an additional field into the key, the benefits of the IDHACK operation disappear and the regular index works better.

In general, my experiments did not confirm the advantages of using a composite key. The artificial key model is more flexible in querying and returns results faster. However, if you want to search even faster, you may consider using a simple key created by concatenating selected fields.

I realize, of course, that I did not use the latest available version of the mongo server. For this reason, I am posting all the scripts I used in my attempts and encourage you to experiment on your own.

Evaluating performance of time series collections

Michał Knasiecki — Mon, 20 Nov 2023 10:57:43 +0000

A few years ago, I was working on a new version of Allegro purchase ratings. It was a time of a pretty large revolution in the rating system when we moved this product away from our monolith, also introducing quite significant changes to the concept of purchase rating itself. Replacing the system of positive, neutral or negative rating, we introduced an approach based on “thumbs up” and “thumbs down” as well as the option to rate several elements of the purchase separately: the time and cost of delivery, or products’ conformity with its description. The product-related revolution was accompanied by a major change in technology. Apart from transitioning towards the microservices architecture, we also decided to migrate data from a large relational database to MongoDB. There were many reasons for this decision: from the non-relational nature of our model, through the need for easier scaling, to the wish for cost reduction. Upon completion of the works, we were for the most part content with the decision that we made. The new solution was more user-friendly, easier to maintain and worked smoothly. The sole exception was aggregation queries, specifically: determining the average of seller ratings in a specified period of time. While at the level of the 99th percentile times were very low, some queries were much slower. We spent a lot of time optimising both queries and the code, and had to use some programming tricks to achieve satisfactory results. While we were able to solve our problems in the end, the final conclusion was that the aggregation of data in large MongoDB collections is quite challenging.

To the rescue: Time series

A new version of MongoDB, 5.0, has been recently launched. The list of changes included one that I found particularly interesting: the time series collections. It is a method of effective storing and processing of time-ordered value series. A classic example for this case is measuring the temperature of air. These measurements are taken periodically (for instance every hour), and their sequence forms time series. We then often review such data in an appropriate order, as well as calculate the maximum and minimum values, or the arithmetic mean. Therefore, in the said case of use, a database must be highly efficient when saving the data, store records in a compact manner due to the large number thereof, and must quickly calculate aggregates. Although in the case of temperature readings database write operations are made on a regular basis, it turns out that in the case of time series it is not mandatory, and the only thing that truly matters is the presence of time. While reading about this topic, I instantly remembered my countless late nights struggling with slow ratings aggregations. Therefore, I decided to explore this topic and see how the solution works in practice.

Before the release of MongoDB 5.0, the only way to efficiently process time series was to store pre-calculated aggregates, or use a bucket pattern, which was obviously associated with additional work and complexity of the code. Now, things have been made much easier as this additional complexity is covered by a convenient abstraction. In MongoDB, time series are not actual collections, but materialised views that cover physical collections. This abstraction is intended to simplify complicated operations based on “buckets” of documents. You can read more about the concept of storing data in the new kind of collection on the MongoDB official blog. Everyone who is interested in using this solution should take a look at it. In my article, on the other hand, I would like to verify whether the processing of time series is really as fast as promised by the authors.

Data preparation

For the purposes of our considerations I will use a simple document describing the rating in the form of:

{
    "timestamp" : ISODate("2021-10-22T00:00:00.000Z"),
    "rating" : 2.0
}

Attentive readers will immediately notice the absence of the field containing the ID of the seller whom the rating concerns. It was done intentionally, otherwise I would have to create an additional index covering this field. However, I did not want to introduce any additional elements in my experiment that could have any impact on the results. Let’s assume for this experiment that we are rating a restaurant, not Allegro sellers, therefore all ratings in the collection concern the restaurant only.

Now we can create two collections storing an identical set of data. One will be a standard collection, and the other will be a time series. The time series collection has to be created manually by indicating the field specifying the time label:

db.createCollection("coll-ts", { timeseries: { timeField: "timestamp" } } )

If you did not install Mongo 5.0 from scratch, but updated its previous version, you should make sure that it is set to an adequate level of compatibility. Otherwise, the above command will not create a time series collection.
You can check it with this command:

db.adminCommand( { getParameter: 1, featureCompatibilityVersion: 1 } )

If the returned value is less than 5.0, you need to issue:

db.adminCommand( { setFeatureCompatibilityVersion: "5.0" } )

Upon creating a collection, it is also worth checking that a time series has actually been created:

db.runCommand( { listCollections: 1.0 } )

We look for "type" : "timeseries" entry:

{
    "name" : "coll-ts",
    "type" : "timeseries",
    "options" : {
        "timeseries" : {
            "timeField" : "timestamp",
            "granularity" : "seconds",
            "bucketMaxSpanSeconds" : 3600
        }
    },
    "info" : {
        "readOnly" : false
    }
}

We will also create the second collection manually (although it is not necessary, because it would be created with the first INSERT command). We will want to check the speed of data search based on time field, so we will create a unique index:

db.createCollection("coll-ord")
db.getCollection('coll-ord').createIndex({timestamp: 1}, {unique: true})

Let’s use the following scripts to fill both collections with 10 million documents:

// Save as fill-ts.js
var bulk = db.getCollection("coll-ts").initializeUnorderedBulkOp();
var startTime = new Date("2021-10-22T00:00:00.000Z")
for (let i = 0; i < 10000000; i++) {
    bulk.insert({
       "timestamp": new Date(startTime.getTime() + i * 60000),
       "rating": Math.floor(1 + Math.random() * 4)
    })
}
bulk.execute();

// Save as fill-ord.js
var bulk = db.getCollection("coll-ord").initializeUnorderedBulkOp();
var startTime = new Date("2021-10-22T00:00:00.000Z")
for (let i = 0; i < 10000000; i++) {
    bulk.insert({
       "timestamp": new Date(startTime.getTime() + i * 60000),
       "rating": Math.floor(1 + Math.random() * 4)
    })
}
bulk.execute();

It is worth to measure the execution time of scripts to compare write times to both collections. For this purpose I use time command and mongo command-line client, test is the name of my database.
To avoid network latency I perform the measurements on my laptop with a local instance of MongoDB version 5.0.3 running.

time mongo test fill-ts.js

time mongo test fill-ord.js

As expected, the filling of the time series collection was faster: 3:30,11 vs 4:39,48. Such a difference can be essential if our system performs many write operations in a short period of time. At the very beginning of our experiments collection of a new type comes to the forefront.

Now when our collections already contain data, we can take a look at the size of files. On the manual page we can read that:

Compared to normal collections, storing time series data in time series collections improves query efficiency and
reduces the disk usage

Let’s find out how true that is.

A closer look at the data

In the first place, it is worth making sure that documents in both collections look the same:

db.getCollection('coll-ts').find({}).limit(1)
db.getCollection('coll-ord').find({}).limit(1)

Both documents should be similar to this one:

{
    "timestamp" : ISODate("2021-10-22T00:00:00.000Z"),
    "_id" : ObjectId("6184126fc42d1ab73d5208e4"),
    "rating" : 2.0
}

The documents have the same schema. In addition, the _id key was automatically generated in both cases, although our filling script did not contain them.

Let’s move on to indexes now and use the commands: db.getCollection('coll-ord').getIndexes() and db.getCollection('coll-ts').getIndexes() to get the indexes of both collections.

The normal collection has two indexes, one that was created automatically for the _id key and the one that we created manually:

// coll-ord:
[
    {
        "v" : 2,
        "key" : {
            "_id" : 1
        },
        "name" : "_id_"
    },
    {
        "v" : 2,
        "key" : {
            "timestamp" : 1.0
        },
        "name" : "timestamp_1",
        "unique" : true
    }
]

What is interesting is that the time series collection has no index at all:

// coll-ts:
[]

The lack of the index for the _id key of course means that, by default, the time series collection will have to perform the COLLSCAN operation if we want to search documents based on _id. The index is probably missing simply to save disc space and storage time, stemming from the assumption that the time series collections are mainly used to search based on time. The lack of the index for the timestamp field is much more surprising. Does it mean that time-based searches in time series will also cause COLLCSAN and work slowly? The answer to this question can be found in the documentation: > The internal index for a time series collection is not displayed

So, there actually is an index, but it is different from those created manually, and even from indexes created automatically for the _id key. As I wrote in another post, indexes are not all the same, so it’s worth taking a closer look at this one. Let’s check the query execution plans:

db.getCollection('coll-ord').find({"timestamp" : ISODate("2021-10-22T00:00:00.000Z")}).explain('executionStats')
db.getCollection('coll-ts').find({"timestamp" : ISODate("2021-10-22T00:00:00.000Z")}).explain('executionStats')

//coll-ord
"winningPlan" : {
    "stage" : "FETCH",
    "inputStage" : {
        "stage" : "IXSCAN",
        "keyPattern" : {
            "timestamp" : 1.0
        }
    },
    "indexName" : "timestamp_1",
}

//coll-ts
"winningPlan" : {
    "stage" : "COLLSCAN",
    "filter" : {
        "$and" : [
            {
                "_id" : {
                    "$lte" : ObjectId("6171ff00ffffffffffffffff")
                }
            },
            {
                "_id" : {
                    "$gte" : ObjectId("6171f0f00000000000000000")
                }
            },
            {
                "control.max.timestamp" : {
                    "$_internalExprGte" : ISODate("2021-10-22T00:00:00.000Z")
                }
            },
            {
                "control.min.timestamp" : {
                    "$_internalExprLte" : ISODate("2021-10-22T00:00:00.000Z")
                }
            }
        ]
    },
}

It turns out that while in the case of the regular collection the plan shows the use of the index, in the time series collection we see the COLLSCAN operation. It doesn’t mean that this operation is slow, though. The execution times of both operations were similar. We will move on to a more detailed time comparison in a moment; for now we should only note that the hidden index in the time series collection follows specific rules, it is not only invisible, but it also cannot be seen in the execution plan, although it clearly affects the speed of the search.

And what happens if we add sorting?

db.getCollection('coll-ts').find({}).sort({timestamp: 1})

{
 "ok" : 0,
 "errmsg" : "PlanExecutor error during aggregation :: caused by :: Sort exceeded memory limit of 104857600 bytes,
 but did not opt in to external sorting.",
 "code" : 292,
 "codeName" : "QueryExceededMemoryLimitNoDiskUseAllowed"
}

Surprise! The internal index for the time series collection does not have a sorting feature. This means that if we add the sort clause to our query, the operation will take very long, or even fail because of exceeding the memory limit. It is surprising because I did not find any information on this in the documentation. Therefore, if we plan to sort our data based on the field with time, we will need to index this field manually. It means, of course, that the benefits stemming from a lower disc usage and faster saving times will unfortunately diminish.

Since we are talking about the use of disc space, let’s check the data size:

db.getCollection('coll-ts').stats()
db.getCollection('coll-ord').stats()

We will compare several fields:

size: data size before the compression,
storageSize: size of data after the compression,
totalIndexSize: size of indexes,
totalSize: total size of data and indexes.

Results are gathered in the table below (in bytes, space is thousand separator):

Field	Normal collection	Time series collection	Diff
size	570 000 000	747 855 228	+ 31%
storageSize	175 407 104	119 410 688	-31%
totalIndexSize	232 701 952	0	-100%
totalSize	408 109 056	119 410 688	-70%

As you can see, raw data of the time series collection may take up more space, but after their compression the new-type collection turns out to be the winner in the size-on-disc category. After adding the size of indexes created in the regular collection, the difference will be even greater. Therefore, we must admit that the way time series data are packed on the disc is impressive.

Now it’s time to compare query execution times for both collections.

Speed is all that matters

Using the script below (saved as gen-find.sh file), I generated two files containing commands getting documents from both collections based on the time label:

#!/bin/bash
RANDOM=42
for ((i = 1; i <= $1; i++ ));
do
  x=$(( $RANDOM % 10000000))
  t=$(date -jv +${x}M -f "%Y-%m-%d %H:%M:%S" "2021-10-22 0:00:00" +%Y-%m-%dT%H:%M:%S)
  echo "db.getCollection('$2').find({'timestamp' : new ISODate('$t')})"
done >> find-$2.js

The script takes as parameters: the number of queries, and the name of the collection that we want to search. I generated a million queries (it may take some time depending on your hardware, so you can start with a lower amount of queries):

./gen-find.sh 1000000 coll-ts
./gen-find.sh 1000000 coll-ord

Then I checked the time of the execution of both query sequences using the command:

time mongo test find-coll-ts.js
time mongo test find-coll-ord.js

The standard collection was a bit slower: 16,854 for coll-ord vs 16,038 for coll-ts. Although the difference is small, another point goes to time series: simple search is slightly faster than in the case of the regular collection.

But we’re yet to discuss the most interesting part. Time series is primarily used for quick aggregate counting. Let’s see what the comparison looks like when calculating the arithmetic mean in a given time interval.

The script below (saved as gen-aggregate.sh) creates a list of queries calculating the arithmetic mean of ratings for a randomly selected six-hour interval:

#!/bin/bash
RANDOM=42
for ((i = 1; i <= $1; i++ ));
do
  x1=$(( $RANDOM % 10000000))
  x2=$(( x1 + 360))
  t1=$(date -jv +${x1}M -f "%Y-%m-%d %H:%M:%S" "2021-10-22 0:00:00" +%Y-%m-%dT%H:%M:%S)
  t2=$(date -jv +${x2}M -f "%Y-%m-%d %H:%M:%S" "2021-10-22 0:00:00" +%Y-%m-%dT%H:%M:%S)
  echo "db.getCollection('$2').aggregate([{ \$match: { "timestamp" : " \
    "{\$gte:new ISODate('$t1'),\$lt:new ISODate('$t2')} } }," \
    "{ \$group: { _id: null, avg: { \$avg: \"\$rating\" } } }])"
done >> aggregate-$2-$1.js

I prepared three script sets with: 10K, 50K and 100K queries for both collections:

./gen-aggregate.sh 10000 coll-ts
./gen-aggregate.sh 50000 coll-ts
./gen-aggregate.sh 100000 coll-ts

./gen-aggregate.sh 10000 coll-ord
./gen-aggregate.sh 50000 coll-ord
./gen-aggregate.sh 100000 coll-ord

I made the measurements using following commands:

time mongo test aggregate-coll-ts-10000.js
time mongo test aggregate-coll-ord-10000.js

time mongo test aggregate-coll-ts-50000.js
time mongo test aggregate-coll-ord-50000.js

time mongo test aggregate-coll-ts-100000.js
time mongo test aggregate-coll-ord-100000.js

The results are shown in the table below:

Number of queries	Normal collection [min:sec]	Time series [min:sec]	Diff
10000	0:21,947	0:16,835	-23%
50000	1:37,02	1:11,18	-26%
100000	4:33,29	2:21,37	-48%

Although there are far fewer queries this time than in the previous experiment, the differences in times are much greater. It clearly proves that time series collections are indeed spreading their wings when we want to use aggregation queries and have been developed mainly for this purpose. The reason is probably the sequential way of writing data, which shows a noticeable improvement when running ranged queries. The results clearly show also how much of an impact query performance can have on an element that often doesn’t get the adequate attention: proper modelling the data.

Limitations

Of course, time series collections also have their limitations and should not be used as a golden hammer. We have already mentioned the first one — the lack of the primary key index. It stems from the assumption that searches will be primarily based on time, so there is no point in creating an index that will be useful for very few people. Of course, we can create this index ourselves.

It is also not possible to sort by time field, which is another inconvenience. If we want to have sorting queries, we have to create an additional index.

Although these two missing indexes may seem to be an easy thing to fix, we must remember that it involves additional use of disc space as well as longer indexing time during the saving of the document, which means that the benefits stemming from the use of time series will be somewhat reduced.

The third, and perhaps most import limitation is the immutability of the document. Once saved, documents cannot be updated or deleted. The only way to delete data is to use drop() command or to define retention for the collection using the expireAfterSeconds parameter, which, like TTL indexes, will automatically delete documents certain time after creation.

The lack of possibility to manipulate the saved documents will probably be the main reason why programmers will be hesitant to use time series. We should mention, however, that the authors of MongoDB will probably add the possibility to edit and delete documents in the future:

While we know some of these limitations may be impactful to your current use case, we promise we’re working on this
right now and would love for you to provide your feedback!

Summary

Adding the ability to store time series in MongoDB is a step in the right direction. First tests show that in certain cases the new type of collections really does work better than the regular ones. They use less disc space, are faster at saving and searching by the time. But high performance always comes at a cost, in this case: the cost of reduced flexibility. Therefore, the final decision to use time series should be preceded by an analysis of advantages and disadvantages of both in particular cases. We should also hope that the authors of the database are working on improving it and will soon eliminate most limitations.

Easy tips to reduce MongoDB database size in no time

Michał Knasiecki — Mon, 13 Nov 2023 07:58:55 +0000

In this series of articles, I show how we can reduce database size and improve it’s performance just by using the right structures and proper data types. In the previous post, we were checking how json structure may influence database storage size. The results were promising, so it's worth taking a broader look at the topic. Therefore what else can we do to reduce the size of data on disk?

Enumerated types

Now, let's also look at the enumerated types. You can store them in two ways, either by using a label:

{"_id": 1, "source": "ALLEGRO"}

or by ordinal value:

{"_id": 1, "source": 1}

Here an analogy with the first experiment can be found: again we replace a character string with a numerical value.
Since we got a great result the first time, maybe we could repeat it here?

Unfortunately, the result is disappointing and the reduction in size is small. However, if we think more deeply, we will come to the conclusion that it could not have been otherwise. The enumerated types are not unique identifiers. We are dealing only with a few possible values, appearing many times in the whole collection, so it’s a great chance for MongoDB data compression to prove its worth. Most likely the values from the first collection have been automatically replaced by the engine to its corresponding ordinal values. Additionally, we don't have any profit on the schema here, because they are the same in both cases. The situation might be different for collections that do not have data compression enabled.

This is a good time to take a closer look at how snappy compression works in MongoDB. I've prepared two more collections, with identical data, but with data compression turned off. The results are shown in the chart below, compiled together with the collections with data compression turned on.

It is clear that the use of an ordinal value instead of a label of the enumerated type brings considerable profit for collections with data compression disabled. In case of lack of compression it is definitely worth considering using numerical values.

Useless _class field

If we use spring-data, each of our collections additionally contains a _class field storing the package and the name of the entity class containing the mapping for documents from this collection. This mechanism is used to support inheritance of model classes, that is not widely used. In most cases this field stores exactly the same values for each document what makes it useless. And how much does it cost? Let's compare the following documents:

{"_id": 1}

and

{"_id": 1, "_class": "pl.allegro.some.project.domain.sample"}

The difference is considerable, over 50%. Bearing in mind that the compression is enabled, I believe that the impact of the data itself is small and the result is caused by the schema of the collection containing only the key being half as small as the one with the key (1 field vs 2 fields). After removal of the _class field from a document with more fields, the difference expressed in percent will be much smaller of course. However, storing useless data does not make sense.

Useless indexes

When investigating the case with identifiers stored as strings, I checked whether manipulating the data type could reduce the index size. Unfortunately I did not succeed, so I decided to focus on the problem of redundant indexes.

It is usually a good practice to cover each query with an index so that the number of collscan operations is as small as possible. This often leads to situations where we have too many indexes. We add more queries, create new indexes for them, and often it turns out that this newly created index takes over the queries of another one. As a result, we have to maintain many unnecessary indexes, wasting disk space and CPU time.

Fortunately, it is possible to check the number of index uses with the query:

db.COLLECTION_NAME.aggregate([{$indexStats:{}}])

It allows you to find indexes that are not used so you can safely delete them.

And finally, one more thing. Indexes can be removed quickly, but they take much longer to rebuild. This means that the consequences of removal of an index by mistake can be severe. Therefore it is better to make sure that the index to be deleted is definitely not used by any query. The latest MongoDB 4.4 provides a command that helps to avoid errors:

db.COLLECTION_NAME.hideIndex(<index>)

The above-mentioned command hides the index from the query optimizer. It does not take this index into account when building the query execution plan, but the index is still updated when modifying documents. Thanks to that, if it turns out that it was needed, we are able to restore it immediately and it will still be up-to-date.

Conclusion

Using a few simple techniques, preparing several versions of the schema and using stats() command we are able to design a model which does not overload our infrastructure. I encourage you to experiment with your own databases. Maybe you too can save some disk space and CPU time.

How to reduce MongoDB database size just by changing json structure

Michał Knasiecki — Mon, 06 Nov 2023 10:37:52 +0000

In this series of articles, I show how we can reduce database size and improve it’s performance just by using the right structures and proper data types. In the previous post, we looked at how the _id field type affects the size of the database. The results were promising, so it's worth taking a broader look at the topic. Therefore what else can we do to reduce the size of data on disk?

Simple and complex structures

Let's move on to more complex data. Our models are often built from smaller structures grouping data such as user, order or offer. I decided to compare the size of documents storing such structures and their flat counterparts:

{"_id": 1, "user": {"name": "Jan", "surname": "Nowak"}}

and

{"_id": 1, "userName": "Jan", "userSurname": "Nowak"}

In both cases we store identical data, the documents differ only in the schema. Take a look at the result:

There is a slight reduction by 0.4MB. It may seem not much compared to the effect we achieved for the field containing an ID. However, we have to bear in mind that in this case we were dealing with a more complex document. It contained – in addition to the compared fields – a numerical type identifier that, as we remember from previous experiments, takes up about 5MB of disk space.

Taking this into account in the above results we are talking about a decrease from 3.4MB to 3MB. It actually looks better as percentage - we saved 12% of the space needed to store personal data.

Let's go back to the discussed documents for a moment:

{"_id": 1, "user": {"name": "Jan", "surname": "Nowak"}}

and

{"_id": 1, "userName": "Jan", "userSurname": "Nowak"}

A watchful eye will notice that I used longer field names in the document after flattening. So instead of user.name and user.surname I made userName and userSurname. I did it automatically, a bit unconsciously, to make the resulting JSON more readable. However, if by changing only the schema of the document from compound to flat we managed to reduce the size of the data, maybe it is worth to go a step further and shorten the field names?

I decided to add a third document for comparison, flattened and with shorter field names:

{"_id": 1, "name": "Jan", "surname": "Nowak"}

The results are shown in the chart below:

The result is even better than just flattening. Apart from the document’s key size, we achieved a decrease from 3.4MB to 2MB. Why does this happen even though we store exactly the same data?

The reason for the decrease is the nature of NoSQL databases that, unlike the relational ones, do not have a schema defined at the level of the entire collection. If someone is very stubborn, they can store user data, offers, orders and payments in one collection. It would still be possible to read, index and search that data. This is because the database, in addition to the data itself, stores its schema with each document. Thus, the total size of a document consists of its data and its schema. And that explains the whole puzzle. By reducing the size of the schema, we also reduce the size of each document, i.e. the size of the final file with the collection data. It is worth taking this into account when designing a collection schema in order not to blow it up too much. Of course, you cannot go to extremes, because that would lead us to fields named a, b and c, what would make the collection completely illegible for a human.

For very large collections, however, this approach is used, an example of which is the MongoDB operation log that contains fields called:

Empty fields

Since we are at the document's schema, it is still worth looking at the problem of blank fields. In the case of JSON the lack of value in a certain field can be written in two ways, either directly - by writing null in its value - or by not serializing the field at all. I prepared a comparison of documents with the following structure:

{ "id" : 1 }

and

{ "id" : 1, "phone" : null}

The meaning of the data in both documents is identical - the user has no phone number. However, the schema of the second document is different from the first one because it contains two fields.

Here are the results:

The results are quite surprising: saving a million null’s is quite expensive as it takes more than 1MB on a disk.

That's it for today. In the next entry we will take a look at enumerations and unused fields. If you find this topic interesting, please let me know by like/unicorn, so I know it's worth continuing.

How to shrink MongoDB database size just by changing _id field type

Michał Knasiecki — Mon, 30 Oct 2023 14:52:41 +0000

MongoDB is by far one of the most popular databases used in the modern web world. This is because it is based on a simple idea: since we send data as json through the web, why not store it in the same form in a database?

This definitely makes tho job easier and faster for developers. But storing data in the same form as we send it over the network is not necessarily optimal from a database point of view. Database creators implement various solutions (sometimes very smart ones) based on certain assumptions, if we ignore them, we will not use the full potential of their product.

To shed some light on this interesting topic, in this series of articles, I will show how we can reduce database size and improve it’s performance just by using the right structures and proper data types. From now on, every Monday I will post another post on this topic. If you find this topic interesting, please let me know by like/unicorn, so I know it's worth continuing.

Introduction

So I was tuning one of our services in order to speed up some MongoDB queries. Incidentally, my attention was caught by the size of one of the collections that contains archived objects and therefore is rarely used. Unfortunately I wasn't able to reduce the size of the documents stored there, but I started to wonder: is it possible to store the same data in a more compact way? Mongo stores JSON that allows many different ways of expressing similar data, so there seems to be room for improvements.

It is worth asking: why make such an effort in the era of Big Data and unlimited resources? There are several reasons.

First of all, the resources are not unlimited and at the end we have physical drives that cost money to buy, replace, maintain, and be supplied with power.

Secondly, less stored data results in less time to access it. Moreover, less data means that more of it will fit into cache, so the next data access will be an order of magnitude faster.

I decided to do some experiments and check how the model design affects the size of database files.

Tools

I used a local MongoDB Community Edition 4.4 installation and I initially tested collections containing 1 million and 10 million documents. One of the variants contained up to 100 million, but the results were proportional (nearly linear). Therefore, in the end I decided to stop at 1M collections, because loading the data was simply much faster.

Having access to local database files, I could easily check the size of the files storing individual collections. However, it turned out to be unnecessary, because the same data can be obtained with the command:

db.getCollection('COLLECTION_NAME').stats()

The following fields are expressed in bytes:

size: size of all collection documents, before compression,
avgObjSize: average document size, before compression,
storageSize: file size on the disk; this is the value after the data is compressed, the same value is returned by ls command,
freeStorageSize: size of unused space allocated for the data; Mongo does not increase the file size byte-by-byte, but allocates a percentage of the current file size and this value indicates how much data will still fit into the file.

To present the results I used (storageSize - freeStorageSize) value that indicates the actual space occupied by the data.

My local MongoDB instance had compression enabled. Not every storage engine has this option enabled by default, so when you start your own analysis it is worth checking how it is in your particular case.

Tip 1: ID field type

In the beginning I decided to check ID fields. Not the collection primary key, mind you – it is 'ObjectId' type by default and generally shouldn't be changed. I decided to focus on user and offers identifiers which, although numerical, are often saved as String in Mongo. I believe it comes partly due to the contract of our services - identifiers in JSON most often come as Strings and in this form they are later stored in our databases. Let’s start with some theory: the number of int32 type in Mongo has a fixed size of 4 bytes. The same number written as a String of characters has a size dependent on the number of digits; additionally it contains the length of the text (4 bytes) and a terminator character. For example, the text "0" is 12 bytes long and "1234567890" is 25 bytes long. So in theory we should get interesting results, but what does it look like in reality?

I prepared 2 collections, one million documents each, containing the following documents:

{ "_id" : 1 }

and

{ "_id" : "1" }

The values of identifiers were consecutive natural numbers. Here is the comparison of results:

As you can see the difference is significant since the size on disk decreased by half. Additionally, it is worth noting here that my data is synthetic and the identifiers start from 1. The advantage of a numerical identifier over a String increases the more digits a number has, so benefits should be even better for the real life data.

Encouraged by the success I decided to check if field type had any influence on the size of an index created on it. In this case, unfortunately, I got disappointed: the sizes were similar. This is due to the fact that MongoDB uses a hash function when creating indexes, so physically both indexes are composed of numerical values. However, if we are dealing with hashing, maybe at least searching by index in a numerical field is faster?

I made a comparison of searching for a million and 10 million documents by indexed key in a random but the same order for both collections. Again, a missed shot: both tests ended up with a similar result, so the conclusion is this: it is worth using numerical identifiers, because they require less disk space, but we will not get additional benefits associated with indexes on these fields.

That's it for today. In the next entry we will check whether it is better to use flat or nested json structures.
If you find this topic interesting, please let me know by like/unicorn, so I know it's worth continuing.

GC, hands off my data!

Michał Knasiecki — Fri, 27 Oct 2023 17:54:34 +0000

Certainly one of the main distinguishing features of the Java world is the Garbage Collector. Using it is safe and convenient, it allows us to forget about many tedious responsibilities, letting us focus on the pure joy of coding. Yet sometimes it can cause a headache too, especially when we notice that GC uses our resources too intensively. Each of us has probably experienced a time in our career when we wanted to get rid of the Garbage Collector from our application because it was running too long, too often, and perhaps even led to temporary system freezes.

What if we could still benefit from the GC, but in special cases, also be able to store data beyond its control? We could still take advantage of its convenience and, at the same time, be able to easily get rid of long GC pauses.

It turns out that it is possible. In this article, we will look at whether and when it is worth storing data beyond the reach of the Garbage Collector’s greedy hands.

Comfort comes at a price

At Allegro we are very keen on metrics. We measure anything that can tell us something about the condition of our services. Apart from the most obvious metrics directly related to the application, such as throughput, the number of errors, CPU and memory usage, we also pay a great deal of attention to metrics related to the garbage collecting — GC working time and number of its cycles. Too much time spent on releasing the memory or too frequent GC launches may signal problems with memory leaks or indicate that it is worth considering optimising memory usage or switching to a different GC strategy.

Following the example of large technology companies, we have been organising company meetups within the so-called guilds for some time now. In one of such guilds, over a hundred engineers meet regularly once a month and discuss various topics related to performance, scaling and service optimisation. At one of these meetings, our colleague discussed the method of determining the actual size of data stored in a cache. Apparently, this is not a simple matter, as internal mechanisms for optimising memory usage, such as deduplication or compression, must be taken into account. After the presentation, an interesting discussion ensued about how much memory on the heap is actually used by the cache and how long it takes to clean it up. Someone pointed out that there is a hidden cost of using the cache that takes the form of time needed to free the memory of expired cache items, which not everyone is aware of. What is more, the manner in which the cache works does not quite fit the generational hypothesis and may mislead the JVM by preventing it from properly tuning the GC mechanism. I then began to wonder whether it might not be worth keeping the cache in an area excluded from the GC’s control? I knew this is possible, although I had never seen a practical implementation of this technique. This topic was bothering me for some time, so I decided to investigate.

Memory architecture

Any skilled Java programmer knows the division of memory into young and old generation areas. People interested in details are probably also familiar with the more precise division into eden, survivor, tenured and perm. There are many excellent articles discussing this topic (like this one), so we won’t go into details. Instead, we will focus on a very specialised area of memory that the GC has no control over, which is the off-heap memory, sometimes also called native memory. This is a special area under the direct control of the operating system, which the JVM uses for its own purposes. It stores information about classes and methods, internal thread data and cached code necessary for operation. As I mentioned earlier, off-heap memory is not subject to the GC. In particular, it is excluded from garbage collection processes, which means that programmers creating the JVM code using this area are wholly responsible for freeing memory allocated for variables. There is also a dedicated area to which we — the programmers — have access as well. There is a possibility to write and read data from this space, remembering of course, that the responsibility for cleaning up after unnecessary variables lies entirely with us.

This area can be accessed using a simple API.
The following code allocates 100 bytes of off-heap memory and stores a String and an Integer.
At the end the data are loaded from the off-heap memory and then printed out.

int size = 100;

ByteBuffer buff = ByteBuffer.allocateDirect(size);
buff.put("Michal".getBytes());
buff.putInt(42);

buff.position(0); // set the pointer back to the beginning

byte[] name = new byte[6]; // length of my name
buff.get(name);

out.println(new String(name));
out.println(buff.getInt());

Note the allocateDirect method that allocates off-heap memory unlike a similar method: allocate that allocates on-heap memory. The behavior of both methods can be compared with the help of a profiler (I will use jConsole). The following programs allocate 1GB of memory, respectively, on-heap and off-heap:

ByteBuffer.allocate(1000000000)

ByteBuffer.allocateDirect(1000000000)

The chart below shows heap memory profile comparison for both programs (on-heap on the left vs. off-heap on the right):

Such a possibility to bypass Garbage Collector may seem extremely tempting to developers struggling with long working time of the GC. However, this raises the question: what type of usage justifies the extra effort involved in manually freeing the memory and the potential risk of error? What are the advantages of using off-heap memory? Is it faster? How much time will we save by bypassing the GC? Why is this method so uncommon? To put it simply: is it worth doing and if so, when?

Be gone GC!

GC is a wonderful tool. It allows us – although sometimes only for a while – to forget about the problems related to painful memory management. We can create variables of any type and any scope almost freely, and not worry about what happens to memory once we stop using them. This task is handled by the GC, which does it brilliantly. In each successive version of the JDK we get a new algorithm, which in some specific cases is even better than the previous one.

However, I’m more than sure that many of us have once encountered the problem of long GC time or too frequent GC calls. Every developer has their own ideas on how to deal with this issue - we look for memory leaks, profile the application in search of hot spots, examine the scope of created variables, use object pools, verify the system behaviour with different GC algorithms, and check the cache configuration.

In my case, it is the cache that is often responsible for long GC time. Sometimes it stores large numbers of objects, usually complex ones, containing references to other objects. What is more, the way cache objects are accessed is often not uniform. Some objects are never queried after being inserted into the cache, others are read throughout their whole lifecycle. This causes the cache to disrupt the somewhat ideal world order defined by the generational hypothesis. Then, GC algorithms are faced with a very difficult task of determining the optimal way to clean up the memory freed by the items removed from the cache. All this causes the cache cleanup to be expensive. This made me wonder if there was any benefit in storing cache data outside the heap?

Off-heap space: Pros and cons

In a sense, the off-heap space lies outside the control of the JVM (though it belongs to the Java process), and for this reason, it is not possible to write complex structures used in JVM languages into it. This raises the need for an intermediate step of serializing the data into a plain byte array, which can then be stored in the off-heap area. When the data is loaded, the reverse process must be performed: deserialization into a form that we can use in Java. These additional steps will of course come at an extra cost, which is why accessing off-heap data will, for obvious reasons, take longer than accessing on-heap data directly.

Since writing and reading data in the off-heap space takes longer, what is the benefit of this approach then? Well, the data stored in the off-heap space are not subject to GC processes, so on the one hand we – the programmers – are responsible for each freeing of memory after a given variable is no longer useful. On the other hand, we relieve the management processes in the JVM by releasing CPU’s time for the rest of the application, so, theoretically, it should result in some resource savings. The question is, do these differences balance each other out to any degree? Will the savings associated with the GC process balance out our longer data access time? If so, does it depend only on the amount of data, or is there a specific usage scenario? To answer these questions, it is necessary to run a few experiments.

Experiments

We can store any data structure in the on-heap area, which means that the advantage of this approach lies in the fact that there is no overhead involved in transforming the data to another form, while its disadvantage consists of the additional cost related to the GC. On the other hand, in the case of off-heap storage, there is no GC extra cost, but there is the cost of serialising the data to a byte array.

Over the last years, significant progress has been made in the field of GC and with the right matching of the algorithm to the application profile, its time can be very short. But is there any case where it is worth reaching into the unmanaged space after all?

I decided to start with an overview of what open-source options are currently available. When it comes to the implementation of the on-heap cache mechanism, the options are numerous – there is well known: guava, ehcache, caffeine and many other solutions. However, when I began researching cache mechanisms offering the possibility of storing data outside GC control, I found out that there are very few solutions left. Out of the popular ones, only Terracotta is supported. It seems that this is a very niche solution and we do not have many options to choose from. In terms of less-known projects, I came across Chronicle-Map, MapDB and OHC. I chose the last one because it was created as part of the Cassandra project, which I had some experience with and was curious about how this component worked:

OHC was developed in 2014/15 for Apache Cassandra 2.2 and 3.0 to be used as the new row-cache backend.

To run the experiment, I decided to use a service built to provide the offer description based on its unique number. After downloading the offer description from the repository, it is placed in the cache to speed up future calls. Obviously, the cache has a limited capacity, which is chosen in such a way that it forces the deletion of items that have been placed in it for the longest time ago.

In our cache, the offer number is the key, while its description in the form of a string of characters is the value. This allows us to easily simulate almost any size of data in the cache (all we have to do is to make the offer description longer), and additionally, it makes the overhead related to the aforementioned serialisation relatively small – serialisation of a text string is obviously faster than a complex DTO object.

In my project, I used the Caffeine cache to store the data in the on-heap area and OHC library to store it in the off-heap area.

The test scenario consists of querying for descriptions of different offers. During the test, I will collect data on memory and GC parameters using jConsole. I will run the test scenario using jMeter, which additionally will allow me to measure response times.

From my preliminary research I know that this method is only applicable to memory-intensive systems. However, for the sake of order, let’s first run an experiment on a small cache size with element set to 5 KB: - maximum number of cached elements: 10000

cached element size: 5.000 bytes
10 threads querying for random offers in a loop of 100000 iterations each

Take a look at the screenshots from jConsole below. The results are in line with expectations: no benefit from the use of off-heap space. Both the number of garbage collection cycles (63 vs. 65) and GC run time (0.182s vs 0.235s) are nearly identical in both cases:

The GC profile of on-heap variant:

The GC profile of off-heap variant:

Not much of an improvement for small to medium cache size. However, this result is not disappointing to me because I expected it. GC is designed to handle much more memory than 400 MB, it would therefore be strange if we obtained an improvement at such an early stage.

Now let’s see how the comparison looks for a much larger cache element size, let’s increase it up to 100 KB. At the same time, due to the fact that I am running the tests on a laptop with limited resources, I will reduce threads configuration and cache maximum element size.

The configuration of the second test is as follows:

maximum number of cached elements: 5000
cached element size: 100.000 bytes
10 threads querying for random offers in a loop of 1000 iterations each

Let’s take a look at the results.

The GC profile of on-heap variant:

Memory usage increases throughout the test, there are 40 GC collection cycles that together last 0.212s.

The GC profile of off-heap variant:

This time heap memory usage chart definitely looks different, is shaped like a saw, and reaches half of the previous value.
Please note also, that this time there are only 13 GC cycles with total time of 0.108s.

The results of the GC profile comparison are therefore as expected, and what about the response times?

jMeter metrics of on-heap variant:

jMeter metrics of off-heap variant:

Request time metrics data is also in line with predictions, off-heap variant proved to be slightly slower than on-heap.

Now let’s see what effect increasing the data size will have on the results. Let’s do tests for the following sizes: 100.000 B, 200.000 B and 300.000 B, jMeter configuration stays unchanged: 10 threads with 1000 iterations each. This time, for the sake of clarity, the results are summarized in a table:

Cached item size	Variant	GC cycles count	GC time	Request time (median)	Throughput
100.000 B	on-heap	40	0.212 s	171 ms	83.2 rps
100.000 B	off-heap	13	0.108 s	179 ms	78.1 rps
200.000 B	on-heap	84	0.453 s	396 ms	38.2 rps
200.000 B	off-heap	19	0.182 s	355 ms	40.2 rps
300.000 B	on-heap	114	0.6s	543 ms	27.3 rps
300.000 B	off-heap	27	0.185s	528 ms	27.9 rps

It turns out that as the size of cache item increases, the benefits of using off-heap space grow – all metrics are improved.

What about cache maximum elements? Let’s use 200.000B item size and check what happens when we increase the maximum cache element size, we will test cache for 5000, 10.000 and 15.000 elements:

Cache max elements	Variant	GC cycles count	GC time	Request time (median)	Throughput
5000	on-heap	84	0.453 s	396 ms	38.2 rps
5000	off-heap	19	0.182 s	355 ms	40.2 rps
10000	on-heap	81	0.46 s	393 ms	38.8 rps
10000	off-heap	19	0.173 s	345 ms	42.6 rps
15000	on-heap	84	0.462 s	355 ms	41.8 rps
15000	off-heap	19	0.167 s	344 ms	42.6 rps

No surprise here either, increasing cache size has a positive impact on both variants. Of course in case of on-heap cache, some of the benefits are offset by the need for cleaning larger memory area.

With the experiments conducted, we can conclude that the more data we store in memory, the greater the benefit of using the off-heap area may be. At the same time, it should be added that these benefits are not huge, just a few RPS more. In the case of systems that store tremendous amounts of data, this method may bring some improvements in terms of resource utilization. However, for most of our apps and services, that’s probably not the way to go, a code audit is a better idea.

This is probably a good time to highlight how well implemented the current memory sweeper algorithms are. Well done GC!

Conclusions

Everyone has probably come across a case when an application froze as a result of GC’s operation. As the above data show, there is a relationship between the amount of data stored in memory and the time the GC requires to clean it up – the more data we store on the heap, the longer it takes to free the memory. That is why the cases where we process large amounts of data provide us with a potential benefit of using the off-heap area. There are some very specialised uses of this technique, such as Spark, which can store large amounts of data for subsequent processing steps and can do so using the off-heap space (you can read more about Spark memory model here). Another example of the use of the off-heap approach is the Apache Cassandra database. The OHC used in this post was developed from this particular project.

There is a very narrow group of cases where storing data outside of GC control is justifiable. However, for the vast majority of applications, a much better approach is to take advantage of ever-improving GC implementations. If you have experienced problems with the slow performance of the GC while developing your business service, you should definitely audit your code first and experiment with different heap size settings and the GC algorithm. When all other methods fail, you can give the off-heap area a try.

However, if you are working on a server that processes massive amounts of data, it is worth considering off-heap storage earlier, similar to Spark or Cassandra solutions.

Probabilistic Data Structures and Algorithms in NoSQL databases

Michał Knasiecki — Mon, 23 Oct 2023 19:44:37 +0000

One of the four fundamental features of transactional databases is durability. It says that once a transaction is committed, the stored data remains available even if the database crashes. If we upload some information into the database, we must be able to read it later, no matter what happens.

It is so elementary that we frequently don’t even think about it: if we save a record with the ’42’ value in a database, we will get ’42’ every time we read that record, until the next modification. The durability concept can be generalized somewhat, by considering not only transactional databases but those that do not provide transactions. After all, in each of them, after a correct write we can be sure that the stored information is in the database and we have access to it.

But it turns out that there are databases that provide us with solutions making that the concept of durability — even in this generalized form — no longer so obvious. What would you say if we stored 1 000 records in a database, and the database claimed that there were only 998 of them? Or, if we
created a database storing sets of values and in some cases the database would claim that an element was in that set, while in fact it was not? Seeing such a behavior many would probably start looking for an error. However, behavior like this is not necessarily an error, as long as we use a database
that implements probabilistic algorithms and data structures. Solutions based on these methods allow some inaccuracy in the results, but in return they are able to provide us with great savings in the resources used. More interesting is that there is a good chance that you are already using such a DB.

In this post we will learn about two probability-based techniques, perform some experiments and consider when it is worth using a database that lies to us a bit.

Fast cardinality aggregation

Some time ago I had the opportunity to work on a service based on Elasticsearch. This service collects huge amounts of data, which is later analyzed by our customer care specialists. One of the key elements to be analyzed is a simple aggregate — the number of unique occurrences of certain values. In mathematics, this quantity is called the power of the set or the cardinal number.

The easiest way to understand this is to use an example: imagine that I take out all the banknotes from my wallet and it turns out that I have 10 of them, with the following nominal values:

[10, 20, 50, 20, 50, 100, 50, 20, 10, 10]

If we arranged them by value, we would end up collecting these 10 banknotes in four piles with values: [10, 20, 50, 100], so the cardinal number of the set containing my 10 banknotes equals: 4.

Elasticsearch has a special function: cardinality, which is used to determine the power of the set and we use this function specifically to count unique occurrences that I mentioned earlier.

It may seem that counting unique occurrences of values is a trivial task.
Let’s go back to our example with the banknotes. You can think of many ways to check how many unique values there are in this list, probably one of the simplest is to use the HashSet class. One of its main features is that it de-duplicates the elements added to it, thus it stores only one occurrence of each.

After adding 10 values of banknotes: [10, 20, 50, 20, 50, 100, 50, 20, 10, 10] to an instance of the HashSet class, it will ultimately only store the values [10, 20, 50, 100] (not necessarily in that order, but it doesn’t matter it this case). So all we need to do is check the size of this set and we have the result we were looking for: 4.

This solution is simple and looks tempting, yet it has a certain drawback: the more unique elements the set stores, the more memory our program needs. In an extreme case, when each added element is different from the others, the memory complexity of this approach will be linear. This is bad news when we want to operate on a large volume of data, because we will immediately use all available memory.
If, additionally, requests for the cardinal number come from clients with high intensity, and the input set contains billions of elements, it is easy to imagine that the approach described above has no chance of success.

How to address this issue? In such a situation we can switch to one of ingenious probabilistic algorithms. Their main feature is that they give approximate rather than exact results. The huge advantage, on the other hand, is that they are much less resource-intensive.

Near-optimal cardinality estimator

One such algorithm — HyperLogLog (HLL) — has been implemented in the aforementioned Elasticsearch to build the cardinality function. It is used to count the unique values of a given field of an indexed document, and it does so with a certain approximation, using very little memory.
Interestingly, you can control the accuracy of this approximation with a special parameter. This is because in addition to the field to be counted, the cardinality function also accepts a precision_threshold argument, due to which we can specify how much inaccuracy we agree to, in exchange for less or more memory usage.

Obviously, in some cases even a small error is unacceptable. We must then abandon the probabilistic approach and look for another solution. However, for a sizable class of problems, certain approximation is completely sufficient. Imagine a video clip uploaded to a popular streaming service.
If the author of the clip has a bit of luck, the counter of unique views of his/her work starts spinning very quickly. In case of very high popularity, when displaying the current number of visits, full accuracy will not matter so much; we can reconcile with displaying a value that differs from the actual one by a few percent. It is completely sufficient that the accurate data — e.g. for monetization purposes — is available the next day, when we calculate it accurately using, for example, Apache Spark.

Implementing such a counter of unique visitors into a site operating on huge data sets, we could therefore consider using the HLL algorithm.

Readers interested in a detailed description of the HLL algorithm are referred to a great article on Damn Cool Algorithms post.
However, its most important features are worth noting here:

the results, although approximate, are deterministic,
the maximum possible error is known,
amount of memory used is fixed.

The last two features are closely related and can be controlled: we can decrease the error level by increasing the available memory limit and vice versa.
There are many ready-made implementations of the HLL algorithm available, so it’s worth reaching for one of them and doing some experiments. I will use datasketches and compare the memory consumption with the classic approach using the HashSet. Moreover, I will add a third variant based on a distinct method from the Kotlin language, which — like the HashSet constructor — de-duplicates elements from the list.

Below there is a code snippet of a simple program that determines the cardinal number of a set of numbers using HashSet class from Java language. In order to be able to run some trials, I’ve introduced a couple of basic parameters. The input list consists of n numbers, while using the f parameter and the modulo function I decide what part of the input list is unique. For example, for n=1 000 000 and f=0.1, the result will be a cardinal number equal to 100 000.

Please note the HashSet constructor parameter. By default, when the constructor is empty - this class is initialized with the value 16, which means that before adding the 17th element, memory reallocation must occur for next portion of elements, which takes time.
To eliminate this extra time I allocate in advance as much memory as needed.

val mod = (n * f).toLong()
val set = HashSet<Long>(mod.toInt())

val elapsed = measureTimeMillis {
    for (i in 0 until n) {
        set.add(i % mod)
    }

    cardinality = set.size
}

Two other programs do exactly the same thing: determine the cardinal number of a set of numbers, but one uses Kotlin
distinct method and the second one uses HLL algorithm. You can find full code of all three applications on this repository.

All three programs, in addition to the result, also measure total execution time. Moreover, using jConsole I am also able to measure the amount of memory used. I decided to measure the total memory used by the programs, because measuring the size of the data structures is not a trivial task.

We start by checking the variant n=1 000 000/f=0.25 as a result of which we should get a power of set equal 250 000. Let’s take a look at the results:

n=1 000 000/f=0.25

Metric\Variant	HashSet	distinct	HLL
cardinality	250 000	250 000	249 979.9
error [%]	0	0	0.01
time [ms]	71	106	53
memory [MB]	42	73	21

In case of such a small set the deviation of the result of the HLL variant from the true value is far less than 1%, while in this case you can already see the benefits of this method; the amount of memory used is half compared to the HashSet version and as much as 3 times less when compared to the version using the Kotlin language function.

It is worth pausing here for a moment to consider what is the reason for such a big difference in consumed memory.
The first two programs are based on collections of objects, thus storing in memory entire instances along with their references.
The HLL method, on the other hand, uses memory-efficient bit arrays that store data based on object hashes. This makes it insensitive to the original size of the processed data. It means that the benefits of using HLL increase with the memory needed to store the objects you want to count. The results presented above would be even more spectacular if we used, for example, email addresses or IP addresses instead of numbers.

During the next attempt we increase the value of the n parameter tenfold:

n=10 000 000/f=0.25

Metric\Variant	HashSet	distinct	HLL
cardinality	2 500 000	2 500 000	2 484 301.4
error [%]	0	0	0.63
time [ms]	483	863	189
memory [MB]	233	574	21

The error value has increased slightly, while the difference in memory usage and the performance time is even greater than before. Therefore, it is worthwhile to increase the size of the set again:

n=100 000 000/f=0.25

Metric\Variant	HashSet	distinct	HLL
cardinality	25 000 000	25 000 000	25 301 157.2
error [%]	0	0	1.2
time [ms]	3857	7718	1538
memory [MB]	1800	5300	21

Deviation from the correct result exceeded 1%; the times also went up, although they are still many times shorter compared to other variants. It’s worth noting that the amount of memory used has practically not changed.

Now let’s see what happens when we change the second parameter, which determines the number of unique elements in the input set:

n=10 000 000/f=0.5

Metric\Variant	HashSet	distinct	HLL
cardinality	5 000 000	5 000 000	5 067 045.2
error [%]	0	0	1.34
time [ms]	467	914	183
memory [MB]	420	753	21

n=10 000 000/f=0.75

Metric\Variant	HashSet	distinct	HLL
cardinality	7 500 000	7 500 000	7 619 136.7
error [%]	0	0	1.59
time [ms]	589	1187	191
memory [MB]	616	843	26

Again, the results clearly show the advantages of the HLL algorithm. With a relatively low error we significantly reduced the amount of memory used and the time required for calculations.
As you can see and as expected, the classical approach gives accurate results but it consumes a lot of memory, while the solution using HLL brings results characterized by approx. 1% error, but in return we use much less memory. A certain surprise for me is the poor result of the Kotlin distinct function; I expected results more similar to the variant based on the HashSet. Presumably the key difference is that it returns an instance of the List class rather than HashSet. This requires further investigation, which is beyond the scope of my considerations.

The HLL algorithm is implemented in several solutions, including the aforementioned Elasticsearch, as well as in e.g. Redis and Presto. The above experiments clearly show that the approximate method, in case we need to process huge amounts of data, is a good idea provided that we allow a result with a small error.

Memory-efficient presence test

It turns out that the HLL is not the only probabilistic algorithm available in popular databases — another example of this approach is the Bloom Filter. This is an implementation of a memory-saving structure that is used in the so-called presence test. Let’s go back to our example with my cash: [10, 20, 50, 20, 50, 100, 50, 20, 10, 10].
Imagine that we want to test whether there is a 100 value banknote in my wallet. In this case the answer is positive, but the test for the 200 value banknote should be false, since there is no such a banknote in the wallet.

Of course, we are able again to implement a solution to this problem by simply using the properties of the HashSet class and the contains method. However, similarly as in case of determining the cardinality — the memory requirement increases with the size of the dataset. Again, the solution for this problem may be an approximate method.

Similarly as in case of the HLL algorithm the Bloom Filter allows for some inaccuracy, and in this case this means false positive results. This is because it can happen that the Bloom Filter finds that an element belongs to a given set, while in fact it is not there. However, the opposite situation is not possible so if the Bloom Filter states that an element is not part of the set, it is certainly true. Referring this to our example with the content of my wallet, the Bloom Filter could therefore assure me that there was a 200 value banknote in it, while standing at the checkout in a store it would turn out that, unfortunately, it is not there. What a pity...

Before we move on to examine how this algorithm works, let’s consider where it could be useful. A typical example is a recommendation system. Imagine we are designing a system intended to suggest articles for users to read, a feature common on social media sites. Such a system needs to store a
list of articles read by each user so that it does not suggest them again. It is easy to imagine that storing these articles with each user in the classic way would quickly exhaust memory resources. If we don’t use any data removal mechanism, the database will grow indefinitely. The discussed Bloom
Filter fits perfectly here as it will allow us to save a lot of memory, although, one must consider consequences of its limitations related to possible false results. It may happen that we will get false information that a certain article has already been read by someone, while in fact this is not true.
Consequently, we will not offer that user to read the material. On the other hand, the opposite situation is not possible: we will never display to a user a recommendation of an article that he/she has already read.

At this point it is worth checking how much we gain by accepting the inconvenience described above. I have prepared two implementations of a program that adds to a set of values and then checks if they are there.
The first program uses the classic approach — the HashSet class, while the second uses the Bloom Filter available in the popular guava library. Again, using jConsole we register for both programs the amount of memory used, and additionally — for the version with the Bloom Filter we also check the number of false positives. This value can be easily controlled, as the maximum allowed false positive
rate can be set in the API; for needs of the following tests we will set it to 1%.

Moreover, we will measure the total time of adding values to the set and the total time of querying whether there are values in the set.

Same as before we will perform a number of tests using the following parameters: n — the size of the set of numbers, and f — what part of it should be added to the set. The configuration n=1 000 000 and f=0.1 means
that the first 100 000 numbers out of 1 000 000 will be added to the set. So, in the first part, the program will add 100 000 numbers to the set and then — in the second stage — it will perform a presence test by checking whether the numbers above 100 000 belong to the set. There is no point in checking the
numbers added to the set beforehand, because we know that Bloom Filters do not give false negative results. On the other hand, if any number above 100 000 is found according to the Bloom Filter in the set, we will consider it a false positive.

Following code snippet presents fragment of the Bloom Filter variant:

val insertions = (n * f).toInt()
val filter = BloomFilter.create(Funnels.integerFunnel(), insertions, 0.01)
var falsePositives = 0

val insertTime = measureTimeMillis {
    for (i in 0 until insertions) {
        filter.put(i)
    }
}

val queryTime = measureTimeMillis {
    for (i in insertions until n) {
        if (filter.mightContain(i)) {
            falsePositives++;
        }
    }
}

val fpRatio = falsePositives/n.toDouble()

Again — you can find full code of both programs on aforementioned repository.

Let’s start with the following configuration: n=10 000 000/f=0.1:

n=10 000 000/f=0.1

Metric\Variant	HashSet	Bloom filter
error[%]	0	0.9
insert time [ms]	81	293
query time [ms]	82	846
memory [MB]	94	30

As you can see, the Bloom Filter returned less than 1% false results, but — at the same time — it used three times less memory than HashSet variant. Unfortunately, the times of Bloom Filter’s version are significantly higher.
Let’s check what happens when we increase the size of the input set:

n=100 000 000/f=0.1

Metric\Variant	HashSet	Bloom filter
error[%]	0	0.9
insert time [ms]	593	318
query time [ms]	988	944
memory [MB]	876	29

n=500 000 000/f=0.1

Metric\Variant	HashSet	Bloom filter
error[%]	0	0.9
insert time [ms]	1975	1372
query time [ms]	4115	4923
memory [MB]	4400	81

The number of false positives is still below the preset 1%, the amount of memory used is still lower than the classic implementation, and interestingly also the times of the probabilistic variant are lower, at least for inserting. Thus, it can be seen that along with the increase in the size of the data the benefit of this method increases.

Summary

The above results clearly show that by accepting a small share of false answers, we can gain significant savings in memory usage.
Similarly to the HLL algorithm, the structure based on the Bloom Filters is available in many popular databases like Redis, HBase or Cassandra.

The simple experiments we conducted showed that probabilistic algorithms can save a lot of memory, which is especially important if our database stores huge amounts of data. In such cases it is sometimes worth letting your database lie to you a little.