[go: up one dir, main page]

imbl/
lib.rs

1// This Source Code Form is subject to the terms of the Mozilla Public
2// License, v. 2.0. If a copy of the MPL was not distributed with this
3// file, You can obtain one at http://mozilla.org/MPL/2.0/.
4
5//! # Immutable Data Structures for Rust
6//!
7//! This library implements several of the more commonly useful immutable data
8//! structures for Rust.
9//!
10//! ## What are immutable data structures?
11//!
12//! Immutable data structures are data structures which can be copied and
13//! modified efficiently without altering the original. The most uncomplicated
14//! example of this is the venerable [cons list][cons-list]. This crate offers a
15//! selection of more modern and flexible data structures with similar
16//! properties, tuned for the needs of Rust developers.
17//!
18//! Briefly, the following data structures are provided:
19//!
20//! * [Vectors][vector::Vector] based on [RRB trees][rrb-tree]
21//! * [Hash maps][hashmap::HashMap]/[sets][hashset::HashSet] based on [hash
22//!   array mapped tries][hamt]
23//! * [Ordered maps][ordmap::OrdMap]/[sets][ordset::OrdSet] based on
24//!   [B+trees][b+tree]
25//!
26//! ## Why Would I Want This?
27//!
28//! While immutable data structures can be a game changer for other
29//! programming languages, the most obvious benefit - avoiding the
30//! accidental mutation of data - is already handled so well by Rust's
31//! type system that it's just not something a Rust programmer needs
32//! to worry about even when using data structures that would send a
33//! conscientious Clojure programmer into a panic.
34//!
35//! Immutable data structures offer other benefits, though, some of
36//! which are useful even in a language like Rust. The most prominent
37//! is *structural sharing*, which means that if two data structures
38//! are mostly copies of each other, most of the memory they take up
39//! will be shared between them. This implies that making copies of an
40//! immutable data structure is cheap: it's really only a matter of
41//! copying a pointer and increasing a reference counter, where in the
42//! case of [`Vec`] you have to allocate the same
43//! amount of memory all over again and make a copy of every element
44//! it contains. For immutable data structures, extra memory isn't
45//! allocated until you modify either the copy or the original, and
46//! then only the memory needed to record the difference.
47//!
48//! Another goal of this library has been the idea that you shouldn't
49//! even have to think about what data structure to use in any given
50//! situation, until the point where you need to start worrying about
51//! optimisation - which, in practice, often never comes. Beyond the
52//! shape of your data (ie. whether to use a list or a map), it should
53//! be fine not to think too carefully about data structures - you can
54//! just pick the one that has the right shape and it should have
55//! acceptable performance characteristics for every operation you
56//! might need. Specialised data structures will always be faster at
57//! what they've been specialised for, but `imbl` aims to provide the
58//! data structures which deliver the least chance of accidentally
59//! using them for the wrong thing.
60//!
61//! For instance, [`Vec`] beats everything at memory
62//! usage, indexing and operations that happen at the back of the
63//! list, but is terrible at insertion and removal, and gets worse the
64//! closer to the front of the list you get.
65//! [`VecDeque`](std::collections::VecDeque) adds a little bit of
66//! complexity in order to make operations at the front as efficient
67//! as operations at the back, but is still bad at insertion and
68//! especially concatenation. [`Vector`] adds another
69//! bit of complexity, and could never match [`Vec`] at
70//! what it's best at, but in return every operation you can throw at
71//! it can be completed in a reasonable amount of time - even normally
72//! expensive operations like copying and especially concatenation are
73//! reasonably cheap when using a [`Vector`].
74//!
75//! It should be noted, however, that because of its simplicity,
76//! [`Vec`] actually beats [`Vector`] even at its
77//! strongest operations at small sizes, just because modern CPUs are
78//! hyperoptimised for things like copying small chunks of contiguous memory -
79//! you actually need to go past a certain size (usually in the vicinity of
80//! several hundred elements) before you get to the point where
81//! [`Vec`] isn't always going to be the fastest choice.
82//! [`Vector`] attempts to overcome this by actually just being
83//! an array at very small sizes, and being able to switch efficiently to the
84//! full data structure when it grows large enough. Thus,
85//! [`Vector`] will actually be equivalent to
86//! [Vec] until it grows past the size of a single chunk.
87//!
88//! The maps - [`HashMap`] and
89//! [`OrdMap`] - generally perform similarly to their
90//! equivalents in the standard library, but tend to run a bit slower
91//! on the basic operations ([`HashMap`] is almost
92//! neck and neck with its counterpart, while
93//! [`OrdMap`] currently tends to run 2-3x slower). On
94//! the other hand, they offer the cheap copy and structural sharing
95//! between copies that you'd expect from immutable data structures.
96//!
97//! In conclusion, the aim of this library is to provide a safe
98//! default choice for the most common kinds of data structures,
99//! allowing you to defer careful thinking about the right data
100//! structure for the job until you need to start looking for
101//! optimisations - and you may find, especially for larger data sets,
102//! that immutable data structures are still the right choice.
103//!
104//! ## Values
105//!
106//! Because we need to make copies of shared nodes in these data structures
107//! before updating them, the values you store in them must implement
108//! [`Clone`].  For primitive values that implement
109//! [`Copy`], such as numbers, everything is fine: this is
110//! the case for which the data structures are optimised, and performance is
111//! going to be great.
112//!
113//! On the other hand, if you want to store values for which cloning is
114//! expensive, or values that don't implement [`Clone`], you
115//! need to wrap them in [`Rc`][std::rc::Rc] or [`Arc`][std::sync::Arc]. Thus,
116//! if you have a complex structure `BigBlobOfData` and you want to store a list
117//! of them as a `Vector<BigBlobOfData>`, you should instead use a
118//! `Vector<Rc<BigBlobOfData>>`, which is going to save you not only the time
119//! spent cloning the big blobs of data, but also the memory spent keeping
120//! multiple copies of it around, as [`Rc`][std::rc::Rc] keeps a single
121//! reference counted copy around instead.
122//!
123//! If you're storing smaller values that aren't
124//! [`Copy`]able, you'll need to exercise judgement: if your
125//! values are going to be very cheap to clone, as would be the case for short
126//! [`String`] or small [`Vec`]s, you're probably better off storing them directly
127//! without wrapping them in an [`Rc`][std::rc::Rc], because, like the [`Rc`][std::rc::Rc],
128// they're just pointers to some data on the heap, and that data isn't expensive to clone -
129//! you might actually lose more performance from the extra redirection of
130//! wrapping them in an [`Rc`][std::rc::Rc] than you would from occasionally
131//! cloning them.
132//!
133//! ### When does cloning happen?
134//!
135//! So when will your values actually be cloned? The easy answer is only if you
136//! [`clone`][Clone::clone] the data structure itself, and then only
137//! lazily as you change it. Values are stored in tree nodes inside the data
138//! structure, each node of which contains up to 64 values. When you
139//! [`clone`][Clone::clone] a data structure, nothing is actually
140//! copied - it's just the reference count on the root node that's incremented,
141//! to indicate that it's shared between two data structures. It's only when you
142//! actually modify one of the shared data structures that nodes are cloned:
143//! when you make a change somewhere in the tree, the node containing the change
144//! needs to be cloned, and then its parent nodes need to be updated to contain
145//! the new child node instead of the old version, and so they're cloned as
146//! well.
147//!
148//! We can call this "lazy" cloning - if you make two copies of a data structure
149//! and you never change either of them, there's never any need to clone the
150//! data they contain. It's only when you start making changes that cloning
151//! starts to happen, and then only on the specific tree nodes that are part of
152//! the change. Note that the implications of lazily cloning the data structure
153//! extend to memory usage as well as the CPU workload of copying the data
154//! around - cloning an immutable data structure means both copies share the
155//! same allocated memory, until you start making changes.
156//!
157//! Most crucially, if you never clone the data structure, the data inside it is
158//! also never cloned, and in this case it acts just like a mutable data
159//! structure, with minimal performance differences (but still non-zero, as we
160//! still have to check for shared nodes).
161//!
162//! ## Data Structures
163//!
164//! We'll attempt to provide a comprehensive guide to the available
165//! data structures below.
166//!
167//! ### Performance Notes
168//!
169//! "Big O notation" is the standard way of talking about the time
170//! complexity of data structure operations. If you're not familiar
171//! with big O notation, here's a quick cheat sheet:
172//!
173//! *O(1)* means an operation runs in constant time: it will take the
174//! same time to complete regardless of the size of the data
175//! structure.
176//!
177//! *O(n)* means an operation runs in linear time: if you double the
178//! size of your data structure, the operation will take twice as long
179//! to complete; if you quadruple the size, it will take four times as
180//! long, etc.
181//!
182//! *O(log n)* means an operation runs in logarithmic time: for
183//! *log<sub>2</sub>*, if you double the size of your data structure,
184//! the operation will take one step longer to complete; if you
185//! quadruple the size, it will need two steps more; and so on.
186//! However, the data structures in this library generally run in
187//! *log<sub>64</sub>* time, meaning you have to make your data
188//! structure 64 times bigger to need one extra step, and 4096 times
189//! bigger to need two steps. This means that, while they still count
190//! as O(log n), operations on all but really large data sets will run
191//! at near enough to O(1) that you won't usually notice.
192//!
193//! *O(n log n)* is the most expensive operation you'll see in this
194//! library: it means that for every one of the *n* elements in your
195//! data structure, you have to perform *log n* operations. In our
196//! case, as noted above, this is often close enough to O(n) that it's
197//! not usually as bad as it sounds, but even O(n) isn't cheap and the
198//! cost still increases logarithmically, if slowly, as the size of
199//! your data increases. O(n log n) basically means "are you sure you
200//! need to do this?"
201//!
202//! *O(1)** means 'amortised O(1),' which means that an operation
203//! usually runs in constant time but will occasionally be more
204//! expensive: for instance,
205//! [`Vector::push_back`], if called in
206//! sequence, will be O(1) most of the time but every 64th time it
207//! will be O(log n), as it fills up its tail chunk and needs to
208//! insert it into the tree. Please note that the O(1) with the
209//! asterisk attached is not a common notation; it's just a convention
210//! I've used in these docs to save myself from having to type
211//! 'amortised' everywhere.
212//!
213//! ### Lists
214//!
215//! Lists are sequences of single elements which maintain the order in
216//! which you inserted them. The only list in this library is
217//! [`Vector`], which offers the best all round
218//! performance characteristics: it's pretty good at everything, even
219//! if there's always another kind of list that's better at something.
220//!
221//! | Type | Algorithm | Constraints | Order | Push | Pop | Split | Append | Lookup |
222//! | --- | --- | --- | --- | --- | --- | --- | --- | --- |
223//! | [`Vector<A>`] | [RRB tree][rrb-tree] | [`Clone`] | insertion | O(1)\* | O(1)\* | O(log n) | O(log n) | O(log n) |
224//!
225//! ### Maps
226//!
227//! Maps are mappings of keys to values, where the most common read
228//! operation is to find the value associated with a given key. Maps
229//! may or may not have a defined order. Any given key can only occur
230//! once inside a map, and setting a key to a different value will
231//! overwrite the previous value.
232//!
233//! | Type | Algorithm | Key Constraints | Order | Insert | Remove | Lookup |
234//! | --- | --- | --- | --- | --- | --- | --- |
235//! | [`HashMap<K, V>`] | [HAMT][hamt] | [`Clone`] + [`Hash`][std::hash::Hash] + [`Eq`] | undefined | O(log n) | O(log n) | O(log n) |
236//! | [`OrdMap<K, V>`] | [B+tree][b+tree] | [`Clone`] + [`Ord`] | sorted | O(log n) | O(log n) | O(log n) |
237//!
238//! ### Sets
239//!
240//! Sets are collections of unique values, and may or may not have a
241//! defined order. Their crucial property is that any given value can
242//! only exist once in a given set.
243//!
244//! | Type | Algorithm | Constraints | Order | Insert | Remove | Lookup |
245//! | --- | --- | --- | --- | --- | --- | --- |
246//! | [`HashSet<A>`] | [HAMT][hamt] | [`Clone`] + [`Hash`][std::hash::Hash] + [`Eq`] | undefined | O(log n) | O(log n) | O(log n) |
247//! | [`OrdSet<A>`] | [B+tree][b+tree] | [`Clone`] + [`Ord`] | sorted | O(log n) | O(log n) | O(log n) |
248//!
249//! ## In-place Mutation
250//!
251//! All of these data structures support in-place copy-on-write
252//! mutation, which means that if you're the sole user of a data
253//! structure, you can update it in place without taking the
254//! performance hit of making a copy of the data structure before
255//! modifying it (this is about an order of magnitude faster than
256//! immutable operations, almost as fast as
257//! [`std::collections`]'s mutable data structures).
258//!
259//! Thanks to [`Rc`][std::rc::Rc]'s reference counting, we are able to
260//! determine whether a node in a data structure is being shared with
261//! other data structures, or whether it's safe to mutate it in place.
262//! When it's shared, we'll automatically make a copy of the node
263//! before modifying it. The consequence of this is that cloning a
264//! data structure becomes a lazy operation: the initial clone is
265//! instant, and as you modify the cloned data structure it will clone
266//! chunks only where you change them, so that if you change the
267//! entire thing you will eventually have performed a full clone.
268//!
269//! This also gives us a couple of other optimisations for free:
270//! implementations of immutable data structures in other languages
271//! often have the idea of local mutation, like Clojure's transients
272//! or Haskell's `ST` monad - a managed scope where you can treat an
273//! immutable data structure like a mutable one, gaining a
274//! considerable amount of performance because you no longer need to
275//! copy your changed nodes for every operation, just the first time
276//! you hit a node that's sharing structure. In Rust, we don't need to
277//! think about this kind of managed scope, it's all taken care of
278//! behind the scenes because of our low level access to the garbage
279//! collector (which, in our case, is just a simple
280//! [`Rc`](std::rc::Rc)).
281//!
282//! ## Thread Safety
283//!
284//! The data structures in `imbl` are thread safe by default using
285//! [`Arc`](std::sync::Arc). However, `imbl` also supports `Rc` as the pointer
286//! type through the [`archery`] crate, just like `im-rc` in the original
287//! `im` crate. If you prioritise speed over thread safety, you can use
288//! [`GenericVector<T, shared_pointer::RcK>`](vector::GenericVector) that uses
289//! non-threadsafe but faster `Rc`, instead of the type alias
290//! [`Vector`]. You can also create your own type alias for that.
291//!  It can be done on other types too.
292//!
293//! ## Feature Flags
294//!
295//! `imbl` comes with optional support for the following crates through Cargo
296//! feature flags. You can enable them in your `Cargo.toml` file like this:
297//!
298//! ```no_compile
299//! [dependencies]
300//! imbl = { version = "*", features = ["proptest", "serde", "bincode"] }
301//! ```
302//!
303//! | Feature | Description |
304//! | ------- | ----------- |
305//! | [`proptest`](https://crates.io/crates/proptest) | Strategies for all `imbl` datatypes under a `proptest` namespace, eg. `imbl::vector::proptest::vector()` |
306//! | [`quickcheck`](https://crates.io/crates/quickcheck) | [`quickcheck::Arbitrary`](https://docs.rs/quickcheck/latest/quickcheck/trait.Arbitrary.html) implementations for all `imbl` datatypes |
307//! | [`rayon`](https://crates.io/crates/rayon) | parallel iterator implementations for [`Vector`] |
308//! | [`serde`](https://crates.io/crates/serde) | [`Serialize`](https://docs.rs/serde/latest/serde/trait.Serialize.html) and [`Deserialize`](https://docs.rs/serde/latest/serde/trait.Deserialize.html) implementations for all `imbl` datatypes |
309//! | [`bincode`](https://crates.io/crates/bincode) | [`Encode`](https://docs.rs/bincode/latest/bincode/enc/trait.Encode.html) and [`Decode`](https://docs.rs/bincode/latest/bincode/de/trait.Decode.html) implementations for all `imbl` datatypes |
310//! | [`arbitrary`](https://crates.io/crates/arbitrary/) | [`arbitrary::Arbitrary`](https://docs.rs/arbitrary/latest/arbitrary/trait.Arbitrary.html) implementations for all `imbl` datatypes |
311//! | [`triomphe`](https://crates.io/crates/triomphe/) | Use [`triomphe::Arc`](https://docs.rs/triomphe/latest/triomphe/struct.Arc.html) for the default shared pointer. This is a drop-in replacement for [`std::sync::Arc`](https://doc.rust-lang.org/std/sync/struct.Arc.html) that is faster in some cases. |
312//!
313//! [rrb-tree]: https://infoscience.epfl.ch/record/213452/files/rrbvector.pdf
314//! [hamt]: https://en.wikipedia.org/wiki/Hash_array_mapped_trie
315//! [b+tree]: https://en.wikipedia.org/wiki/B%2B_tree
316//! [cons-list]: https://en.wikipedia.org/wiki/Cons#Lists
317
318#![forbid(rust_2018_idioms)]
319#![deny(nonstandard_style)]
320#![warn(unreachable_pub, missing_docs)]
321#![deny(unsafe_code)]
322
323#[cfg(test)]
324#[macro_use]
325extern crate pretty_assertions;
326
327#[macro_use]
328mod util;
329
330mod config;
331mod nodes;
332mod sort;
333mod sync;
334
335#[macro_use]
336mod ord;
337pub use crate::ord::map as ordmap;
338pub use crate::ord::set as ordset;
339
340#[macro_use]
341mod hash;
342pub use crate::hash::map as hashmap;
343pub use crate::hash::set as hashset;
344
345#[macro_use]
346pub mod vector;
347
348pub mod iter;
349
350#[cfg(any(test, feature = "proptest"))]
351pub mod proptest;
352
353#[cfg(any(test, feature = "serde"))]
354#[doc(hidden)]
355pub mod ser;
356
357#[cfg(feature = "bincode")]
358#[doc(hidden)]
359pub mod bincode;
360
361#[cfg(feature = "arbitrary")]
362#[doc(hidden)]
363pub mod arbitrary;
364
365#[cfg(feature = "quickcheck")]
366#[doc(hidden)]
367pub mod quickcheck;
368
369pub mod shared_ptr;
370
371pub use crate::hashmap::{GenericHashMap, HashMap};
372pub use crate::hashset::{GenericHashSet, HashSet};
373pub use crate::ordmap::{GenericOrdMap, OrdMap};
374pub use crate::ordset::{GenericOrdSet, OrdSet};
375#[doc(inline)]
376pub use crate::vector::{GenericVector, Vector};
377
378#[cfg(test)]
379mod test;
380
381#[cfg(test)]
382mod tests;
383
384/// Update a value inside multiple levels of data structures.
385///
386/// This macro takes a [`Vector`], [`OrdMap`] or [`HashMap`],
387/// a key or a series of keys, and a value, and returns the data structure with the
388/// new value at the location described by the keys.
389///
390/// If one of the keys in the path doesn't exist, the macro will panic.
391///
392/// # Examples
393///
394/// ```
395/// # #[macro_use] extern crate imbl;
396/// # use std::sync::Arc;
397/// # use imbl::Vector;
398/// # fn main() {
399/// let vec_inside_vec = vector![vector![1, 2, 3], vector![4, 5, 6]];
400///
401/// let expected = vector![vector![1, 2, 3], vector![4, 5, 1337]];
402///
403/// assert_eq!(expected, update_in![vec_inside_vec, 1 => 2, 1337]);
404/// # }
405/// ```
406///
407
408#[macro_export]
409macro_rules! update_in {
410    ($target:expr, $path:expr => $($tail:tt) => *, $value:expr ) => {{
411        let inner = $target.get($path).expect("update_in! macro: key not found in target");
412        $target.update($path, update_in!(inner, $($tail) => *, $value))
413    }};
414
415    ($target:expr, $path:expr, $value:expr) => {
416        $target.update($path, $value)
417    };
418}
419
420/// Get a value inside multiple levels of data structures.
421///
422/// This macro takes a [`Vector`], [`OrdMap`] or [`HashMap`],
423/// along with a key or a series of keys, and returns the value at the location inside
424/// the data structure described by the key sequence, or `None` if any of the keys didn't
425/// exist.
426///
427/// # Examples
428///
429/// ```
430/// # #[macro_use] extern crate imbl;
431/// # use std::sync::Arc;
432/// # use imbl::Vector;
433/// # fn main() {
434/// let vec_inside_vec: Vector<Vector<i64>> = vector![vector![1, 2, 3], vector![4, 5, 6]];
435///
436/// assert_eq!(Some(&6), get_in![vec_inside_vec, 1 => 2]);
437/// # }
438/// ```
439
440#[macro_export]
441macro_rules! get_in {
442    ($target:expr, $path:expr => $($tail:tt) => * ) => {{
443        $target.get($path).and_then(|v| get_in!(v, $($tail) => *))
444    }};
445
446    ($target:expr, $path:expr) => {
447        $target.get($path)
448    };
449}
450
451#[cfg(test)]
452mod lib_test {
453    #[test]
454    fn update_in() {
455        let vector = vector![1, 2, 3, 4, 5];
456        assert_eq!(vector![1, 2, 23, 4, 5], update_in!(vector, 2, 23));
457        let hashmap = hashmap![1 => 1, 2 => 2, 3 => 3];
458        assert_eq!(
459            hashmap![1 => 1, 2 => 23, 3 => 3],
460            update_in!(hashmap, 2, 23)
461        );
462        let ordmap = ordmap![1 => 1, 2 => 2, 3 => 3];
463        assert_eq!(ordmap![1 => 1, 2 => 23, 3 => 3], update_in!(ordmap, 2, 23));
464
465        let vecs = vector![vector![1, 2, 3], vector![4, 5, 6], vector![7, 8, 9]];
466        let vecs_target = vector![vector![1, 2, 3], vector![4, 5, 23], vector![7, 8, 9]];
467        assert_eq!(vecs_target, update_in!(vecs, 1 => 2, 23));
468    }
469
470    #[test]
471    fn get_in() {
472        let vector = vector![1, 2, 3, 4, 5];
473        assert_eq!(Some(&3), get_in!(vector, 2));
474        let hashmap = hashmap![1 => 1, 2 => 2, 3 => 3];
475        assert_eq!(Some(&2), get_in!(hashmap, &2));
476        let ordmap = ordmap![1 => 1, 2 => 2, 3 => 3];
477        assert_eq!(Some(&2), get_in!(ordmap, &2));
478
479        let vecs = vector![vector![1, 2, 3], vector![4, 5, 6], vector![7, 8, 9]];
480        assert_eq!(Some(&6), get_in!(vecs, 1 => 2));
481    }
482}