Sets and The Universal Type

Safely solving more problems with the same language tools

7 July 2015

Kevin Gillette

Sets in Go

Often implemented as map[T]struct{} or map[T]bool
The keys of the map are the elements of the set
Great for checking element presence

package main

import "fmt"

func main() {

    set := make(map[string]bool)
    set["a"] = true
    set["b"] = true

    // iterate over set elements (unordered)
    for elem := range set {
        fmt.Println(elem)
    }

    fmt.Println("c in set?", set["c"])

Sets (minimal theory)

A set is a collection of unique elements.

Membership testing: is element x in set A?
Other set operations (in terms of input sets A and B):

Union:                 elements in at least one of A or B
Intersection:          elements in both A and B
Symmetric Difference:  elements in either A or B but not both
Difference:            elements in A but not B

Except for Set Difference, all of these operations are associative and commutative.

Associativity: ((a + b) + c) == (a + (b + c))
Commutativity:       (a + b) == (b + a)

Convenient Implementation

package main

import (
	"fmt"
	"sort"
)

type StringSet map[string]bool

// Intersect returns a copy containing the intersection of both sets.
func (s StringSet) Intersect(t StringSet) StringSet {
    m := make(StringSet)
    for k := range s {
        if t[k] {
            m[k] = true
        }
    }
    return m
}

func main() {
    x := StringSet{"a": true, "b": true}
    y := StringSet{"b": true, "c": true}
    z := x.Intersect(y)
    fmt.Printf("%v & %v == %v\n", x, y, z)
}


func (s StringSet) String() string {
	lst := make(sort.StringSlice, 0, len(s))
	for k, ok := range s {
		if ok {
			lst = append(lst, k)
		}
	}
	sort.Sort(lst)
	return fmt.Sprint(lst)
}

Efficient/Flexible Implementation

package main

import (
	"fmt"
	"sort"
)

type StringSet map[string]bool

// Intersect updates s in-place to contain the intersection of both sets.
func (s StringSet) IntersectInPlace(t StringSet) {
    for k := range s {
        if !t[k] {
            delete(s, k)
        }
    }
}

func (s StringSet) Copy() StringSet {
    m := make(StringSet, len(s))
    for k := range s {
        m[k] = true
    }
    return m
}

func main() {
    x := StringSet{"a": true, "b": true} // try with {"c": false}
    y := StringSet{"b": true, "c": true}
    z := x.Copy()
    z.IntersectInPlace(y)
    fmt.Printf("%v & %v == %v\n", x, y, z)
}


func (s StringSet) String() string {
	lst := make(sort.StringSlice, 0, len(s))
	for k, ok := range s {
		if ok {
			lst = append(lst, k)
		}
	}
	sort.Sort(lst)
	return fmt.Sprint(lst)
}

Implementation Compromises (1/3)

Element type is theoretically inconsequential, but can practical consequences:

map[T]bool requires discipline in use (three possible element states)
map[T]struct{} is simpler, more compact in memory, but needs more syntax

Implementation Compromises (2/3)

We can choose to give the user convenience or control.

We can provide both at the expense of code/API volume.

We can alternatively provide both at the expense of API complexity:

func (s StringSet) Intersect(t StringSet, inplace bool) StringSet

Implementation Compromises (3/3)

Writing ad-hoc algorithms may be error prone.

It's very easy to under-test code.

Test coverage reports would not have helped find the previous issue.

There's no known technique for writing type-safe, polymorphic map-based set algorithms within the current Go language.

They may be impossible.

Solutions? (1/3)

We can settle on one or a small number of well-tested concrete implementations.

type Set map[string]struct{}

High overhead if your data isn't already in string-form.

API and binary bloat if you want to support all the builtin types, and...

Sets of structs, pointers, channels, and other non-builtin types can be very useful.

Solutions? (2/3)

We can provide a reusable library based on map[interface{}]struct{}

Avoids code/binary bloat
Not type-safe
Not convenient: type assertions are required

We can use various reflect package techniques to avoid the explicit type assertions.

Still not type-safe.
Merely shifts the inconvenience to other parts of the user code.

Solutions? (3/3)

We can use code-generation.

Type safe, adaptable, but many programmers will avoid codegen tools.
Binary bloat can be a problem.
Bug-fixes and improvements are difficult to distribute.

Performance?

map-based sets in Go are not competitive with other languages' "native" sets.

Go maps have excellent element-access performance, but the undefined iteration order greatly restricts the quality of the algorithms that can be used for the other set operations.

A custom ordered-map implementation could eliminate some performance barriers, but would likely have the same tradeoffs as regular maps.

Can we avoid these compromises and performance issues altogether?

To answer this question, we can look a Go stdlib package that faced a similar problem, and devised a simple, yet elegant solution.

Sorting in Go

A polymorphic, type-safe, single algorithm implementation. But how?

Interfaces, implemented on the collection, not the element
The methods' parameters and return values are all concrete types (important!)

type Interface interface {
    Len() int
    Less(i, j int) bool
    Swap(i, j int)
}

golang.org/pkg/sort/#Interface

The Universal Type (1/2)

Sorting is necessarily based around the concept of order:

Any fully-ordered collection can be enumerated
In Go, this is expressed using the int type
int represents an index; it is a proxy for the real value (brilliant!)

type IntSlice []int

func (p IntSlice) Len() int           { return len(p) }
func (p IntSlice) Less(i, j int) bool { return p[i] < p[j] }
func (p IntSlice) Swap(i, j int)      { p[i], p[j] = p[j], p[i] }

golang.org/pkg/sort/#IntSlice

The Universal Type (2/2)

An ordering can be defined on sets too:

Nearly all practical sets have order defined on the element type anyway
Example: sets of ints, sets of strings, etc.

Sets in Go: a different approach (1/4)

github.com/xtgo/set

Every sort.Interface implementation is implicitly a set implementation.
sort.Sort and set.Uniq sanitize arbitrary inputs. None, either, or both may be used depending on how "prepared" the inputs already are.
set.Union and friends implement the set operations.
Absolutely no reflection, type assertions, empty interfaces, or code generation.

Sets in Go: a different approach (2/4)

func SymDiff(data sort.Interface, pivot int) (size int)

pivot delimits the two concatenated input sets.

package main

import (
	"fmt"
	"sort"

	"github.com/xtgo/set"
)

func main() {

    a, b := sort.IntSlice{2, 3, 4}, sort.IntSlice{1, 2, 4, 5}
    data := append(a, b...)

    size := set.SymDiff(data, len(a))
    c := data[:size]
    fmt.Println(c)

Set semantics are merely slice semantics.

Sets in Go: a different approach (3/4)

These op functions expect sorted, duplicate-free concatenated inputs...
... rearrange elements using efficient, in-place merge algorithms...
... return the size of the resulting set.
The resulting set will also be sorted and duplicate-free.

Sets in Go: a different approach (4/4)

type Op func(data sort.Interface, pivot int) (size int)

func Apply(op Op, data sort.Interface, pivots []int) (size int)

Concurrent, adaptive application of op to all of the input sets.

op must be mathematically associative; doc examples show set.Diff approach.

package main

import (
	"fmt"
	"sort"

	"github.com/xtgo/set"
)

func main() {

    a := sort.IntSlice{2, 3, 5, 7, 11, 13}   // primes
    b := sort.IntSlice{0, 1, 2, 3, 5, 8, 13} // Fibonacci numbers
    c := sort.IntSlice{1, 2, 5, 14, 42}      // Catalan numbers
    data := append(append(a, b...), c...)
    pivots := set.Pivots(len(a), len(b), len(c))

    size := set.Apply(set.Inter, data, pivots)
    data = data[:size]
    fmt.Println(data)

Good in theory, but in practice? (1/3)

Doesn't allocate internally (map implementations do)
The control and flexibility of slice semantics apply equally to slice-based sets
Much faster than map-based algorithms for set-to-set operations
Map-based algos get slower with set size; no so for the slice-based algorithms

Good in theory, but in practice? (2/3)

set.Union and set.SymDiff currently use multi-pass implementations with reasonable speed -- these will get replaced with much faster, single pass techniques like set.Inter and set.Diff already use
All of these algorithms will be made even faster by baking in binary search to detect runs

Good in theory, but in practice? (3/3)

BenchmarkInter32            199 ns/op     1 allocs/op
BenchmarkMapInter32        5530 ns/op     6 allocs/op
BenchmarkInter64K        214941 ns/op     1 allocs/op
BenchmarkMapInter64K   10522886 ns/op   370 allocs/op

BenchmarkMap* correspond to straightforward map[int]struct{} algorithms
1x alloc is from test, not library code. BenchmarkInter* correspond to algorithms based around sort.Interface

Additional Possibilities: shuffle algorithm

package main

import (
	"fmt"
	"math/rand"
	"sort"
)

func Shuffle(data sort.Interface) { sort.Sort(proxy{data, rand.Perm(data.Len())}) }

type proxy struct {
    data sort.Interface
    sort.IntSlice
}

func (s proxy) Swap(i, j int) { s.data.Swap(i, j); s.IntSlice.Swap(i, j) }

func main() {
    data := sort.StringSlice{"zero", "one", "two", "three", "four"}
    Shuffle(data)
    fmt.Println(data)
}

Additional Possibilities: container/heap

container/heap uses of interface{} could have been pivot and size ints
Pivots would allow batch pushes and pops -- potentially much more efficient
Type assertions and redundant methods go away
Thus, all sort.Interface implementations could implicitly be heap implementations

Additional Possibilities

Many data-structures with a definable order could have sort-style algorithms.

[]int is a good proxy data-structure that can be used to form more complex algorithms on top of "concrete method" interfaces.

ex: merge sort and specialized sorts API-compatible with sort.Sort

Thanks (in chronological order)

Cory LaNou
Luna Duclos
Dave Cheney