mohitd’s Blog

Writing Extensible and Modular Frameworks in C++

2025-10-03T00:00:00+00:00

When we’re working in industry, we often have to build frameworks to consolidate duplicated code for re-use and help extend our code to future use-cases so we spend less time on integration and more time on business logic. When I work on completely novel problems, it’s often faster to first work on the business logic without worrying about building a framework first since my priority is to get somtehing working first. Trying to generalize too early in an unknown problem space often causes more problems than it solves. Eventually, we converge to a point where the cost of properly designing a framework is lower than the cost of continuing to move forward wit hbespoke solutions for each problem.

As an example, in robotics, at the lowest level we usually have motion controllers operating on feedback loops on the order of hundreds of Hertz but we always have some kind of higher level on top that performs some kind of collision-aware local motion planning on top of those controllers. Going a layer above that, we often have planners or behaviors or actions (or whatever name they’re given) that compose different local behaviors together via an interface. Starting out, getting the robot up and running with some planners is more important than trying to build a generic framework for those not-yet-existent planners. But after some critical point, building the planning framework helps accelerate adding new planners or fixing bugs in the existing ones.

There’s never a right answer when building frameworks: it all depends on what the goal is and the state of the architecture. For example, do we know all of the plugins at compile time? Or do we not even know which plugins exist for a single invocation of our program? One example use-case for the latter would be graphics: if we’re running on different Linux systems, we might want to load one kind of graphics engine vs. another depending on the specifics of the system. If we’re building software for a robot, then we may want to load different libraries depending on the specific kinds of sensors.

The intention of this post is to explore some of those different kinds of approaches to extensible architectures by building a little Linux system monitor application as a motivating toy example. This system monitor will print out some useful information like CPU usage, CPU temperature, RAM usage, and uptime that will refresh at a fixed rate. We’ll start with building the system moniter by getting everything working and printing to the screen first. Then we’ll start adding in more considerations as we go and we’ll see a few different options on how we can modify the existing architecture to make it more extensible. For example, suppose we want to isolate the executable that runs the monitors from the monitors themselves but we can’t modify the monitor runner; how could we modify our architecture to support that use-case? (That’s a sneak peek towards the direction we’ll go!) We’ll also look at a few ways to accomplish the same kind of extensibility using purely compile-time constructs.

Keep in mind that the focus is on the extensibility, not on the exact implementation details of the system monitor so I’ll be a bit lax in terms of the implementations of the individual monitors, e.g., ignoring proper error handling, not performing proper testing, glossing over writing documentation.

Aside: my main motivation for writing this is my mild disappointment in the kinds of “extensible” architectures that I see in all kinds of code. The primary authors of these frameworks and libraries write out in their Readmes or in threads that “of course the framework is extensible and modular!” but then when I look through the effort it takes to create a new subcomponent, it requires things like “remember to add your component to this giant global registry!”. This is not extensible to me, and there are better options that we’ll explore!

All of the code will be available on my GitHub here!

Linux System Monitor

Now let’s start building our system monitor! We’ll be monitoring CPU usage, RAM usage, CPU temperature, and uptime. When the monitor runs, we’ll print and format all of these out to the terminal and refresh those values at a fixed rate.

MVP

The first thing we’ll do is set up our minimum viable product (MVP). If we were developing a prototype or a small-scope throwaway project, then starting with a barebones MVP without a super strict design is often the fastest way to get something working. We’ll follow that same route to start and then generalize it. That being said, even for this MVP, we’ll still group logical blocks into functions. Let’s start with setting up our main function and refresh rate (1s):

void clear_screen() { std::cout << "\033[2J\033[1;1H"; }

int main(int, char**) {
    using namespace std::chrono_literals;

    while (true) {
        clear_screen();

        std::cout << "---- Linux System Monitor ----n\n";
        // monitor stuff goes here!

        std::this_thread::sleep_for(1s);
    }
    return 0;
}

As the name implies, clear_screen() clears the screen using some Unix terminal special characters so we’ll refresh the screen and write new values of the monitors.

Let’s start with the CPU monitor! On most Linux systems, information about the CPU is in the /proc/stat file. We’re going to define a helper struct to match the format of this file so we can simply read it in using normal file stream operations. (On a Linux system, these aren’t actually “files” like files on disk but “virtual” files that live in memory so it’s completely fine to open and read from these at a “fast” rate since the OS isn’t actually reading from the disk. It fits into the Linux philosophy that “everything is a file”!)

struct CpuTimes {
    long long user{};     // Time spent in user mode.
    long long nice{};     // Time spent in user mode with low priority (nice).
    long long system{};   // Time spent in system mode.
    long long idle{};     // Time spent in the idle task.
    long long iowait{};   // Time waiting for I/O to complete.
    long long irq{};      // Time servicing interrupts.
    long long softirq{};  // Time servicing softirqs.
    long long steal{};    // Stolen time
};

CpuTimes read_cpu_times() {
    CpuTimes times{};
    std::ifstream stat_file("/proc/stat");
    if (!stat_file.is_open()) {
        return times;
    }

    std::string line;
    std::getline(stat_file, line);
    std::string cpu_label;
    std::stringstream ss(line);
    ss >> cpu_label >> times.user >> times.nice >> times.system >> times.idle
            >> times.iowait >> times.irq >> times.softirq >> times.steal;
    return times;
}

Now we’ll need to compute the actual usage over an interval of time. One naïve way to calculate this is to look at the percentage of time that all CPUs are not idle (100% utilization minus idle time divided by the total time). We’ll multiply by 100 since the calculation is in 1/100’s of a Hz to get a percentage. Since we’re already refreshing the screen at 1s, let’s use that same rate (there’s an implied division by the delta time but it’s just 1 in this case). Let’s wrap this in a function to calculate the CPU usage given a previous and next set of CPU times.

double calculate_cpu_usage(const CpuTimes& prev, const CpuTimes& current) {
    const long long prev_idle = prev.idle + prev.iowait;
    const long long idle = current.idle + current.iowait;

    const long long prev_non_idle = prev.user + prev.nice + prev.system
                                    + prev.irq + prev.softirq + prev.steal;
    const long long non_idle = current.user + current.nice + current.system
                               + current.irq + current.softirq + current.steal;

    const long long prev_total = prev_idle + prev_non_idle;
    const long long total = idle + non_idle;

    const long long total_d = total - prev_total;
    const long long idle_d = idle - prev_idle;

    if (total_d == 0) {
        return 0.0;
    }

    // multiply by 100 since these are in 1/100s of Hz
    return 100. * (1.0 - (static_cast<double>(idle_d) / total_d));
}

Let’s incorporate this into the main function. We’ll need to keep track of the previous CPU times and when we’re executing the loop for the first time so we don’t print anything out.

int main(int, char**) {
    using namespace std::chrono_literals;

    CpuTimes prev_times{};
    bool is_first_loop = true;
    //...
}

Now we’ll need some special logic inside of the refresh loop to keep track of the previous and current CPU times and set is_first_loop.

while (true) {
    clear_screen();

    const CpuTimes current_times = read_cpu_times();
    const double cpu_usage = calculate_cpu_usage(prev_times, current_times);

    std::cout << "---- Linux System Monitor ----\n";
    if (is_first_loop) {
        std::cout << std::format("{:<12} Calculating...\n", "CPU Usage:");
        is_first_loop = false;
    } else {
        std::cout << std::format("{:<12} {:>3.1f} %\n",
                                    "CPU Usage:", cpu_usage);
    }

    prev_times = current_times;
    std::this_thread::sleep_for(1s);
}

We fetch the current CPU times and use the previous one to calculate the CPU usage but if it’s the first time in the loop, prev_times will be empty so we skip over it. (This could be more efficient if we don’t bother calculating the CPU usage if we know it’s the first time in the loop, but we’ll make that minor optimization in the later sections.) Note we’re using the new std::format in C++20 but everything we’re doing can be accomplished by using stream operators or even printf. We’ll do some minor formatting like {:<12} just to make the output line up and look a bit more tabular but it’s just for aesthetics.

The other monitors are even simpler since we can get instantaneous results. For example, let’s compute the RAM usage. Similar to the CPU usage, there’s a file /proc/meminfo with the RAM usage that we can read in a similar fashion as the CPU times: we’ll define a struct with the data we need and just read the relevant parts of that file into that struct.

struct RamInfo {
    double total_gb{};
    double used_gb{};
    double percentage{};
};

This file is a bit more complicated than the one with CPU statistics but still easy to parse.

RamInfo get_ram_usage() {
    std::ifstream meminfo_file("/proc/meminfo");
    std::string line;
    long mem_total{};
    long mem_available{};

    RamInfo info;
    if (!meminfo_file.is_open()) {
        return info;
    }

    while (std::getline(meminfo_file, line)) {
        std::stringstream ss(line);
        std::string key;
        long value;
        ss >> key >> value;
        if (key == "MemTotal:") {
            mem_total = value;
        } else if (key == "MemAvailable:") {
            mem_available = value;
        }
    }

    if (mem_total > 0 && mem_available > 0) {
        const long mem_used = mem_total - mem_available;
        static constexpr double KB_TO_GB = 1.0 / (1024.0 * 1024.0);
        info.total_gb = mem_total * KB_TO_GB;
        info.used_gb = mem_used * KB_TO_GB;
        info.percentage = static_cast<double>(mem_used) / mem_total * 100.0;
    }
    return info;
}

We read the file line-by-line until we find the two rows we’re looking for: "MemTotal" and "MemAvailable". From those, we can compute the memory used and convert it into gigabytes (since it’s usually in kilobytes). Finally we compute the percentage of used RAM and return the struct as info. It’s straightforward to incorporate that into the main function with some formatting to make it look pretty.

// after the CPU usage informatoin
const RamInfo ram_info = get_ram_usage();

std::cout << std::format("{:<12} {:.1f} / {:.1f} GB ({:.1f} %)\n",
                            "RAM Usage:", ram_info.used_gb, ram_info.total_gb,
                            ram_info.percentage);

Next up is CPU temperature which is even easier since the corresponding file has only a single value: the CPU temperature (for a given thermal zone) in milli-Celsius.

double get_cpu_temperature() {
    const std::filesystem::path path = "/sys/class/thermal/thermal_zone0/temp";

    if (std::filesystem::exists(path)) {
        std::ifstream temp_file(path);
        if (temp_file.is_open()) {
            double temp;
            temp_file >> temp;
            // The value is typically in millidegrees Celsius
            return temp / 1000.0;
        }
    }
    return -1.0;
}

In this case, we’re using thermal zone 0 but we could pick other thermal zones or print all of them. For the sake of this example, we’ll just pick the first one and report it. Incorporating this into the main function is even more straightforward.

// after the CPU temp info
const double cpu_temp = get_cpu_temperature();

std::cout << std::format("{:<12} {:.1f} °C\n", "CPU Temp:", cpu_temp);

Finally we want to print the uptime. Unsurprisingly, this is also in a file /proc/uptime!

std::string format_uptime() {
    std::ifstream uptime_file("/proc/uptime");
    double uptime_seconds_val = 0.0;
    if (!uptime_file.is_open()) {
        return "N/A";
    }
    uptime_file >> uptime_seconds_val;

    using namespace std::chrono;
    seconds total_seconds(static_cast<long>(uptime_seconds_val));

    const auto d = duration_cast<days>(total_seconds);
    total_seconds -= d;
    const auto h = duration_cast<hours>(total_seconds);
    total_seconds -= h;
    auto m = duration_cast<minutes>(total_seconds);
    total_seconds -= m;
    auto s = total_seconds;

    std::stringstream ss;
    ss << d.count() << "d " << h.count() << "h " << m.count() << "m "
       << s.count() << "s";
    return ss.str();
}

The file’s first value is the floating-point uptime of the system in seconds which is what we need (the second value is the idle time). We’ll read that into seconds and then perform some arithmetic operations to convert it into a nice day, hour, minute, seconds format using the chrono library. Since this is already formatted as a string, printing it in the main function is trivial.

// after the CPU temp info
const std::string uptime = format_uptime();
std::cout << std::format("{:<12} {}\n", "Uptime:", uptime);

Running this, we’ll get an output like the following that’ll refresh every second.

---- Linux System Monitor ----
CPU Usage:   1.5 %
RAM Usage:   5.0 / 15.1 GB (33.1 %)
CPU Temp:    36.0 °C
Uptime:      0d 1h 13m 24s
------------------------------------

And that’s it! We’ve completed our MVP of our Linux system monitor. If this were a prototype or some intentional throwaway work, then what we’re written is completely acceptable: we have some degree of modularity by abstracting the core logic of the monitor computations in functions and then ordered them into the loop itself (maybe with some block comments acting as separators). In the case of a prototype, intentional throwaway work, or even starting a project in an unfamiliar domain space, over-generalization and over-abstraction tends to slow down progress. In the case of the prototype, it’ll be replaced with a better, production version given the learnings from the prototype or scrapped entirely. In the case of intentional throwaway work, the focus is on getting something working as soon as possible so extra work only prolongs the timeline. In the case of working in a novel and unfamiliar domain, it’s a bit of a combination of the previous two use-cases in that we don’t know what exactly we want and what interfaces are actually important to construct so we want to move fast to get learnings that we can replace with a better system as we learn more.

Runtime Polymorphic Interface

Now that we have an initial MVP that works, suppose we want to go ahead with this Linux system monitor beyond the prototyping stage so now we have a need to actual generalize. The primary criticism about the current implementation that makes it difficult to generalize is that the computation, printing, and state (for the CPU monitor) are all in the main function which can easily get bloated with other kinds of computation, printing, and state from other monitors turning it into a tangled, overlapping mess.

Let’s create a class for each monitor. In fact, let’s go a step further and define a virtual interface that all of the monitors abide by so that the main function can just hold a polymorphic std::vector> that we can iterate over and ask the monitors to do something uniformly via the Monitor interface.

Now we have to sit down and design what that Monitor interface looks like especially since we have monitors that operate on heterogeneous data. One candidate might be something like this:

class Monitor {
public:
    // don't forget the virtual destructor!
    virtual ~Monitor() = default;

    // fetches the value that the monitor is responsible for
    double get() = 0;

    // returns a formatted string to print
    std::string print(double val) = 0;
};

We could even go a step further and generalize the double to a generic T type like this:

template<typename T>
class Monitor {
public:
    virtual ~Monitor() = default;

    // fetches the value that the monitor is responsible for
    T get() = 0;

    // prints the value
    std::string format(const T& val) = 0;
};

The monitor classes then might be defined like this:

class CpuUsageMonitor : public Monitor<CpuTimes> {
    // ...
};

class RamMonitor : public Monitor<RamInfo> {
    // ...
};

class CpuTempMonitor : public Monitor<double> {
    // ...
};

This particular interface is too overly-specific: why does the user need to specify a get when there’s already a print that’s going to consume that value anyways? If we were going to do something useful with the value of get, then perhaps that interface design has some legitimacy but not in this case. Interfaces should be as minimal as possible to get the job done and no more or less minimal than that. An interface with many similar or unrelated functions usually indicates it needs to be broken up into smaller interface or the inputs and outputs of those functions need to be redesigned. In both cases, we should define the necessary functions on interfaces in a way that only defines what we want from the derived classes and nothing more.

I’ve seen some people over-design interfaces with many functions with specific inputs and outputs thinking that we need to add more required functions and constraints, but, inevitably, there will be derived classes that don’t use those required functions that are forced to override those as empty functions just because they’re required by the interface. And when new requirements come in, those people think to add more functions to satisfy the new requirements. Rather than adding functions with a high level of specificity, I’ve found that removing specific functions is often more generic. My rationale is that if the interface inhibits derived classes from implementing new requirements on their own, then the interface might be too restrictive already! Removing parts of the interface and giving the derived classes more freedom and flexibility is the more generic way to go.

All of that being said, interface design is definitely more of an art than a science!

Tying this back to our specific use-case, for our monitors, we don’t actually care about the value that’s returned, but we just want the monitors to print out what they’re monitoring to the screen in whatever way they want. The cleaner interface we’ll go with directly captures this requirement:

class Monitor {
public:
    virtual ~Monitor() = default;

    // prints to the screen
    virtual void print() = 0;
};

We’re completely delegating the responsibility of printing the monitored information to the derived classes. There are different alternatives to this interface along the same vein: for example, we could still keep a similar std::string format() function and have the main function stream format() to the output. If we wanted to uniformly log the output of the system monitor, then having that format might even be better since we could have the main executor open a log file and write to that. But we’re not going to support that use-case for now.

With this new interface, let’s move all of the logic of the functions into classes. For example, the (abridged) CPU usage class would look like this:

class CpuMonitor : public Monitor {
public:
    void print() override {
        const CpuTimes current_times = read_cpu_times();
        const double cpu_usage
                = calculate_cpu_usage(prev_times_, current_times);
        if (is_first_loop_) {
            std::cout << std::format("{:<12} Calculating...\n", "CPU Usage:");
            is_first_loop_ = false;
        } else {
            std::cout << std::format("{:<12} {:>3.1f} %\n",
                                     "CPU Usage:", cpu_usage);
        }
        prev_times_ = current_times;
    }

private:
    // same as before
    struct CpuTimes {
        //...
    };

    CpuTimes prev_times_{};
    bool is_first_loop_{true};

    // same as before
    CpuTimes read_cpu_times() {
        // ...
    }

    // same as before
    double calculate_cpu_usage(const CpuTimes& prev, const CpuTimes& current) {
        // ...
    }
};

The other classes follow similarly (I’ll leave them as exercises to the reader 😉). Given these derived classes, we can simplify our main function to leverage the new runtime polymorphic interface:

int main(int, char**) {
    std::vector<std::unique_ptr<Monitor>> monitors{};
    monitors.emplace_back(std::make_unique<CpuMonitor>());
    monitors.emplace_back(std::make_unique<RamMonitor>());
    monitors.emplace_back(std::make_unique<CpuTempMonitor>());
    monitors.emplace_back(std::make_unique<UptimeMonitor>());

    using namespace std::chrono_literals;

    while (!stop_loop) {
        clear_screen();

        std::cout << "---- Linux System Monitor ----n\n";
        for (const auto& monitor : monitors) {
            monitor->print();
        }
        std::cout << "------------------------------------\n";
        std::cout << "Press Ctrl+C to exit." << std::endl;

        std::this_thread::sleep_for(1s);
    }

    return 0;
}

Much simpler since all of the logic is moved out! We’re storing all of the monitors polymorphically in a std::vector> that we populate once at the start with the derived classes that we later use in the main loop. We’re creating concrete instantiations of the derived classes and then storing them into a std::vector of the base class so we need the std::unique_ptr for runtime polymorphism. Note: when we iterate over monitors, we need to use a reference to a std::unique_ptr since we can’t copy a std::unique_ptr.

One very important thing to note that we won’t address right now is regarding the memory allocated by the std::vector>. Normally, the std::unique_ptr will free its memory automatically at the end of the scope, but since we’re running an infinite loop, if we use Ctrl-C, we’ll raise a SIGINT to the process and exit the entire program without cleaning up the memory. This is technically fine since the OS will clean up the memory anyways but it won’t be fine for some of the later patterns so we’ll address the memory issue when we get there.

Now we’re making more progress! Going beyond just functions and directly writing logic into the main function, we defined a runtime polymorphic interface and used it to help remove almost all of the business logic from the main function. The only monitor-specific code just creates the monitors themselves. This produces the same output as the MVP but is more extensible and maintainable.

Registry Pattern

One blatant issue with the runtime polymorhpic approach is that, if we want to add a new monitor, we still have to go into the main function and add it. This might not be possible if the code that runs the main function lives elsewhere where it’s not easily modifiable, e.g., the executor takes several hours to build or, in industry, it’s owned by a completley different team. Furthermore, this example is simple in that there’s just a singular main function that we need to modify to add new monitors, but, in real codebases, this place might not be obvious (e.g., the monitor list is in some utility file or other software package somewhere far from the main function) or there might be multiple places where the monitor needs to be registered. Hopefully there’s documentation on all of the places the module needs to be registered, but, if not, then we’ll have to hunt down all of those places!

The registry pattern is one technique to invert the place where the derived classes are constructed: rather than constructing them manually in the main function’s std::vector, the idea is to have each monitor register itself into a global registry that’s stored in the framework code but used by the executor.

As an added step, to decouple monitor registration from monitor construction, we’ll store the factories that construct the monitors in a global vector and then use the factories to instantiate the monitors into the registry at runtime in the main function. Let’s first define the factory function and global factory registry.

using MonitorFactory = std::function<std::unique_ptr<Monitor>()>;

inline std::vector<MonitorFactory>& getMonitorFactoryRegistry() {
    static std::vector<MonitorFactory> monitor_factories{};
    return monitor_factories;
}

Now we need a way for monitors to add themselves to this factory registry. We can take advantage of static initialization and define a dummy static variable that, in its constructor, registers its factory function into the factory registry. For the CPU utilization struct, it might look like this:

namespace {
struct RegistrarCpuMonitor {
    RegistrarCpuMonitor() {
        getMonitorFactoryRegistry().push_back([] { return std::make_unique<CpuMonitor>(); });
    }
};
RegistrarCpuMonitor registrarCpuMonitor;
}

We’re using an unnamed/anonymous namespace so the struct and variable have internal linkage just to prevent it from leaking out of the translation unit/cpp file. In the constructor of our registrar, we access the monitor factory registry and add a factory function/lambda that constructs our CpuMonitor. Then we immediately create an instance of it to invoke that constructor as the first thing that happens when the program is executed. (Static initialization happens even before main is executed since those variables live in a different segment of the program.)

The logic is the same for each monitor but since we’re relying on creating a unique global variable in static storage, there’s unfortunately no way to directly write C++ to abstract this away. Our only option is to define a macro that does the same thing.

#define REGISTER_MONITOR(MonitorClass) \
    namespace { \
    struct Registrar##MonitorClass { \
        Registrar##MonitorClass() { \
            getMonitorFactoryRegistry().push_back([] { return std::make_unique(); }); \
        } \
    }; \
    Registrar##MonitorClass registrar##MonitorClass; \
    }

Now we can add this to the end of each monitor class in the global scope.

class CpuMonitor : public Monitor {
    // ...
};
REGISTER_MONITOR(CpuMonitor)

class RamMonitor : public Monitor {
    // ...
};
REGISTER_MONITOR(RamMonitor)

With all of the monitor factories registered, we need to construct them into a monitor registry that reads the factory registry and invokes the factory functions to construct the monitors.

class MonitorRegistry {
public:
    explicit MonitorRegistry(const std::vector<MonitorFactory>& factories) {
        monitors_.reserve(factories.size());
        std::ranges::transform(factories, std::back_inserter(monitors_), [](const auto& factory) {
            return factory();
        });
    }

    ~MonitorRegistry() = default;

    MonitorRegistry(const MonitorRegistry&) = delete;
    MonitorRegistry(MonitorRegistry&&) = delete;

    MonitorRegistry& operator=(const MonitorRegistry&) = delete;
    MonitorRegistry& operator=(MonitorRegistry&&) = delete;

    const std::vector<std::unique_ptr<Monitor>>& getMonitors() const {
        return monitors_;
    }

private:
    std::vector<std::unique_ptr<Monitor>> monitors_{};
};

I’m using the new C++20 ranges library to do this but a normal std::transform will work as well. Now the main function gets simplified even further! We can create a MonitorRegistry and iterate over the monitors.

int main(int, char**) {
    using namespace std::chrono_literals;

    MonitorRegistry monitor_registry(getMonitorFactoryRegistry());

    while (true) {
        clear_screen();

        std::cout << "---- Linux System Monitor ----n\n";
        for (const auto& monitor : monitor_registry.getMonitors()) {
            monitor->print();
        }
        std::cout << "------------------------------------\n";
        std::cout << "Press Ctrl+C to exit." << std::endl;

        std::this_thread::sleep_for(1s);
    }

    return 0;
}

Now the monitors have the responsibility of registering themselves! (One added benefit of this approach is that it’s easier to test as well.) With this change, we’re moving towards a better architecture where the monitors and main executor can live completely independently to each other, in separate libraries even. At this point, we’ve hit a good milestone: the monitors are completely independent of each other and the executor.

Dynamic Plugin Architecture

Going a step further, even the registry pattern assumed that we had the main executor code that we could freely link against at build-time. This might not be the case in some scenarios: in the extreme case, the executor code is proprietary and we don’t have access to it. The vendor that supplies it wants to hide their trade secrets so they just provide the framework and obfuscated executable. In another case, perhaps the executor code itself is difficult to directly link against or is too expensive to build each time. In these cases, we’d ideally want to define the plugin in a completely separate shared library and have the executor read and use that library dynamically at runtime.

Fortunately, there’s a solution: the dynamic loader. On most systems, shared libraries can be directly loaded into a process’s memory space by that process in code. Whenever we create shared library, we can see all of the symbols that the library exports (try running nm -D ) and, when we load it into our process, we can get a pointer to any symbol in the library. Suppose we know that a particular symbol referred to a function with a specific signature: we could cast it to a function pointer and invoke it! This is the idea behind plugins in a plugin architecture: we define a function that constructs the monitor, grab a function pointer to it, and use that function pointer to construct the object! Since all of this logic happens at runtime, we could also add functionality to hot-reload any plugins, i.e., re-read the plugin library and re-initialize without ever restarting the executor process!

Let’s start by partitioning our classes into separate files and libraries. We’ll need a header file for the framework and new export macro that defines the function that will construct a given monitor.

// framework.hpp

class Monitor {
public:
    virtual ~Monitor() = default;
    virtual void print() = 0;
};

#define EXPORT_MONITOR(MonitorClass) \
    extern "C" Monitor* createMonitor() { return new MonitorClass{}; }

We’re using extern "C" to ensure the function abides by the C application binary interface (ABI). In the case of C++, for example, that means removing any namespace modifiers. So when we build a monitor library, if we listed the symbols, we’d see a global function called createMonitor in the text section of the program! One important note is that since we’re using the C ABI, we can’t return a proper std::unique_ptr (since C has no notion of classes or templates!) so we’ll have to freely allocate the monitor using a raw new, but, in the executor, we’ll immediately wrap it in a std::unique_ptr so there won’t be any issues with memory. It’s always good practice to use resource acquisition is initialization (RAII)/scoped memory management.

Now we can separate each of the monitors into their own separate files and change the macro from REGISTER_MONITOR to EXPORT_MONITOR.

// ram_monitor.cpp
class RamMonitor : public Monitor {
    // ...
};

EXPORT_MONITOR(RamMonitor);

Then we can build each of these monitors into their own separate libraries that we’ll dynamically load when we run the executor.

Speaking of the executor, we’ll modify it yet again to support this. To show how dynamically this plugin architecture works, we’re going to put the paths to the plugins into a file and have the executor read that file, load the plugins, and run them. Then we can edit the file with new plugins or remove plugins and re-run the same executor again, without making any code change to it, to see the new monitors on the screen.

Going back to the memory issues with the infinite loop and SIGINT were discussing, I’ll demonstrate how to get the signal handler set up since it will make a different if we’re hot-loading so that we don’t keep leaking memory every time we hot-load a plugin. It’s pretty simple to set up but uses some C conventions:

static std::atomic_bool stop_loop = false;

static void signal_handler(int signum) {
    if (signum == SIGINT) {
        stop_loop = 1;
    }
}
int main(int argc, char* argv[]) {
    std::signal(SIGINT, signal_handler);
    // ...
    while (!stop_loop) {
        // ...
    }
}

We override the signal handler for Ctrl-C (SIGINT) to invoke signal_handler which we use cleanly break out of the loop and release any memory or close any resources.

For the dynamic loading, the only three functions we’ll need are dlopen, dlsym, and dlclose from C. The first opens the shared library and the last one closes it. The middle one fetches a pointer to a particular symbol. Opening a shared library returns a void* handle that must be closed with dlclose. One trick we can use is to re-use std::unique_ptr but give it a custom “deletor” that doesn’t directly delete the void*, but just calls dlclose on it.

struct DlCloser {
    void operator()(void* handle) const {
        if (handle) {
            ::dlclose(handle);
        }
    }
};

// Define a type alias for the unique_ptr with the custom deleter
using DlHandlePtr = std::unique_ptr<void, DlCloser>;

int main(int argc, char* argv[]) {
    // ...
    std::vector<DlHandlePtr> handles{};
    std::vector<std::unique_ptr<Monitor>> monitors{};

    // ...
}

Note I’m choosing to use the global scope resolution operator :: like ::dlclose since C doesn’t have namespaces so all symbols are in the global namespace; using that operator ensures I’m referring to the C version of those functions and not some other namespace free function called dlclose. (C functions tend to have really generic names like open and close so it’s possible to conflict with a namespace level open and close but using ::open avoids that.) As a reader, it also signifies to me that these are C functions. This is not required, of course, but just something I do to make it easier to read and write. Now we need to do the work of loading each of these handles from the file. We’ll make the user provide the text file as an argument to the program.

int main(int argc, char* argv[]) {
    if (argc < 2) {
        std::cout << "Usage: ./monitr \n";
        return -1;
    }
    std::signal(SIGINT, signal_handler);

    std::vector<DlHandlePtr> handles{};
    std::vector<std::unique_ptr<Monitor>> monitors{};
    // ...
}

We’ll read each line of the text file and create a handle from the file path.

int main(int argc, char* argv[]) {
    // ...

    std::vector<DlHandlePtr> handles{};
    std::vector<std::unique_ptr<Monitor>> monitors{};

    std::ifstream monitors_file{argv[1]};
    std::string line;
    // read file line-by-line
    while (std::getline(monitors_file, line)) {
        // create handle for the shared library
        DlHandlePtr handle{::dlopen(line.c_str(), RTLD_NOW), DlCloser{}};
        using MonitorCreateFn = Monitor*(*)();
        // get a reference to the createMonitor function and cast it as a function pointer
        MonitorCreateFn monitor_create_fn = (MonitorCreateFn)::dlsym(handle.get(), "createMonitor");
        // invoke the function pointer to create the monitor
        std::unique_ptr<Monitor> monitor{monitor_create_fn()};

        monitors.emplace_back(std::move(monitor));
        handles.emplace_back(std::move(handle));
    }
}

The rest of the function is the same as before:

int main(int argc, char* argv[]) {
    // ...
    while (!stop_loop) {
        clear_screen();

        std::cout << "---- Linux System Monitor ----n\n";
        for (const auto& monitor : monitors) {
            monitor->print();
        }
        std::cout << "------------------------------------\n";
        std::cout << "Press Ctrl+C to exit." << std::endl;

        std::this_thread::sleep_for(1s);
    }
    return 0;
}

Now we have a dynamic plugin architecture! This is one of the most extensible and flexibility kinds of modular architectures. If we edit the text file and re-run the executor, without having to recompile anything, we’ll get a different set of monitors printing to the screen!

Compile-time Polymorphic Interface

Taking a step back from runtime polymorphism, there’s another use-case where we already know which monitors we want to run at compile-time. Or perhaps we’re running in a very resource-constrained or performance-critical environment where we want to use as many compile-time constructs as we can to help reduce the heap memory allocation or improve performance. Of course, we should measure to verify that this interface is actually a substantial contributing factor to performance!

For runtime polymorphism, we stored the monitors in a std::vector but that’s a runtime construct that dynamically allocates memory on the heap; furthermore, we have an extra pointer indirection from the virutal function table due to polymorphism which may contribute a tiny bit to performance. A corresponding compile-time construct to a std::vector is a std::tuple: we define all of the monitors as elements of a std::tuple at compile-time in its template parameter pack.

While we don’t necessarily have to explicitly enforce the template, we’ll swap out our runtime polymorphic interface with a compile-time one enforced on each type in the std::tuple. Let’s use a C++20 concept!

template<typename T>
concept MonitorLike = requires(T a) {
    // requires that any type that abides by MonitorLike must have a
    // function-like object called print that takes no parameters and
    // returns void
    { a.print() } -> std::same_as<void>;
};

Rather than directly using a std::tuple, we’ll create a MonitorChain that hides it and calls functions on all of the underlying types. The implementation is short but requires a bit of explanation.

// template parameter pack for all of the monitor types
template<typename... Monitors>
    requires(MonitorLike<Monitors> && ...) // apply MonitorLike concept for all types in the chain
class MonitorChain {
public:
    void print() {
        std::apply([](auto&... monitor) {
            // for 3 monitors, expands to (monitor1.print(), monitor2.print(), monitor3.print());
            (monitor.print(), ...);
        }, monitors_);
    }

private:
    // tuple to store our monitors
    std::tuple<Monitors...> monitors_{};
};

Using this is straightforward:

MonitorChain<CpuMonitor, RamMonitor, CpuTempMonitor, UptimeMonitor>
        monitors{};
// invokes .print() for all above monitors
monitors.print()

Unlike runtime polymorphism, we do have to specify all of the types in a way that the compiler can see all of them at compile-time so they all do need to get into the same executable (although you could still split these up into separate software packages and have some headers define them). Using templates, we can achieve a similar kind of polymorphism that has a much greater chance of being almost entirely inlined during compilation.

Conclusion

In this post, we explored a few different approaches to writing extensible C++ frameworks and architectures. Our motivating example was building a Linux system monitor with sub-monitors that are responsible for reading one specific aspect of the Linux subsystem like CPU usage and uptime. We started with building a functioning application and then looked at different ways to make it more extensible like the registry pattern the plugin architecture. We also explored a way to do the same at compile-time.

I hope this exploration provides you with some different options on building some truly extensible architectures 🙂

Quantum Computing - Part 1: Basic Quantum Circuits

2025-04-18T00:00:00+00:00

Quantum Computing has started to enter the popsci news in the past few years as it matures from theoretical knowledge to practical application. For example, some years ago, major automobile manufacturer Volkswagen announced a partnership with D-Wave Quantum Inc. where it showcased small-scale proof-of-concept Quantum Routing algorithm in Lisbon, Portugal to reduce the waiting times for passengers in the bus system. In more recent news, Google recently announced Willow, their new quantum chip with 106 superconducting qubits. Just a few months ago, Microsoft announced their Majorana quantum chip. I expect more major companies to start investing in and using this technology in the coming years and decades as it shows extraordinary speed-ups on algorithms core to many businesses! Now is the best time to start understanding how it all works and especially its limitations in the kinds of problems it can solve well.

Before you get concerned about the “quantum” in “quantum computing”, at least for the kinds of computer science use-cases we’ll be going into, we won’t need a very detailed understanding of quantum mechanics like knowing how to solve the Schrödinger equation, but we will need to at least understand and accept some core concepts like state, superposition, and measurement. While Quantum Computing is unsurprisingly also used for modeling quantum mechanical systems like interactions between molecules in quantum chemistry, we’re not going to get into those particular applications of quantum computing since they require substantial background knowledge. All this being said, if you want to really understand quantum computing, you’d benefit from learning more about quantum physics. (Even with that knowledge, there’s certainly an aspect of “we don’t know why but this is the empirical way the universe works! We’ll even encounter some of that in this post itself!)

This is just the start in our quantum journey! We’ll need to understand the basics before delving into some more practical use-cases for quantum computing. For example, the vehicle routing problem of identifying the most efficient routes for a fleet of vehicles to perform deliveries can be formulated as a quantum computing problem that can be solved significantly faster than classical approaches. I’m sure you’ve heard of Shor’s Algorithm for factoring large prime numbers that could be used to break certain RSA encryption which is ubiquitous in cybersecurity. There’s also Grover’s Algorithm for searching through an unstructured database faster than classical methods. All of these are quantum algorithms that have potential to be much faster than their classical counterparts!

In this post, I’ll walk through some basics of quantum computing, starting with some basic quantum mechanics concepts that we’ll need to accept. Then I’ll define a qubit by making analogues to classical bits. We’ll see some ways to manipulate a single qubit before moving on to multi-qubit systems. Finally, we’ll apply everything we’ve learned to two interesting quantum circuits that can be used to transmit quantum state using classical bits and vice-versa with transmitting classical bits using quantum state.

I’ll assume you know some basic linear algebra with vectors and matrices and give a quick refresher on complex numbers but you don’t have to know any quantum phyics!

Quantum Mechanics Concepts

Quantum computing lives at the intersection of quantum mechanics and computer science; while we won’t have to cover the entirety of quantum mechanics, we’ll still need to accept some concepts that are the basis of quantum computing. We’ll use a few historical experiments to motivate the concepts but we won’t be getting too much into the underlying maths and physics. For a more rigorous treatment, any introductory quantum mechanics textbook will do (Griffiths is good one).

In the early 20th century, physicists were preoccupied with experiments that classical physics couldn’t explain. One such experiment was the the Stern-Gerlach experiment. Vaporize some silver in an oven, send the beam of silver atoms through a magnetic field, and measure the deflection on a detector screen.

Credit: Wikipedia. The Stern-Gerlach experiment had silver atoms traveling through a magnetic field into a detector screen to measure their deflection. (1) The oven vaporizing the silver, (2) the beam of silver atoms, (3) the magnetic field, (4) the expected result using classical electrodynamics, (5) the actual result.

While the silver atom is neutral, the electron in the farthest shell will have a magnetic moment which behaves almost like the entire atom has a little magnet; this is an intrinsic, fundamental property of all particles called spin. From classical electrodynamics, if a charged object passes through a magnet, it experiences a force proportional to the “alignment” of its “north pole” with the north pole of magnet. Since the spin of the farthest out electron is a vector, we’d expect a Gaussian distribution with the mean being a straight line from the emitter, i.e., no deflection, and then some linear spread indicating some atoms that were slightly deflected only along one axis.

However this was not the observed result! Instead physicists observed two distinct peaks! Instead of the spin being a continuous distribution it was quantized into two values! Let’s give these outcomes symbols: $\ket{\uparrow}$ for silver atoms deflected upward and $\ket{\downarrow}$ for silver atoms deflected downward. For each atom, we’ll get one or the other outcome with some probability but not both and not something in between. We can describe the outcome of this system using an equation:

\[\ket{\psi} = \alpha\ket{\uparrow} + \beta\ket{\downarrow}\]

where $\ket{\uparrow}$ and $\ket{\downarrow}$ are the two possible outcomes, $\ket{\psi}$ represents the combination of all possible outcomes, and $\alpha^2 + \beta^2 = 1$ since we have a probability of being in one or the other outcome. (The $\ket{\cdot}$ is just a physics notation.) As it turns out, instead of these being real numbers in $\R$, in quantum mechanics, these are always generalized to complex numbers in $\C$ because there are many kinds of wave-like equations and other structures that are much more easily described using complex exponentials and complex numbers. There are some reformulations of quantum mechanics that use purely real numbers but they’re less canonical and much more difficult. (As a refresher, a complex number $z\in\C$ is a number $z = a + bi$ such that $i\equiv\sqrt{-1}$.) So with $\alpha,\beta\in\C$, we should modify the constraint in the above equation to take the norm like $\abs{\alpha}^2 + \abs{\beta}^2 = 1$ where $\abs{z} \equiv \sqrt{a^2+b^2}$. QM tells us the statistics of a particle so on average $\abs{\alpha}^2$ percent of the time, the system will be in $\ket{\uparrow}$ and the other $\abs{\beta}^2$ percent of the time, the system will be in $\ket{\downarrow}$.

Beforehand, we don’t know what the outcome is going to be. We say that the outcome is a superposition of the $\ket{\uparrow}$ and $\ket{\downarrow}$ states; this is just a fancy word for describing that the measured state could be one of many possible outcomes. After we take a measurement, then we get a single outcome of the possibilities of the possibilities. It’s an open physics/meta-physics question as to why this happens and how to interpret it but the reality is that measuring a quantum mechanical system produces exactly one outcome from all possible outcomes.

From Classical Computing to Quantum Computing: the Qubit

To motivate quantum bits, let’s start with classical bits. The most fundamental unit of computing is the bit which takes a value of exactly and only 0 or 1. Something that we might forget in the modern era of computing is that a bit is a logical object but the physical representation of a bit depends on the kind of hardware used to represent it. In modern computing, we use transistors, specifically a metal-oxide-semiconductor field-effect transistor (MOSFET), where the logical value of 0 means the transistor isn’t conducting any current while the value of 1 means that current is flowing through it. Going back over half a century, we were using magnetic tapes, disks, and other magnetic medium where we’d align a little region on the medium either “down” or “up” which represented 0 or 1 respectively. For most practical computer science, we generally don’t worry about the physical representation and assume it’s reliable; after all, that’s a job for the electrical engineers and physicists!

Now what if we stretch our notion of a “bit” from the classic sense into the quantum sense of superposition, probability, and possibilities. A quantum bit or qubit is the most fundamental unit of computing for quantum computing that also takes the value of either $\ket{0}$ or $\ket{1}$ but is only known at most with some probability before measuring. The most general kind of qubit is in some superposition of $\ket{0}$ or $\ket{1}$ with respective probability amplitudes $\alpha$ and $\beta$.

\[\ket{\psi} = \alpha\ket{0} + \beta\ket{1}\\\]

such that $\abs{\alpha}^2 + \abs{\beta}^2 = 1$. The actual probabilities are $\abs{\alpha}^2$ and $\abs{\beta}^2$ so we call $\alpha$ and $\beta$ probability amplitudes.

The simplest examples of qubits are $\ket{\psi}=\ket{0}$ and $\ket{\psi}=\ket{1}$. Let’s use IBM’s quantum computing library Qiskit to represent the first state as a quantum circuit and simulate it on our classical hardware.

# Remember to install the following!:
# pip3 install qiskit qiskit-aer

from qiskit import QuantumCircuit, transpile
from qiskit_aer import AerSimulator
from qiskit.visualization import plot_histogram

# define a quantum circuit with a single qubit
circuit = QuantumCircuit(1, 0)
# measure all qubits
circuit.measure_all()
# print an ASCII-art version of the circuit
print(circuit)

# create simulator to run the circuit against
simulator = AerSimulator()
# transpile the circuit from the software representation
# to a version that's optimized for quantum computing hardware
# (in this case, we're just using our simulator on our classical hardware)
circuit = transpile(circuit, simulator)

# simulate the circuit for 2^10 trials and get the results
result = simulator.run(circuit).result()
# fetch and print the counts of the distribution
counts = result.get_counts(circuit)
print(counts)

If we run this, unsurprisingly, we’ll see that all trials measure the same state: 0.

         ░ ┌─┐
     q: ─░─┤M├
         ░ └╥┘
meas: 1/════╩═
            0 
{'0': 1024}

Remember that quantum physics tells us the statistics of distributions not of a single individual particle so we need to run this quantum circuit for a number of trials. The necessity of running a number of trials instead of just a single one will become apparent in a little while. What about the other state where $\ket{\psi}=\ket{1}$? We can initialize the qubit to $\ket{1}$ just by adding a call to circuit.initialize before circuit.measure_all():

# initialize qubit 0 to 0*|0> + 1*|1>
circuit.initialize([0, 1], 0)

Note we use a vector to represent the coefficients of $\ket{0}$ and $\ket{1}$! An essential representation of a quantum state $\ket{\psi}$ is as a vector in a vector space (specifically a Hilbert space) of some basis states, for example $\ket{0}$ and $\ket{1}$. We’ll get more into this when we discuss quantum gates and operators. Let’s run this code!

        ┌─────────────────┐ ░ ┌─┐
     q: ┤ Initialize(0,1) ├─░─┤M├
        └─────────────────┘ ░ └╥┘
meas: 1/═══════════════════════╩═
                               0 
{'1': 1024}

Also unsurprisingly, we find that the result is always 1 (and there’s an Initialize block in the circuit).

A more interesting example is a uniform superposition of both.

\[\ket{\psi} = \frac{1}{\sqrt{2}}\ket{0} + \frac{1}{\sqrt{2}}\ket{1}\\\]

(Verify that the norm of the coefficients sum to 1!) This means that, if we prepare and measure this qubit, for about half the trials, the final outcome will be 0 and the other half of the time, the final outcome will be 1.

We can simulate this using the same circuit.initialize function, being careful with the normalization.

# initialize qubit 0 to 1/sqrt(2)*|0> + 1/sqrt(2)*|1>
circuit.initialize([1./np.sqrt(2), 1./np.sqrt(2)], 0)

Now our results are more interesting!

        ┌─────────────────────────────┐ ░ ┌─┐
     q: ┤ Initialize(0.70711,0.70711) ├─░─┤M├
        └─────────────────────────────┘ ░ └╥┘
meas: 1/═══════════════════════════════════╩═
                                           0 
{'0': 518, '1': 506}

Now when we measure, we have roughly even counts of 0 and 1 as expected! The counts are exactly equal since we’re just simulating the quantum circuit and the overall system has noise! This is something we’ll have to get accustomed to: quantum computing is noisy! But now it’s clear why we need to run multiple trials: quantum computing is not deterministic so we need a statistically significant number of trials for each circuit to get meaningful results.

Quantum Logic Gates for Single Qubits

Now that we have some familiarity with a single qubit’s initialization and measurements, let’s see what kinds of operations we can perform on that qubit. Just like with classical computing and what I’ve alluded to when using Qiskit, quantum computing also has a notion of logic gates that qubits pass through and have their state changed as part of a quantum circuit.

The most critical and common single-qubit quantum logic gates are $X$, $Y$, $Z$, and $H$. The first three are sometimes called Pauli gates since they correspond to the Pauli matrices $\sigma_x$, $\sigma_y$, and $\sigma_z$ in quantum physics. An alternative geometric representation is that those operators represent rotations of $\frac{\pi}{2}$ about their respective axes on a kind of unit sphere called the Bloch sphere where we represent each qubit as a point on the sphere such that the basis states $\ket{1}$ and $\ket{0}$ are at $z=1$ and $z=-1$ respectively. I don’t find much insight in this geometric representation especially since it doesn’t generalize to multiple qubits well.

We can define these operators based on how they transform a generic qubit $\ket{\psi}=\alpha\ket{0}+\beta\ket{1}$.

\[X\ket{\psi} \equiv \alpha\ket{1} + \beta\ket{0}\]

So the $X$ gate effectively swaps the probability amplitudes of the two states! Equivalently, we could have defined the $X$ gate based on how it transformed the basis states themselves!

\[\begin{align*} X\ket{0} &= \ket{1} \\ X\ket{1} &= \ket{0} \end{align*}\]

This becomes clear if we substitute $\ket{\psi}=1\ket{0}+0\ket{1}$ and $\ket{\psi}=0\ket{0}+1\ket{1}$. We can think about the $X$ gate as being roughly like a NOT gate from classical computing! Remember when we were initializing the state of a qubit and we learned we could represent it as a vector of coefficients? Well if we have the “before” qubit and the “after” qubit, a matrix is how we can represent a linear transform between the two! In quantum mechanics, all operators can be represented as matrices since quantum mechanics is a linear framework. Therefore, we can represent all quantum logic gates as matrices too. Specifically, we’re looking for the matrix that maps the state vector $\begin{bmatrix}\alpha \\ \beta\end{bmatrix}$ to $\begin{bmatrix}\beta \\ \alpha\end{bmatrix}$. With some effort, we can figure this out:

\[X = \begin{bmatrix} 0 & 1\\ 1 & 0 \end{bmatrix}\]

And we can verify this matrix is correct:

\[\begin{align*} X\ket{\psi} &= \begin{bmatrix} 0 & 1\\ 1 & 0 \end{bmatrix} \begin{bmatrix}\alpha \\\\ \beta\end{bmatrix}\\ &=\begin{bmatrix}\beta \\\\ \alpha\end{bmatrix} \end{align*}\]

All quantum gates/operators must unitary: their inverse must equal to their own conjugate transpose, i.e., $U^\dagger U=UU^\dagger=I$. This property is a generalization of real orthogonal matrices where their transpose equals their inverse, i.e., $Q^TQ=QQ^T=I$. Recall that the conjugate of a complex number $z=a+bi$ is just $\bar{z}=a-bi$ so the conjugate transpose $U^\dagger$ of a matrix $U$ entries must obey $a_{ij} = \bar{a_{ji}}$ where $a_{ij}\in\C$ is the entry in the $i$th row and $j$th column of $U$. In other words, we transpose the matrix and then take the complex conjugate of each entry.

We can verify the $X$ gate is unitary:

\[X^\dagger X = \begin{bmatrix} 0 & 1\\ 1 & 0 \end{bmatrix}^\dagger \begin{bmatrix} 0 & 1\\ 1 & 0 \end{bmatrix} = \begin{bmatrix} 0 & 1\\ 1 & 0 \end{bmatrix} \begin{bmatrix} 0 & 1\\ 1 & 0 \end{bmatrix} = \begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix} = I\]

This is the most important property of all quantum logic gates/operators because it preserves normalization! Every quantum state must be normalized so this property ensures that, after we apply a any quantum operators to any state, we always end up with a properly normalized state. Another consequence of unitary operators is that all quantum gates are reversible! This is generally not true for classical gates. Consider a classical AND gate: we can’t know what the two operands were from just the result of the AND gate. We’ll see a number of quantum circuits of the form “perform some operations to map the input into a different space, manipulate the state in that space, perform the inverse operations from the beginning to map the state back into the original space”.

But for now, let’s build a quantum circuit using the $X$ gate.

# replace the initialize call
# apply X gate to qubit 0
circuit.x(0)

Applying this to qubit 0 in state $\ket{0}$ yields 1 always.

        ┌───┐ ░ ┌─┐
     q: ┤ X ├─░─┤M├
        └───┘ ░ └╥┘
meas: 1/═════════╩═
                 0 
{'1': 2048}

Applying it to the $\ket{1}$ state always yields 0. This is the quantum version of the NOT gate!

Moving on, before we get to the $Y$ gate, let’s first talk about the $Z$ gate. We can represent it matrix form as the following.

\[Z = \begin{bmatrix} 1 & 0\\ 0 & -1 \end{bmatrix}\]

Applying the $Z$ gate to $\ket{0}$ leaves it unchanged, i.e., $Z\ket{0}=\ket{0}$, but to $\ket{1}$, this maps it to $-1\ket{1}$, i.e., $Z\ket{1}=-\ket{1}$. To a general qubit $\ket{\psi}=\alpha\ket{0}+\beta\ket{1}$, the $Z$ gate maps it to $Z\ket{\psi}=\alpha\ket{0}-\beta{\ket{1}}$. Does this affect the probabilities of observing either outcome? Nope! Recall that the likelihood of each state is a norm and $\abs{-\beta}=\abs{\beta}$ so the $Z$ gate doesn’t change the final observation likelihoods. This extra factor is called the phase (specifically relative phase) and the $Z$ gate is sometimes called the phase flip gate because it flips the sign of $\ket{1}$. On the surface, phase doesn’t seem to affect the final measurement of the qubit but, used in conjunction with other quantum gates and operators, it’s essential to all complex quantum algorithms like Grover’s Algorithm and the famous Shor’s Algorithm to factor large prime numbers.

But for now, let’s build a circuit with the $Z$ gate.

circuit.z(0)

And then run it.

        ┌───┐ ░ ┌─┐
     q: ┤ Z ├─░─┤M├
        └───┘ ░ └╥┘
meas: 1/═════════╩═
                 0 
{'0': 2048}

As expected, this phase didn’t change the measured result! Phase is one of the unique facets of quantum computing with no classical analogue that provides an entirely new dimension to quantum algorithms.

Circling back, the $Y$ gate flips the qubit and adds a complex phase of $i$ and can also be represented by a matrix.

\[Y = \begin{bmatrix} 0 & -i\\ i & 0 \end{bmatrix}\]

So $Y\ket{0}=i\ket{1}$ and $Y\ket{1}=-i\ket{0}$. Note that we can represent the $Y$ gate as $Y=iXZ$! In fact, we can represent each of the Pauli gates in terms of the others! I’ve found the $Y$ gate to be less useful as the $X$ and $Z$ gates but it does correspond to a Pauli matrix so it’s worth mentioning it for completeness.

Moving on to arguably the most important single-qubit gate, the Hadamard $H$ gate is used to create uniform superpositions of qubits. Remember $\ket{\psi} = \frac{1}{\sqrt{2}}\ket{0} + \frac{1}{\sqrt{2}}\ket{1}$? Well we can create it using the Hadamard gate applied to $\ket{0}$. We can represent it as a matrix:

\[H = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 1\\ 1 & -1 \end{bmatrix}\]

Applying this to the basis states, $H\ket{0} = \frac{1}{\sqrt{2}}\ket{0} + \frac{1}{\sqrt{2}}\ket{1}$ and $H\ket{1} = \frac{1}{\sqrt{2}}\ket{0} - \frac{1}{\sqrt{2}}\ket{1}$. So given either basis state, the Hadamard gate can be used to create a uniform superposition (sometimes with a phase)! Notationally, some resources define $\ket{+}\equiv H\ket{0} = \frac{1}{\sqrt{2}}\ket{0} + \frac{1}{\sqrt{2}}\ket{1}$ and $\ket{-}\equiv H\ket{1} = \frac{1}{\sqrt{2}}\ket{0} - \frac{1}{\sqrt{2}}\ket{1}$ as a shorthand.

Let’s build a quantum circuit that uses it!

circuit.h(0)

And then run it.

        ┌───┐ ░ ┌─┐
     q: ┤ H ├─░─┤M├
        └───┘ ░ └╥┘
meas: 1/═════════╩═
                 0 
{'1': 1034, '0': 1014}

As expected it produces a roughly equal number of 0s and 1s! Now we have some of the building blocks to properly initialize our qubits. In practice, we don’t initialize qubits arbitrarily beyond just the basis states, but we use gates to put them into whatever state we want before invoking quantum algorithms.

Before wrapping up with single-qubit gates, I did want to mention that there are other single-qubit gates like the rotation gate $R_\theta$ that applies a phase of $e^{i\theta}$ to $\ket{1}$ while leaving $\ket{0}$ unchanged. Technically, we can represent any single-qubit gates as a combination of rotation gates about some axis by some amount but the ones we’ve discussed are the most common. Of course, we can construct novel gates and, as long as we can show the gate is unitary, then it’s a valid gate!

Multi-qubit Systems

So far, we’ve discussed a single qubit but the real power of quantum computing comes from how multiple qubits are handled. We always represent multiple qubits by enumerating all possible outcomes.

\[\ket{\psi} = a_{00}\ket{00} + a_{01}\ket{01} + a_{10}\ket{10} + a_{11}\ket{11}\\\]

where $\sum_i\abs{a_i}^2 = \abs{a_0}^2 + \abs{a_1}^2 + \abs{a_2}^2 + \abs{a_3}^2 = 1$. This is particularly powerful in its ability to represent $2^n$ classical bits/outcomes using only $n$ qubits: with 2 qubits, we’ve represented 4 possible outcomes.

Similar to the one qubit case, we can put two qubits in a uniform superposition across all possible outcomes.

\[\ket{\psi} = \frac{1}{2}\ket{00} + \frac{1}{2}\ket{01} + \frac{1}{2}\ket{10} + \frac{1}{2}\ket{11}\\\]

When we measure this state, we get all possible outcomes equally on average! We can construct such a state using two Hadamard operators but, when dealing with multi-qubit systems, we have to specify which qubit we’re applying the operator to. Conventionally similar to classical computing, qubit 0 is the rightmost qubit. So to build this state, we need to apply two Hadamard operators to the first and second qubits.

\[\begin{align*} H_1 H_0\ket{00} &= H_1\Bigg[\frac{1}{\sqrt{2}}(\ket{00} + \ket{01})\Bigg]\\ &= \frac{1}{\sqrt{2}}\Bigg(H_1\ket{00} + H_1\ket{01}\Bigg)\\ &= \frac{1}{\sqrt{2}}\Bigg[\frac{1}{\sqrt{2}}\Bigg(\ket{00} + \ket{10}\Bigg) + \frac{1}{\sqrt{2}}\Bigg(\ket{01}+\ket{11}\Bigg)\Bigg]\\ &= \frac{1}{2}\ket{00} + \frac{1}{2}\ket{01} + \frac{1}{2}\ket{10} + \frac{1}{2}\ket{11}\\ \end{align*}\]

Note that when we’re applying an operation to a single qubit of a multi-qubit system, we leave the unaffected qubits unchanged but still write them all out. Just like with single-qubit gates, we can represent this with a single $4\times 4$ matrix. To do this, we need to define a kind of product called the tensor product used to combine two independent qubits into a single multi-qubit system. The tensor product has a very specific mathematical definition but can think of it as a way to combinatorially combine two qubit states into a single joint state. Suppose we have two qubits.

\[\begin{align*} \ket{\psi_0} &= \alpha\ket{0} + \beta\ket{1}\\ \ket{\psi_1} &= \gamma\ket{0} + \delta\ket{1}\\ \end{align*}\]

Then we can define the tensor product.

\[\ket{\psi_1}\otimes\ket{\psi_0} = \ket{\psi_1}\ket{\psi_0} = \ket{\psi_1\psi_0} = \gamma\alpha\ket{00} + \gamma\beta\ket{01} + \delta\alpha\ket{10} + \delta\beta\ket{11}\]

We’re basically multiplying each state of $\psi_0$ with each state of $\psi_1$. In terms of the coefficients, we can do the same thing.

\[\begin{align*} \ket{\psi_1}\otimes\ket{\psi_0} &= \begin{bmatrix}\gamma\\\delta\end{bmatrix}\otimes\begin{bmatrix}\alpha\\\beta\end{bmatrix}\\ &= \begin{bmatrix} \gamma\otimes\begin{bmatrix}\alpha\\\beta\end{bmatrix} \\ \delta\otimes\begin{bmatrix}\alpha\\\beta\end{bmatrix} \end{bmatrix}\\ &= \begin{bmatrix} \gamma\alpha\\ \gamma\beta\\ \delta\alpha\\ \delta\beta\\ \end{bmatrix} \end{align*}\]

The coefficients of all of the states line up with the matrix representation! This is also sometimes called the Kronecker product. Now we can construct the matrix that represents $H_1\otimes H_0 = H^{\otimes 2}$.

\[\begin{align*} H_1\otimes H_0 = H^{\otimes 2} &= \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 1\\ 1 & -1 \end{bmatrix} \otimes \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 1\\ 1 & -1 \end{bmatrix}\\ &= \frac{1}{2} \begin{bmatrix} \begin{bmatrix} 1 & 1\\ 1 & -1 \end{bmatrix} \otimes 1 & \begin{bmatrix} 1 & 1\\ 1 & -1 \end{bmatrix} \otimes 1 \\ \begin{bmatrix} 1 & 1\\ 1 & -1 \end{bmatrix} \otimes 1 & \begin{bmatrix} 1 & 1\\ 1 & -1 \end{bmatrix} \otimes -1 \end{bmatrix}\\ &= \frac{1}{2} \begin{bmatrix} 1 & 1 & 1 & 1\\ 1 & -1 & 1 & -1\\ 1 & 1 & -1 & -1\\ 1 & -1 & -1 & 1 \end{bmatrix} \end{align*}\]

Applying this to the vector representing $\ket{00}$, we get the expected result: a uniform superposition.

\[\frac{1}{2} \begin{bmatrix} 1 & 1 & 1 & 1\\ 1 & -1 & 1 & -1\\ 1 & 1 & -1 & -1\\ 1 & -1 & -1 & 1 \end{bmatrix} \begin{bmatrix} 1\\ 0\\ 0\\ 0 \end{bmatrix} = \frac{1}{2} \begin{bmatrix} 1\\ 1\\ 1\\ 1 \end{bmatrix}\]

Now let’s build the corresponding quantum circuit!

# need to specify 2 quantum bits this time!
circuit = QuantumCircuit(2, 0)
# apply Hadamard to both qubits (equivalent to applying h to each one individually)
circuit.h([0, 1])

And then run it.

        ┌───┐ ░ ┌─┐   
   q_0: ┤ H ├─░─┤M├───
        ├───┤ ░ └╥┘┌─┐
   q_1: ┤ H ├─░──╫─┤M├
        └───┘ ░  ║ └╥┘
meas: 2/═════════╩══╩═
                 0  1 
{'01': 519, '00': 512, '10': 508, '11': 509}

As expected, this measures roughly equal counts across all possible states!

Quantum Logic Gates for Multi-qubit Systems

In addition to single-qubit quantum gates, there are also quantum logic gates that work with more than one qubit. The most important of which is called the controlled NOT (CNOT) gate that takes a control qubit and flips the target qubit if the control qubit is 1 otherwise the target qubit is left unchanged: $\text{CNOT} : \ket{x_1,x_0} = \ket{x_1\oplus x_0,x_0}$ where $\oplus$ is a classical XOR operation. I’ll use $\text{CNOT}_{0,1}$ to refer to applying the CNOT gate with control qubit 0 and target qubit 1. The CNOT gate is like a quantum version of the XOR gate. The difference is that the CNOT gate requires a control qubit that’s passed through unchanged. This is because of the property that quantum gates are reversible: a classical XOR gate is not reversible unless we know the value of one of the operands which is what the control qubit represents.

Let’s construct the quantum circuit that uses it and see for ourselves.

# CNOT gate with qubit 0 as the control qubit
# and qubit 1 as the target qubit
circuit.cx(0, 1)

Running this leaves the system unchanged since the control qubit is 0.

              ░ ┌─┐   
   q_0: ──■───░─┤M├───
        ┌─┴─┐ ░ └╥┘┌─┐
   q_1: ┤ X ├─░──╫─┤M├
        └───┘ ░  ║ └╥┘
meas: 2/═════════╩══╩═
                 0  1 
{'00': 2048}

But if we change the circuit so that the control qubit is 1 (add circuit.x(0) before the CNOT), then we always get 11. Just like with every quantum gate, we can represent it with a unitary matrix.

\[\text{CNOT}= \begin{bmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1\\ 0 & 0 & 1 & 0 \end{bmatrix}\]

These examples aren’t particularly intersting but what if we pass the control qubit into a Hadamard gate before applying the CNOT gate? This is far more interesting since now the control qubit is in a uniform superposition of 0 and 1 so the CNOT gate might flip the target qubit half the time depending on the state of the control qubit. Let’s figure out what would happen analytically.

\[\begin{align*} \text{CNOT}_{0,1}H_0\ket{00} &= \text{CNOT}_{0,1}\Bigg[\frac{1}{\sqrt{2}}\Bigg(\ket{00}+\ket{01}\Bigg)\Bigg]\\ &= \frac{1}{\sqrt{2}}\Bigg(\text{CNOT}_{0,1}\ket{00}+\text{CNOT}_{0,1}\ket{01}\Bigg)\\ &= \frac{1}{\sqrt{2}}\Big(\ket{00}+\ket{11}\Big)\\ \end{align*}\]

Now this is a very interesting state! If we measure qubit 0 as 0, then qubit 1 will definitely be 0, and, if we measure qubit 0 as 1, then qubit 1 will definitely be 1. The qubits’ final measured values are coupled! This is called entanglement in quantum physics and these kinds of coupled states are called Bell states (after John Bell), EPR (Einstein-Podolsky-Rosen) pairs (after Albert Einstein, Boris Podolsky, and Nathan Rosen), or just entangled pairs. The scary part is that no one currently knows how entanglement works but just that it does. In fact, it even works across arbitrary distances! If we put the two qubits on opposite sides of the galaxy, much farther than the speed of light could transmit any information, measuring one of them immediately tells us what the other one is.

Let’s build a quantum circuit to show this empirically!

circuit.h(0)
circuit.cx(0, 1)

And run it.

        ┌───┐      ░ ┌─┐   
   q_0: ┤ H ├──■───░─┤M├───
        └───┘┌─┴─┐ ░ └╥┘┌─┐
   q_1: ─────┤ X ├─░──╫─┤M├
             └───┘ ░  ║ └╥┘
meas: 2/══════════════╩══╩═
                      0  1 
{'11': 1015, '00': 1033}

As expected, with this multi-qubit system, the only two possible outcomes are 00 and 11 with roughly equal probability! Depending on the input qubits, we could get one of four possible Bell states (try to figure them out on your own!) They all share the same characteristic that knowing the result of one qubit determines the value of the other qubit. We can write the closed-form definition of a Bell state.

\[\ket{B_{x,y}}\equiv\frac{\ket{0,y} + (-1)^x\ket{1,\bar{y}}}{\sqrt{2}}\]

So the Bell state we created above was $\ket{B_{0,0}}$. The CNOT gate and single-qubit gates give us most of the foundation we need to construct more complex and practical quantum circuits that realize quantum algorithms. Let’s see a few examples!

Quantum Teleportation

One interesting application of the CNOT gate and entanglement is quantum teleportation. Suppose Alice wants to transmit some arbitrary quantum state $\ket{\psi}$ to Bob. On the surface, we might think to just copy $\ket{\psi}$ and send it to Bob, but quantum physics has a No-cloning Theorem that states that it is impossible to perfectly copy an unknown quantum state. This can be proven via proof-by-contradiction but, intuitively, if we could perfectly copy an unknown quantum state then we would be violating the Heisenberg Uncertainty Principle since we’d need to perfeclty know all of the properties of that unknown quantum state in order to perfectly copy it. The tangible consequence is that we can’t just copy $\ket{\psi}$ and send it to Bob.

Instead, suppose Alice and Bob had shared two halves of an entangled pair ahead of time. She can interact her arbitrary state $\ket{\psi}$ with her half of the entangled pair, measure it, and then send Bob the result over a classical communication channel. Based on the result, Bob can apply an operator to his half of the entangled pair to recover Alice’s $\ket{\psi}$.

This time, let’s first build the quantum circuit and then analyze it after. We’ll take a slightly different approach than before to define the circuit just to showcase another way to use Qiskit.

q = QuantumRegister(1, 'q')

bell_0 = QuantumRegister(1, 'B_0')
bell_1 = QuantumRegister(1, 'B_1')

c_0 = ClassicalRegister(1, 'c_0')
c_1 = ClassicalRegister(1, 'c_1')
c_2 = ClassicalRegister(1, 'c_2')

qc = QuantumCircuit(q, bell_0, bell_1, c_0, c_1, c_2)

# prep bell state
qc.h(bell_0)
qc.cx(bell_0, bell_1)
qc.barrier(label='ψ_0')

# Alice entangles her qubit with her half of the Bell state
qc.cx(q, bell_0)
qc.h(q)
qc.barrier(label='ψ_1')

# Alice measures to affect Bob's Bell state and sends him the classical qubits
qc.measure(q, c_0)
qc.measure(bell_0, c_1)
qc.barrier(label='ψ_2')

# Bob applies the right operators to his Bell state based on the
# classical qubits received from Alice
with qc.if_test((c_1, 1)):
    qc.x(bell_1)
with qc.if_test((c_0, 1)):
    qc.z(bell_1)
qc.barrier(label='ψ_3')

# Bob measures his Bell state
qc.measure(bell_1, c_2)
print(qc)

Running this for $\ket{\psi}=\ket{0}$, we see that the last qubit (the leftmost qubit) is always measured to be 0!

                  ψ_0      ┌───┐ ψ_1 ┌─┐    ψ_2                                                ψ_3    
    q: ────────────░────■──┤ H ├──░──┤M├─────░──────────────────────────────────────────────────░─────
       ┌───┐       ░  ┌─┴─┐└───┘  ░  └╥┘┌─┐  ░                                                  ░     
  B_0: ┤ H ├──■────░──┤ X ├───────░───╫─┤M├──░──────────────────────────────────────────────────░─────
       └───┘┌─┴─┐  ░  └───┘       ░   ║ └╥┘  ░  ┌────── ┌───┐ ───────┐ ┌────── ┌───┐ ───────┐   ░  ┌─┐
  B_1: ─────┤ X ├──░──────────────░───╫──╫───░──┤ If-0  ┤ X ├  End-0 ├─┤ If-0  ┤ Z ├  End-0 ├───░──┤M├
            └───┘  ░              ░   ║  ║   ░  └──╥─── └───┘ ───────┘ └──╥─── └───┘ ───────┘   ░  └╥┘
                                      ║  ║         ║                   ┌──╨──┐                      ║ 
c_0: 1/═══════════════════════════════╩══╬═════════╬═══════════════════╡ 0x1 ╞══════════════════════╬═
                                      0  ║      ┌──╨──┐                └─────┘                      ║ 
c_1: 1/══════════════════════════════════╩══════╡ 0x1 ╞═════════════════════════════════════════════╬═
                                         0      └─────┘                                             ║ 
c_2: 1/═════════════════════════════════════════════════════════════════════════════════════════════╩═
                                                                                                    0 
{'0 0 1': 525, '0 1 1': 506, '0 1 0': 479, '0 0 0': 538}

If we initialized $\ket{\psi}=\ket{1}$, we’d see that the last qubit is always measured to be 1!

Let’s analyze this circuit for the general case where Alice has some arbitrary qubit $\ket{\psi}=\alpha\ket{0} + \beta\ket{1}$ that she wants to transmit to Bob. We first start by creating a Bell state with the two leftmost qubits $\ket{B_{0,0}}=\frac{1}{\sqrt{2}}(\ket{00} + \ket{11})$.

\[\begin{align*} \ket{\psi_0} &= \ket{B_{0,0}}\ket{\psi}\\ &= \frac{1}{\sqrt{2}}\Big(\ket{00} + \ket{11}\Big)\Big(\alpha\ket{0} + \beta\ket{1}\Big)\\ &= \frac{1}{\sqrt{2}}\Bigg[\Big(\ket{00} + \ket{11}\Big)\alpha\ket{0} + \Big(\ket{00} + \ket{11}\Big)\beta\ket{1}\Bigg]\\ \end{align*}\]

Now let’s apply the CNOT first with the control bit being the rightmost qubit and the target qubit being the midddle qubit.

\[\ket{\psi'_1} = \frac{1}{\sqrt{2}}\Bigg[\Big(\ket{00} + \ket{11}\Big)\alpha\ket{0} + \Big(\ket{01} + \ket{10}\Big)\beta\ket{1}\Bigg]\]

The $\alpha$ terms aren’t affected since the control bit is 0 but the $\beta$ terms have their middle (or rightmost in their Bell state) qubit flipped. Now let’s apply the Hadamard to the rightmost qubit, i.e., Alice’s original quantum state.

\[\begin{align*} \ket{\psi_1} &= \frac{1}{\sqrt{2}}\Bigg[\Big(\ket{00} + \ket{11}\Big)\alpha\frac{1}{\sqrt{2}}(\ket{0}+\ket{1}) + \Big(\ket{01} + \ket{10}\Big)\beta\frac{1}{\sqrt{2}}(\ket{0}-\ket{1})\Bigg]\\ &= \frac{1}{2}\Bigg[\Big(\ket{00} + \ket{11}\Big)\alpha(\ket{0}+\ket{1}) + \Big(\ket{01} + \ket{10}\Big)\beta(\ket{0}-\ket{1})\Bigg]\\ \end{align*}\]

In the last step, we pulled out the common factor of $\frac{1}{\sqrt{2}}$. Let’s pull $\alpha$ and $\beta$ out to the front of their respective terms and expand out the two products.

\[\begin{align*} &= \frac{1}{2}\Bigg[\alpha\Big(\ket{00} + \ket{11}\Big)(\ket{0}+\ket{1}) + \beta\Big(\ket{01} + \ket{10}\Big)(\ket{0}-\ket{1})\Bigg]\\ &= \frac{1}{2}\Bigg[\alpha\Big(\ket{00}\ket{0} + \ket{11}\ket{0} + \ket{00}\ket{1} + \ket{11}\ket{1}\Big) + \beta\Big(\ket{01}\ket{0} + \ket{10}\ket{0} - \ket{01}\ket{1} - \ket{10}\ket{1}\Big)\Bigg]\\ \end{align*}\]

Now let’s regroup the qubits from $\ket{B_2 B_1}\ket{q}$ to $\ket{B_2}\ket{B_1 q}$ since Alice is going to send over the rightmost two qubits. (This is mathematically legal since the tensor product is associative.) Bob will read the two rightmost values to figure out which gates to apply to his half of the Bell state.

\[= \frac{1}{2}\Bigg[\alpha\Bigg(\ket{0}\ket{00} + \ket{1}\ket{10} + \ket{0}\ket{01} + \ket{1}\ket{11}\Bigg) + \beta\Bigg(\ket{0}\ket{10} + \ket{1}\ket{00} - \ket{0}\ket{11} - \ket{1}\ket{01}\Bigg)\Bigg]\]

Now let’s regroup the terms where the righmost two qubits are $\ket{00}$, $\ket{01}$, $\ket{10}$, and $\ket{11}$.

\[\begin{align*} = \frac{1}{2}\Bigg[ &\phantom{+}\Big(\alpha\ket{0} + \beta\ket{1}\Big)\ket{00}\\ &+ \Big(\alpha\ket{0} - \beta\ket{1}\Big)\ket{01}\\ &+ \Big(\alpha\ket{1} + \beta\ket{1}\Big)\ket{10}\\ &+ \Big(\alpha\ket{1} - \beta\ket{1}\Big)\ket{11} \Bigg] \end{align*}\]

Now this is interesting! If Bob receives $\ket{00}$ from Alice, he can recover the state that Alice originally sent $\ket{\psi}=\alpha\ket{0} + \beta\ket{1}$! But when Bob receives $\ket{01}$ from Alice, he needs to apply a Z gate to his half of the Bell state so that his qubit can be mapped from $\alpha\ket{0} - \beta\ket{1}$ to the original state $\alpha\ket{0} + \beta\ket{1}$. This is thanks to the corollary of the unitary property of quantum gates: they must be invertible! Similarly, if Bob receives $\ket{10}$, he needs to apply an X gate, and, if he receives a $\ket{11}$, then he needs to apply both an X and Z gate. This is exactly what the circuit does!

This is really phenominal! Alice can create an arbitrary quantum state, share an entangled qubit with a receiver, interact her arbitrary quantum state with half of the entangled pair so that it produces an effect on the other half of the entangled pair, measure and transmit the state and entangled pair to Bob, and Bob can recreate the original arbitrary quantum state! Note that we’re not violating the No-cloning Theorem since Alice actually measures her arbitrary quantum state which makes it a known state.

The most important thing to note about quantum teleportation is that it does not allow for faster-than-light communication! We still need to transmit classical bits which are limited by the speed of light/causality. The term “speed of light” isn’t quite complete since other things travel at the speed of light (namely a particle called a gluon or gravitational waves). A better term would be the speed of causality. No information of any kind can be transmitted faster than the speed of causality. But with this circuit, we can transmit an arbitrary quantum state over classical channels and perfectly recover it on the other side!

Superdense Coding

The inverse of quantum teleportation is called superdense coding where we take some classical bits, encode them into a quantum state, and send it over a quantum communication channel to recover the classical bits. The neat part is that we only need to transmit one qubit for every 2 classical bits! That’s why it’s called superdense coding!

It’s almost the inverse of quantum teleportation! Suppose Alice has 2 classical bits $d,c$ that she wants to transmit in a single qubit to Bob. Like with quantum teleportation, we’ll start with both Alice and Bob sharing an entangled pair and then Alice will perform some operations on her qubit and send it to Bob. Bob now has both Alice’s qubit and his half of the entangled pair to interact to recover the original two bits that Alice encoded. Alice transmitted only one qubit! Let’s look at the the circuit first.

qc = QuantumCircuit(2)

# Classical bits that Alice wants to encode
d, c = 0, 0

# Prep Bell state
qc.h(0)
qc.cx(0, 1)
qc.barrier()

# Alice performs some operations on her half of the entangled pair
if c == 1:
    qc.z(0)
if d == 1:
    qc.x(0)
qc.barrier()

# Bob receives Alice's qubit
# Bob interacts it with his half of the entangled pair and measures both
qc.cx(0, 1)
qc.h(0)
qc.measure_all()

print(qc)

Note that Alice’s operations change the circuit based on which two bits she wants to encode.

        ┌───┐      ψ_0  ψ_1      ┌───┐ ψ_2  ░ ┌─┐   
   q_0: ┤ H ├──■────░────░────■──┤ H ├──░───░─┤M├───
        └───┘┌─┴─┐  ░    ░  ┌─┴─┐└───┘  ░   ░ └╥┘┌─┐
   q_1: ─────┤ X ├──░────░──┤ X ├───────░───░──╫─┤M├
             └───┘  ░    ░  └───┘       ░   ░  ║ └╥┘
meas: 2/═══════════════════════════════════════╩══╩═
                                               0  1 
{'00': 2048}

When Bob measures both his half of the entangled state as well as Alice’s transmitted qubit, he gets the encoded classical bits with 100% accuracy!

Let’s analyze this circuit. First, we create a Bell state $\ket{B_{0,0}}$.

\[\ket{\psi_0} = \ket{B_{0,0}} = \frac{1}{\sqrt{2}}\Big(\ket{00} + \ket{11}\Big)\]

Now depending on the classical bits, we either apply an X gate, Z gate, or both. Sound familiar to quantum teleportation? Let’s list out the scenarios, starting with trying to encode $00$. In this case, we don’t do anything and let the state pass to the next set of gates.

\[\ket{\psi^{00}_1} = \ket{\psi_0} = \frac{1}{\sqrt{2}}\Big(\ket{00} + \ket{11}\Big)\]

After the CNOT, we get the following state.

\[\begin{align*} \ket{\psi'^{00}_2} &= \frac{1}{\sqrt{2}}\Big(\ket{00} + \ket{01}\Big)\\ &= \ket{0}\frac{1}{\sqrt{2}}\Big(\ket{0} + \ket{1}\Big)\\ \end{align*}\]

Applying a Hadamard to the rightmost qubit collapses the superposition.

\[\ket{\psi^{00}_2} = \ket{0}\ket{0} = \ket{00}\]

So when Bob measures, he’ll always get the bits $00$! An even quicker way to see this for $0,0$ is that the quantum circuit is mirrored about $\ket{\psi_0}$ and quantum gates are invertible so this entire circuit is effectively a no-op if the input state is $\ket{00}$ so of course we get $\ket{00}$ at the very end.

Let’s try this with trying to send $01$. We start with the Bell state but then we apply a Z gate to the first qubit, which flips the sign of the second term.

\[\ket{\psi^{01}_1} = \frac{1}{\sqrt{2}}\Big(\ket{00} - \ket{11}\Big)\]

Then we follow the same steps of applying the CNOT, regrouping the states, and applying a Hadamard.

\[\begin{align*} \ket{\psi'^{01}_2} &= \frac{1}{\sqrt{2}}\Big(\ket{00} - \ket{01}\Big)\\ &= \ket{0}\frac{1}{\sqrt{2}}\Big(\ket{0} - \ket{1}\Big)\\ \ket{\psi^{01}_2} &= \ket{0}\ket{1}=\ket{01}\\ \end{align*}\]

For transmitting $10$, we need to apply an X gate to the first qubit of the Bell state, which flips the rightmost qubit for both terms.

\[\ket{\psi^{10}_1} = \frac{1}{\sqrt{2}}\Big(\ket{01} + \ket{10}\Big)\]

Following the same steps, we get the right answer.

\[\begin{align*} \ket{\psi'^{10}_2} &= \frac{1}{\sqrt{2}}\Big(\ket{11} + \ket{10}\Big)\\ &= \ket{1}\frac{1}{\sqrt{2}}\Big(\ket{1} + \ket{0}\Big)\\ \ket{\psi^{10}_2} &= \ket{1}\ket{0}=\ket{10}\\ \end{align*}\]

Finally for transmitting $11$, we need to apply an X gate then a Z gate which will first flip the rightmost qubit and then flip the sign on the left term.

\[\ket{\psi^{11}_1} = \frac{1}{\sqrt{2}}\Big(\ket{10} - \ket{01}\Big)\]

Following the same steps, we get the right answer.

\[\begin{align*} \ket{\psi'^{11}_2} &= \frac{1}{\sqrt{2}}\Big(\ket{10} - \ket{11}\Big)\\ &= \ket{1}\frac{1}{\sqrt{2}}\Big(\ket{0} - \ket{1}\Big)\\ \ket{\psi^{11}_2} &= \ket{1}\ket{1}=\ket{11}\\ \end{align*}\]

Now we know how to send classical information encoded into a qubit and decode it perfectly on the other end! Similar to quantum teleportation, this relies on Alice and Bob originally sharing an entangled pair before moving away from each other and Alice needs to reliably send her qubit to Bob somehow (also not faster than the speed of causality). But with this circuit, we can now also transmit arbitrary classical bits over quantum channels and perfectly decode the classical bits on the other side!

Physical Representation of Qubits

So far, we’ve only been talking about qubits as mathematical entities, but I wanted to take a very brief aside to talk about how they’re actually physically realized. I’ve already mentioned the spinning hard disks used for classical bits some decades ago and the more modern solid-state disks (SSDs) consisting of arrays of transistors that can be electronically controlled.

The qubits we’ve seen so far are called logical qubits since we can regard them as being “perfect” for our computational needs. However, physical qubits are the actual hardware representation of these qubits, much like a transistor is the physical representation of a classical bit. The two most prevalent kinds of qubits are superconducting qubits and trapped-ion qubits.

Superconducting quantum computers, used by IBM, Google, and Intel, utilize superconductors, sometimes called quantum dots, cooled to almost absolute zero where the logical $\ket{0}$ and $\ket{1}$ states are physically represented as the ground state $\ket{g}$ and excited state $\ket{e}$ of the superconductor. Superconducting qubits are manipulated by electrical signals on the order of nanoseconds, and they can scale up nicely since we can fabricate and connect arrays and lattices of quantum dots on the same chip. The primary challenges with superconducting qubits are that (i) we need to cool them to almost absolute zero in a dilution fridge to get the superconducting property of the superconductor and (ii) they are very susceptible to decoherence where the initialized qubits decay to random states after a short period of time due to noise in their environment.

Trapped-ion quantum computers, used by IonQ, realize qubits as ions held in free space by electromagnetic fields where the logical states map to electrical states of the ions. Many different kinds of ions can be used (IonQ uses ytterbium). These ions are manipulated using lasers of various frequencies to modify their electrical states corresponding to the quantum gates of the given quantum circuit. Trapped-ion qubits tend to have longer decoherence times since they’re represented as stable ions from nature, and they can move among the lattice of traps to interact with any other arbitrary qubit; they’re not just limited by the pre-fabricated lattice structure of the chip. On the other hand, gate operations tend to be slower than superconducting qubits, and it’s more challenging to scale up ion-trap quantum computers since the traps allow for any qubit configuration as opposed to the fixed-lattice structure of a superconducting qubit chip.

Neither physical realization of qubits is better than the other: they just have different trade-offs between them. Perhaps people will invent another physical realization of qubits in the future!

Conclusion

Quantum Computing is starting to be set up as “the next big thing”! Many top companies like Google, Intel, Microsoft, and IBM have started to construct their own quantum computers and showcase their progress on applying their quantum computers to solve practical business problems. We started our quantum journey with defining the qubit and comparing it to classical bits, and then discussed the various kinds of single-qubit gates. We quickly moved on to multi-qubit systems and their operations. Finally, we applied our knowledge to quantum teleportation to transmit quantum state using classical bits and superdense coding to transmit classical bits using a quantum state.

This is just the start in our journey towards better understanding quantum algorithms and how we can move towards using quantum computing to solve practical problems in the real world today 🙂

Language Modeling - Part 4: Transformers

2024-08-07T00:00:00+00:00

In the previous post, we discussed one of the first deep learning model build specifically for language modeling: recurrent neural networks, with both plain and with long short-term memory cells. For a period of time, they were the state-of-the-art for language model as well as cross-domain tasks like image captioning. In retrospect, they did a fairly decent job at these tasks, even though they had issues with generating longer texts. We pushed these models to the limit by adding more and more parameters, making them bidirectional, and stacking them until we hit dataset and computational limits. However the beauty of research is that, every few years, a novel approach that’s a drastic departure from all previous work revolutionizes some task or subfield. The approach that did this for language modeling (and later other tangential fields) is the Transformer. This is the most widely-used neural language model underpinning large language models (LLMs) like OpenAI’s ChatGPT, Anthropic’s Claude, Meta’s Llama, and many others!

In this post, I’ll finally get to discussing the state-of-the-art neural language model: the transformer! First we’ll start by analyzing the issues with RNNs. Then we’ll introduce the transformer and deep dive into their constituent parts: position encoding, multihead self-attention, layer normalization, and position-wise feedforward networks. Finally, we’ll use an off-the-shelf pre-trained GPT2 model for language modeling!

For the direct application of transformers to language modeling, we’ll specifically be discussing decoder-only transformers which just comprised of the decoder from the original Transformers paper. The original work called the full encoder-decoder architecture a “Transformer” since it was first applied to sequence-to-sequence machine translation but modern LLMs use only the decoder part of the architecture. In reality, both use 95% of the novelty of the original Transformers paper (except for cross-attention betwen the input language and output language) but it’s an unfortunate historical point.

Transformers

RNNs and the LSTM cells we discussed last time have a few issues that make them difficult to train. The largest issue with general RNNs and their recurrence relation is the bottleneck problem: we have a single, fixed-size hidden/cell state vector that represents the accumulation of everything the RNN has seen up to the current timestep, no matter how long the input sequence is. Furthermore, traditional RNNs require sequential parsing: to compute an output at timestep $t$, we need to compute the hidden state $h_t$ which is a function of all previous hidden states and inputs. If we have a long sequence, this becomes expensive to do and limits us from training on larger corpi. As we showed last time, even with LSTM cells, RNNs still suffer from the vanishing gradient problem, albeit not as severe as in vanilla RNN cells; this makes it difficult to capture long-term semantic relationships.

RNNs support arbitrary-length sequences but compress the prior history into a finite-size hidden/cell state as we progress through the input sequence. For longer sequences, we’re trying to compress a lot of information in that hidden/cell state for the next timestep to operate on. To add some numbers to this, suppose we have an input embedding size of $256$ and a hidden/cell state of size $512$. For a sequence of just $64$ tokens, the single $512$-dimensional hidden/cell state is expected to retain $64\cdot 256=16,384$ bits of information which is a compression factor of $\frac{16,384}{512}=32$ by the end of the sequence. (Not all of these data are important so this is a worst-case analysis.) In practice, we deal with much longer sequences so it becomes progressively more and more difficult to retain important information in that small hidden/cell state, which acts as a “bottleneck” for information propagation.

The Transformer architecture (Vaswani et al. 2017) was created to help remedy these issues by fundamentally changing how we process sequential text data. Since this new architecture is such a radical departure from previous neural language model architectures, let’s define it up-front and later discuss and motivate the different parts.

The Transformer architecture from Vaswani et al. 2017 featured some interesting components: (i) a positional encoding as an efficient way to understand the relative position of tokens in the input sequence; (ii) a multihead self-attention mechanism to help retain long-term semantic relationships between tokens; (iii) layer normalization to help with regularization; (iv) a point-wise feedforward neural network to add more parameters and non-linearity; and (v) residual connections to help propagate the unedited gradient backwards to all layers at all timesteps.

We’ll be diving into the details of this neural network architecture but I want to provide a short high-level description of the different pieces and their purposes. The transformer architecture consumes a fixed-length sequence of a particular size all at once (as opposed to purely sequentially like RNNs). The first step is to embed each token of the input sequence. Then we add the positional encoding to the embedding to help the model reason about the positions of tokens and their relative relations; this is a sort of substitute for the recurrence relation. Then we apply a multihead self-attention mechanism to allow the model to dynamically focus on certain parts of the entire previous input (to help with the bottleneck problem!). We have some residual connections to help propagate the unedited gradient and some regularization to prevent overfitting. Finally, we have point-wise feedforward neural networks to process each token in the sequence in the same way and to help increase the number of parameters of non-linearity of the model.

Compare this to the RNN-based architecture to see how radically different it is! It’s a bit difficult to motivate directly but, similar to the motivation for LSTM cells, we can at least assess if we’re addressing the aforementioned issues with RNNs.

The first issue with RNN-based architectures was that we had to process the input data sequential. With the Transformer, we process the input sequence (albeit batched) all at once. This makes it much easier to chunk, parallelize, and optimize the forward and backwards passes; in fact, when we discuss the multihead attention module, we’ll see how we can optimize the operation across the entire input sequence into a single large matrix multiplication (which GPUs love!). The other issue was maintaining long-term dependencies: we’ll see later how the multihead self-attention module helps the model learn these associations by giving the model an opportunity to “attend” to all previous timesteps rather than use a condensed hidden/cell state.

With that, let’s dive into each individual module in more detail.

Positional Encoding

One of the major issue that makes RNNs difficult to train efficiently is that we have to process the input sequentially. Transformers do away with this by processing the input data in batches where the model sees the entire batch of input data at the same time. However, we lose information about the ordering of the input tokens so we need a way to bring that back.

There are two pieces to the puzzle we’ll have to solve: how to compute the embedding $\text{PE}$ and how to fold it into the input $x_t$. For the latter, we primary have a few options: (i) add, (ii) element-wise multiply, and (iii) concat. Which one we choose depends on how we compute the embedding too but let’s start very simply with addition:

\[y_t = x_t + \text{PE}_t\]

To start, the embedding $\text{PE}$ can just be a vector with the absolute position.

\[y_t = x_t + t\cdot\mathbb{1}\]

where $\mathbb{1}$ is a vector of just $1$s so we effectively just create a vector of natural numbers like $\begin{bmatrix}1\cdots 1\end{bmatrix}^T$, $\begin{bmatrix}2\cdots 2\end{bmatrix}^T$, and so on for each position $t$ where the size of the vector is the same as the size of the encoding. This absolute linear positional encoding is the simplest thing to do but has significant drawbacks. First of all, absolute positions aren’t agnostic of the sequence size: longer sequences will have larger positional encodings which creates an asymmetry for shorter sequences. It would be better to have an encoding that is sequence-size-agnostic: different lengths of sequences are treated fairly. Furthermore, we use the same value across all dimensions: each $x_t$ has the dimensionality of the embedding and using $t\cdot\mathbb{1}$ means that each value in that dimension has the same value which doesn’t really provide that much distinguishing information to the model.

Positional encodings are an open area of research but let’s see what the original Transfomer paper does. They had a novel idea about using alternating sinusoids as the values of the positional encoding. Let’s first see what their proposal is and then analyze it.

\[\begin{align*} \text{PE}_{(j, 2k)} &= \sin\frac{j}{10000^{\frac{2k}{d}}}\\ \text{PE}_{(j, 2k+1)} &= \cos\frac{j}{10000^{\frac{2k}{d}}}\\ \end{align*}\]

The positional encoding $\text{PE}$ can be considered as a matrix where the row is the position $j$ in the sequence, and the column is the value of the positional encoding. $k$ doesn’t represent the dimension: it’s just a counter so we can alternate sines and cosines. $10000$ is an arbitrary number that just needs to be significantly larger than $d$, which is the dimensionality of the encoding.

The positional encoding matrix shows how the different sinusoids blend together. In practice, we construct this for the largest forseeable sequence size and then only apply it up to the size of the sequences encountered during training. Since this is a deterministic/non-learned matrix, we can easily adapt it for larger sequence sizes if we happen to encounter one during model evaluation.

Let’s see this in action concretely with a sequence length of 4 and dimensionality of 6. The embedding can be represented by an $4\times 6$ matrix.

\[\begin{bmatrix} \sin\frac{0}{10000^{\frac{0}{d}}} & \cos\frac{0}{10000^{\frac{0}{d}}} & \sin\frac{0}{10000^{\frac{2}{d}}} & \cos\frac{0}{10000^{\frac{2}{d}}} & \sin\frac{0}{10000^{\frac{4}{d}}} & \cos\frac{0}{10000^{\frac{4}{d}}}\\ \sin\frac{1}{10000^{\frac{0}{d}}} & \cos\frac{1}{10000^{\frac{0}{d}}} & \sin\frac{1}{10000^{\frac{2}{d}}} & \cos\frac{1}{10000^{\frac{2}{d}}} & \sin\frac{1}{10000^{\frac{4}{d}}} & \cos\frac{1}{10000^{\frac{4}{d}}}\\ \sin\frac{2}{10000^{\frac{0}{d}}} & \cos\frac{2}{10000^{\frac{0}{d}}} & \sin\frac{2}{10000^{\frac{2}{d}}} & \cos\frac{2}{10000^{\frac{2}{d}}} & \sin\frac{2}{10000^{\frac{4}{d}}} & \cos\frac{2}{10000^{\frac{4}{d}}}\\ \sin\frac{3}{10000^{\frac{0}{d}}} & \cos\frac{3}{10000^{\frac{0}{d}}} & \sin\frac{3}{10000^{\frac{2}{d}}} & \cos\frac{3}{10000^{\frac{2}{d}}} & \sin\frac{3}{10000^{\frac{4}{d}}} & \cos\frac{3}{10000^{\frac{4}{d}}}\\ \end{bmatrix}\]

This seems like a fairly complicated formulation but there are several nice properties behind this. First of all, using bounded functions like sines and cosines means that this embedding is agonstic of the sequence length since it doesn’t monotonically grow (or shrink) with the length of the sequence. In the above example, we used a sequence length of 4 but, in practice, we set this to be the maximum desired sequence length and just take a slice of it for whatever sequence length we get as input. Another very nice property for the gradient is that these are mathematically smooth (continuous and infinitely differentiable) so we don’t have to worry about sparse or constant gradients.

One other important property is that using periodic functions like sines and cosines help us with learning relative positions at different scales. To explain this better, consider the most generic form of sine function:

\[f(x) = A\sin(T(x+\phi)) + b\]

where

$A$ is the amplitude
$\frac{2\pi}{T}$ is the period
$\phi$ is the phase/horizontal shift ($\phi > 0$ means the plot shifts right; otherwise it shifts left)
$b$ is the vertical shift ($b > 0$ means the plot shifts up; otherwise it shifts down)

Another useful trigonometric property is that $\sin(\theta)=\cos(\frac{\pi}{2}-\theta)$ and $\cos(\theta)=\sin(\frac{\pi}{2}-\theta)$ so we can think of sine and cosine as just being phase shifts of each other by $\frac{\pi}{2}$. Going back to the sinusoidal positional encoding, let’s just consider the even terms (knowing that the odd terms are just offset by a phase shift).

\[\text{PE}_{(j, 2k)} = \sin\frac{j}{10000^{\frac{2k}{d}}}\\\]

For this term, $A=1$ and $b=1$ so the amplitude is $1$ and there’s no vertical shift but what about the period and phase shift? Well even if we expand the $\sin$ argument, there’s no additive term, only the factor on $j$. Let’s start by holding $k$ constant and varying only $j$. This simplifies the equation into $\sin\frac{j}{c}$ where $c$ is constant; the period of this function is $2\pi c$ which we can see if we rewrite as $\sin\frac{1}{c}j$. As we progress through the sequence, the value of $j$ increases along the sinusoid.

To get a better idea of the shape of the positional encoding, we can plot the positional encoding with the position in the continuous domain and the dimension in the discrete domain. As the dimension increases in pairs, the frequency of the sinusoid across the position decreases which gives the model many different ways to correlate the relative positions of different words in the input sequence. (These plots are generated from here.)

Now let’s try holding $j$ constant and varying $k$; in other words, for a particular timestep $j$, how do the values of the sinusoidal positional encoding change as the dimensionality of the embedding increases? This one is a bit trickier since $k$ is in an exponent in the denominator but we can reason about that term $10000^{\frac{2k}{d}}$. As we increase $k$, $10000^{\frac{2k}{d}}$, i.e., the denominator, increases which means the $\sin$ argument decreases. Specifically, the overall period $2\pi c$ decreases as $k$ increases: in other words, as the dimensionality of the embedding increases, the period of the sinusoids increases at a fixed timestep.

The perodicity of this encoding means that there’ll be tokens in the input that end up with the same positional encoding value at different intervals. Intuitively, this means that our model can learn relative positions of input tokens because of the repeating pattern of sinsusoids. Since the frequency increases with the dimensionality, input tokens get multiple relative positions associations for different intervals or scales.

In concrete implementations of the positional encoding, rather than seeing that exact formula above, we tend to see this formulation:

\[\text{PE}_{(j, 2k)} = \sin \Bigg[j\exp\Bigg(\frac{-2k}{d}\log{10000}\Bigg)\Bigg]\]

This modified formulation is more numerically stable since we’re taking the log of a large number instead of raising a large number to a large power (thus maybe overflowing). To get from the original formulation to the current one, take the exponential log of the quantity $10000^{-\frac{2k}{d}}$ (which is legal since the exponential log of any quantity is the quantity itself, much like adding and subtracting $1$ cleverly) and simplify until it looks like the exponential term in the above equation.

Interesting, there have been a few recent work that show the positional encoding is optional and that transformers without such encoding can perform as well as those without it. Positional encodings (or potentially lack of) still a very active area of research!

Multihead Self-Attention

The next module in the transformers architecture is multihead self-attention. This is a different flavor of an attention mechanism. To motivate it, consider an RNN language model with a long context window: as we move along the sequence, the only information we pass forward to the next timestep is the hidden/cell state. If we have a long sequence, this hidden/cell state has the huge responsibility of retaining all of the “important” information from all previous timesteps to provide to the current timestep. With an LSTM cell, the model can do a much better at determining and propagating forward the “important” information but, for longer sequence, we’re still bound by the dimensionality of the cell state; we can try increasing the size of the cell state but that adds computational complexity.

The novelty behind the attention mechanism is to reject the premise that the current timestep can only see this condensed hidden/cell state from the previous timestep. Instead, the attention mechanism gives the current timestep access to all previous timesteps instead of just the previous hidden/cell state!

The trick becomes how to integrate all previous timesteps into the current one, especially since we have a variable number of previous timesteps as we progress through the sequence. The novel contribution of self-attention is to take an input sequence and, for each timestep, numerically compute how much we should consider each previous timestep into the current one.

For each timestep, we learn a key, query, and value. Then we compute how much a query aligns with each key. This alignment is passed through a softmax layer to normalize the raw values into attention scores. Then we can multiply these against the values to figure out how much we should weigh the values when computing the current state.

Suppose the input $X$ is a matrix of $n\times d$ where $n$ is the sequence length and $d$ is the dimensionality of the embedding space. Using fully-connected layers (omitting biases for brevity), we project $X$ into three different spaces: query, key, and value:

\[\begin{align*} Q &= W_Q X\\ K &= W_K X\\ V &= W_V X\\ \end{align*}\]

We’re applying these three to all vectors in the sequence simultaneously as a vectorized operation. The key and query have the same dimension of $d_k$ while the values have a dimension of $d_v$. Now we take the dot product of each key with each value and run it through a softmax and scale it by $\sqrt{d_k}$ to get attention scores. Intuitively, these scores tell us, for a particular timestep, how much we should consider the other timesteps. We use these attention scores by multiplying them with the learned values to get the final result of the self-attention mechanism.

\[\text{Attention(Q, K, V)} = \text{softmax}\Bigg(\frac{QK^T}{\sqrt{d_k}}\Bigg)V\]

Let’s double-check the dimensions. $Q$ is $n\times d_k$ (a query vector for each input token) and $K$ is $n\times d_k$ (a key vector for each input token) so $QK^T$ is $n\times n$, and, after softmaxing across each row and scaling by $\sqrt{d_k}$, we get the attention scores that measure, for each timestep, how much we should focus on another term in the same sequence. This particular flavor is sometimes called scaled dot-product attention. We multiply these by the learned values of size $n\times d_v$ so the output will be the same.

The reason for the $\sqrt{d_k}$ is that the authors of the original Transformers paper mentioned that the dot products will get larger as the dimensionality of the space gets larger since we’re adding up element-wise products and we’ll have more terms in the sum with a larger embedding key-query dimension. (Just like with absolute positional encodings!)

To summarize, given a particular input, we map it to a key, query, and value using a fully-connected layer. Then we take the dot product of the query with each key; an intuitive way to interpret the dot product is measuring the “alignment” of two vectors. Then we can run that result through a softmax to get a probability distribution over all previous keys to multiply by the learned values to get the result of the attention module.

Instead of using a single key, query, value set, we can use multiple different ones so that the model can learn to attend to different kinds of characteristics in the input sequence. The idea is that we can copy-and-paste the same self-attention into several different attention heads using another projection, concatenate the results of all of the attention heads, then finally run the concatenation through a final fully-connected layer. This is called multi-head attention.

For multi-head self-attention we take the same self-attention mechansim with the same input but use a different set of key, query, and value weights for each head.

Mathematically, for each head, we can take a query $Q$, key $K$, and value $V$, project them again (again ignoring biases for brevity), and compute attention.

\[\text{head}^{(i)} = \text{Attention}(W_Q^{(i)}Q, W_K^{(i)}K, W_V^{(i)}V)\]

Then we can concatenate all of the heads together and project again to the the result of multihead self-attention.

\[\text{Multihead}(Q, K, V) = \Big(\text{head}^{(1)} \oplus \cdots \oplus \text{head}^{(h)}\Big)W_O\]

where $h$ represents the number of heads. We usually set $d_k=d_v=\frac{d_m}{h}$ so that we can cleanly split and rejoin the different heads without having to worry about fractional dimensions.

Now we finally have the full multi-head self-attention in the paper!

One important aspect when training this module is the causal mask $M$: it sees the entire sequence at once and computes attention scores across the whole sequence. However, this isn’t entirely accurate since, at a timestep $t$, we’ve only seen tokens at the $(1,\cdots,t-1)$ timesteps, not the entire sequence. So in the attention score matrix, we need to mask out all future timesteps using causal mask. Since a sequence is already ordered, the mask is an upper-triangular matrix with $\infty$ in the upper triange and $0$ everywhere else.

\[M=\text{Upper}(\infty)=\begin{bmatrix} \infty & \infty & \infty & \cdots & \infty\\ 0 & \infty & \infty & \cdots & \infty\\ 0 & 0 & \infty & \cdots & \infty\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & 0 & \cdots & \infty\\ \end{bmatrix}\]

When we add this mask to the attention score matrix, it nullifies future tokens at each timestep so the model doesn’t cheat by “seeing into future”.

Layer Normalization

The next step in the transformer architecture is a module called Layer Normalization. In the context of neural networks, normalization is the act of perturbing the inputs to a layer with the intent to help the model generalize better and learn faster. As we train a deep neural network, the intermediate activations go through layer after layer, and each layer can have drastically different weights; if we think about the activations as a distribution, they go through many different distributional changes that the model has to exert effort in learning. This problem is called the internal covariate shift. Wouldn’t it be easier if we innocently standardized the activations before each layer? That’s exactly what normalization does! This means the model can handle different kinds of input “distributions” and doesn’t have to waste effort in learning each of the distributional shifts across layers. There are a number of other reasons why we use normalization in neural networks but this is one of the most important reasons.

There are a few different kinds of normalization but the one that’s used by the transformers paper is Layer Normalization. The idea is to normalize the values across the feature dimension. So for each example in a batch of inputs, we take the mean and variance of the features per example to get 2 scalars and then standardize each component of the input by that mean and variance, i.e., we’re shifting the “location” of the input distribution. Additionally, we also learn a scale and bias factor per feature to alter the shape of the distribution.

Given a sequence of training examples, per example, we compute a mean and variance and offset the values for that particular example. Another normalization technique popular for non-sequential data is batch normalization where we do effectively the same thing, but across a particular feature dimension in the batch instead of across the training example itself.

Suppose we have an example $x$ in the batch where each is a $d$-dimensional vector and $x_j$ is the $j$th component. First, we compute a mean $\mu$ and variance $\sigma^2$ of each example $x_i$ over its features.

\[\begin{align*} \mu &= \frac{1}{d}\sum_{j} x_{j}\\ \sigma^2 &= \frac{1}{d}\sum_j (x_{j} - \mu)^2 \end{align*}\]

Now, for each example, we offset it by its mean and standard deviation.

\[\hat{x_j} = \frac{x_j-\mu}{\sqrt{\sigma^2 + \epsilon}}\]

where $\epsilon$ is a small value for numerical stability to avoid divide by zero issues. Finally, we apply learnable scale $\gamma_j$ and shift $\beta_j$ parameters for each feature/component.

\[y_j = \gamma_j \hat{x_j} + \beta_j\]

Layer normalization helps perturb the layer activations for better training results. We apply these layers after every major component, specifically the multihead self-attention module and the position-wise feedforward network that we’ll discuss shortly!

Position-wise Feedforward Network

To help add more parameters and nonlinearity to increase the expressive power of the transformer, we add a position-wise feedforward network. It’s a little two-layer neural network that sends a vector at a single position in the sequence to a latent space and then back to the same dimension using a ReLU nonlinearlity in the middle.

\[\text{FFN}(x) = W_2\cdot\text{ReLU}(W_1x + b_2) + b_2\]

We apply this to each timestep (or position) in the sequence, hence the “position-wise” part, and it’s the same network operating on all positions independently so they operate on the input consistently.

Training

Training a transformer for language modeling is identical to training any other kind of neural model for language model. We take the raw input text and try to get the transformer to predict the next token at each timestep and use the cross-entropy loss. Remember to apply the causal mask!

GPT2

Now that we’ve grasped the basics of the core transformer architecture, let’s talk about some specifics of the OpenAI GPT2 model since the research, model weights, and tokenizer are all public artifacts! Besides the transformer architecture itself, on either ends of it are the encoding, turning raw text into a sequence of tokens for the input layer of the transformer, and decoding, producing a sequence of tokens for the tokenizer to convert back into raw text.

Byte-pair Encoding (BPE)

So far, we’ve skirted around the topic of tokenization by splitting our corpus into individual characters but that character-based representation is too local to meaningfully represent the English language. When we think about words in English, the smallest unit of understanding is called a morpheme in linguistics, and it’s often comprised of multiple characters. For example, the word “understanding” in the previous sentence is made up of two morphemes: (i) the root “understand” and (ii) the present participle “ing” meaning “in the process of”. In both cases, each morpheme is built from several characters (also called a grapheme) so using our character-level representation is not quite the right level of abstraction.

This might seem like the “magical” solution to all of our tokenization woes! Instead of coming up with any kind of tokenization scheme, let’s just take each morpheme in the English language, assign it a unique ID, and split words based on these morphemes! Unfortunately, there are several reasons this won’t just work. First of all, there are too many English morphemes! As a rough calculation, we can multiple the number of English roots with the number of affixes (like “un” and “ing”) with the number of participles and so on to arrive at about 100,000 morphemes which is a huge embedding space! Even cutting that in half to 50,000 is still a pretty large embedding! But that’s just English, other languages may have even more! Furthermore, language is an ever-evolving structure so the current set of morphemes might not be sufficient for new words; traditionally, we’d just reserve a token like to represent unknown words in the vocabulary but that’s an extreme. New words usually don’t come out of nowhere: their constituent parts are usually from existing words.

One kind of encoding that GPT2 and others use is Byte-pair encoding (BPE): a middle-ground that tries to balance grouping graphemes into morphemes while also trying to bound the vocabulary size. Conceptually, BPE computes all character-level bigrams in the corpus and finds the most common pair, e.g., (A, B) ; then it replaces that pair with a unique token, e.g., (AB). Then we repeat the process until we reach the desired vocabulary size or there are no more character-level bigrams to merge. The trick is that we don’t use Unicode characters but actually use bytes (hence the “byte” in byte-pair encoding!).

Let’s use a dummy corpus as an example.

m o r p h e m e
e m u l a t o r
l a t e r

In the first step, let’s compress e and m into a single token em.

m o r p h em e
em u l a t o r
l a t e r

Now we can merge l and a into la.

m o r p h em e
em u la t o r
la t e r

We can merge o and r into or.

m or p h em e
em u la t or
la t e r

Then we can merge la and t into lat.

m or p h em e
em u lat or
lat e r

Now there are no more character-level bigrams to merge so the vocabulary is m, or, p, h, em, e, u, lat, or, and r. Now suppose we encounter a new word like grapheme; we can partially tokenize it into [UNK r UNK p h em e] which is better than just replacing the whole word with . This is a very simple example but it illustrates how BPE gives us a robust representation somewhere between characters and words. In reality, the vocabularies are large enough that we’d rarely have unknown sub-morphemes.

Let’s code an example using the off-the-shelf GPT2 tokenizer. (Make sure you have the transformers Python package installed!)

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
prompt = 'Tell me what the color of the sky is.'
tokens = tokenizer(prompt, return_tensors='pt').input_ids
print(f'Input: {prompt}\nTokens: {tokens}')

# Input: Tell me what the color of the sky is.
# Tokens: [[24446, 502, 644, 262, 3124, 286, 262, 6766, 318, 13]]

Our input prompt features words frequent enough that each entire word (and the punctuation) is represented as a whole token! Try using more complicated words or made-up words to see what the tokens would look like!

Decoding

The last stage before de-tokenizing is decoding. Recall that language models are probabilistic and compute the likelihood of a sequence of tokens. When we discussed n-gram and RNN models, we generated text using a random sampling approach where we sample the next word according to the next word’s probability distribution conditioned on the entire previous sequence, i.e., $w_t\sim p(w_t\vert w_1,\dots,w_{t-1})$.

Top-$k$ Sampling

The issue with random sampling is that we give some tiny weight to words that wouldn’t create a cohesive sentence. For example, if we had a sentence like “The cat sat on the “, there would be a nonzero likelihood of sampling “adversity”.

Rather than using the entire distribution, we should disqualify these low-likelihood words and only select the most likely ones. One way to accomplish this is to select only the $k$ most likely words, renormalize that back into a probability distribution, and sample from that distribution instead. This is called top-$k$ sampling! The intent is to remove any low-likelihood words so that it’s impossible to sample them.

Visually, we can conceptualize it as taking the distribution $p(w_t\vert w_1,\dots,w_{t-1})$, sorting it by probability, selecting the top $k$ most likely words, renormalizing that back into a probability distribution, and sampling from that top-$k$ distribution..

When $k=1$, we recover greedy sampling where we always select the most likely word next. We can use the same transformers library to pull the GPT2 language model weights, sample using top-$k$ sampling, and de-tokenize!

from transformers import GPT2TokenizerFast, GPT2LMHeadModel

model_name = 'gpt2'
tokenizer = GPT2TokenizerFast.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

prompt = 'Some example colors are red, blue, '
model_input = tokenizer(prompt, return_tensors='pt')

# do_sample samples from the model
# no_repeat_ngram_size puts a penalty on repeating ngrams
# early_stopping means to stop after we see an end-of-sequence token
output = model.generate(**model_input, do_sample=True, top_k=10, no_repeat_ngram_size=2, early_stopping=True)

decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)
print(decoded_output)

Try it out with different values of $k$!

Top-$p$ Sampling / Nucleus Sampling

The issue with top-$k$ sampling is that $k$ is fixed across all contexts: in some contexts perhaps the probability distribution is very flat where the top $k$ words are equally likely as other words. But what if we have a probability distribution that’s heavily peaked? In that case, $k$ might still consider very unlikely words.

Consider this peaked distribution. Setting the wrong value of $k$ would still select unlikely words. The root of the issue is that the value of $k$ is fixed: for one word, we might get a “good” distribution but for the immediate next word, we might get this peaked distribution!

The Curious Case of Neural Text Degeneration by Holtzman et al. took a different approach: rather than selecting and sorting in the word space, they do something similar in the cumulative probability space. The idea is similar to top-$k$ in that we sort the probability distribution, but then, instead of selecting a fixed $k$ words, we select words, from the most likely to the least likely, until their cumulative probability exceeds a certain probability $p$. Think of it as having a “valid” set of words that we populate based on the cumulative probability of the set: we add the most likely word, then the next likely word, and keep going until the total probability of the set exceeds $p$. Then we stop, renormalize, and sample from that distribution.

Visually, we can conceptualize it as taking the distribution $p(w_t\vert w_1,\dots,w_{t-1})$, sorting it by probability, selecting the set of words from the most likely to the least likely until the sum of their probabilities meets the threshold. Then we renormalize that back into a probability distribution and sample from that.

This overcomes the challenge of selecting the “right” $k$ value in top-$k$ sampling because we’re dynamically choosing how many words we put into the “valid” set based on what the distribution over the next word looks like. When $p$ is small, the “valid” set tends to be smaller since it takes fewer words to reach the $p$ value; this tends to produce more predictable and less diverse output. When $p$ is larger, we need more words in the “valid” set to reach the $p$ value; this tends to produce less predictable but diverse outputs. Similar to top-$k$ sampling, we can use the same transformers library to pull the GPT2 language model weights, sample using top-$p$ sampling, and de-tokenize!

# use top_p instead
output = model.generate(**model_input, do_sample=True, top_p=0.9, no_repeat_ngram_size=2, early_stopping=True)

Try it out with different $p$ values!

Temperature Sampling

One more approach to decoding is called temperature sampling because it’s inspired from thermodynamics: a system of particles at a high temperature will behave more unpredictably than one at a lower temperature. Temperature sampling mimics that behavior through a temperature parameter $\tau$. The idea is that we scale the raw logit activations $a$ by the temperature before taking the softmax: $\text{softmax}(\frac{a}{\tau})$. To understand the effect of $\tau$, recall that taking the softmax of a set of logits will tend to drive them to the extremes of 1 and 0 so if we drastically increase the value of one of the logits will increase the probabilities to the higher-likelihood words and lower the probabilities of the lower-likelihood words.

Given a distribution, lower temperatures tend to cause the distribution to be more sharply peaked towards just the few high-likelihood words; this reduces variability in the output. As the temperature gets higher, the distribution gets flatter.

Now let’s consider the role of $\tau$: if $\tau=1$, then we don’t change the distribution at all. However, for a low temperature $\tau\in(0, 1]$, we’ll increase all of the logits, thus driving the softmax distribution to high likelihood words so the sampling is more predictable (just like with lower temperature in thermodynamics!). For a high temperature $\tau > 1$, we’re making each of the logits smaller which “flattens” the distribution so it’s more likely that previously low-likelihood words would be selected which gives us more variability in the output.

Just like with the previous sampling, let’s try it out!

output = model.generate(**model_input, do_sampling=True, temperature=1.5, no_repeat_ngram_size=2, early_stopping=True)

Try playing around with different temperatures!

Conclusion

In this post, we finally arrived at the state-of-the-art neural language model: the transfomer! To better motivate the pieces, we first discussed the pitfalls of RNNs. Then we discussed the pieces of the transformers starting with the positional encoding, then moving onto multihead self-attention, through the layer norm, and finally to the position-wise feedforward networks. Finally, we used a pre-trained GPT2 language model for language modeling and discussed the encoding and decoding on either ends of the transformer to fully-construct a language model for text generation!

That’s the conclusion (?) on our tour of language modeling! We started learning about n-gram models all the way through transformers used in state-of-the-art large language models. Most of what I’ve seen now is people using LLMs to build really cool creative things, but I hope this tour has helpee peel back the curtain behind how they work 🙂

Language Modeling - Part 3: Recurrent Neural Networks

2024-07-28T00:00:00+00:00

In the previous post, we upgraded from representing words as string to representing words as vectors with word embeddings. Around the same time, GPUs and strategies to train deep neural networks took off with the AlexNet paper and started the whole deep learning craze. Neural networks showed plenty of promise in many different fields, so, naturally, the question arises: can we use neural networks to improve the quality of our language models? Most of the neural network technologies have already existed for decades but only until then was it feasible to train them with more parameters and on larger sets of data. Furthermore, since neural networks operate on vectors and we now know how to map words to vectors using embeddings. Let’s try to use the advancements in neural networks to see if we can create a better language model!

In this post, I’ll start to marry neural networks and language modeling with neural language models. Then we’ll build on those with a simple recurrence relation to construct plain/vanilla recurrent neural networks (RNNs). Then we’ll upgrade those into a modern version that’s more widely used by replacing the vanilla RNN cells with long short-term memory (LSTM) cells. Finally, we’ll train a language model on some public domain plays by Shakespeare and see how well they work!

Going forward, I’ll assume that you have a basic understanding of neural networks and how to train them.

Neural Language Models

Recall that for n-gram models, we represented the likelihood of the next word with the conditional distribution of the previous $n$ words. Take the bigram model for example: given the previous word, what’s the likelihood of the next word? Given a sequence of words, we can decompose it into a produce of single unigram/marginal and a product of conditionals.

\[p(w_1,\dots, w_N)\approx p(w_1) p(w_2|w_1)p(w_3|w_2)\cdots p(w_N|w_{N-1})\]

We computed n-grams by going through our corpus and counting and dividing. Instead, we can try to use a neural network to consume that information and predict the next work. Since neural networks operate on numerical data, we embed each word using either learned or pre-trained word embeddings, concatenate them all together, then pass those through a neural network. Similar to the classification task, the output is a probability distribution over the vocabulary. This training is almost exactly the same as normal categorical training over class labels except our “class labels” are the vocabulary of the corpus.

A simple neural language model consumes a context window of input embeddings, concatenates them together, and runs them through a some fully-connected layers to get an output of logits. Then we can take the output, softmax it, and compute the loss between the target output, i.e., the true next word, using the categorical cross-entropy loss. This trains the model to try to predict the next word given the context window.

This has similar properties to n-grams in that we have a context window of several words, but, instead of using count-and-divide explicit probabilities, we use the weights of the model to learn the next word. Naturally, we could use a huge context window to get better results (on average) but the trade-off is that it’s more computationally expensive. Even if we could afford that extra computation, the fact remains that we’re not really treating the data as sequential: we’re just concatenating the embeddings together but there’s nothing to tell the model that the second word in the context window comes after the first one. This is missing the critical factor of n-grams: the sequential modeling that word $w_i$ is a function of the words that came before it. Can we change the construction of the neural network to better encapsulate the sequentiality of the input data?

Recurrent Neural Networks

The core of the issue of language modeling with regular neural networks is that they don’t factor in the previous embedding $x_{t-1}$ when computing the next one $x_t$ sequentially. Embedding the whole sequence, we have vectors $x_1,\dots,x_N$ but how do we relate them in sequence? To start, just the embeddings themselves have limited expressivity so we shouldn’t use those vectors themselves but rather map them to a latent/hidden space using a single fully-connected layer just like what we do with plain neural networks and non-text-based data. We now have an $h_t = Wx_t + b^{(x)}$ for each input vector: $h_1,\dots,h_N$. How do we relate $h_i$ to $h_{i-1}$ and $h_{i-2}$ and so on? The simplest thing to do to start would be to use another fully-connected layer!: $h_t = Uh_{t-1} + b^{(h)}$ Combining these into a single equation and merging the biases, we get a recurrence relation:

\[h_t = \tanh(Wx_t + Uh_{t-1} + b^{(h)})\]

where

$W$ is the weight matrix from the input to the hidden layer
$x_t$ is the input embedding
$U$ is the weight matrix from the hidden layer back to the hidden layer
$b^{(h)}$ is the combined bias for the hidden layer
$h_{t-1}$ is the previous hidden state
$h_t$ is the current hidden state

(We’ll get to the choice of hyperbolic tangent $\tanh$ over the sigmoid $\sigma$ activation function in just a little bit.)

(Alternatively, we could concatenate the input and previous hidden state into a single vector and passing the concatenated state $[x_t;h_t]$ through a fully-connected layer like $W[x_t;h_t] + b$. It’s an equivalent formulation since all operations are linear and we can define $b= b^{(x)} + b^{(h)}$)

With this recurrence relation, we can now handle arbitrary-length sequences! Beyond the hidden layer, we compute the output layer at each timestep $y_t$ by running the hidden layer through another fully-connected layer. Then we can normalize it using a softmax operation to produce a probability distribution over the vocabulary.

\[\begin{aligned} h_t &= \tanh(Wx_t + Uh_{t-1} + b^{(h)})\\ \hat{y_t} &= Vh_t + b^{(y)}\\ \end{aligned}\]

where

$V$ is the weight matrix from the hidden to the output layer
$b^{(y)}$ is the bias for the output layer
$y_t$ is the output

Since we’re using trainable fully-connected layers and a recurrence relation, we call this a Recurrent Neural Network (RNN)! The component that computes and propagates forward the hidden state is called an RNN Cell.

A recurrent neural network (RNN) has an input, recurrent hidden state, and output. The hidden state is fed back into itself across the entire input sequence. We can represent it “folded” or “unfolded” for a few timesteps.

Going back to the choice of activation function, a few intuitive reasons we use $\tanh$ instead of a sigmoid is that the range is in $(-1,1)$ so the hidden state is more expressive of values as opposed to the range $(0,1)$. Also, we don’t have any requirements to normalize the hidden state to $(0,1)$ unlike for binary classification or probabilities. Both $\tanh$ and $\sigma$ are bounded, which is what we want since we’re accumulating the hidden state over potentially a large number of timesteps and we don’t want it to go to infinity.

Practically speaking, for the task of language modeling, the input $x_t$ and output $y_t$ sizes are the same size as the vocabulary (and also the size of the embedding matrix but it need not be). The hidden state $h_t$ size is a hyperparameter that we can set.

Backpropagation with RNNs

Now that we’ve defined an RNN, let’s see how we can train one for language modeling. The first thing we need is a corpus of text. Recall that language modeling predicts the next word given the previous history so all we need a single corpus to construct a supervised training set from it. Given the corpus and a vocabulary size, we can tokenize the text and run it through an embedding matrix to get an embedded vector $x_t$.

At each timestep, we run each embedded vector $x_t$ through the RNN to compute the next hidden state $h_t$ and the output $\hat{y_t}$. The very first hidden state $h_0$ is normally set to the zero vector. As we progress through the sequence, we keep folding in the previous hidden state $h_{t-1}$ into the current one $h_t$ so that the current hidden state is a “summarization”/representation of all of the previous history that the RNN has seen so far.

The output of the RNN $\hat{y_t}$ is computed by passing the hidden layer through a single-layer neural network to get an output vector same size as the vocabulary. The target $y_t$ is a one-hot embedding of the input offset by one word into the future since we’re trying to predict the next word. We take the predicted output, run it through a softmax layer, and then compute the categorical cross-entropy between the softmax’ed predicted output and the one-hot embedding of the target output: $\sum_t L_{CE}(\hat{y_t} - y_t)$. We do this for each timestep and sum up all of the timestep loss into a single, global loss over the entire corpus. This flavor of backpropagation is sometimes called backpropagation through time (BPTT).

Backpropagation through time “unrolls” the RNN across all of the timesteps and computes a loss between each output and each target at each timestep. The total loss is summed up from the individual losses at each timestep

We accumulate this loss throughout the entire sequence and then backpropagate by “unrolling” the RNN through time. One issue with this is that the computational complexity increases with the length of the sequence; to help bound this, we chop up the full input sequence into chunks of a fixed size and just unroll and backpropagate for those chunks. This technique is sometimes called truncated backpropagation through time (tBPTT).

Truncated backpropagation through time “unrolls” the RNN only for a fixed length sequence and backpropate for that subset of the sequence. However, we always propagate the hidden state forward and never reset it until we’re gone through the entire sequence.

Even though we’re only accumulating gradients for the size of the chunks, we still accumulate the hidden state over the entire sequence, but we just unroll and backpropagate over time window. This is computationally ok since the hidden state itself is a finite size.

Sampling RNN Language Models

Given a trained RNN language model, we can sample from it to generate text. We’ll need a starting word or token then we run it through the RNN to get the output vector. We run the output vector through a softmax and use that probability distribution to sample from to get the next word.

When sampling RNNs, we need some initial seed word (or alternatively we can use a dedicated SOS token). After running that first one through an RNN, we take the output, normalize it into a distribution over the vocabulary using the softmax operator, sample from that distribution, and then use that sampled word as the next input.

Given the next word, we treat it as input into the next time step and repeat the process until we produce a sequence of words of the desired length. An alternative is to use greedy sampling where we always take the highest likelihood word but that tends to restrict the output variability. There are even better (and more complex) sampling approaches such as top-k sampling, nucleus sampling, and beam search.

RNN Flavors

So far, we’ve discussed the most basic kind of vanilla RNN, but there are a number of different improvements on this that have been made over the years so we’ll discuss a few common ones briefly.

Bidirectional RNNs

When we’re training the RNN, we always run the sequence through the RNN sequentially forward in time. However, we can give the model more information if it could also “see into the future” by also running the sequence in backwards in time and combining the data before passing to the output layer. This gives the RNN information about the previous history as well as the future at each timestep. Specifically, at each timestep $t$, we could create a joint hidden state $[h^{(f)}_t;h^{(b)}_t]$ of the forward hidden state $h^{(f)}_t$ from being propagating the sequence forward and backward hidden state $h^{(b)}_t$ from running the sequence in reverse. This flavor of RNN is called a bidirectional RNN.

A bidirectional RNN passes the sequence forward to compute a forward hidden state $h^{(f)}_t$ and runs the sequence in reverse to compute a backward hidden state $h^{(b)}_t$. Both are concatenated together to get a joint hidden state $[h^{(f)}_t;h^{(b)}_t]$ at a particular timestep $t$. Intuitively, this gives the model more information (both the past and the future) to compute a better output in the present.

Computing the output is the same as regular RNNs: we run the concatenated hidden state through a fully-connected layer to get an output.

Stacked RNNs

Similar to how deep neural networks provide more expressive power when we add more hidden layers between the input and output, we can do the same thing with RNNs and stack RNN cells on top of each other. We do this by feeding a hidden state at a particular layer $h_t^{(l)}$ as the input to the next layer’s RNN cell at the same timestep. We keep doing this through the layers until we get to the last one, then we compute the output as usual. This flavor of RNN is sometimes called a stacked RNN.

A stacked RNN layers the hidden cells so that the output of a hidden cell at one layer $h_t^{(l)}$ is propagated as input to a hidden cell at a higher level $h_t^{(l+1)}$.

These are also trained and sampled in the exact same way as regular RNNs. However, just like with deep neural networks, these have more parameters as a function of how many hidden layers we have so they’ll perform better (on average) but take longer to train!

Long Short-term Memory (LSTM) Cells

RNNs seem really great at capturing the sequence nature of text and language, but the vanilla RNNs we’ve seen so far suffer from two major issues: (i) exploding gradient and (ii) vanishing gradient. Both of these issues arise from backpropagating the gradient at the current timestep through all of the previous timesteps to the start of the sequence. Recall that when we’re backpropagating through hidden layers of a neural network, we use the chain rule of calculus to multiply the gradient by a factor for each layer we backpropagate through. Backpropagation through time does a similar thing except the gradient is backpropagated through the timesteps back to the start of the sequence. When we’re moving backwards through the timesteps, we’re multiplying the gradient by some factor, let’s call it $\alpha$, each time. So at a timestep $t$, going all the way back to the first timestep, we have a long product of those factors like $\alpha_1\cdots\alpha_{t}$. If each factor is exactly $1$, then the product is also $1$. If most of the factors are greater than $1$, then the product is going to go off to infinity. On the other hand, if most of the factors are less than $1$, then the product is going to go towards zero. The former problem is the exploding gradient problem and the latter is the vanishing gradient problem.

For a single output, we have to backpropagate to the start of the sequence through the hidden layer at each timestep since all of the weights and biases for the hidden state are re-used for each timestep. This means multiplying the gradient by some factor for each timestep we backpropagate through a timestep.

For the exploding gradient problem, a crude but very effective and direct solution is to clip all of the gradients into a finite range before updating the model parameters. A common range to clip the gradients to is $(-5, 5)$. While it might seem that, since we’re intentionally clipping the gradients, the RNN will train slower, it actually means that the training is going to be far more stable and overall take less time since we won’t be jumping the parameters everywhere.

Unfortunately, the vanishing gradient problem is more challenging to resolve. It isn’t a novel problem since we see it in regular neural networks: as we add more and more layers, the gradient gets smaller and smaller until it approaches zero and the earlier layers get no gradient signal so their weights and biases don’t update.

With RNNs, we have a similar problem with the gradient vanishing, but not in space, in time! For longer sequences, when we unroll the RNN, the gradient vanishes by the time we get to the earlier part of the sequence. This prevents our model from learning long-term relationships between our words.

For a more mathematical treatment of both of these issues, check the Appendix!

The vanishing gradient problem is fundamental to the RNN recurrent relation itself so instead of trying to shoehorn “solutions” to the issue, it would be better to redesign the entire RNN cell. Remember that the root of the issue is that, when the gradient backpropagates backwards through the hidden layers, we end up multiplying by a factor at each timestep. Instead of multiplicative operations, additive operations are a bit easier for the gradient since addition acts as a “gradient copier” and preserves, not attenuates, the gradient. So it seems like we need an alternative or additional mechanisms that allows the gradient to more easily flow, unedited, to earlier timesteps.

(As an aside, I’m going to need to make some hand-wavy justifications and sequence of steps to get us to where we’re trying to go since it’s difficult to directly motivate a solution. This is a somewhat common theme in machine learning, but I think that’s fine. Research sometimes requires us to take a leap using our intuition and evaluate our solution to see how well our intuition works out.)

The first thing we can try is to define a new kind of state called the cell state $C_t$ that we can propagate forward. Ideally we want to avoid multiplicative operations on this state so that the gradient can flow but how do we populate it? The most straightforward and simplest thing to do is to use the previous hidden state and carry it forward $C_t=C_{t-1}$ but this doesn’t provide any input into it. Similar to what we did with the hidden state, we can take the input and previous hidden state through a fully-connected layer and $\tanh$ activation and add it to the cell state.

\[\begin{align*} g_t &= \tanh(W^{(g)}x_t + U^{(g)}h_{t-1} + b^{(g)})\\ C_t &= C_{t-1} + g_t\\ \end{align*}\]

This new $g_t$ is a candidate gate that gates the values of what we’d like to put into the cell state. However the cell state isn’t bounded in any way: if we keep adding to it, even by clipping the gradient, it could become infinity! Furthermore, we’re assuming that we want to pass forward the entire previous cell state and the entire candiate input. In both scenarios, rather than hard-coding what information we should preserve and what information we should take into the cell state, we can have the model learn what to do. For the former, we want to learn which information we remember and which information we forget; for the latter, we want to learn which information to put into the cell state. We can use two more fully-connected layers to gate the information we pass forward to the current cell state from the previous one as well as the information we take from the input to the current cell state.

These are called the forget gates $f_t$ and input gates $i_t$. $\odot$ is called the Hadamard product which is a fancy name for an element-wise product. $\sigma$ is the usual sigmoid function. Note that the forget and input gates use the sigmoid so components of those gates with a value close to $0$ mean we’ll “forget” or “ignore” those components of the previous cell state and candidate gate as well. For values close to $1$, we’ll “remember” or “retain” those components.

What about the hidden state? Do we even need it or can we just use the new cell state that we’ve developed? As it turns out, yes we do because they perform the same purpose but at different scales. The intent of the cell state is to act as a long-term memory and retain long-term information (we’re being careful and intentional about gradient propagation) while the hidden state acts as a short-term memory or “working memory”.

Now that we have both hidden and cell states, how do we relate them to complete the loop? Since the cell state represents long-term memory, we don’t want to just copy it into the hidden state since it defeats the purpose of these two state. Instead, we can follow a similar pattern to the cell state and learn which parts of cell state, i.e., long-term memory, to apply to the hidden state, i.e., working memory, using a new gate called the output gate $o_t$ that we element-wise multiply with the previous cell state.

\[\begin{align*} f_t &= \sigma(W^{(f)}x_t + U^{(f)}h_{t-1} + b^{(f)})\\ g_t &= \tanh(W^{(g)}x_t + U^{(g)}h_{t-1} + b^{(g)})\\ i_t &= \sigma(W^{(i)}x_t + U^{(i)}h_{t-1} + b^{(i)})\\ o_t &= \sigma(W^{(o)}x_t + U^{(o)}h_{t-1} + b^{(o)})\\\\ C_t &= f_t\odot C_{t-1} + i_t\odot g_t\\ h_t &= o_t\odot\tanh(C_t)\\ \end{align*}\]

Intuitively, the model will learn which parts of the long-term memory to put into the working memory! Note that we pass the current cell state through a $\tanh$ layer so that the hidden state is still bounded in the same range of $(-1, 1)$ except we use the new output gate to determine which components of the cell state to write to the hidden state.

With some hand-waving and intuition, we’ve created the Long Short-term Memory (LSTM) cell!

A Long Short-term Memory cell has a cell state that’s propagated forward that’s written to in a more intentionally way than vanilla RNNs. It features four gates: forget, candidate, input, and output. The forget gate determines which parts of the cell state to forget; the candidate and input gates determine what to write to the cell state and which parts to write to; finally the output gate is used to determine which parts of the cell state are written to the hidden state.

In addition to the LSTM cell equations, we also have an output as well so the full set of equations becomes the following.

Now that we’ve intuited an LSTM cell with all of its gates and structure, let’s look a bit more closely at the individual gates and their intents.

The forget gate $f$ selectively removes/forgets information in the cell state/long-term memory as a function of the current input. The activation function we use for $f_t$, is the sigmoid to get a value in $(0, 1)$, and then we take the element-wise product of it with the cell state. A value of 0 for a component $j$ means we’ll forget the $j$th component of the cell state, and a value of 1 means we’ll remember that component of the cell state.

The input gate, similar to the forget gate, determines which parts of the cell state will take on new values. The candidate gate actually produces those values (specifically in the $(-1,1)$ range). We take the element-wise product of those two gate vectors as the combined result of values to add into the current cell state (after applying the forget gate).

The output gate $o$ is used to determine which parts of the new, updated cell state/long-term memory make it to the hidden state/short-term memory.

Recall that the whole purpose for redesigning the vanilla RNN cell was to avoid/mitigate the vanishing gradient problem. Do we accomplish this with the cell state? With the hidden state for plain RNNs, we were multiplying by a constant, variable factor each time. With the current cell state, we’re adding a scaled version of the previous cell state to the candidate and input gates. Addition means that the gradient is copied and preserved across the addition! The only complication is the forget gate sigmoid activation. This will still technically attenuate the gradient, but at a much smaller rate than with RNNs!

Again, for a more mathematical treatment, check the Appendix!

In terms of training and sampling LSTM cells, they’re exactly the same as plain RNN cells! Everything we’ve seen about training and sampling is exactly the same! In fact, we can construct training frameworks that are agnostic to the kind of RNN that we’re training since all RNNs fundamentally operate on the same kinds of inputs and outputs even if their internal cell representations are different.

There are so many different flavors of RNNs and LSTMs and some of them work better than others for certain kinds of tasks. For example, there’s a different flavor called Peephole LSTMs where the gates can peek at the previous cell state in addition to the previous hidden state. There’s another very popular kind of cell called a Gated Recurrent Unit (GRU) that is functionally similar to an LSTM cell but computationally much cheaper with only two gates: (i) update and (ii) reset.

Lots of people have tried a number of different things to help RNNs and I’d encourage you to experiment as well!

Training an RNN Language Model

So far we’ve done a lot of theory and maths behind RNNs and LSTMs but now it’s time to train one! Specifically, let’s train a character-based RNN and LSTM on Shakespeare’s plays and see if we can get it to generate some dialogue. There’s going to be a lot of boilerplate code to load the dataset and setup logging and whatnot so the see the full (thoroughly-commented!) code on my GitHub!

The first thing we can do is define the vanilla RNN model in Pytorch (check out the Pytorch documentation for the APIs! They’re pretty straightforward but I’ll try to explain the more complicated bits). To do this, we’ll need to know the input embedding size, the hidden state size, and the output size. We need to define an embedding layer that maps each index into a full one-hot vector and then into the embedding space (same as the size of the vocabulary). Fortunately Pytorch has a submodule called nn.Embedding that does that for us! Then we’ll need three fully-connected layers: input to hidden state, hidden state to hidden state, and hidden state to output! Pytorch also has nn.Linear that defines a fully-connected layer with weights and a bias.

class RNN(nn.Module):

    def __init__(self,
                 input_size: int,
                 hidden_size: int,
                 output_size: int,
                 _num_layers: int):
        super().__init__()

        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size

        # define an embedding layer to map index inputs to a learned dense vector
        self.embedding = nn.Embedding(input_size, input_size)
        self.i2h = nn.Linear(input_size, hidden_size)
        self.h2h = nn.Linear(hidden_size, hidden_size)
        self.h2o = nn.Linear(hidden_size, output_size)

We’re not going to use _num_layers for the vanilla RNN; we just need it to have a uniform constructor but you’re welcome to try to implement a stacked RNN! Afterwards, we can define the forward pass function as taking in an input sequence of size (seq_size, vocab_size) (first dimension is the sequence size for truncated backpropagation through time and the second dimension is the size of the vocabulary where each entry is an index into the vocabulary) and hidden state of size (1, hidden_size).

    def forward(self, x: torch.Tensor, h=None):
        # initialize hidden state if none was provided
        if h is None:
            h = torch.zeros(1, self.hidden_size).to(x.device)

        seq_size, _ = x.size()
        out = []

        # run each token through the RNN and collect the outputs
        for t in range(seq_size):
            embedding = self.embedding(x[t])
            h = F.tanh(self.i2h(embedding) + self.h2h(h))
            o = self.h2o(h)
            out.append(o)
        out = torch.stack(out)

        # detach hidden state so we can optimize over it over the sequence
        return out, h.detach()

For each timestep, we run it through the embedding layer, compute the hidden state (re-using the variable so we propagate it forward!) and finally computing the output. Since we have a sequence, we keep the list of outputs and stack them into a Pytorch tensor. Finally, we return the sequence of output as well as the accumulated hidden state!

The LSTM variant is also fairly straightforward. The only nuance is that the “hidden state” is actually the hidden state concatenated with the cell state. We do this to abide by Pytorch conventions but there’s nothing stopping us from accepting multiple inputs and producing multiple outputs. As it turns out, Pytorch also has a (much better) implementation of RNNs and LSTMs so we can use that as well! Check the GitHub for implementation details!

Putting aside the model implementation, now let’s see how to prepare our text corpus and the main training loop. (I’m going to omit any boilerplate or additional logic for the sake of brevity.) First thing we need to do is load the corpus and create a “vocabulary” of characters. Then we can convert each character in the corpus into an index in the vocabulary and turn it into a Pytorch tensor.

with open(args.corpus, 'r') as f:
    corpus = f.read()

unique_chars = sorted(set(corpus))
vocab_size = len(unique_chars)

# create mappings between chars and indices
ch_to_ix = {ch: ix for ix, ch in enumerate(unique_chars)}
ix_to_ch = {ix: ch for ix, ch in enumerate(unique_chars)}

# convert string corpus into Pytorch tensors
data = [ch_to_ix[ch] for ch in corpus]
data = torch.tensor(data).to(device)

# reshape into tensor format: num_chars x 1
data = torch.unsqueeze(data, dim=1)

In practice, for very large datasets, we generally can’t load them all into memory at once so instead, we stream to the model and only keep a buffer in memory; it’s a bit slower than loading everything into memory but it means we can train on very large data sets! Now we can create our model and define our optimizer (Adam) and loss function (categorical cross-entropy).

# create model
model_init = get_model_type(args.model_arch)
model = model_init(vocab_size, args.hidden_size, vocab_size, args.num_layers).to(device)

# create loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=args.learning_rate)

Since the model architecture is an input argument, we do quick mapping of the string to a Python class (get_model_type) and instantiate it (note all classes have the same constructor for this reason) using model_init. For Pytorch’s nn.CrossEntropyLoss, it’ll handle the normalization for us so we don’t need an explicit softmax operation.

Now we can get into the main training loop over the number of epochs. Also, according to backpropagation through time, we also have a sequence size that we backpropagate over instead of the entire sequence so we can iterate over chunks of that size. Then we can chunk our source and target sequences. Remember that the target sequence is the source sequence but offset by one character.

for e in range(args.num_epochs):
    epoch_loss = 0
    hidden_state = None

    for i in range(0, len(data), args.sequence_size):
        # extract source and target sequences of len sequence_size
        source = data[i:i+args.sequence_size]
        # target sequence is offset by 1 char
        target = data[i+1:i+args.sequence_size+1]

Now it’s as simple as running both through our model and backpropagating! Remember to clip the gradients before doing the backward pass!

# run source (and hidden state) through model and compute loss of target set
output, hidden_state = model(source, hidden_state)
loss = criterion(torch.squeeze(output), torch.squeeze(target))

# compute gradients
optimizer.zero_grad()
loss.backward()

# clip the gradient to prevent exploding gradient!
torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip_grad)

# update parameters
optimizer.step()

At each epoch, we can also sample from our model to see how it improves as it trains.

# sample output every epoch
sampled_output = ''.join(ix_to_ch[i] for i in sample(device, model, args.output_sequence_size))

Sampling is also fairly straightforward: we start by picking a random character to start the sequence (or we could select a seed input). Then we can run the hidden state and input through the model, normalize the output into a probability distribution over the characters, and sample from that distribution. Remember to set the input to the sampled output so it’s updated for the next output!

def sample(device: torch.device,
           model: nn.Module,
           output_seq_size: int) -> list[int]:
    hidden_state = None

    # store output as list of indices
    sampled_output = []
    # create an input tensor from a random index/character in the input set
    random_idx = np.random.randint(model.input_size)
    seq = torch.tensor(random_idx).reshape(1, 1).to(device)

    for _ in range(output_seq_size):
        output, hidden_state = model(seq, hidden_state)

        # normalize output into probability distribution over all characters
        probs = F.softmax(torch.squeeze(output), dim=0)
        dist = torch.distributions.categorical.Categorical(probs)

        # sample from the distribution and append to list
        sampled_idx = dist.sample()
        sampled_output.append(sampled_idx.item())

        # reset sequence to sampled char for next loop iteration
        seq[0][0] = sampled_idx.item()
    return sampled_output

In the GitHub repo, I’ve also put some logs and pre-trained models for our custom vanilla RNN as well as the Pytorch LSTM, both trained on the Shakespeare corpus for 32 epochs. We can see that, even for the vanilla RNN, within the first epoch, it starts to learn a bit about the structure of how plays are written and even starts to get names right!

hatbetts,
Well by you shokseecing.

ANTONIO:
Son, wrworn, speak your fore them.

SEBASTIAN:
Witholers ndvery backs.

ASTONSON:
W

The LSTM at the first epoch does a bit better since it also remembers words.

ry a poor tinis
Would tisle bechosh attein, and I,
My father, and risun.

ANTONIO:
You boumon manicable.

ANTONIO:
All old thou 

In the later epochs, we start to get some better results from the vanilla RNN.

ALONSO:
Of the so speanfelty,
I do my should shipt yould your and ateal,--
I pind, adve yeny youbt sones in you so.

SEBASTIAN:

The LSTM does even better.

GONZALO:
'Tis incapress to us in actions.

ANTONO:
Therefore I will not
Some pillaria, but what again, woul

See the full code full code, logs, and pre-trained models on my GitHub! Try to train your own RNN on your own corpus or generate text using the pre-trained models!

Conclusion

In this post, we graduated from n-grams and plain neural networks into recurrent neural networks (RNNs) that can handle arbitrary-length sequences and more correctly model the sequential nature of language! We discussed how to train them and how to sample from them. We also saw a few variants that also ran the sequence in reverse and combined the hidden states (bidirectional RNNs) as well as a variant that stacked them deep like neural networks (stacked RNNs). Vanishing and exploding gradients are the primary issue with these and that motivated us to create the long short-term memory (LSTM) cell to help address vanishing gradient. Finally, we saw how to train RNNs using Pytorch and saw some example outputs during and near the end of training!

In the next post, we’ll finally get to a state-of-the-art language model called a Transformer, the very same ones used by many different Large Language Models (LLMs) such as OpenAI’s ChatGPT and Anthropic’s Claude! 🙂

Appendix

Vanishing/Exploding Gradient in Plain and LSTM Cells

To better mathematically see how vanishing and exploding gradient appear, we have to derive the backpropagation equations for the gradient from the RNN equations. Specifically, we have to compute the derivative of the loss function with respect to the hidden state weight matrix $\frac{\p L}{\p U}$ since it’s the main parameter used by the hidden states.

We’ll be accumulating the total gradient as we move backward so we’ll start with $\frac{\p L}{\p U}$ and then expand into smaller pieces using the chain rule of calculus.

For any RNN, the total loss is the sum of the individual losses at each timestep.

\[L = \sum_t L_t\]

Since this simply sums over all of the individual losses, the gradient is just copied; to help focus, let’s ignore the top-level gradient and just worry about a particular $\frac{\p L_t}{\p U}$, knowing that we can just sum over $t$ to the the total loss. For each individual loss, we’re using categorical cross-entropy using the true “next word” $y_t$ and the predicted one from the model $\hat{y_t}$.

\[L_t = L_{\text{CE}}(y_t, \hat{y_t})\]

Instead of getting right into a generic solution, let’s try to compute it by hand for a small sequence of size three.

In a toy example of an RNN with three timesteps, the green arrows show the gradient moving backwards; on the arrows are the local gradients! To compute a derivative with respect to a parameter, we multiply the local gradients along all paths behind the current timesteps to the target parameter and sum them up. For example, $\frac{\p L_1}{\p U}=\frac{\p L_1}{\p \hat{y_1}}\frac{\p \hat{y_1}}{\p h_1}\frac{\p h_1}{\p U}$.

Following the local gradient, the loss at the first timestep is the easiest since there are no previous timesteps to apply to.

\[\frac{\p L_1}{\p U}=\frac{\p L_1}{\p \hat{y_1}}\frac{\p \hat{y_1}}{\p h_1}\bigg( \frac{\p h_1}{\p U} \bigg)\]

Pretty straightforward! Now let’s look at the loss at the second timestep where we have the loss at the second timestep as well as the one from the first timestep.

\[\frac{\p L_2}{\p U}=\frac{\p L_2}{\p \hat{y_2}}\frac{\p \hat{y_2}}{\p h_2}\bigg( \frac{\p h_2}{\p U} + \frac{\p h_2}{\p h_1}\frac{\p h_1}{\p U} \bigg)\]

Notice the first term in the parentheses is the same, but the second term arises since we have to backpropagate to the first timestep using $\frac{\p h_2}{\p h_1}$. Now let’s do the same for the third timestep.

\[\frac{\p L_3}{\p U}=\frac{\p L_3}{\p \hat{y_3}}\frac{\p \hat{y_3}}{\p h_3}\bigg( \frac{\p h_3}{\p U} + \frac{\p h_3}{\p h_2}\frac{\p h_2}{\p U} + \frac{\p h_3}{\p h_2}\frac{\p h_2}{\p h_1}\frac{\p h_1}{\p U} \bigg)\]

See the pattern? At a particular timestep $t$, we backpropagate to each earlier timestep using the hidden states and then, after we get to a timestep, we backpropagate a little bit into the weight matrix.

Now that we’ve seen an example, let’s go back and try to formulate this more generically across an arbitrary number of timesteps.

\[\frac{\p L_t}{\p U} = \frac{\p L_t}{\p \hat{y_t}}\frac{\p \hat{y_t}}{\p U}\]

(I’m abusing some notation since taking the derivative with respect to a matrix is technically undefined.) To compute $\hat{y_t}$, we use the output weight matrix and bias applied to the hidden state at timestep $t$.

\[\hat{y_t} = Vh_t + b^{(y)}\]

To get to the hidden state, we have to backpropagate through the output weight matrix.

\[\frac{\p L_t}{\p U}=\frac{\p L_t}{\p \hat{y_t}}\frac{\p \hat{y_t}}{\p h_t}\frac{\p h_t}{\p U}\]

We have to break down $\frac{\p h_t}{\p U}$ carefully since we’re applying the hidden state weight matrix $U$ at each of the previous timesteps. Consider the diagram: we have multiple gradients going into $U$ so we have to sum over them. Getting the gradient at $t$ is straightforward, but what about the earlier timesteps? They also depend on $U$ since it’s the same one we use for all timesteps! We can move the gradient backwards through the hidden states since $h_t$ depends on $h_{t-1}$ and $h_{t-1}$ depends on $h_{t-2}$ and so on. So $\frac{\p h_t}{\p U}$ really expands into a sum over all of the previous timesteps.

\[\frac{\p L_t}{\p U}=\frac{\p L_t}{\p \hat{y_t}}\frac{\p \hat{y_t}}{\p h_t}\sum_{k=1}^t \frac{\p h_k}{\p h_{k-1}}\frac{\p h_{k-1}}{\p U}\]

The first term in the sum moves the gradient back to earlier timesteps while the second term backpropagates into the hidden state weight matrix $U$. And we sum over all of the previous timesteps in the sequence up to timestep $t$.

However $\frac{\p h_k}{\p h_{k-1}}$ can be expanded out again using the chain rule into a product! For example, if we’re at timestep $t$ trying to go back to some timestep $t-3$, then we need to go back through the hidden states at $t-1$, $t-2$, and finally $t-3$ so the product looks like $\frac{\p h_t}{\p h_{t-1}}\frac{\p h_{t-1}}{\p h_{t-2}}\frac{\p h_{t-2}}{\p h_{t-3}}$. So we can expand $\frac{\p h_k}{\p h_{k-1}}$ into a product.

\[\frac{\p L_t}{\p U}=\frac{\p L_t}{\p \hat{y_t}}\frac{\p \hat{y_t}}{\p h_t}\sum_{k=1}^t \bigg(\prod_{j=k+1}^{t} \frac{\p h_j}{\p h_{j-1}}\bigg) \frac{\p h_{k-1}}{\p U}\]

This is full gradient of $\frac{\p L_t}{\p U}$!

Now that we have the gradient of the hidden state weight matrix, we can finally investigate the vanishing and exploding gradient problems! Since both of these problems occur with the gradient moving backwards to the earlier timesteps, the core of the issue lies in the product term, specifically the partial derivative of the next hidden state with the previous one.

\[\prod_{j=k+1}^{t} \frac{\p h_j}{\p h_{j-1}}\]

Recall what we said earlier, since this is a product, if most of these terms are less than 1, then we get vanishing gradient issue. If most of these terms are greater than 1, then we get the exploding gradient issue. Let’s expand this term and investigate!

\[\begin{align*} \frac{\p h_j}{\p h_{j-1}} &= \frac{\p}{\p h_{j-1}}\tanh(Wx_t + Uh_{j-1} + b^{(h)})\\ &= \tanh'(Wx_t + Uh_{t-1} + b^{(h)})\frac{\p}{\p h_{j-1}}\bigg[ Wx_t + Uh_{j-1} + b^{(h)}\bigg]\\ &= \tanh'(Wx_t + Uh_{t-1} + b^{(h)})U\\ \end{align*}\]

Between the first two steps, we backpropagate through the $\tanh$ non-linearity and then directly take the derivative. So the culprit is $U$! We’re compounding $U$ at each timestep which will cause our gradients to either vanish or explode. $U$ is a matrix so it’s more difficult to reason about what kind of $U$ will cause either vanishing or exploding gradients. Fortunately, that part of the work has already been done for in the Appendix of “On the Difficulty of Training Recurrent Neural Networks” by Pascanu, Mikolov, and Bengio. If we compute the largest eigenvalue of $U$, we can prove if the magnitude of that eigenvalue is greater than $1$, then the gradient will grow exponentially fast; if it’s smaller than $1$, then the gradient in the limit will approach $0$.

This is the mathematical reason that we get vanishing and exploding gradients in RNNs!

How do LSTMs fair? Recall that with the LSTM, we propagate the cell state forward in a similar way to the hidden state so we can look at the partial derivative of the current cell state with respect to the previous one.

\[\frac{\p c_k}{\p c_{k-1}} = \frac{\p}{\p c_{k-1}}\bigg[ f_k\odot c_{k-1} + i_k\odot g_k \bigg]\]

How far back in the equation do we go? We have to keep unraveling it until all of the $c_{k-1}$s are found. Remember that $f_k$, $i_k$, and $g_k$ are all functions of $h_{k-1}$ which is a function of $c_{k-1}$! Let’s start by applying the chain rule.

\[\begin{align*} \frac{\p c_k}{\p c_{k-1}} &= \frac{\p}{\p c_{k-1}}f_k\odot c_{k-1} + f_k\odot \frac{\p}{\p c_{k-1}} c_{k-1} + \frac{\p}{\p c_{k-1}}i_k\odot g_k + i_k\odot \frac{\p}{\p c_{k-1}}g_k\\ &= \frac{\p f_k}{\p c_{k-1}}\odot c_{k-1} + f_k + \frac{\p i_k}{\p c_{k-1}}\odot g_k + i_k\odot \frac{\p g_k}{\p c_{k-1}}\\ &= c_{k-1}\frac{\p f_k}{\p c_{k-1}} + f_k + g_k\frac{\p i_k}{\p c_{k-1}} + i_k\frac{\p g_k}{\p c_{k-1}}\\ \end{align*}\]

So we have three other partial derivatives that we have to compute, one for each gate except the output gate (which we’ll encounter later). Let’s compute each one in turn.

\[\begin{align*} \frac{\p f_k}{\p c_{k-1}} &= \frac{\p}{\p c_{k-1}}\sigma(W_f x_f + U_f h_{k-1} + b_f)\\ &= \frac{\p}{\p c_{k-1}}\sigma(z_f)\\ &= \sigma'(z_f)\frac{\p}{\p c_{k-1}}\bigg[W_f x_f + U_f h_{k-1} + b_f\bigg]\\ &= \sigma'(z_f)U_f\frac{\p}{\p c_{k-1}}\bigg[h_{k-1}\bigg]\\ &= \sigma'(z_f)U_f\frac{\p}{\p c_{k-1}}\bigg[o_{k-1}\odot\tanh(c_{k-1})\bigg]\\ &= \sigma'(z_f)U_f o_{k-1}\frac{\p}{\p c_{k-1}}\bigg[\tanh(c_{k-1})\bigg]\\ &= \sigma'(z_f)U_f o_{k-1}\tanh'(c_{k-1})\\ \end{align*}\]

As it turns out, the other partial derivatives are basically the same with some constants being different so I’ll skip the derivations.

\[\begin{align*} \frac{\p i_k}{\p c_{k-1}} &= \sigma'(z_i)U_i o_{k-1}\tanh'(c_{k-1})\\ \frac{\p g_k}{\p c_{k-1}} &= \tanh'(z_g)U_g o_{k-1}\tanh'(c_{k-1})\\ \end{align*}\]

Combining all of these together, we have the partial derivative $\frac{\p c_k}{\p c_{k-1}}$.

\[\begin{align*} \frac{\p c_k}{\p c_{k-1}} = &c_{k-1}\sigma'(z_f)U_f o_{k-1}\tanh'(c_{k-1})\\ &+ f_k\\ &+ g_k\sigma'(z_i)U_i o_{k-1}\tanh'(c_{k-1})\\ &+ i_k\tanh'(z_g)U_g o_{k-1}\tanh'(c_{k-1}) \end{align*}\]

Let’s compare the product of this with the vanilla RNN side-by-side.

\[\begin{align*} \prod_{j=k+1}^{t} \frac{\p h_k}{\p h_{k-1}} &= \prod_{j=k+1}^{t}\tanh'(Wx_t + Uh_{t-1} + b^{(h)})U\\ \prod_{j=k+1}^{t} \frac{\p c_k}{\p c_{k-1}} &= \prod_{j=k+1}^{t}\bigg[ c_{k-1}\sigma'(z_f)U_f o_{k-1}\tanh'(c_{k-1}) + f_k + g_k\sigma'(z_i)U_i o_{k-1}\tanh'(c_{k-1}) + i_k\tanh'(z_g)U_g o_{k-1}\tanh'(c_{k-1})\bigg] \end{align*}\]

The LSTM one is very different (and more complex) than the similar one for plain RNNs! The most important part of it is that it’s additive: with the plain RNN the term was multiplicative. So when we multiply everything together, for the vanilla RNN, we get a giant product that could explode or vanish. On the other hand, for the LSTM, we’d still have a sum! This means there’s a much fewer chance of the gradient vanishing since addition copies the gradient. Also know that the forget gate is right there in the equation that can help adjust the gradient to prevent it from vanishing; the values of the gate are learned so that the LSTM can decide when it should prevent the gradient from vanishing.

Anatomy of a Good-enough Modern CMake Project for C++ Libraries

2024-04-27T00:00:00+00:00

C++ is one of the most widely used programming languages in the world, from mobile apps to gaming to robotics. Personally, I’ve used it for at least these things, but there are hundreds and thousands of more uses of the language. Invariably, the majority of those uses will, at some point, involve having to write C++ libraries and executables.

CMake is one of the most commonly-used ways to create a set of build files to construct the library or executable. It’s a meta-build system since does not build anything itself: it creates the files that we then use to build, e.g., generating Makefiles to run make. Learning CMake is challenging since tutorials and the official CMake documentation and public projects either range from constructing the very basic “Hello World” to constructing multi-platform, multi-compiler submodular libraries. In other words, the complexity is often binary from “let’s build this one C++ file!” to “let’s build something like Boost!” The majority of times, I’ve found that a CMake structure somewhere in between tends to be good enough for most projects.

In this post, I’ll describe a good-enough C++ library project structure and CMake file that accomplishes enough to build a fairly flexible library for a client to build from scratch and use (or some automated build system to generate binaries). To concretely demonstrate this, I’ve started on a catch-all miscellaneous C++ library called bagel, named after an “everything bagel” that I had for breakfast that day 😄, that I’m going to be using as a C++ playground going forward.

I don’t intend for this to be a CMake tutorial for complete beginners; I’ll assume you have enough CMake knowledge where I won’t have to explain syntax or basic commands like set or project. The purpose of this post is to talk more about how to use that CMake knowledge to create a project structure that makes building easy and flexible.

A Good-enough Project Structure

Before getting into the CMake file, let’s describe a good-enough directory structure for a mid-sized project:

.
├── CMakeLists.txt
├── LICENSE
├── Readme.md
├── cmake
│   └── Config.cmake.in
├── examples
│   ├── CMakeLists.txt
│   └── timer.cpp
├── include
│   └── bagel
│       ├── chrono
│       │   └── timer.hpp
│       └── export.hpp
├── src
│   └── chrono
│       └── timer.cpp
└── tests
    ├── CMakeLists.txt
    └── chrono
        └── test_timer.cpp

10 directories, 11 files

In this directory, we have a few “required” files like Readme.md and LICENSE that provide an overall description of the library (among many other things) as well as the legal software license it falls under. Often times open-source libraries have more files like CONTRIBUTING.md and AUTHORS.md that explain how to contribute to the library and the core authors of the library, respectively.

The crux of building the library is in the CMakeLists.txt which is the CMake file that’s used by the CMake executable to write the Makefiles used to actually build this library; it contains the actual library definition including things like compile options and where to install the headers and whatnot. When we run the CMake command in a directory like cmake ., it will search for CMakeLists.txt in that directory and parse and execute it. A related directory we’ll cover in the later sections is cmake, which tends to store auxiliary CMake files used by the root CMakeLists.txt.

The next directory examples contains example usage of the library with its own CMakeLists.txt that just builds the examples. This allows the builder to control if they want to build examples or not. In this case, examples is flat, but it could be more hierarchical if we had a larger library. We’ll get to this definition later as well. The tests directory contains our tests for the library and it’s own CMakeLists.txt for the same reason as the examples directory. We use GoogleTest to validate our library, but any testing framework will do. I’d highly recommend having tests for your libraries so it helps provide credibility and confidence to users that your library actually does what it intends to do.

The next two directories include and src contain the actual content of our library. In the case of include, we have some subdirectories, the main one being the name of the library bagel. Then we have subdirectories for the subcomponents like chrono. The reason we use a subdirectory bagel with the same name as the project is so that, when we install the header files, e.g., to a place like /usr/local/include in a Linux system, that our headers like timer.hpp are prefixed by the library folder to avoid overwriting some other file named timer.hpp from some other library.

We’ll see most of these directories play a part in the project-level CMakeLists.txt. The focus for this post is on the CMake required to build our library and not on what the library itself actually does so we won’t necessarily talk about what timer.hpp/timer.cpp contains. The contents aren’t as important as how we build the contents into a library.

Anatomy of a Good-enough CMakeLists.txt

Building a project starts with the CMakeLists.txt file that defines the project, build artifacts, and other options. I like to divide the CMake into several larger sections:

Preamble: define the entire CMake project as a whole.
Configuration: check any project-level variables and configure building examples and tests
Build: define the library and its associated source files, compile options, versions, and other properties
Install: configure where to install the library and headers
Extra stuff: recurse into directories for tests and examples as well as build documentation

Preamble

The preamble defines the minimum CMake binary version as well as defines the project.

cmake_minimum_required(VERSION 3.14)
project(bagel
    VERSION 0.1.0
    DESCRIPTION "An everything bagel of C++"
    LANGUAGES CXX)

In general, using a too-recent version of CMake can make it difficult for developers to use your library since not everyone might be able to use the latest version of CMake, especially in industry where upgrades to newer build tools can be very slow. For the versioning, semantic versioning is usually a popular choice.

Configuration

After defining the root CMake project, we define some project-level configurations and check some variables. One of the first configurations we’ll provide to builders is the ability to build our code into a shared or a static library. A shared library (also called shared object hence the .so file extension) is a kind of library that is dynamically loaded into an executable at runtime; these kinds of libraries make the overall executable smaller but, since the library is loaded dynamically at runtime, the executable requires the shared library to be located in the right place in the filesystem otherwise the exectuable fails when you run it. On the other hand, a static library (file extension .a for archive) is the other kind of library that is actually built into an executable at compile-time; these kinds of libraries make the executable larger but, since they’re built into the executable, it ensures the exectuable is self-sufficient.

CMake allows the builder to specific which kind of library they want to build. There’s a built-in variable called BUILD_SHARED_LIBS. However, since this is general to all CMake libraries and is coupled to other CMake behavior, oftentimes we provide a project-specific override usually called something like ${PROJECT_NAME}_SHARED_LIBS. If that is defined, then we can use it, otherwise, we can default to whatever the BUILD_SHARED_LIBS variable decides. The default option is to build static libraries.

One nuance is that we want the variable to be defined like BAGEL_BUILD_SHARED_LIBS not bagel_BUILD_SHARED_LIBS for consistencency so we’ll define a ${UPPER_PROJECT_NAME} variable that’s just ${PROJECT_NAME} but uppercase.

set(namespace ${PROJECT_NAME})
string(TOUPPER ${PROJECT_NAME} UPPER_PROJECT_NAME)

message(CHECK_START "Checking ${UPPER_PROJECT_NAME}_SHARED_LIBS")
if(DEFINED ${UPPER_PROJECT_NAME}_SHARED_LIBS)
    set(BUILD_SHARED_LIBS ${UPPER_PROJECT_NAME}_SHARED_LIBS)
    message(CHECK_PASS "${${UPPER_PROJECT_NAME}_SHARED_LIBS}")
else()
    message(CHECK_FAIL "${BUILD_SHARED_LIBS}")
endif()

message(CHECK_START "Building shared libraries")
if(BUILD_SHARED_LIBS)
    message(CHECK_PASS "yes")
else()
    message(CHECK_FAIL "no")
endif()

We’re also defining a ${namespace} that we’ll use later. To write things to the screen, we use the message macro but use the CHECK_START, CHECK_PASS, and CHECK_FAIL settings so that CMake formats our message nicely like the following.

[cmake] -- Checking BAGEL_SHARED_LIBS
[cmake] -- Checking BAGEL_SHARED_LIBS - ON
[cmake] -- Building shared libraries
[cmake] -- Building shared libraries - yes

In CMake, like in bash, there’s a difference between a variable existing and not existing and a variable having a value. We first check if the variable ${UPPER_PROJECT_NAME}_SHARED_LIBS exists. Note that we don’t use ${} around the entire expression since we’re not checking if the contents of the variable exist, we want to check if the variable itself exists. If the variable is defined, then we override the value of BUILD_SHARED_LIBS, otherwise we default to BUILD_SHARED_LIBS. If that also doesn’t exist, then we’ll use CMake’s default (building a static library).

There are several ways to set these variables. One way is to do it using the cmake command like cmake -DMY_VAR to set define MY_VAR.

Another common CMake configuration is the build type. The build type mostly sets the compiler optimizations and options such as debug symbols. The most commonly-used ones are Debug, Release, and RelWithDebInfo. Debug has minimal optimizations but retains debug symbols; Release has the strongest optimizations but strips any debug symbols for debugging through a debugger like gdb. The last one has the optimizations of release mode but still contains debug symbols. Similar to BUILD_SHARED_LIBS, if CMAKE_BUILD_TYPE isn’t defined, we’ll default to Release mode since that’s what builders of our library will tend to use.

if(NOT DEFINED CMAKE_BUILD_TYPE)
    set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type" FORCE)
endif()
message(STATUS "Setting build type: ${CMAKE_BUILD_TYPE}")

Using the CACHE and FORCE options, we override whatever user-defined value is set in the cache with this value; this is fine since the user didn’t specify a CMAKE_BUILD_TYPE in the first place. The STRING "Build type" tells CMake that CMAKE_BUILD_TYPE is a string.

Next we set some variables for later and define some custom other build options like building examples, tests, and documentation.

set(export_header_name "export.hpp")
set(export_file_name "${CMAKE_CURRENT_SOURCE_DIR}/include/${PROJECT_NAME}/${export_header_name}")

include(GNUInstallDirs)
set(cmake_config_dir ${CMAKE_INSTALL_LIBDIR}/cmake/${PROJECT_NAME})
set(build_tests ${UPPER_PROJECT_NAME}_BUILD_TESTS)
set(build_examples ${UPPER_PROJECT_NAME}_BUILD_EXAMPLES)
set(build_docs ${UPPER_PROJECT_NAME}_BUILD_DOCS)

option(${build_tests} "Builds tests" OFF)
option(${build_examples} "Builds examples" OFF)
option(${build_docs} "Builds docs" OFF)

We use a few CMake variables:

${CMAKE_CURRENT_SOURCE_DIR}: the directory being processed by CMake; in our case, since our library itself is a top-level CMake project itself, this is the root of the project. This usually refers to the root of the project for single-project CMakes.
${CMAKE_INSTALL_LIBDIR}: the install directory for libraries; in Linux systems, this is usually called lib (or sometimes lib32 and lib64). Note that the install prefix is prepended to this folder. Since we used include(GNUInstallDirs) earlier, it will set this folder correctly for us.

We’ll discuss the export header and config directory later.

Build

Now we’re getting into actually building the library. First thing we’ll do is define the library itself and an alias.

add_library(${PROJECT_NAME})
add_library(${PROJECT_NAME}::${PROJECT_NAME} ALIAS ${PROJECT_NAME})

The alias is so that, if someone was building our library from source and linking it as part of their library, then the target_link_libraries would look the same. We’re not adding any sources to it yet, just defining the library’s existence. After we define the library, we also set the minimum C++ version and provide some compile-time options.

target_compile_features(${PROJECT_NAME} PUBLIC cxx_std_17)
target_compile_options(${PROJECT_NAME} PRIVATE -Wall -Wextra)

We use PUBLIC for the minimum C++ version so that it’s visible to users when they try to link against our library. For the compile options, those are PRIVATE since they’re only applicable to our library; we don’t want our library decisions on warnings and errors to be propagated to all of our users!

C++ provides access specifiers like public and private, but when building a shared library, we also kind of have a notion of “library” visibility. For shared libraries, each class and function defines a symbol in the symbol table of the library. When you link the shared library to an executable (or other library), the linker resolves those symbols to actual memory addresses. Think of them as placeholders and the actual interface that your library itself provides (sometimes called its ABI or Application Binary Interface). By default, all defined symbols (except the defined inline ones) are exported by our shared library. However, sometimes we have some internal classes or functions that we don’t want to export as part of the shared library interface. It would be better to explicitly mark which symbols should be part of our library’s interface and default all other symbols to be hidden. The asymmetry is that, for static libraries, we don’t have this distinction since the static library is built into the executable in its entirety; the linker doesn’t apply such symbol visibility to static libraries. So we have a few criteria we need to satisfy:

By default, hide all symbols
Provide a mechanism to manually export symbols
Ignore the export symbol mechanism for static libraries

CMake handles this by generating an export header that can create a symbol like BAGEL_EXPORT that’ll export symbols for shared libraries but it becomes a no-op operation for static libraries.

if(NOT BUILD_SHARED_LIBS)
    target_compile_definitions(${PROJECT_NAME} PUBLIC ${UPPER_PROJECT_NAME}_STATIC_DEFINE)
endif()

include(GenerateExportHeader)
generate_export_header(${PROJECT_NAME}
    EXPORT_FILE_NAME ${export_file_name}
)

The first part will add a macro definition BAGEL_STATIC_DEFINE that will no-op BAGEL_EXPORT. The generate_export_header will auto-generate a header file at ${export_file_name} that will define macros to change the visibility of a symbol. To export certain classes or functions, we can import that header and use BAGEL_EXPORT right before the symbol name like the following.

class BAGEL_EXPORT MyClass {
    ...
};

void BAGEL_EXPORT myFunc() {
    ...
}

If we inspect the symbol table of shared library, we’ll see only those symbols exported while others won’t be. For a class, exporting the class exports all symbols but the export header also defined a BAGEL_NO_EXPORT that “un-exports” the symbol again.

The last thing we need to do is to disable exporting all symbols by default.

if(NOT DEFINED CMAKE_CXX_VISIBILITY_PRESET)
    set_target_properties(${PROJECT_NAME} PROPERTIES
        CXX_VISIBILITY_PRESET hidden
    )
endif()
if(NOT DEFINED CMAKE_VISIBILITY_INLINES_HIDDEN)
    set_target_properties(${PROJECT_NAME} PROPERTIES
        VISIBILITY_INLINES_HIDDEN ON
    )
endif()

That finishes our symbol exporting stuff. Moving on, one minor thing we’ll do is also set our library’s version based on what we set in the project() macro.

set_target_properties(${PROJECT_NAME} PROPERTIES
    SOVERSION ${PROJECT_VERSION_MAJOR}
    VERSION ${PROJECT_VERSION}
)

After all of that, we’re finally ready to actually add header and source files.

target_include_directories(${PROJECT_NAME}
    PRIVATE
        "${CMAKE_CURRENT_SOURCE_DIR}/src"
    PUBLIC
        "$${CMAKE_CURRENT_SOURCE_DIR}/include>"
        "$${CMAKE_INSTALL_INCLUDEDIR}>"
)
target_sources(${PROJECT_NAME} PRIVATE
    src/chrono/timer.cpp)

We use target_include_directories to add headers to our library. The PRIVATE part means that only our source files in our src can also access headers in our src directory but external users can’t (since those are meant to be for library use only). For the PUBLIC part, we use CMake generators to specify a build and install interface. When building the library, we can also use headers in the include directory directly; for users, they’ll use headers wherever we’ve install them as part of the install stage. Recall that ${CMAKE_INSTALL_INCLUDEDIR} is just like ${CMAKE_INSTALL_LIBDIR} but for includes instead of libraries (set to include by GNUInstallDirs).

target_sources adds sources to our library and PRIVATE is really the only thing that makes sense here. We could also glob all source files under the src directory but I like to be more explicit about which source files are added to the library.

Install

At this point, we have our library and header files ready and we just need to install them in a way so that users can find the library and link against it. The ideal user experience is to be as simple as possible.

find_package(bagel REQUIRED)
target_link_libraries(${PROJECT_NAME} bagel)

These two lines should be all that’s required to link against the installed library. So how can we accomplish this? First thing we need to do is install the headers. There’s a PUBLIC_HEADERS field but that doesn’t work so nicely for nested directory structures. I’ve found it easier to just install the entire include directory into the right place.

install(DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}/include/${PROJECT_NAME}"
    DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
)

This does go against my previous sentiment about being more explicit about which files are added to the library but we’ve already configured out project to find headers in the src directory only for our project so we have a mechanism to keep some headers private. The next thing we need to install is our actual library itself and associate the headers with it.

install(TARGETS ${PROJECT_NAME}
    EXPORT "${PROJECT_NAME}Targets"
)

Installing the library isn’t enough: we need to create an export target for our library that describes how to find the header files and library file from the target itself. We’ll use the export target we just created and create a corresponding *Targets.cmake for it. We’ll give it a namespace; this is a more modern way for CMake to know that a particular alias is a build target and not a folder or something else.

install(EXPORT "${PROJECT_NAME}Targets"
    FILE "${PROJECT_NAME}Targets.cmake"
    NAMESPACE ${namespace}::
    DESTINATION ${cmake_config_dir}
)

We’ll get to why we’re installing this into ${cmake_config_dir} in just a second.

The last thing we need is to write a package config so that find_package (using pkg-config) in a client CMakeLists.txt can actually find it and import the build target. First thing we’ll do is write a version config file, but there’s a helper we can use.

include(CMakePackageConfigHelpers)
write_basic_package_version_file(
    "${CMAKE_CURRENT_BINARY_DIR}/${PROJECT_NAME}ConfigVersion.cmake"
    VERSION "${PROJECT_VERSION}"
    COMPATIBILITY SameMajorVersion
)

Recall ${CMAKE_CURRENT_BINARY_DIR} is the location of the build directory; this is fine since we’ll be installing these generated files immediately anyways. We’re setting the compatibility to be the SameMajorVersion since, under our semantic versioning scheme, there are no breaking changes across major versions. Next thing we need to create is a config file that imports our previously-created target file. For that, first we create a separate Config.cmake.in.

@PACKAGE_INIT@

include("${CMAKE_CURRENT_LIST_DIR}/@PROJECT_NAME@Targets.cmake")

check_required_components(@PROJECT_NAME@)

Some of this is a bit esoteric, but the documentation says to ensure @PACKAGE_INIT@ is at the start and check_required_components(@PROJECT_NAME@) is at the bottom. In the middle, all we have to do is include our targets file. Finally, we install both of these to the right location.

install(FILES
    "${CMAKE_CURRENT_BINARY_DIR}/${PROJECT_NAME}Config.cmake"
    "${CMAKE_CURRENT_BINARY_DIR}/${PROJECT_NAME}ConfigVersion.cmake"
    DESTINATION ${cmake_config_dir}
)

Note that we install the package config files and the targets file to the ${cmake_config_dir} we defined earlier. This effectively installs to a filepath like lib/bagel/cmake on a Linux system. This is where pkg-config looks when you write find_package(bagel): it’ll go through the folder of each library stored in lib and look for a cmake diretory. If we were to put it somewhere else, we’d get some error like the following.

CMake Error at CMakeLists.txt:6 (find_package):
  Could not find a package configuration file provided by "bagel" with any of
  the following names:

    bagelConfig.cmake
    bagel-config.cmake

  Add the installation prefix of "bagel" to CMAKE_PREFIX_PATH or set
  "bagel_DIR" to a directory containing one of the above files.  If "bagel"
  provides a separate development package or SDK, be sure it has been
  installed.

Alternatively, we could install this anywhere and append to the CMAKE_PREFIX_PATH or define a bagel_DIR, but it’s convenient to have the right suffix location by default so clients don’t have to do that extra step. Of course a client could add an install prefix to anywhere but then it’s on them to set either of the two variables above.

Extra stuff

At this point, we technically have everything we need for our library, but let’s also provide a way to build examples, tests, and documentation. In the project-level CMakeLists.txt, we just need to recurse into the lower-level CMakeLists.txt.

message(CHECK_START "Building tests")
if(${build_tests})
    message(CHECK_PASS "yes")

    add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/tests)
else()
    message(CHECK_FAIL "no")
endif()

message(CHECK_START "Building examples")
if(${build_examples})
    message(CHECK_PASS "yes")

    add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/examples)
else()
    message(CHECK_FAIL "no")
endif()

We’ll get into those in a minute but building documentation relies on Doxygen and there are some CMake variables that can be set and the doxygen_add_docs command generates docs. One additional thing we can do is to create a dependency in our project to our generate_docs target so that, whenever we rebuild the library due to a code change, the documentation will automatically be re-generated too!

message(CHECK_START "Building docs")
if(${build_docs})
    message(CHECK_PASS "yes")

    find_package(Doxygen REQUIRED)
    
    set(README_PATH "${CMAKE_CURRENT_SOURCE_DIR}/Readme.md")
    set(DOXYGEN_PROJECT_NAME "${PROJECT_NAME}")
    set(DOXYGEN_PROJECT_BRIEF "${PROJECT_DESCRIPTION}")
    set(DOXYGEN_USE_MDFILE_AS_MAINPAGE "${README_PATH}")
    doxygen_add_docs(generate_docs include "${README_PATH}"
        COMMENT "Generating docs")
    add_dependencies(${PROJECT_NAME} generate_docs)
else()
    message(CHECK_FAIL "no")
endif()

Alternatively, we could use a backup documentation generator and not make Doxygen required but that’s a choice.

The CMakeLists.txt in the examples folder is fairly straightforward

cmake_minimum_required(VERSION 3.16)
project(bagel-examples)

add_executable(timer timer.cpp)
target_link_libraries(timer PRIVATE bagel::bagel)

Notice how we link our example executable to our library with bagel::bagel using PRIVATE since we have an executable.

Tests are slightly more complicated beacuse of downloading and using GoogleTest, but still readable.

cmake_minimum_required(VERSION 3.16)
project(bagel-tests)

set(INSTALL_GTEST OFF)

enable_testing()

include(FetchContent)
FetchContent_Declare(
    googletest
    URL https://github.com/google/googletest/archive/refs/tags/v1.14.0.zip
)
FetchContent_MakeAvailable(googletest)

include(GoogleTest)

add_executable(test_timer chrono/test_timer.cpp)
target_link_libraries(test_timer
    PRIVATE
        bagel::bagel
        GTest::gtest_main
)
gtest_discover_tests(test_timer)

Again notice how we link our library to a test binary (and also to the GoogleTest binary).

To evaluate if we did everything correct, I created a dummy C++ executable for testing purposes. The main.cpp simply imports the header and does some work.

#include 
#include 
#include 
#include 

using namespace std::chrono_literals;

int main(int argc, char** argv) {
    bagel::WallTimer t;
    t.start();
    std::this_thread::sleep_for(10ms);
    auto elapsed = t.stop();
    std::cout << elapsed.count() << "s\n";
    return 0;
}

We create a timer, intentionally pause the main thread for about 10ms, stop the timer, and record the value in the timer.

The CMakeLists.txt simply defines an executable and links against our library. Since I’ve installed the library to a custom location for development purposes, I’m manually appending the location to the CMAKE_PREFIX_PATH.

cmake_minimum_required(VERSION 3.14)
project(hungry)

list(APPEND CMAKE_PREFIX_PATH "/Users/mohit/Developer/bagel/install/")

find_package(bagel CONFIG REQUIRED)

add_executable(hungry main.cpp)

target_link_libraries(hungry PRIVATE bagel::bagel)

Now we can create a build directory, run cmake, build our executable, and run it!

mkdir build && cd build
cmake ..
make
./hungry

The output is what we expect: a value close to 10ms (a little off depending on your scheduler).

0.012527s

Conclusion

CMake is the most popular meta-build system to build C++ libraries and executables, but it’s also one of the most challenging ones to learn well. In this post, we went over a project structure and CMakeLists.txt for a medium-sized project with multiple subcomponents. We broke the CMakeLists.txt down into a parts: (i) preamble, (ii) configuration, (iii) building, (iv) installing, and (v) extra stuff. In (i), we simply define the project. In (ii), we define some variables that clients can use to configure how they build our library. (iii) is where we actually build the library and set things like include directories. After building the library, (iv) is where we install it and the headers in a way and place where clients can easily link against it. Finally, (v) is where we build optional things like examples, tests, and documentation.

CMake can be pretty complicated to “get right” and there’s a lot of variability in how developers use CMake to write libraries and executables. Hopefully this little tutorial provides some guidance on how to provide more structure your CMakeLists.txt abiding to some best practices to avoid. If you’re working on C++ stuff, try to crystallize some of this guidance into your team or project’s standards and let me know how it goes 🙂

Neural Nets - Part 3: Artificial Neural Networks and Backpropagation

2023-12-10T00:00:00+00:00

In the previous post, we extended our perceptron model to a more modern artificial neuron model and learned how to train it using stochastic gradient descent (SGD) on the Iris dataset. However, we only did this for a single neuron. Practically, we’ve already shown that we can compose perceptrons together into multilayer perceptrons for much more expressive power so can we do the same thing with our new modern artificial neurons and train them using SGD?

In this post, we’ll compose our modern artificial neurons together into an actual artificial neural network (ANN) and discuss how to train an ANN using the most important and fundamental algorithm in all of deep learning: backpropagation of errors. In fact, we’re going to derive this algorithm for ANNs of any width, depth, and activation and cost function! Similar to previous posts, we’ll implement a generic ANN and the backpropagation algorithm in Python code using numpy. However, this time we’ll use a more complex dataset to highlight the expressive power of a full ANN.

Disclaimer: this part is going to have a lot of maths and equations since I want to properly motivate backpropagation and dispel any myths and misconceptions about backpropagation being this magical thing known only to machine learning library implementers and to combat those saying “ah just let X library take care of it; it’ll ‘just work’”. To make this understanding more accessible, I’ll have sections that summarize the high-level ideas as well as intuitive explanations for each of the core equations.

Neural Network Architecture

A two-layer network is composed by taking the output of the previous layer’s neurons and feeding them as input to each of the next layer’s neurons to create an all-to-all connection across layers. We don’t currently consider self-connections although there are network architectures, such as recurrent neural networks (RNNs), that do.

From the previous post in the series, we’re already familiar with building a perceptron network that has an intermediate hidden layer to solve the XOR gate problem. We saw that adding this hidden layer gave our model far more expressive power than a single layer. This structure or architecture, however, is general enough we can call it an artificial neural network: we have an input layer, any number of hidden layers, and an output layer. Layer to layer, we connect each neuron of the previous layer to each neuron of the next layer, forming a many-to-many connection.

Zooming into a single neuron, we take the weighted sum of its inputs and add a bias to form the pre-activation. One way to represent a bias is a “weight” whose input is always $+1$. The weights and bias are learning parameters of the network through some learning algorithm such as gradient descent. The activation function is applied to the pre-activation to produce the neuron output. This output is fed into each of the next layer’s neurons.

To compute a value for each neuron, we take the weighted sum of its inputs plus the bias to compute a pre-activation, and then run the pre-activation through an activation function to get the actual activation/value of the neuron. Then that activation becomes an input into the next layers neurons. Finally, we have an output layer that computes some value that’s useful for evaluation, e.g., a number for linear regression or a class label.

The intuition behind having multiple hidden layers is that it gives the network more expressive power. We saw this with the XOR gate problem: the input space wasn’t linear separable but the hidden space was. One interpretation of these hidden layers is that they transform the input space nonlinearly until the output space is linear separable. A complementary interpretation is that the hidden layers iteratively build complexity from the earlier layers to the later layers. It’s easier to see this with neural networks that operate on images, i.e., convolutional neural networks: the weights of the earlier layers activate on simple lines and edges while the weights of the later layers compose these to activate on shapes and more complex geometry.

Let’s introduce/re-introduce some notation to make talking about pre-activations, activations, weights, biases, layers, and other neural network stuff easier. I’m a fan of Michael Nielsen’s notation since I think it makes the subsequent analysis easier to understand and pattern-match against so we’ll use that going forward.

Let’s consider neural networks with $L$ total layers, indexed by $l\in [1, L]$. We define a bias vector vector $b^l$ with components $b_j^l$ for each layer $l$. The superscripts here mean layer not exponents! Between layers, we collect all of the individual weights into a weight matrix $W^l$ with elements $W_{jk}^l$ that represent the value of the individual weight from neuron $k$ in layer $(l-1)$ to neuron $j$ in layer $l$, i.e., $W_{jk}^l$ is the weight from neuron $k\to j$ from layer $(l-1)$ to $l$. Notice the first index represents the neuron in layer $l$ and the second is the neuron in layer $l-1$; Michael Nielsen defines the weight matrix this way since it simplifies some of the equations and makes the intuition easier to understand.

An entry in the weight matrix $W_{jk}^l$ is the value connecting the $k$-th neuron in the $(l-1)$-th layer with the $j$-th neuron in the $l$-th layer. When we compute the input for any particular neuron, we sum over all of the output activations of the previous layer, hence the $\sum_k W_{jk}^l a_k^l$ part of the pre-activation.

For the input layer, we can define the pre-activation as the weighted sum of the inputs and weights plus the bias $z_j^1=\sum_k W_{jk}^1 x_k + b_j^1$; we can write it in a vectorized form like $z^1= W^1 x + b^1$. The activation just runs the pre-activation through an activation function $\sigma(\cdot)$ like $a_j^1=\sigma(z_j^1)$ or $a^1=\sigma(z^1)$ for the vectorized version (assuming the activation function is applied element-wise). For the next layer, we use the activations of the previous layer all the way until we get to the activations for the last layer $a^L$, also called the output layer. For simplicity, we can define the zeroth set of activations as the input $a^0 = x$ so we can write the entire set of equations in a general form for each layer.

\[\begin{align*} z_j^l &= \displaystyle\sum_k W_{jk}^l a_k^{l-1} + b_j^l & z^l &= W^l a^{l-1} + b^l\\ a_j^l &= \sigma(z_j^l) & a^l &= \sigma(z^l)\\ \end{align*}\]

Performing a forward pass/inference is just computing $z^l$ and $a^l$ all the way to the final output layer $L$. During training, that final output $a^L$ goes into the cost function to determine how well the current set of weights and biases help produce the desired output. For tasks like classification, we can express the cost function in terms of the output layer activations $a^L$ and the desired class $y$ like $C(a^L, y)$. Note that if we were to “unroll” $a^L$ and all activations back to the input, $a^L$ would expand into a huge equation that would be a function of all of the weights and biases in the network so putting in the cost function is really evaluating all of the weights and biases.

This is a lot of notation but take a second to understand the placement of indices and what they represent. As an example, suppose we wanted to compute $z_1^l$, then substituting $j=1$ into the pre-activation equation, we get $\sum_k W_{1k}^l a_k^{l-1} + b_1^l$. Intuitively, this means we take each $k$ neurons from the $(l-1)$ th layer as a vector, multiply by the 1st column of the weight matrix, and add the 1st component of the bias vector to get the 1st vector component of the pre-activation. Make sure the indices match up and make sense, i.e., there should be the same number of free lower indices on both sides of any equation! Try other kinds of substitutions to make sure you understand how the index placement works.

Backpropagation of Errors

In the previous post, we demonstrated how to train a single neuron using gradient descent by computing the partial derivatives of the cost function with respect to the weights and bias. That is actually still the exact same principle and idea that we’ll be going forward with; it’s just that in the general case, the maths gets a bit more complicated since we have multiple sets of parameters across multiple layers written as functions of each other. Rather than computing individual partial derivatives for each weight and bias, we can come up with a general set of equations that tell us how to do so for any width and depth of neural network.

Instead of jumping right into the maths, let’s go through a numerical example of backpropagation to get our feet wet first. I actually wrote a post many years ago on this that I’ll steal from and take this opportunity to update the writing and narrative. Since we already somewhat used backpropagation in the previous post, let’s analyze that in a bit more detail.

Computation Graphs

One useful visual representation for a a computation is a computation graph. Each node in the graph represents an operation and each edge represents a value that is the output of the previous operation. Let’s draw a computation graph for our little artificial neuron from the previous post and substitute some random values for the weights and bias.

This computation graph represents a single neuron with two inputs and corresponding weights and a bias term. Example values have been substituted and a forward pass has been computed. The $y$ value is the target/true value fed into the cost function.

In this very simple example, we have a few operations: multiplication, addition, sigmoid activation, and cost function evaluation. We’ve done a forward pass and recorded the outputs of the operands on the top of the line. At the very last step, we have a sigmoid output of 0.73 but a desired output of 1. So the goal is to adjust our weights and biases such that, the next time we perform a forward pass, the output of the model is closer to 1. What we did last time was to compute the partial derivatives of the cost function with respect to each parameter by expanding out the entire cost function and analytically computing derivatives. One of the things we saw was that all of the learnable parameters had similar terms in their derivatives, namely $\frac{\p C}{\p a}$ and $\frac{\p C}{\p z}$. Was this coincidental or a byproduct of how we compute the output of a neuron?

To answer this question, we’re going to take a slightly different, but equivalent, approach at computing the partial derivatives by using the graph as a visual guide for which derivatives compose. For each node, we’re going to take the derivative of the operation with respect to each of the inputs and accumulate the overall gradient, starting at the end, through the graph until we get all the way back to the parameters of the model at the very left of the graph. We’ll start with the first derivative $\frac{\p C}{\p a}$ and keep tacking on factors as we go backwards through the graph. For example, the next factor we’ll tack on is $\frac{\p a}{\p z}$ to get $\frac{\p C}{\p a}\frac{\p a}{\p z}=\frac{\p C}{\p z}$. By multiplying through the partial derivatives this way, propagating the gradient signal backwards through the graph is equivalent to applying the chain rule. By the time we get to the model parameters, we will have computed something like $\frac{\p C}{\p w_1}$ and we can simply read this off the graph.

Let’s start with the output layer and the cost function. We’re using the quadratic cost function that looks like this for a single output: $C(a, y) = \frac{1}{2}(y - a)^2$. There are technically two possible partial derivatives of this function $\frac{\p C}{\p a}$ and $\frac{\p C}{\p y}$ but the latter doesn’t make sense since $y$ is given and not a function of the parameters of the model so let’s compute the former. We’ve already done so in the previous post so we’ll lift the derivative from there.

\[\begin{align*} \frac{\p C}{\p a} &= -(y-a)\\ &= -(1 - 0.73)\\ &= -0.27 \end{align*}\]

Computing the derivative and substituting our values, we get $-0.27$ for the start of the gradient signal.

We’ve computed the gradient of the cost function with respect to its inputs and placed it below the corresponding edge in green. Since $y$ is given, we don’t compute a gradient to it.

We’re going to write the gradient values under the edges and track them as we move backward through the graph. Now the next operation we encounter is the sigmoid activation function $\sigma(z) = \frac{1}{1+e^{-z}}$. Let’s compute the derivative of the sigmoid with respect to input $z$. Similar to the above example, we already know a closed-form of $\sigma’(z)$ from the previous post so we’ll lift the derivative from there.

\[\begin{align*} \frac{\p a}{\p z} &= \sigma(z)\big[1-\sigma(z)\big]\\ &= a(1-a)\\ &= 0.73(1-0.73)\\ &= 0.1971 \end{align*}\]

Computing the derivative and substituting values, we get $0.1971$. Now do we add this number underneath the corresponding edge of the graph? Not quite. We could call this value a local gradient since we’re just computing the gradient of a single node with respect to its inputs. But remember what we said above: propagating the gradient is equivalent to applying the chain rule so we actually need to multiply this by $-0.27$ to get the total gradient $\frac{\p C}{\p a}\frac{\p a}{\p z}=\frac{\p C}{\p z}=-0.27(0.1971)=-0.053$ which we can put underneath the corresponding edge.

We’ve computed the gradient of the activation function with respect to its inputs. To get the actual gradient, we multiply it with the previous gradient from the cost function so that we have a full global gradient.

Now we’ve reach our first parameter the bias $b$! Same as before, we’ll compute the local gradient and multiply by the thus-far accumulated gradient. To make things a bit easier, let’s just define $\Omega \equiv w_1 x_1 + w_2 x_2$ so the operation can be defined like $z = \Omega + b$. We have two local gradients to compute $\frac{\p z}{\p \Omega}$ and $\frac{\p z}{\p b}$. Fortunately, this is easy since the derivative of a sum with respect to either terms is 1 so $\frac{\p z}{\p \Omega}=\frac{\p z}{\p b}=1$ so we just “copy” the gradient along both input paths of the addition node. We’ve successfully computed the gradient of the cost function with respect to our bias parameter!

We’ve computed the gradient across the weighted sum and bias. Notice that the gradient is “copied” across addition nodes because the derivative of a sum with respect to the terms is always $+1$.

We have two more parameters to go. The next node we encounter on our way to the weights is another addition node. Similar to what we just did, we can “copy” the gradient along both paths.

Let’s first consider $w_1$ and now we encounter a multiplication node. Similarly, we can define $\omega_1 = w_1 x_1$ and compute just the local gradient $\frac{\p \omega_1}{\p w_1}$ since $\frac{\p \omega_1}{\p x_1}$ is fixed just like with the output.

\[\begin{align*} \frac{\p \omega_1}{\p w_1} &= x_1\\ &= -1\\ \end{align*}\]

Multiplying this with the incoming gradient we get the total gradient of $\frac{\p C}{\p a}\frac{\p a}{\p z}\frac{\p z}{\p\Omega}\frac{\p \Omega}{\p \omega_1}\frac{\p \omega_1}{\p w_1} = \frac{\p C}{\p w_1} = 0.053$. Collapsing the identity terms, a more meaningful application of the chain rule would be $\frac{\p C}{\p a}\frac{\p a}{\p z}\frac{\p z}{\p w_1} = \frac{\p C}{\p w_1} = 0.053$. We can easily figure out the other derivative $\frac{\p C}{\p a}\frac{\p a}{\p z}\frac{\p z}{\p w_2} = \frac{\p C}{\p w_2} = -0.053(-2)=0.106$ by noting that for a multiplication node, the local gradient of one of the inputs is the other input so $\frac{\p z}{\p w_2}=x_2$.

We’ve computed all of the gradients in the computation graph, including the weights. For a multiplication gate, the gradient of a particular term is the product of the other terms. For example, $\frac{\p}{\p a}abc=bc$ and the other derivatives follow. For a product like this, we multiply by the incoming gradient.

Now we’ve computed the gradient of the cost function for every parameter so we’re ready for a gradient descent update!

\[\begin{align*} w_1&\gets w_1 - \eta\frac{\p C}{\p w_1}\\ w_2&\gets w_2 - \eta\frac{\p C}{\p w_2}\\ b&\gets b - \eta\frac{\p C}{\p b} \end{align*}\]

Let’s set the learning rate to $\eta=1$ for simplicity and perform a single update to get new values for our parameters.

\[\begin{align*} w_1 &\gets 2 - (0.053) &= 1.94\\ w_2 &\gets -3 - (0.106) &= -3.106\\ b &\gets -3 - (-0.053) &= -2.947 \end{align*}\]

If we run another forward pass with these new parameters, we get $a=0.79$ which is closer to our target value of $y=1$! We’ve successfully performed gradient descent numerically by hand and saw that it does, in fact, adjust the model parameters to get us closer to the desired output!

To summarize, a computation graph is a useful tool for visualizing a larger computation in terms of its constituent operations, represented as nodes in the graph. To perform backpropagation on this graph, we start with the final output and work our ways backwards to each parameter, accumulating the global gradient as we go by successively multiplying it by the local gradient at each node. The local gradient at each node is just the derivative of the node with respect to its inputs. If we keep doing this, we’ll eventually arrive at the global gradient for each parameter which is equivalent to the derivative of the cost function with respect to the parameter. We can directly use this gradient in a gradient descent update to get our model closer to the target value.

Backpropagation Equations

Now that we’ve seen backpropagation work in a few different cases, e.g., single neuron and computation graph, we’re ready to actually derive the general backpropagation equations for any ANN. This is where the maths is going to start getting a little heavy so feel free to skip to the last paragraph of this section. I’ll be loosely following Michael Nielsen’s general approach here since I like the high-level way he’s structured the derivation. We’re going to start with computing the gradient of the cost function with respect to the output of the model, then come up with an equation for propagating a convenient intermediate quantity (he calls this the “error”) from layer to layer, and finally two more equations to compute the partial derivatives of the weights and bias of a particular layer with respect to that intermediate quantity of the layer.

From the previous section, we started with computing the gradient of the cost function with respect to the entire model output first so that sounds like a sensible thing to compute first: $\frac{\p C}{\p a_j^L}$ or $\nabla_{a^L}C$ in vector form. We’re making an implicit assumption that the cost function is a function of the output of the network but that’s most often the case. There are more complex models that account for other things in the cost function, but it’s a reasonable assumption to make. Note that this gradient is entirely dependent on the cost function we use, e.g. mean absolute error, mean squared error, or something more interesting like Huber loss, so we’ll leave it written symbolically.

Going a step further, we want to compute the derivative of the cost function with respect to the weights and biases of the very last layer, i.e., $\frac{\p C}{\p W_{jk}^L}$ and $\frac{\p C}{\p b_j^L}$. To do this, we’ll have to go backwards through the activation function first $\frac{\p C}{\p a_j^L}\frac{\p a_j^L}{\p z_j^L}=\frac{\p C}{\p z_j^L}$. One thing to note is that, for every layer, the pre-activation is always a function of the weights and biases at the same layer. By that logic, if we could compute $\frac{\p C}{\p z_j^l}$ for each layer, the gradients of the weights and biases would just be another factor tacked on to this. For convenience purposes, it seems like a good idea to define a variable and name for this quantity so let’s directly call this the error in neuron $j$ in layer $l$.

\[\begin{equation} \delta_j^l \equiv \frac{\p C}{\p z_j^l} \end{equation}\]

Note that we could have defined the error in terms of the activation rather than the pre-activation like $\frac{\p C}{\p a_j^l}$ but then there would be an extra step to go through the activation into the pre-activation anyways (for each weight matrix and bias vector) so it’s a bit simpler to define it in terms of the pre-activation. But everything we do past this point could be done using $\frac{\p C}{\p a_j^l}$ as the definition of the error without loss of generality.

A visual way to think about the error is taking the green gradient path from the cost function to the pre-activation $z_j^l$ (across its activation $a_j^l$) of a particular neuron.

Intuitively, $\delta_j^l$ represents how a change in the pre-activation in a neuron $j$ in a layer $l$ affects the entire cost function. This little wiggle in the pre-activation occurs from a change in the weights or bias but since the pre-activation is a function of both, we use it to represent both kinds of wiggles. It’s really just a helpful intermediate quantity that simplifies some of the work of propagating the gradient backwards.

Now that we have this quantity, the first step is to compute this error at the output layer $L$. Let’s substituting $l=L$ into the definition of $\delta_j^l$

\[\begin{align*} \delta_j^L &= \frac{\p C}{\p z_j^L}\\ &= \sum_k\frac{\p C}{\p a_k^L}\frac{\p a_k^L}{\p z_j^L}\\ &= \frac{\p C}{\p a_j^L}\frac{\p a_j^L}{\p z_j^L}\\ &= \frac{\p C}{\p a_j^L}\sigma'(z_j^L) \end{align*}\]

Between the first and second steps, we have to sum over the activations of all of the output layer since the cost function depends on all of them. Between the second and third steps, we used the fact that the pre-activation $z_j^L$ is only used in the corresponding activation $a_j^L$ and any other $a_k^L$ is not a function of $z_j^L$. So the only activation that is a function of $z_j^L$ is $a_j^L$. So all of the other terms in the sum disappear. So now we have an equation telling us the error in the last layer.

\[\begin{equation} \delta_j^L = \frac{\p C}{\p a_j^L}\sigma'(z_j^L) \end{equation}\]

and its vectorized counterpart

\[\begin{equation} \delta^L = \nabla_{a^L}C \odot \sigma'(z^L) \end{equation}\]

where $\odot$ is the Hadamard product or element-wise multiplication. Intuitively, this equation follows from the derivation: to get to the pre-activation at the last layer, we have to move the gradient backwards through the cost function and then again backwards through the activation of the last layer.

For the first backpropagation equation, we apply the definition of the error, but move back only to the output layer. To get to the pre-activation $z_j^L$, we start at the cost function $\frac{\p C}{\p a_j^L}$ and through the corresponding activation $\frac{\p a_j^L}{\p z_j^L}$ to get the total gradient $\frac{\p C}{\p a_j^L}\frac{\p a_j^L}{\p z_j^L}=\frac{\p C}{\p a_j^L}\sigma’(z_j^L)=\delta_j^L$.

Now we could go right into computing the weights and biases from here, but let’s first figure out a way to propagate this error from layer to layer first and then come up with a way to compute the derivative of the cost function with respect to the weights and biases of any layer, including the last one. So we’re looking to propagate the the error $\delta^{l+1}$ from a particular layer $(l+1)$ to a previous layer $l$. Specifically, we want to write the error in the previous layer $\delta^l$ in terms of the error of the next layer $\delta^{l+1}$. As we did before, we can start with the definition of $\delta^l$ and judiciously apply the chain rule.

\[\begin{align*} \delta_k^l &= \frac{\p C}{\p z_k^l}\\ &= \sum_j \frac{\p C}{\p z_j^{l+1}}\frac{\p z_j^{l+1}}{\p z_k^l}\\ &= \sum_j \delta_j^{l+1}\frac{\p z_j^{l+1}}{\p z_k^l} \end{align*}\]

Between the second and third steps, we substituted back the definition of $\delta_j^{l+1}=\frac{\p C}{\p z_j^{l+1}}$ just using $k\to j$ and $l\to (l+1)$ from the original definition (both are free indices). Now we have $\delta^l$ in terms of $\delta^{l+1}$! The last remaining thing to expand is $\frac{\p z_j^{l+1}}{\p z_k^l}$.

\[\begin{align*} \frac{\p z_j^{l+1}}{\p z_k^l} &= \frac{\p}{\p z_k^l}z_j^{l+1}\\ &= \frac{\p}{\p z_k^l}\bigg[\sum_p W_{jp}^{l+1}a_p^l + b_j^{l+1}\bigg]\\ &= \frac{\p}{\p z_k^l}\bigg[\sum_p W_{jp}^{l+1}\sigma(z_p^l) + b_j^{l+1}\bigg]\\ &= \frac{\p}{\p z_k^l}\sum_p W_{jp}^{l+1}\sigma(z_p^l)\\ &= \frac{\p}{\p z_k^l} W_{jk}^{l+1}\sigma(z_k^l)\\ &= W_{jk}^{l+1}\frac{\p}{\p z_k^l} \sigma(z_k^l)\\ &= W_{jk}^{l+1}\sigma'(z_k^l)\\ \end{align*}\]

This derivation is more involved. In the second line, we expand out $z_j^{l+1}$ using its definition; note that we use $q$ as the dummy index to avoid any confusion. In the fourth line, we cancel $b_j^{l+1}$ since it’s not a function of $z_k^l$. Going to the fifth line, similar to the reasoning earlier, the only term in the sum that is a function of $z_k^l$ is when $p=k$ so we cancel all of the other terms. Then we differentiate as usual. We can take this result and plug it back into the original equation.

\[\begin{equation} \delta_k^l = \sum_j W_{jk}^{l+1}\delta_j^{l+1}\sigma'(z_k^l) \end{equation}\]

To get the vectorized form, note that we have to transpose the weight matrix since we’re summing over the rows instead of the columns; also note that the last term is not a function of $k$ so we can take the Hadamard product.

\[\begin{equation} \delta^l = (W^{l+1})^{T}\delta^{l+1}\odot\sigma'(z^l) \end{equation}\]

This is why we intentionally ordered the terms in the multiplication this way: to better show how it translates into matrix product and why we use the transpose of weight matrix.

For the second backpropagation equation, we assume we’ve already computed the error at some layer $(l + 1)$ and try to propagate it back to layer $l$. We can always apply this to the last and second-to-last layer anyways. Starting from $\delta_j^{l+1}$, to get to $\delta_k^l$, we need to move backwards through the weight matrix and through the activation. In the forward pass, since we compute the pre-activation of a neuron using the weighted sum of all previous activation, to compute gradient, we need the sum of all of the previous errors, weighted by the transpose of the weight matrix (consider the dimensions) which explains the $\sum_j W_{jk}^{l+1}\delta_j^{l+1}$ part. Then we move backwards through the cost function which explains the $\sigma’(z_k^l)$ term.

This has an incredibly intuitive explanation: since the weight matrix propagates inputs forward, the transpose of the weight matrix propagates errors backwards, specifically the error in the next layer $\delta^{l+1}$ to the current layer. Another way to think about it is in terms of the dimensions of the matrix: the weight matrix multiples against the number of neurons of the previous layer to produce the number of neurons in the next layer so the transpose of the weight matrix multiples against the number of neurons in the next layer and produces the number of neurons in the previous layer. After the weight matrix multiplication, we have to Hadamard with the derivative of the activation function to move the error backward through the activation to the pre-activation.

We’re almost done! The last two things we need are the actual derivatives of the cost function with respect to the the weights and biases. Fortunately, they can be easily expressed in terms of the error $\delta_j^l$. Let’s start with the bias since its easier. This time, we can start with what we’re aiming for and then decompose in terms of the error.

\[\begin{align*} \frac{\p C}{\p b_j^l} &= \sum_k\frac{\p C}{\p z_k^l}\frac{\p z_k^l}{\p b_j^l}\\ &= \frac{\p C}{\p z_j^l}\frac{\p z_j^l}{\p b_j^l}\\ &= \delta_j^l\frac{\p z_j^l}{\p b_j^l}\\ &= \delta_j^l\frac{\p}{\p b_j^l}\Big(\sum_k W_{jk}^l a_k^{l-1} + b_j^l\Big)\\ &= \delta_j^l \end{align*}\]

In the first step, we use the chain rule to expand the left-hand side. Similar to the previous derivations, all except for one term in the sum cancels. Then we plug in the definition of the error and differentiate.

\[\begin{equation} \frac{\p C}{\p b_j^l} = \delta_j^l \end{equation}\]

The vectorized version looks almost identical!

\[\begin{equation} \nabla_{b^l}C = \delta^l \end{equation}\]

Note that if we had defined the error as the gradient of the cost function with respect to the activation, we’d have to take an extra term moving it across the pre-activation.

Remember that one way to interpret the bias is being a “weight” whose input is always $+1$. Similar to the second backpropagation equation, we’ll assume we’ve computed $\delta_j^l$. To get to the bias $b_j^l$, we don’t have to do anything extra since the input term is simply $+1$.

Turns out the derivative of the cost function with respect to the bias is exactly equal to the error! Convenient that it worked out this way!

Now we just need the corresponding derivative for the weights. It’ll follow almost the same pattern.

\[\begin{align*} \frac{\p C}{\p W_{jk}^l} &= \sum_q\frac{\p C}{\p z_q^l}\frac{\p z_q^l}{\p W_{jk}^l}\\ &= \frac{\p C}{\p z_j^l}\frac{\p z_j^l}{\p W_{jk}^l}\\ &= \delta_j^l\frac{\p z_j^l}{\p W_{jk}^l}\\ &= \delta_j^l\frac{\p}{\p W_{jk}^l}\Big(\sum_p W_{jp}^l a_p^{l-1} + b_j^l\Big)\\ &= \delta_j^l\frac{\p}{\p W_{jk}^l}W_{jk}^l a_k^{l-1}\\ &= \delta_j^l a_k^{l-1} \end{align*}\]

Be careful with the indices! The first step we use a dummy index $q$ to not confuse indices. The only term in the sum that is nonzero is $z_j^l$; remember that the second in index in the weight matrix is summed over so only the first one allows us to cancel the other terms. Then we can expand out using a dummy index again and apply the same reasoning to cancel out other terms in the sum. Then we differentiate.

\[\begin{equation} \frac{\p C}{\p W_{jk}^l} = \delta_j^l a_k^{l-1} \end{equation}\]

Note that all indices are balanced on both sides of the equation so we haven’t made any obvious mistake in the calculation.

Like the previous two backpropagation equations, we’ll assume we’ve computed $\delta_j^l$. To get to the weight between two arbitrary neurons $W_{jk}^l$, the two terms involved are the error $\delta_j^l$ which is the error at the $j$th neuron and the activation of the $k$th neuron that it connects to.

The intuitive explanation for this is that $a_k^{l-1}$ is the “input” to a neuron through a weight and $\delta_j^l$ is the “output” error; this says the change in cost function as a result of the change in the weight is the product of the activation going “into” the weight times the resulting error “output”. The vectorized version uses the outer product since, for a matrix $M_{ij}=x_i y_j \leftrightarrow M=xy^T$.

\[\begin{equation} \nabla_{W^l}C = \delta^l (a^{l-1})^{T} \end{equation}\]

That’s the last equation we need for a full backpropagation solution! Let’s see them all in one place here, both in element and vectorized form!

\[\begin{align*} \delta_j^l &\equiv \frac{\p C}{\p z_j^l} & \delta^l &\equiv \nabla_{z^l} C\\ \delta_j^L &= \frac{\p C}{\p a_j^L}\sigma'(z_j^L) & \delta^L &= \nabla_{a^L}C \odot \sigma'(z^L)\\ \delta_k^l &= \sum_j W_{jk}^{l+1}\delta_j^{l+1}\sigma'(z_k^l) & \delta^l &= (W^{l+1})^{T}\delta^{l+1}\odot\sigma'(z^l)\\ \frac{\p C}{\p b_j^l} &= \delta_j^l & \nabla_{b^l}C &= \delta^l\\ \frac{\p C}{\p W_{jk}^l} &= \delta_j^l a_k^{l-1} & \nabla_{W^l}C &= \delta^l (a^{l-1})^{T}\\ \end{align*}\]

With this set of equations, we can train any artificial neural network on any set of data! Take a second to prod at what happens when various values such as what happens when $\sigma’(\cdot)\approx 0$. This should help give some insight on how quickly or efficiently training can happen, for example. There are some other insights we can gain from analyzing these equations further but that’s a bit tangential to this current discussion and best saved for when we encounter problems (“seeing is believing”).

Backpropagation Algorithm

Now we can describe the entire backpropagation algorithm in the context of stochastic gradient descent (SGD).

Initialize the weights $W^l$ and biases $b^l$ for each layer $l=1,\dots,L$
For each epoch
1. Sample a minibatch $\{x^{(i)}, y^{(i)}\}$ of size $m$
2. For each example $(x^{(i)}, y^{(i)})$ in the minibatch
  1. Forward pass to compute each $z^l = W^l a^{l-1} + b^l$ and $a^l=\sigma(z^l)$ for $l=1,\dots,L$
  2. Compute the error in the last layer $\delta^L = \nabla_{a^L}C \odot \sigma’(z^L)$
  3. Backward pass to compute error for each layer $\delta^l = (W^{l+1})^{T}\delta^{l+1}\odot\sigma’(z^l)$
3. Update all weights using $W^l\gets W^l-\eta\frac{1}{m}\sum_x\delta^l (a^{l-1})^{T}$ and all biases using $b^l\gets b^l-\eta\frac{1}{m}\sum_x\delta^l$, respectively. Average the gradient over all of the training examples in the minibatch and apply the learning rate.

This algorithm follows suit from the previous SGD training loop we wrote except now we’re computing an intermediate quantity (the error $\delta^l$), and have more complicated update equations.

Neural Network Implementation

We’ve derived the equations for backpropagation so we’re ready to implement and train a general artificial neural network in Python! But before we dive into the code, our dataset is going to be different than the Iris dataset. I want to highlight how general ANNs can solve more complex problems than singular neurons so the dataset is going to be more complicated.

We’ll be training on a famous data called the MNIST Handwritten Digits dataset. As the name implies, it’s a dataset of handwritten digits 0-9 represented as grayscale images. Each image is $28\times 28$ pixels and the true label is a digit 0-9. It’s always a good idea to look at raw data of a dataset that we’re not familiar with so that we understand what the inputs correspond to in the real world.

MNIST Handwritten Digits Dataset contains tens of thousands of handwritten digits from 0-9. We can plot some example data from the training set in a grid.

Now that we’ve seen some data, we can start writing the data pre-processing step. In practice, this data pipeline is often more important than the exact model or network architecture. Running poorly-processed data through even the state-of-the-art model will produce poor results. To start, we’re going to use the Pytorch machine learning Python framework to load the training and testing data. For a particular grayscale image pixel, there are a lot of data representations, but the most common are (i) an integer value in $[0, 255]$ or (ii) a floating-point value in $[0, 1]$. We’re going to use the latter since it plays more nicely, numerically, with the floating-point parameters of our model (and the sigmoid activation).

import numpy as np
from torchvision import datasets
from matplotlib import pyplot as plt

# load MNIST dataset
train_dataset = datasets.MNIST('./data', train=True, download=True)
test_dataset = datasets.MNIST('./data', train=False, download=True)
X_train = train_dataset.data.numpy()
X_test = test_dataset.data.numpy()
# normalize training data to [0, 1]
X_train, X_test = X_train / 255., X_test / 255.

We can print the “shape” of this data with X_train.shape. The first dimension represents the number of examples (either training or test) and the remaining dimensions represent the data. In this case, for the MNIST training set, we have 60,000 examples and the images are all $28\times 28$ pixels so the shape of our training data is a multidimensional array of shape $(60000, 28, 28)$. The test set contains 10,000 examples for evaluation. But our neural network accepts a number of neurons as input, not a 2D image. An easy way to reconcile this is to flatten the image into a single layer. So we’ll take each $28\times 28$ image and flatten it into a single list of $28*28=784$ numbers. This will change the shape of the training data to $(60000, 784)$ but we’ll need to add an extra dimension to make Pytorch and the maths work out so we want the resulting shape to be $(60000, 784, 1)$ where the last dimension just means that one set of 784 numbers correspond to 1 input example.

# flatten image into 1d array 
X_train, X_test = X_train.reshape(X_train.shape[0], -1), X_test.reshape(X_test.shape[0], -1)
# add extra trailing dimension for proper matrix/vector sizes
X_train, X_test = X_train[..., np.newaxis], X_test[..., np.newaxis]
print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")

So that handles the input data, but what about the output data? Remember the output is a label from 0-9. We could just leave the label alone but there are problems with this numbering. For example, if we were to take an average across a set of output data, we’d end up with a value corresponding to a different output: the average of 0 and 4 is 2. This relation doesn’t really make sense and arises from the fact that our output data is ordinal: an integer between 0-9. We’d rather have each possible output “stretch” out into it’s own dimension so we can operate on a particular output or set of outputs independently without inadvertently considering all outputs. One way to do this is to literally put each output into it’s own dimension. This is called a one-hot encoding where we create an $n$-dimensional vector where $n$ represents the number of possible output classes. In our specific case, it maps a numerical output to a binary vector with a 1 in the index of the vector: so the digit 2 would be mapped to the vector $\begin{bmatrix}0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\end{bmatrix}^T$. We’ll do the same with the input data and expand the last dimension for the same reasons.

def to_onehot(y):
    """
    Convert index to one-hot representation
    """
    one_hot = np.zeros((y.shape[0], 10))
    one_hot[np.arange(y.shape[0]), y] = 1
    return one_hot

y_train, y_test = train_dataset.targets.numpy(), test_dataset.targets.numpy()
y_train, y_test = to_onehot(y_train), to_onehot(y_test)
y_train, y_test = y_train[..., np.newaxis], y_test[..., np.newaxis]
print(f"Training target size: {y_train.shape}")
print(f"Test target size: {y_test.shape}")

Now we’re ready to instantiate our neural network class with a list of neurons per layer and train it!

ann = ArtificialNeuralNetwork(layer_sizes=[784, 32, 10])

training_params = {
    'num_epochs': 30,
    'minibatch_size': 16,
    'cost': QuadraticCost,
    'learning_rate': 3.0,
}
print(f'Training params: {training_params}')
ann.train(X_train, y_train, X_test, y_test, **training_params)

There are a few parameters that haven’t been explained yet, but we’ll get to them. Even before the class definition, let’s define the activation and cost functions and their derivatives.

class Sigmoid:
    @staticmethod
    def forward(z):
        return 1. / (1. + np.exp(-z))

    @staticmethod
    def backward(z):
        return Sigmoid.forward(z) * (1 - Sigmoid.forward(z))

class QuadraticCost:
    @staticmethod
    def forward(a, y):
        return 0.5 * np.linalg.norm(a - y) ** 2

    @staticmethod
    def backward(a, y):
        return a - y 

The forward pass computes the output based on the input and the backward pass computes the gradient. Note that the forward pass of the quadratic cost computes a vector norm since the inputs are 10-dimensional vectors and the cost function generally outputs a scalar. Now we can define the class and constructor. For the most part, we’ll just copy over the input parameters as well as initialize the weights and biases.

class ArtificialNeuralNetwork:
    def __init__(self, layer_sizes: [int], activation_fn=Sigmoid):
        self.layer_sizes = layer_sizes
        self.num_layers = len(layer_sizes)
        self.activation_fn = activation_fn
        # use a unit normal distribution to initialize weights and biases
        # performs better in practice than initializing to zeros
        # note that weights are j in layer [i] to k in layer [i-1]
        self.weights = [np.random.randn(j, k)
                for j, k in zip(layer_sizes[1:], layer_sizes[:-1])]
        # since the first layer is an input layer, we don't have biases for 
        self.biases = [np.random.randn(j, 1) for j in layer_sizes[1:]]

Notice that we’re initializing the weights and biases with a standard normal distribution rather than with zeros. This is to intentionally create asymmetry in the neurons so that they learn independently! The next function to implement is the training function. This follows from the previous ones we’ve written where we iterate over the number of epochs and then create minibatches and iterate over those.

    def train(self, X_train, y_train, X_test, y_test, **kwargs):
        num_epochs = kwargs['num_epochs']
        self.minibatch_size = kwargs['minibatch_size']
        self.cost = kwargs['cost'] 
        self.learning_rate = kwargs['learning_rate']

        for epoch in range(num_epochs):
            # shuffle data each epoch
            permute_idxes = np.random.permutation(X_train.shape[0])
            X_train = X_train[permute_idxes]
            y_train = y_train[permute_idxes]
            epoch_cost = 0

            for start in range(0, X_train.shape[0], self.minibatch_size):
                minibatch_cost = 0
                # partition dataset into minibatches
                Xs = X_train[start:start+self.minibatch_size]
                ys = y_train[start:start+self.minibatch_size]
                self._zero_grad()
                for x_i, y_i in zip(Xs, ys):
                    a = self.forward(x_i)
                    d_nabla_W, d_nabla_b = self._backward(y_i)
                    self._accumulate_grad(d_nabla_W, d_nabla_b)
                    minibatch_cost += self.cost.forward(a, y_i)
                self._step()
                minibatch_cost = minibatch_cost / self.minibatch_size
                epoch_cost += minibatch_cost

            test_set_num_correct = self.num_correct(X_test, y_test)
            test_set_accuracy = test_set_num_correct / X_test.shape[0]

            print(f"Epoch {epoch+1}: \
                \tLoss: {epoch_cost:.2f} \
                \ttest set acc: {test_set_accuracy*100:.2f}% \
                        ({test_set_num_correct} / {X_test.shape[0]})")

There are a lot of functions that we haven’t defined yet. The first loop defines the outer loop for the epochs, then we create minibatches and iterate over those. At the start of each minibatch, we zero out any accumulated gradient since we’ll be performing a gradient descent update for each minibatch. In the innermost loop for each individual training example, notice that we do a forward pass and a backward pass that computes the weights and biases gradients. We accumulate these gradients over the minibatch. Then we call this self._step() function to perform one step of gradient descent optimization to update all of the model parameters. At the end of each minibatch, we compute the accuracy on the test set. (There a better way to compute incremental progress using something called a validation set.)

Going from top to bottom, the first function we encounter is self._zero_grad() that is called at the beginning of the minibatch loop since, for stochastic gradient descent, we accumulate the gradient over the minibatch and perform a single parameter update over the accumulated gradient of the minibatch. So we need this function to zero out the accumulated gradient for the next minibatch.

    def _zero_grad(self):
        self.nabla_W = [np.zeros(W.shape) for W in self.weights]
        self.nabla_b = [np.zeros(b.shape) for b in self.biases]

We’re going to skip over the forward and backward passes to the self._accumulate_grad(d_nabla_W, d_nabla_b). This folds in the gradient for a single training example into the total accumulated gradient across the minibatch.

    def _accumulate_grad(self, d_nabla_W, d_nabla_b):
        self.nabla_W = [nw + dnw for nw, dnw in zip(self.nabla_W, d_nabla_W)]
        self.nabla_b = [nb + dnb for nb, dnb in zip(self.nabla_b, d_nabla_b)]

The last function self._step() applies one step of gradient descent optimization and updates all of the weights and biases from the averaged accumulated gradient.

    def _step(self):
        self.weights = [w - (self.learning_rate / self.minibatch_size) * nw 
                for w, nw in zip(self.weights, self.nabla_W)]
        self.biases = [b - (self.learning_rate / self.minibatch_size) * nb
                for b, nb in zip(self.biases, self.nabla_b)]

Those are all functions that operate on the gradient and weights and biases and perform simpler calculations. The crux of this class lies in the forward and backward pass functions. For the forward pass, we define the first activation as the input and iterate through the layers applying the corresponding weights and biases and activation functions. For the backwards pass, we cache the values of the activations and pre-activations.

    def forward(self, a):
        self.activations = [a]
        self.zs = []
        for W, b in zip(self.weights, self.biases):
            z = np.dot(W, a) + b
            self.zs.append(z)
            a = self.activation_fn.forward(z)
            self.activations.append(a)
        return a

The backward pass simply implements the backpropagation equations we derived earlier. The only consideration is that we need to apply the derivative of the cost and activation functions at the very end and then move backwards. One thing we do is exploit Python’s negative indexing so the first element is the last layer, the second element is the second-to-last layer, and so on.

    def _backward(self, y):
        nabla_W = [np.zeros(W.shape) for W in self.weights]
        nabla_b = [np.zeros(b.shape) for b in self.biases]

        z = self.zs[-1]
        a_L = self.activations[-1]
        delta = self.cost.backward(a_L, y) * self.activation_fn.backward(z)
        a = self.activations[-1-1]
        nabla_W[-1] = np.dot(delta, a.T)
        nabla_b[-1] = delta

        for l in range(2, self.num_layers):
            z = self.zs[-l]
            W = self.weights[-l+1]
            delta = np.dot(W.T, delta) * self.activation_fn.backward(z)

            a = self.activations[-l-1]
            nabla_W[-l] = np.dot(delta, a.T)
            nabla_b[-l] = delta
        return nabla_W, nabla_b

Finally, we have an evaluation function that computes the number of correct examples. We run the input through the network and take the index of the largest activation of the output layer and compare it against the index of the one in the one-hot encoding of the label vectors.

    def num_correct(self, X, Y):
        results = [(np.argmax(self.forward(x)), np.argmax(y)) for x, y in zip(X, Y)]
        return sum(int(x == y) for (x, y) in results)

And that’s it! We can run the code and train our neural network and see the output! Even with our simple neural network we can get to >95% accuracy on the test set! Try messing around with the other input parameters!

The full code listing can be found here.

Conclusion

We did a lot this article! We started off with our modern neuron model and extended it into layers to support multi-layer neural networks. We defined a bunch of notation to perform a forward pass to propagate the inputs all the way to the last layer. Then we started learning about how to automatically compute the gradient across all weights and biases using the backpropagation algorithm. We demonstrated the concept with a computation graph and then derived the necessary equation to backpropagate the gradient and we coded a neural network in Python and numpy and trained it on the MNIST handwritten dataset.

We have a functioning neural network written in Numpy now! We’re able to get pretty good accuracy on the MNIST data as well. However this dataset has been around for decades and is that really the best we can do? This is a good start but we’re going to learn how to make our neural networks even better with some modern training techniques 🙂

Neural Nets - Part 2: From Perceptrons to Modern Artificial Neurons

2023-12-04T00:00:00+00:00

In the previous post, we motivated perceptrons as the smallest and simplest form of artificial neuron inspired by biological neurons. It consumed some inputs, applied a threshold function, and produced a binary output. While it did seem to model biology, it wasn’t quite as useful for machines. We applied some changes to make it more machine-friendly and applied it to learning logic gates. We even stacked them to form a 2-layer deep perceptron to solve the XOR gate problem! However, the weights and biases we “solved” for were just by inspection and logical reasoning. In practice, neural networks can have millions of parameters so this strategy is not really feasible.

In this post, we’ll learn how to automatically solve for the parameters of a simple neural model. In doing so, we’ll make a number of modifications that will evolve the perceptron into a more modern artificial neuron that we can use as a building block for wider and deeper neural networks. Similar to last time, we’ll implement this new artificial neuron using Python and numpy.

Gradient Descent

In the past, we solved for the weights and biases by inspection. This was feasible since logic gates are human-interpretable and the number of parameters was small. Now consider trying to come up with a network for trying to detect a mug in an image. This is a much more complicated task that required understand what a “mug” even is. What do the weights correspond to, and how would we set them manually? In practice, we have very wide and large networks with hundreds of thousands or even millions and billions of parameters, and we need a way to find appropriate values for these to solve our objective, e.g., emulating logic gates or detecting mugs in images.

Fortunately for us, there already exists a field of mathematics that’s been around for a long time that specializes in solving for these parameters: numerical optimization (or just optimization for short). In an optimization problem, we have a mathematical model, often represented as a function like $f(x; \theta)$ with some inputs $x$ and parameters $\theta$, and the goal is to find the values of the parameters such that some objective function $C$ satisfies some criteria. In most cases, we’re minimizing or maximizing it; in the former case, this objective function is sometimes called a cost/loss/error function (all are interchangeable), and in the latter case it is sometimes called a utility/reward function (all are interchangeable). There’s already a vast literature of numerical optimization techniques to draw from so we should try to leverage these rather than building something from scratch.

More specifically, in the case of our problem, we have a neural network represented by a function $f(x;W,b)$ that accepts input $x$ and is parameterized by its weights $W$ and biases $b$. But to use the framework of numerical optimization and its techniques, we need an objective function. In other words, how do we quantify what a “good” model is? In our classification task, we want to ensure that the output of the model is the same as the desired training label for all training examples so we can intuitively think of this as trying to minimize the mistakes of our model output from the true training label.

\[C = \displaystyle\sum_i \vert f(x_i;W,b) - y_i\vert\]

Notice we used the absolute value since any difference will increase our cost; verify this for yourself (for a single training example) that when the output of the model is different from $y_i$, $C > 0$ and when and output is the same as $y_i$, $C = 0$. This cost function is also sometimes called mean absolute error (MAE). (Sometimes we’ll see a $\frac{1}{N}$, where $N$ is the number of training examples, in front of the sum but this is just a constant factor that makes the maths easier so we can omit it without any issue.) We only get $C=0$ if, for every training example, $f(x_i;W,b) = y_i$, i.e., our model always classifies correctly. Now we have our model and our cost function so we can try to figure out which optimization approach is well-suited for our problem.

One bifurcation of the numerical optimization field is gradient-free and gradient-based methods. Recall from calculus that a gradient measures the rate-of-change of a function with respect to all of its inputs. This extra information in addition to the objective function itself that, if we choose to use it, will have to be computed and maintained. So the former set of methods describes approaches where we don’t need this extra information and rely on just the values of the objective function itself. The latter describes a set of methods where we do use this extra information. In practice, gradient-based methods tend to work better for neural networks since they tend to converge to a better solution, i.e., they more quickly find the set of parameters with lower cost, but it should be noted there are techniques that optimize neural networks using gradient-free approaches as well.

The idea behind gradient-based methods is to compute a partial derivative of the cost function with respect to each parameter $\frac{\p C}{\p \theta_i}$ of the model. From calculus, we can arrange these partial derivatives in a vector called the gradient $\nabla_\theta C$. This quantity tells us how changes in a parameter $\theta_i$ correspond to changes in the cost function $C$; specifically, it tells us how to change $\theta_i$ to increase the value of $C$. Mathematically speaking, if we have a function of a single variable $C(\theta)$ and a little change in its inputs $\Delta\theta$, then $C(\theta + \Delta\theta)\approx C(\theta)+\frac{\p C}{\p\theta}\Delta\theta$; in other words, a little change in the input is mapped to a little change in the output, but proportional to how the cost function changes with respect to that little input: $\frac{\p C}{\p \theta}$. This is very useful because it can tell us in which direction to move $\theta_i$ such that the value of $C$ decreases, i.e., $-\frac{\p C}{\p \theta_i}$. Remember that in our ideal case, we want $C=0$ (we minimize cost functions and maximize reward functions), and the negative of the partial derivatives tell us exactly how to accomplish this. With this information, we can nudge the parameters $\theta$ using the gradient of the cost function.

\[\theta_i\gets\theta_i - \eta\displaystyle\frac{\p C}{\p \theta_i}\]

or in vectorized form

\[\theta\gets\theta - \eta\nabla_\theta C\]

Just like with perceptrons, we’ll have a learning rate $\eta$ that is a tuning parameter that tells us how much to adjust the current $\theta$ by. If we do this, we can find the values of the parameters such that the cost function is minimized! This optimization technique is called gradient descent.

(Note that I’ll be using a bit sloppy with my nomenclature and interchangeably say “partial derivative” and “gradient” but just remember the definition of the gradient of a function: the vector of all partial derivatives of the function with respect to each parameter.)

One intuitive way to visualize gradient descent is to think about $C$ is as an “elevation”, like on a topographic map and and the objective is to find the single lowest valley. Mathematically, we’re trying to find the global minima of the cost function. If we could analytically and tractably compute $C$ exactly with respect to all parameters and the entire dataset, then we could just use calculus to solve for the global minima and be finished perfectly! However, the complexity of neural networks along with the size of the datasets they’re often trained on makes this approach infeasible.

Suppose we have a very well-behaved cost function $C(x, y) = x^2+y^2$ with a single global minima. The idea behind gradient descent is to start at some random point $(x, y)$, e.g., $(5, 5)$ in this example, on this cost surface and incrementally move in a way such that we ultimately arrive at the lowest-possible point. The left figure shows the 3D mesh of the cost function (z axis is the value of the cost function for its x and y axis inputs) as well as the path that gradient descent will take us from the starting point $(5, 5)$ to the global minima at the origin. The right figure shows the same, but a top-down view where the colors represent the value of the cost function.

Instead, imagine we’re at some starting point on the cost surface. Using the negative of the gradient tells us how to move parameters from where we currently are to get to a slightly lower point on the cost surface from where we were. If the cost function is well-behaved, this should decrease our overall cost. We repeatedly do this until we’re at a point on the cost surface where, no matter which direction we nudge our parameters, the cost always increases. This is a minima! Depending on the cost function, we might have multiple local minima which are locally optimal within some bounds of the cost function, but they’re not optimal across the entire cost function; that would be the global minima, which is the best solution.

Another intuitive way to think about this is suppose someone took us hiking and we got lost. All we know is that there is a town in the valley of the mountain but there’s a thick fog so we’re unable to see far out. Rather than picking a random direction to walk in, we can look around (within the visibility of the fog) to see if the elevation goes downhill from where we currently are, and then move in that direction. While we’re moving, we’re constantly evaluating which direction would bring us downhill. We repeat until, no matter which direction we look, we’re always going uphill.

Let’s apply what we’ve learned so far to the same Iris dataset example we did last time! Let’s try to train our perceptron using gradient descent. We’ll use the cost function above and analytically compute the gradients to update the weights. However, we’ll run into an immediate problem: the Heaviside step function we’re using as an activation function. Recall its definition:

\[f(\theta)=\begin{cases} 1 & \theta \geq 0 \\ 0 & \theta < 0 \\ \end{cases}\]

We’ll be computing a gradient, and this step function has a nonlinearity at 0. That alone isn’t a huge issue; the larger issue is that the gradient will be 0 since the output of the step function is constant and the derivative of a constant is always 0. We’ll get no parameter update from gradient descent, and our model won’t learn a thing! So this choice of activation function isn’t going to work; we need an activation function that actually has a gradient.

Rather than picking the step function, we can try to pick a differentiable function that looks just like a step function. Fortunately for us, there exists a whole class of functions call logistic functions that closely resemble this step function. One specific logistic curve is called the sigmoid.

\[\sigma(z) \equiv \frac{1}{1+e^{-z}}\]

The function itself actually even looks like a smooth version of the step function!

The sigmoid (right) can be considered a smooth version of the Heaviside step function (left) so it can be differentiated an infinite amount of times. Both map their unbounded input to a bounded output, but the nuance is that the step function bound is inclusive $[0, 1]$ while the sigmoid bound is exclusive $(0, 1)$ because of the asymptotes.

Note that if the input $z$ is very large and positive, then the sigmoid function asymptotes/saturates to $1$ and if the input is very large and negative, the sigmoid function asymptotes to $0$. In other words, it maps the unbounded real number line $(-\infty, \infty)$ to the bounded interval $(0, 1)$. The sigmoid is smooth in that we can take a derivative, and that little changes in the input will map to little changes in the output. In fact, the derivative of the sigmoid can be expressed in terms of the sigmoid itself (thanks to the properties of the $e$ in its definition!)

\[\sigma'(z) = \sigma(z)(1 - \sigma(z))\]

It’s a good exercise to verify this for yourself! (Hint: rewrite $\sigma(z) = (1+e^{-z})^{-1}$ and use the power rule.)

Now let’s replace our step function with the sigmoid so we do end up with nonzero derivatives. Remember we’re trying to compute the gradient of the cost function with respect to the two weights $w_1$ and $w_2$ and the bias $b$. Substituting and expanding the cost function for a single training example, we get the following.

\[C = \vert \sigma(w_1 x_1 + w_2 x_2 + b) - y\vert\]

Let’s start with computing $\frac{\p C}{\p w_1}$ and the other derivatives will follow. We’ll need to make liberal use of the chain rule; the way I remember it is “derivative of the outside with respect to the inside times the derivative of the inside $\frac{\d}{\d x}f(g(x)) = f’(g(x))g’(x)$. We’ll also need to know that the derivative of the absolute value function $f(x)=\vert x\vert$ is the sign function $\sgn(x)$ that returns 1 if the input is positive and -1 if the input is negative and is mathematically undefined if the input is 0, but practically, in this specific example, we can let $\sgn(0) = 0$. (Similar to the Heaviside step function, we can see this from plotting the absolute value function, looks like a ‘V’, and noting that both sides of the ‘V’ have a constant slope of $\pm 1$ depending on the side of the ‘V’.)

\[\begin{align*} \displaystyle\frac{\p C}{\p w_1} &= \frac{\p}{\p w_1} \vert \sigma(w_1 x_1 + w_2 x_2 + b) - y\vert \\ &= \sgn\big[\sigma(w_1 x_1 + w_2 x_2 + b) - y\big] \frac{\p}{\p w_1}\big[\sigma(w_1 x_1 + w_2 x_2 + b) - y\big]\\ &= \sgn\big[\sigma(w_1 x_1 + w_2 x_2 + b) - y\big] \frac{\p}{\p w_1}\sigma(w_1 x_1 + w_2 x_2 + b)\\ &= \sgn\big[\sigma(w_1 x_1 + w_2 x_2 + b) - y\big] \sigma'(w_1 x_1 + w_2 x_2 + b)\frac{\p}{\p w_1}\big[w_1 x_1 + w_2 x_2 + b\big]\\ &= \sgn\big[\sigma(w_1 x_1 + w_2 x_2 + b) - y\big] \sigma'(w_1 x_1 + w_2 x_2 + b)\frac{\p}{\p w_1}\big[w_1 x_1\big]\\ &= \sgn\big[\sigma(w_1 x_1 + w_2 x_2 + b) - y\big] \sigma'(w_1 x_1 + w_2 x_2 + b)x_1\\ &= \sgn(a - y) \sigma'(z)x_1\\ \end{align*}\]

In the last step, we simplified by substituting back $z=w_1 x_1 + w_2 x_2 + b$ and $a=\sigma(z)$. Similarly, the other derivatives follow from this one with only minor changes in the last few steps so we can compute them all.

\[\begin{align*} \displaystyle\frac{\p C}{\p w_1} &= \sgn(a - y) \sigma'(z)x_1 \\ \displaystyle\frac{\p C}{\p w_2} &= \sgn(a - y) \sigma'(z)x_2 \\ \displaystyle\frac{\p C}{\p b} &= \sgn(a - y) \sigma'(z) \\ \end{align*}\]

Another way to think about these derivatives that will be useful for implementation in code is expanding out the partials in accordance with the chain rule.

\[\begin{align*} \displaystyle\frac{\p C}{\p w_1} &= \displaystyle\frac{\p C}{\p a} \displaystyle\frac{\p a}{\p z}\displaystyle\frac{\p z}{\p w_1} \\ \displaystyle\frac{\p C}{\p w_2} &= \displaystyle\frac{\p C}{\p a} \displaystyle\frac{\p a}{\p z}\displaystyle\frac{\p z}{\p w_2} \\ \displaystyle\frac{\p C}{\p b} &= \displaystyle\frac{\p C}{\p a} \displaystyle\frac{\p a}{\p z}\displaystyle\frac{\p z}{\p b} \\ \end{align*}\]

So the first two terms in each of these are the same, and it’s only the last term that we have to actually change. Now that we have these gradients computed analytically, we can get around to writing code!

A sketch of the general training algorithm is going to look like this.

For each epoch
1. For each training example $(x, y)$
  1. Perform a forward pass through the model $y = f(x)$
  2. Perform a backward pass to compute weight $\frac{\p C}{\p W_i}$ and bias $\frac{\p C}{\p b}$ gradients
2. Update the weights and biases

We refer to passing an input through the network to get an output as a forward pass and computing gradients as a backward pass because of the nature of how we perform both computations (starting from input through the parametres of the model to the output and from the cost function back through the model parameters toward input). We’ll see the name nomenclature in literature and neural network libraries such as Tensorflow and Pytorch.

Let’s first start by defining out cost and activation functions and their derivatives.

import matplotlib.pyplot as plt
from sklearn import datasets
import numpy as np

def cost(pred, true):
    return np.abs(pred - true)

def dcost(pred, true):
    return np.sign(pred - true)

def sigmoid(z):
    return 1. / (1 + np.exp(-z))

def dsigmoid(z):
    return sigmoid(z) * (1 - sigmoid(z))

Now we can define an ArtificialNeuron class that trains its weights and biases using the rough sketch of the algorithm above

class ArtificialNeuron:
    def __init__(self, input_size, learning_rate=0.5, num_epochs=100):
        self.learning_rate = learning_rate
        self.num_epochs = num_epochs
        self._W = np.zeros(input_size)
        self._b = 0

    def train(self, X, y):
        self.costs_ = []
        num_examples = X.shape[0]

        for _ in range(self.num_epochs):
            costs = 0
            dW = np.zeros(self._W.shape[0])
            db = 0
            for x_i, y_i in zip(X, y):
                # forward pass
                a_i = self._forward(x_i)

                # backward pass
                dW_i, db_i = self._backward(x_i, y_i)

                # accumulate cost and gradient
                costs += cost(a_i, y_i)
                dW += dW_i
                db += db_i

            # average cost and gradients across number of examples
            dW = dW / num_examples
            db = db / num_examples
            costs = costs / num_examples

            # update weights
            self._W = self._W - self.learning_rate * dW
            self._b = self._b - self.learning_rate * db
            self.costs_.append(costs)
        return self

    def _forward(self, x):
        # compute and cache intermediate values for backwards pass
        self.z = np.dot(x, self._W) + self._b
        self.a = sigmoid(self.z)
        return self.a

    def _backward(self, x, y):
        # compute gradients
        dW = dcost(self.a, y) * dsigmoid(self.z) * x
        db = dcost(self.a, y) * dsigmoid(self.z)
        return dW, db

That’s it! Now we can load the data and use this new neuron model to train a classifier!

# Load the Iris dataset
iris = datasets.load_iris()
data = iris.data
target = iris.target

# Select only the Setosa and Versicolor classes (classes 0 and 1)
setosa_versicolor_mask = (target == 0) | (target == 1)
data = data[setosa_versicolor_mask]
target = target[setosa_versicolor_mask]

# Extract the sepal length and sepal width features into a dataset
sepal_length = data[:, 0]
petal_length = data[:, 2]
X = np.vstack([sepal_length, petal_length]).T

# Train the artificial neuron 
an = ArtificialNeuron(input_size=2)
an.train(X, target)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Create a scatter plot of values
ax1.scatter(sepal_length[target == 0], petal_length[target == 0], label="Setosa", marker='o')
ax1.scatter(sepal_length[target == 1], petal_length[target == 1], label="Versicolor", marker='x')

# Plot separating line
w1, w2 = an.W_[0], an.W_[1]
b = an.b_
x_values = np.linspace(min(sepal_length), max(sepal_length), 100)
y_values = (-w1 * x_values - b) / w2
ax1.plot(x_values, y_values, label="Separating Line", color="k")

# Set plot labels and legend
ax1.set_xlabel("Sepal Length (cm)")
ax1.set_ylabel("Petal Length (cm)")
ax1.legend(loc='upper right')
ax1.set_title('Artificial Neuron Output')

# Plot neuron cost
ax2.plot(an.costs_, label="Error", color="r")
ax2.set_xlabel("Epoch")
ax2.set_ylabel("Cost")
ax2.legend(loc='upper left')
ax2.set_title('Artificial Neuron Cost')


# Show the plot
plt.show()

Using this new activation function and gradient descent, we’re still able to create a line separating the two classes exactly (left). The cost function on the right shows the overall cost for each iteration of gradient descent and is monotonically decreasing until it approaches 0. Note that there are more than several sets of solutions to this specific separation problem (the space between the two classes is large) so any solution that has the lowest cost will work. Intuitively, we can think of this learning problem to have a non-unique global minima or a “basin” of optimal solutions. This is often not the case with more complex problems.

One criticism is in the cost function itself. Remember that we replaced the step function with the sigmoid because it wasn’t continuous everywhere and the gradient was 0 everywhere. The derivative of the absolute value function is also not continuous everywhere, and, although the gradient does exist, it’s a constant value. Can we come up with a better cost function that provides a more well-behaved, helpful gradient? Similar to what we did with the step function, we can replace the mean absolute error cost function with a smoothed variant where the gradient not only exists everywhere, but provides a better signal on which direction to update the weights. Fortunately for us, such a function exists as the mean squared error (MSE), which just replaces the absolute value with a square.

\[C = \displaystyle\frac{1}{2}\displaystyle\sum_i \big(y_i - a_i\big)^2\]

where $a_i$ is the output layer for input $x_i$. (Similar to mean absolute error, we’re adding the $\frac{1}{2}$ in front purely for mathematical convenience when computing the gradient; it’s just a constant that we could omit) This cost function is a smooth and has a derivative everywhere.

\[\begin{align*} \displaystyle\frac{\p C}{\p a_i} &= \displaystyle\sum_i -\big(y_i - a_i\big)\\ \displaystyle\frac{\p C}{\p a} &= -(y - a) \end{align*}\]

Practically, smooth cost functions tend to work better since the gradient contains more information to guide gradient descent to the optimal solution. Compare this cost function gradient to the previous one that just returned $\pm 1$. In the code, we can replace the cost function and derivative with MSE instead of MAE.

def cost(pred, true):
    return 0.5 * (true - pred) ** 2

def dcost(pred, true):
    return -(true - pred)

Using the new MSE cost function, we can achieve the same optimal result but notice the y-axis scale on the cost plot: it has much smaller values than that of the same plot using MAE as the cost function. This is because MSE produces very small values when true and predicted values are close but very large values when they’re farther apart. In other words, they scale with the magnitude of the difference and are not constant like with MAE.

Another point to bring up that we should address early on is efficiency of gradient descent: it requires us to average gradients over all training examples. This might be fine for a few hundred or even a few thousand training examples (depending on your compute) but quickly becomes intractable for any dataset larger than that. Rather than averaging over the entire set of training examples, we can perform gradient descent on a mini-batch that’s intended to be a smaller, sampled set of data representative of the entire training set. This is called stochastic gradient descent (SGD). We take the training data, divide it up into minibatches, and run gradient descent with parameter updates over those minibatches. An epoch still elapses after all minibatches are seen; in other words, the union of all minibatches form the entire training data, and that’s when an epoch passes. While the cost function plot to convergence is a bit noisier than full gradient descent, it’s often far more efficient per iteration since the minibatch size is much smaller. We can update the corresponding code to shuffle and partition our training data into minibatches, iterate over them, and perform a gradient descent update over the current minibatch instead of the entire training set.

class ArtificialNeuron:
    def __init__(self, input_size, learning_rate=0.5, num_epochs=50, minibatch_size=32):
        self.learning_rate = learning_rate
        self.num_epochs = num_epochs
        self.minibatch_size = minibatch_size
        self.W_ = np.zeros(input_size)
        self.b_ = 0

    def train(self, X, y):
        self.costs_ = []

        for _ in range(self.num_epochs):
            epoch_cost = 0

            # shuffle data each epoch
            permute_idxes = np.random.permutation(X.shape[0])
            X = X[permute_idxes]
            y = y[permute_idxes]

            for start in range(0, X.shape[0], self.minibatch_size):
                minibatch_cost = 0
                dW = np.zeros(self.W_.shape[0])
                db = 0
                # partition dataset into minibatches
                Xs, ys = X[start:start+self.minibatch_size], y[start:start+self.minibatch_size]
                for x_i, y_i in zip(Xs, ys):
                    # forward pass
                    a_i = self._forward(x_i)

                    # backward pass
                    dW_i, db_i = self._backward(x_i, y_i)

                    # accumulate cost and gradient
                    minibatch_cost += cost(a_i, y_i)
                    dW += dW_i
                    db += db_i
                # average cost and gradients across minibatch size
                dW = dW / self.minibatch_size
                db = db / self.minibatch_size
                # accumulate cost over the epoch
                minibatch_cost = minibatch_cost / self.minibatch_size
                epoch_cost += minibatch_cost

                # update weights
                self.W_ = self.W_ - self.learning_rate * dW
                self.b_ = self.b_ - self.learning_rate * db
            # record cost at end of each epoch
            self.costs_.append(epoch_cost)

# rest is the same

Note that we shuffle the training data each epoch so we have different minibatches to compute gradients with and update our parameters. In fact, we see any cyclical patterns in the cost function plot, it’s usually indicative of the same minibatches of data being seen over and over again.

Using SGD instead of a full GD also gives an optimal solution to this problem. Note the loss curve in the right plot is noisier than using full GD since we’re randomly sampling minibatches across the training input rather than evaluating the entire training set for each iteration. In fact, in some iterations, the cost actually goes up a little bit! But the overall trend goes to 0 and that long-term trend is more important.

Now when we run the code, our loss curve looks a bit noisier but each iteration by itself is faster since we’re only using a fraction of the entire training input, yet we can still converge to a similar solution. Computing gradients over minibatches rather than the entire dataset is essential for any practical training on real-world data!

The full code listing is here.

Conclusion

In this post, we learned about numerical optimization and how we could automatically solve for the parameters of our perceptron and artificial neuron (as well as any other mathematical model, in fact!) using gradient descent. Along the way, we discovered some issues with our perceptron model, such as our step function activation, and evolved our perception into something more modern using the sigmoid activation. We also covered a few improvements for gradient descent, e.g., better choice of cost function as well as minibatching, that can help it achieve better performance in terms of speed and quality of result.

In the next post, we’ll use this neuron model as a building block to construct deep neural networks and discuss how we actually train them when they do have millions of parameters. I had originally planned to cover true artificial neural networks and backpropagation in this post as well but felt like it was already big enough to stand alone. Also backpropagation takes a lot of time and explanation that I think deserves its own dedicated article. Hopefully I turn out to be correct for next time 🙂

Neural Nets - Part 1: Perceptrons

2023-11-03T00:00:00+00:00

In the past decade, neural networks have shown a huge spike in interest across industry and academia. Neural nets have actually existed for decades but only recently have we been able to efficiently train and use them due to advancements in software and hardware, especially GPU and GPU architecture. They’ve shown astounding performance across multitudes of tasks like image classification, object segmentation, text generation, speech synthesis, and more! While the exact details of these networks may be different, they are built from the same foundation, and many concepts and structures are common across all different kinds of neural nets.

In this post, we’ll start with the first and simplest kind of neural network: a perceptron. It models a single biological neuron and can actually be sufficient to solve certain problems, even with its simplicity! We’ll use perceptrons to learn how to separate a dataset and represent logic gates. Then we’ll extend perceptons by feeding them into each other to create multilayer perceptrons to separate even nonlinear datasets!

Biological Neurons

In the mid 20th century, there was a lot of work on trying to further artificial intelligence, and the general idea was to create artificial intelligence by trying to model actual intelligence, i.e., the brain. Looking towards nature and how it creates intelligence makes a lot of intuitive sense (e.g. dialysis treatment was the result of studying what kidneys to replicate their function). We know that the brain is comprised of biological neurons like this.

Source. A biological neuron has a cell body that accumulates neurotransmitters from its dendrites. If the neuron has enough of a charge, then it emits an electrical signal called an action potential through the axon. However, this signal is only produced if there is “enough” of a charge; if not, then no signal is produced.

(This wouldn’t be an “Intro to Neural Nets” explanation without a picture of an actual biological neuron!)

A few key conceptual notions of artificial neurons arose from this exact model and simplistic understanding. Specifically, a biological neuron has dendrites that collect neurotransmitters and sends them to the cell body (soma). If there is enough accumulated input, then the neuron fires an electrical signal through its axon, which connects to other dendrites through little gaps called synaptic terminals. There exists an All or Nothing Law in physiology where, when a nerve fiber such as a neuron fires, it produces the maximum-output response rather than any partial one (interestingly this was first shown with the electrical signals across heart muscles that keep the heart beating!); in other words, the output is effectively binary: either the neuron fires or it does not based on a threshold of accumulated neurotransmitters.

Trying to model this mathematically, it seems like we have some inputs $x_i$ (dendrites) that are accumulated and determine if a binary output $y$ fires if the combined input is above a threshold $\theta$. Since we have multiple inputs, we have to combine them somehow; the simplest thing to do would be to add them all together. This is called the pre-activation. Finally, we threshold on the pre-activation to get a binary output, i.e., apply the activation function to the pre-activation.

\[y=\begin{cases} 1, & \displaystyle\sum_i x_i \geq \theta \\ 0, & \displaystyle\sum_i x_i < \theta \\ \end{cases}\]

What are the kinds of things we can model with this? For the simplest case, let’s consider binary inputs and start with binary models. For example, consider logic gates like AND and OR. If we chose the right value of $\theta$, we can recreate these gates using this neuron model. For an AND gate, the output is $1$ if and only if both inputs are $1$. $\theta = 2$ seems to be the right value to recreate an AND gate. For an OR gate, the output is $1$ if either of the inputs are $1$. $\theta=1$ seems to be the right value to recreate an OR gate. What about an XOR gate? This gate returns $1$ if exactly one of the inputs are $1$. What value of $\theta$ would allow us to recreate the XOR gate? We can try a bunch of different values but it turns out that there is no value of $\theta$ that can allow us to recreate the XOR gate under this particular mathematical model. One other way to see this is visually.

We can plot the inputs along two axes representing two inputs and color them based on what the result should be, i.e., white is output of 1 and black is output of 0. Note that the neuron model is a linear model which means we can only represent gates whose outputs are separable by a line. This is true for the AND and OR gates, but not for the XOR gate. However, two lines could be used the recreate the XOR gate so it seems like we’ll need a more expressive model.

We’ll see later what model we need to also be able to recreate the XOR gate, but it’s important to know that this simple model has limitations on its representative power so we’re going to need a more complicated model in the future.

Perceptrons

This seems like a good start but there’s no “learning” happening here. Even before neural networks, we had learning-based approaches that sought to solve (or optimize for) some parameters given an objective/cost/loss function and set of input data. For example, consider fitting a least-squares line to a set of points. Given some parameters of our model (specifically the slope and y-intercept) and a set of data (set of points to fit a line to), we want to find the optimal values of the parameters such that they “best” fit the data (according to the cost). In our example, we do have a single parameter $\theta$ parameter, but we’ve been guessing the value that works, which clearly won’t work for more complex examples.

One thing we can do to improve the expressive power is to add more parameters to the model and figure out how to solve/optimize for them given a set of input data rather than having to guess their values by inspection. There are an number of different ways to do this but one effective way is to introduce a set of weights $w_i$, one for each input, and a bias $b$ across all inputs. Since we have a single bias that can shift the values of the inputs, we can also simplify the activation function to fix $\theta=0$ and let the learned bias shift the input to the right place.

\[y=\begin{cases} 1 & \displaystyle\sum_i w_i x_i + b \geq 0 \\ 0 & \displaystyle\sum_i w_i x_i + b < 0 \\ \end{cases}\]

This thresholding function is also called the Heaviside step function. A simpler notation is to collect the weights and inputs into vectors and use the dot product.

\[y=\begin{cases} 1 & w\cdot x + b \geq 0 \\ 0 & w\cdot x + b < 0 \\ \end{cases}\]

Furthermore, we can absorb the bias into the weights and input by adding a dimension to the input and weight dimension and fixing the first value of every input to $1$ always. We can think of the bias as being a weight whose input is always $1$, i.e., $\sum_{i\neq 0} w_i\cdot x_i + b\cdot x_0$ where $x_0=1$.

\[y=\begin{cases} 1 & w\cdot x \geq 0 \\ 0 & w\cdot x < 0 \\ \end{cases}\]

We’ll also sometimes use $y=f(w\cdot x)$ as a shorthand where $f$ represents the step function.

This very first neural model is called the perceptron: a linear binary classifier whose weights we can learn by providing it with a dataset of training examples and using the perceptron training algorithm to update the weights. Supposing we have the already-trained values of the weights, we can take any input, dot it with the learned weights, and run it through the step function to see which of the two classes the input belongs in.

One illustrative example to see how this is more general than the binary case is to recreate our logic gates, but using this model instead. Again, let’s try to recreate the AND and OR gates. Both of these take two inputs $x_1$ and $x_2$ so we’ll have $w_1$, $w_2$ and $b$ that we need to find appropriate values for.

Similar to the previous examples, we’ll recreate the AND and OR logic gates but use the weights of the perceptron rather than the threshold. The values of the weights are on the edges while the value of the bias term is inside of the neuron. Note that the perceptron model is a still a linear model so we still can’t represent the XOR gate just yet.

With some inspection and experimentation, we can figure out the values for the weights and bias. For the AND gate, if we set $w_1=1$, $w_2=1$, $b=-2$, then for the positive case, we get the input $x_1+x_2-2$. Only when both $x_1=x_2=1$ would the pre-activation be $0$ and hence produce $y=1$ after running through the step function. For the OR gate, the parameters are $w_1=1$, $w_2=1$, $b=-1$ and the input is $x_1+x_2-1$. There are also other gates we can represent, e.g., NOT and NAND, but still not XOR since perceptrons are still linear models. Note that the values of these parameters aren’t the only values that satisfy the criteria; this will become important much later on when we talk about regularization.

Similar to the previous cases, we’ve manually solved for the values of the parameters since there were only three and our “dataset” was one example but what if we wanted to separate a dataset like this.

These data are taken from a famous dataset called the Iris Flower dataset that measured the petal and sepal length and width of 3 species of iris flowers: iris setosa, iris versicolor, and iris virginica. Here, we plot only the sepal and petal lengths of iris setosa and iris versicolor. Notice that we can draw a line that separates these two species. Interestingly this dataset was collected by the God of statistics: Ronald Fisher.

Now trying to figure out the weights and bias by inspection becomes a bit more difficult! Now imagine doing the same for a 100-dimensional dataset. It’d be nigh impossible! These are the majority of practical cases we’ll encounter in the real world so we need an algorithm for automatically solving for the weights and bias of a perceptron. Let’s set up the problem: we have a bunch of linearly separable pairs of inputs $x_i$ and binary class labels $y_i\in\{0,1\}$ that we group into a dataset $\mathcal{D} = \Big\{ (x_1, y_1), \cdots, (x_N, y_N) \Big\}$ and we want to solve for the set of weights that correctly assigns the predicted class value $\hat{y} = f(w\cdot x)$ to the correct class value $y$ for examples in our dataset.

In other words, we want to update our weights using some rule such that we eventually correctly classify every example. The most general kind of weight update rule is of the form.

\[w_i\gets w_i + \Delta w_i\]

For each example in the dataset, we can apply this rule to move the weights a little bit towards the right direction. But what should $\Delta w_i$ be? We can define a few desiderata of this rule and try to put something together. First, if the target and predicted outputs are the same, then we don’t want to update the weight, i.e., $\Delta w_i = 0$ since the model is already correct! However, if the target and predicted outputs are different, we want to move the weights towards the correct class of that misclassified example. One last important thing is to be able to scale the weight update so that we don’t make too large of an update and overshoot. Putting all of these together, we can come up with an update scheme like the following.

\[\Delta w_i = \alpha(y-\hat{y})x_i\]

where $\alpha$ is the learning rate that controls the magnitude of the update. Note that when the target and predicted class are the same, $\Delta w_i = 0$ since we’re already correct. However, if they disagree, then we move the weights towards the direction of the correct class of that misclassified example.

Putting everything together, we have the Perceptron Training Algorithm!

Given a learning rate $\alpha$, set of weights $w_i$, and dataset $\mathcal{D} = \Big\{ (x_1, y_1), \cdots, (x_N, y_N) \Big\}$,

Randomly initialize the weights somehow, e.g., $w_i\sim\mathcal{N}(0, \sigma^2_w)$ with some variance $\sigma^2_w$
For each epoch
1. For each training example $(x_j, y_j)$ in the dataset $\mathcal{D}$
  1. Run the input through the model to get a predicted class $\hat{y}_j = f(w\cdot x_j)$
  2. Update all weights using $w_i\gets w_i + \alpha(y_j-\hat{y}_j)x_j$

An epoch is an full iteration where the network sees all of the training data exactly once; it’s used to control the high-level loop in case the perceptron or network doesn’t converge perfectly. That being said, this update algorithm is actually guaranteed to converge in a finite amount of time by the Perceptron Convergence Theorem. The proof itself isn’t particularly insightful but the existence of the proof is: with a linearly separable dataset, we’re guaranteed to converge after a finite number of mistakes.

Perceptrons are really easy to code up so let’s go ahead and write one really quickly in Python using numpy.

import numpy as np

class Perceptron:
    def __init__(self, lr=0.01, num_epochs=10):
        self.lr = lr
        self.num_epochs = num_epochs

    def train(self, X, y):
        # initialize x_0 to be bias
        self.w_ = np.zeros(1 + X.shape[1])
        self.losses_ = []

        for _ in range(self.num_epochs):
            errors = 0
            for x_i, y_i in zip(X, y):
                dw = self.lr * (y_i - self.predict(x_i))
                self.w_[1:] += dw * x_i
                # bias update; recall x_0 = 1
                self.w_[0] += dw 
                errors += int(dw != 0.0)
            self.losses_.append(errors)
        return self

    def _forward(self, X):
        return np.dot(X, self.w_[1:]) + self.w_[0]

    def predict(self, X):
        return np.where(self._forward(X) >= 0., 1, 0)

Let’s train this on the above linearly separable dataset and see the results!

import matplotlib.pyplot as plt
from sklearn import datasets
import numpy as np

# Load the Iris dataset
iris = datasets.load_iris()
data = iris.data
target = iris.target

# Select only the Setosa and Versicolor classes (classes 0 and 1)
setosa_versicolor_mask = (target == 0) | (target == 1)
data = data[setosa_versicolor_mask]
target = target[setosa_versicolor_mask]

# Extract the sepal length and sepal width features into a dataset
sepal_length = data[:, 0]
petal_length = data[:, 2]
X = np.vstack([sepal_length, petal_length]).T

# Train the Perceptron
p = Perceptron()
p.train(X, target)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Create a scatter plot of values
ax1.scatter(sepal_length[target == 0], petal_length[target == 0], label="Setosa", marker='o')
ax1.scatter(sepal_length[target == 1], petal_length[target == 1], label="Versicolor", marker='x')

# Plot separating line
w1, w2 = p.w_[1], p.w_[2]
b = p.w_[0]
x_values = np.linspace(min(sepal_length), max(sepal_length), 100)
y_values = (-w1 * x_values - b) / w2
ax1.plot(x_values, y_values, label="Separating Line", color="k")

# Set plot labels and legend
ax1.set_xlabel("Sepal Length (cm)")
ax1.set_ylabel("Petal Length (cm)")
ax1.legend(loc='upper right')
ax1.set_title('Perceptron Output')

# Plot perceptron loss
ax2.plot(p.losses_, label="Error", color="r")
ax2.set_xlabel("Epoch")
ax2.set_ylabel("Error")
ax2.legend(loc='upper left')
ax2.set_title('Perceptron Errors')


# Show the plot
plt.show()

After training the perceptron on the dataset, we get a line in 2D that separates the two classes. In the general case, for a dataset where the inputs are $d$-dimension, we’d get a $(d-1)$-dimensional hyperplane. The right plot shows the number of errors the perceptron model occurs as we train on the dataset; if the dataset is linear, the perceptron is guaranteed to converge to some solution after a finite number of tries.

Since our dataset was linearly separable, we were able to converge to a solution in just a few iterations! Note that the result is complete, but maybe not optimal. Feel free to experiment with different kinds of weight initialization and learning rates!

Multilayer Perceptrons (MLPs)

Even with the improvements on the perceptron from the simpler artificial neuron model, we still can’t solve the XOR problem since perceptrons only work for linearly separable data. But recall back to when we were talking about biological neurons. After consuming input from the dendrites, if we’ve accumulated enough inputs to fire the neuron, it’ll fire along the output axon which in turn is used as the input to other neurons. So it seems, at least biologically, that neurons feed into other neurons.

We can also feed our artificial neurons into other neurons and create connections between them. There are a number of different choices for how we connect them; we could even connect neurons recurrently to themselves! But the simplest thing to try is to connect the outputs of the two inputs to another neuron before producing the output.

A multilayer perceptron (MLP) takes the output of one perceptron and feeds it into another perceptron. The edges represent the weights and the circles represent the biases. Here is a 2-layer perceptron with a hidden layer of 2 neurons and output layer of 1 neuron.

This structure is called a multilayer perceptron (MLP) and the intermediate layer is called a hidden layer since it maps an observable input to an observable output, but the hidden layer itself might not directly have an observable result or interpretation. In this particular example, we have 9 learnable parameters $w_1$, $w_2$, $w_3$, $w_4$, $b_1$, $b_2$, $w_5$, $w_6$, and $b_3$. Solving for these parameters via inspection is still possible by making one key observation: we can redefine an XOR gate as a combination of other gates: $a \tt{~XOR~} b = (a\tt{~OR~}b) \tt{~AND~} (a\tt{~NAND~}b)$. We’ve already seen the AND and OR gates so we need to figure out the right weights and bias for the NAND gate. Test this yourself, but the one set of values that satisfies the NAND gate is $w_1=-1$, $w_2=-1$, $b=1$. Because of this decomposition of the XOR gate, we can try to recreate it using those same weights and values.

One way to interpret this solution to the XOR gate problem is that the top hidden neuron represents $h_1 = x_1\tt{~OR~}x_2$ and the bottom one represents $h_2 = x_1\tt{~NAND~}x_2$. Then the final one represents $h_1\tt{~AND~}h_2 = (x_1\tt{~OR~}x_2) \tt{~AND~} (x_1\tt{~NAND~}x_2)=x_1 \tt{~XOR~} x_2$. Now we have a solution to classify even nonlinear data!

In theory this seems to work, but let’s try to plug in some values and run it though this MLP to see if it produces the right outputs. We’ll call the hidden layer outputs $h_1=f(x_1+x_2-1)$ and $h_2=f(-x_1-x_2+1)$. The final output is then $y=f(h_1+h_2-2)$. Here’s a truth table showing the inputs and outputs.

$x_1$	$x_2$	$h_1$	$h_2$	$y$
$0$	$0$	$0$	$1$	$0$
$0$	$1$	$1$	$1$	$1$
$1$	$0$	$1$	$1$	$1$
$1$	$1$	$1$	$0$	$0$

Seems like this MLP works to correctly produce the right outputs for the XOR gate! This is pretty interesting because a single perceptron couldn’t solve the XOR gate problem because the XOR gate wasn’t linearly separable. But it seems by layering perceptrons, we can correctly classify even nonlinear output! To understand why, let’s try plotting the values of the hidden layer using the truth table above.

We can plot the hidden state values in the 2D plane in the same way as plotting the logic gates. Notice that in the latent space, the XOR gate is indeed linearly separable so we only need one additional perceptron on this hidden state to complete our MLP representation of the XOR gate!

This is particularly insightful: in the input space, the XOR gate is not linearly separable but in the hidden/latent space it is! This is a general observation about neural networks: they perform a series of transforms until the final data are linearly separable, then we just need a single perceptron to separate them. Layering perceptrons provides more expressive power to the MLP to separate nonlinear datasets by passing them through multiple transforms. Even this MLP model has limitations as we scale up to many hundreds, thousands, millions, and billions of parameters! We’ll still need to come up with a way to automatically learn the parameters of these kinds of very large neural networks but we’ll save that for next time!

Conclusion

Neural networks have gained immense traction in the past decade for their exceptional performance across a wide variety of different tasks. Historically, these arose from trying to model biological neurons in an effort to create artificial intelligence. From these simple biological models, we derived a few parametrized mathematical models of these. We moved on to perceptrons as a start and learned what their parameters were and how to train them using the Perceptron Learning Algorithm. We showed how they can successfully classify real-world, linearly-separable data. However we found limitations in them, particularly with nonlinear datasets, even the simplest cases such as recreating an XOR gate. But we found that by layering these together into multilayer perceptrons (MLPs), we could even separate some nonlinear datasets!

We solved for the parameters of the MLP by inspection but this isn’t possible for very large neural networks so we’ll need an algorithm to automatically learn these parameters given the dataset. Furthermore, there have been a number of advancements in neural networks to improve their efficiency and robustness, and we’ll discuss the training algorithm and some of these advancements in the next post 🙂

Language Modeling - Part 2: Embeddings

2023-10-21T00:00:00+00:00

In the previous post, we discussed n-gram models and how to use them for language modeling and text generation. If use large-enough n-grams (say $n=6$), we could get pretty decent generated text. However, the caveat with n-gram model is in the representation: these models represent words as strings. This is not ideal since they don’t capture anything about the actual meaning of the word. For example, suppose we were generating text with the sequence “The delicious world-class coffee was”. The n-gram model might output both $p(\text{great}\vert\text{The delicious world-class coffee was})$ and $p(\text{terrible}\vert\text{The delicious world-class coffee was})$ with high probabilities depending on the training set and value of $n$! The words “great” and “terrible” have opposite meaning however. The n-gram model can’t quite understand that these are opposite words because it simply represents words as strings which don’t actually capture meaning or relations.

Can we come up with a better word representation that actually models word meanings and relations? In this post, I’ll go over how to compare words and how to quantify that similarity of words. As always, we’ll start with some background in linguistics. Then, before getting into word similarity, we’ll actually talk about document similarity, since it’s a bit easier to understand, and use those concepts to finally talk about embeddings which are vector representations of words that capture meaning. Finally, we’ll see some tangible examples with code on how to load pretrained embeddings, perform analogy tasks, and visualize them.

Semantics of Words

Similar to the previous discussion on n-grams, since we’re talking about representing meanings of words, we have to understand what that entails linguistically first. n-gram models represent words as strings which doesn’t capture the meaning or connotation of the word in question. For example, some words are synonyms or antonyms of other words; some words have a positive or negative connotation; some words are related, but not synonymous to other words. A good representation should capture all of that.

Let’s start with synonyms as an example: one way to say two words are synonyms is if we can substitute them in a sentence and still have the “truth conditions” of the sentence hold. However, just because words are synonyms doesn’t mean they’re interchangeable in all contexts. For example, $H_2 O$ is used in scientific contexts but strange in other contexts. Furthermore, words can be similar without being synonyms: tea and biscuits are not synonyms but are related because we often serve biscuits with tea.

One methodology linguists came up with to quantify word meaning in the 50s (Osgood et al. The Measurement of Meaning. 1957) is quantifying them along three dimensions: valence (pleasantness of stimulus, e.g., happy/unhappy), arousal (intensity of emotion of the stimulus, e.g., excited/calm), and dominance (degree of control of the stimulus, e.g., controlling/influenced). Linguists asked groups of humans to quantify different words based on those dimensions. From the results, they could numerically measure similarity and relationships of words using those three dimensions. With this representation, we could map a word to a vector in a 3D space and perform arithmetic operations to compare two words along those hand-crafted features. This was a good start but depended on surveying humans to come up with these values when large corpora of human text already exists.

Comparing Documents

Gathering large groups of people to quantify words along those dimensions isn’t a practical way of doing things, but it provides some insight: we can try to come up with an automated mechanism to map a word to a vector that we can perform mathematical operations on. The key insight lies in what linguists call the distributional hypothesis: words that occur in similar contexts have similar meanings. So the idea is to construct this vector representation for a particular word based on the context that word appears in.

Counter-intuitively, figuring this out for documents of words is a bit easier than individual words themselves so let’s sojourn into the world of information retrieval (IR). Given a query $q$ and a set of documents $\mathcal{D}$ (also called a corpus), the problem of information retrieval is to find a document $d\in\mathcal{D}$ such that it “best matches” the query $q$. Based on the distributional hypothesis, one way to compare two documents would be to look at how many words co-occur across documents. For each word in each document of the corpus, we can create a word-document matrix where the rows are words, the columns are documents, and an entry in the matrix represents the number of times a particular word appeared in a particular document.

	As you like it	Julius Caesar	Henry V
battle	1	7	13
good	114	62	89
fool	20	1	4

Source: Speech and Language Processing by Dan Jurafsky and James H. Martin

In this example, we can see that Julius Caesar has more in common with Henry V than it does with As you like it because the counts are more similar to each other. Quantitatively, we can represent each document as a vector of size $\vert V\vert$ where $\vert V\vert$ represents the size of the vocabulary (all words across all documents). So we can represent Julius Caesar and Henry V as two $\vert V\vert$-dimensional vectors, but how do we compare them?

One straightforward comparison is using the Euclidean distance between vectors:

\[d(v, w) = \sqrt{v_1 w_1+\cdots+v_{N}w_{N}}\]

However, that would disproportionally give a higher weight to vectors of greater magnitude. We can normalize against both of the vector sizes and drop the square root (since similarity is relative anyways) and we get a simpler notion of distance.

\[d(v, w) = \frac{v_1 w_1+\cdots+v_{N}w_{N}}{|v||w|} = \frac{v\cdot w}{|v||w|}\]

This measure of similarity is called cosine similarity or also the normalized dot product because of the equation $a\cdot b = \vert a\vert\vert b\vert\cos\theta$. This distance metric is bounded to be in $[-1, 1]$ where a similarity of 1 means the vectors are maximally similar (pointed in the same direction), a similarity of 0 means the vectors are unrelated (orthogonal), and a similarity of -1 means the vectors are maximally dissimilar (pointing in opposite directions). Using this measure, we can compare two documents against each other and quantify their similarity!

tf-idf

One large issue with directly using the term-document matrix is article words. Words like “the”, “a”, “an” are words that occur frequently across all documents. They don’t have any discriminative power when it comes to comparing two documents since they occur too frequently. So we need to balance words that are frequent against words that are too frequent. To quantify this, we define term frequency as the frequency of a word $t$ in a document $d$ using the raw count.

\[\text{tf}_{t,d} = \text{count}(t, d)\]

Word frequencies can become very large numbers so we want to squash the raw counts since they don’t linearly equate to relevance. But what if a word doesn’t occur in a document at all? Its count would be 0, which ends up becoming a problem when we take a log. We can simply offset the raw count by 1 to avoid numerical issues and use the log-space instead of the raw counts.

\[\text{tf}_{t,d} = \log\Big(\text{count}(t, d) + 1\Big)\]

Similarly, we can define document frequency as the number of documents that a word occurs in: $\text{df}_t$. Inverse document frequency (idf) is simply the inverse using $N$ as the number of documents in the corpus: $\text{idf}_t=\frac{N}{\text{df}_t}$. Similar to the above rationale, we also use the log-space.

\[\text{idf}_t = \log\frac{N}{\text{df}_t}\]

The intuition is that frequent words are more important than infrequent ones, but, the fewer documents that a word occurs in should have a higher weight since it has more discriminative power, i.e., it uniquely defines the document. Combining these two constraints, we get the full tf-idf weight

\[w_{t,d} = \text{tf}_{t,d}\text{idf}_t\]

Note that this ensures that really common words would have $w\approx 0$ since their idf score would be close to 0. With the earlier table, let’s replace the raw counts with the tf-idf score for each entry in the word-document matrix.

	As you like it	Julius Caesar	Henry V
battle	0.074	0.22	0.28
good	0	0	0
fool	0.019	0.0036	0.022

Source: Speech and Language Processing by Dan Jurafsky and James H. Martin

Notice that since “good” is a very common word, it’s tf-idf score becomes 0 since it has no discriminative power. Using tf-idf provides a better way to compare documents by more accurately representing their word contents.

Embeddings

We’ve seen how to represent documents as large, sparse vectors of word counts/frequencies and how to compare against each other using various techniques. Let’s see how to compare individual words using embeddings. An embedding is a short, dense vector representation of a word that holds particular semantics about that word. In practice these dense vectors tend to work better than sparse vectors in most language tasks since they are more efficient with capturing the complexity of the semantic space than sparse vectors.

word2vec

One way to construct an embedding for a vector is to go back to that distribution hypothesis: words that occur in similar contexts have similar meanings. This is the principle behind word2vec: we want to train a model that tells us if a word is likely to be near another word. Through training a word2vec model, the weights of the model become the embedding and we’ll learn them for each word in the vocabulary in a self-supervised fashion with no explicit training labels.

There are two flavors of word2vec: continuous bag of words (CBOW) and skip-gram; we’ll use the skip-gram model. The intuition is that we select a target word and define a context window of a few words before and after the target word. We construct a tuple of the target word and each of the words in the context window, and these become our training examples. We learn a set of weights to maximize the likelihood that a context word appears in the context window of a target word and use the learned weights as the embedding itself.

Let’s start with constructing the training tuples. Suppose we have the following sentence and the target word was “cup” and the context window was $\pm 2$:

\[\text{[The coffee }\color{red}{\text{cup}}\text{ was half] empty.}\]

The training examples would be tuples of the target word and the context words: (cup, the), (cup, coffee), (cup, was), (cup, half). We want to train a model such that, given a target word and another word, it returns the likelihood that the other word is a context word of the target word.

word2vec model architecture and training example. We map a word to its one-hot embedding and then use $E$ to map into the embedding itself. Then we remap into the vocabulary, normalize over all words, and try to maximize the likelihood that a particular context word is seen in the context window of the target.

For the input, we represent words as sparse one-hot embeddings where the size of the vector is the size of the vocabulary and we assign a unique dimension/index in the vector to each word.

\[\text{cup}\mapsto\begin{bmatrix}0\\\vdots\\ 0\\ 1\\ 0\\\vdots\\ 0\end{bmatrix} = w\\\]

Then we have a weight matrix $E$ that maps this one-hot vector to its embedding vector of some dimensionality $H$, so the dimensions of the matrix must be $H\times\vert V\vert$. We can get the embedding for a word in its one-hot representation by multiplying $Ew$ to get an output embedding vector of size $H\times 1$ that corresponds to the same row in the matrix. Note that this is equivalent to “selecting” the row of the one-hot embedding. For this reason, we also call $E$ the embedding matrix itself.

Recall that to train the model, we want it to produce a high likelihood if a context word is indeed in the context of the target word. To do that, we need another matrix mapping the embedding space back into the vocab space $E’$ of dimension $\vert V\vert\times H$. Since we want a probability, we need to normalize the output so we get a probability distribution across the vocabulary words. To do this, we apply the softmax operator:

\[\text{softmax}(z_i) = \frac{\exp(z_i)}{\sum_j \exp(z_j)}\]

Intuitively, this takes a particular element $z_i$ and divides it by the total sum of all elements in the exponential space. This gives us a valid probability distribution as the output. For the context word, we use the one-hot embedding. Another way to interpret the one-hot embedding probabilistically is that it represents a distribution with a single peak at a single index/word.

Now that we have the normalized output distribution and a one-hot embedding (thought of as a peaked distribution), the intuition behind the loss function is that we want to push the output distribution to be peaked in the same index as the desired embedding. One loss function that has this property is called the cross-entropy (CE) loss between a target $y$ and predicted $\hat{y}$.

\[\mathcal{L}(\hat{y}, y) = -\sum_i y_i\log\hat{y}_i\]

Note that because the target vector $y$ is a one-hot embedding, almost every term in the sum will $0$ except the one where $i=c$ where $c$ is the index of the context word in the target vector is and the element value is simply $1$. So we can simply this into a single expression.

\[\mathcal{L}(\hat{y}, y) = -\log\hat{y}_c\]

Does this loss function do the right thing? What happens if $\hat{y}_c$ is very close to $0$? Intuitively, this means the model is not doing a good job since it estimates the context word with a low probability of being in the actual context. In this case, we’re taking the log of a very small number which is a very large negative number. After we negate it, we get a very large loss. This makes sense since our model is saying that it doesn’t think the context word has a high likelihood of being in the context window even though it actually is (because that’s how we constructed the dataset). Note that since $\hat{y}_c$ is the output of a softmax, it’s bounded to be in $[0, 1]$. Since we can’t take the log of $0$, we often add a little epsilon $\varepsilon$ inside the log like $\log(\hat{y}_c+\varepsilon )$ for numerical stability.

Now what happens if $\hat{y}_c$ is close to $1$? Intuitively, this means our model is doing great because it’s very confidently estimating that the context word is in the context window. In this case, the log of $1$ is $0$ so we have a loss of $0$. This makes sense since our model is accurately predicting the high likelihood of the context word being in the context window.

Overall, this loss function seems to do what we want: move the predicted distribution of the model to be peaked at the context word. Putting all of this together, the training process looks like the following.

Given a target word $w$ and context word, run the target word through the matrices $E’Ew$.
Take the softmax of the output layer $\text{softmax}(E’Ew)$ to get a distribution over the vocab.
Compute the cross-entropy loss using the one-hot embedding of the context word.
Update the weights of the matrix according to the loss.

Practically, we’d use a framework such as Pytorch or Tensorflow and their automatic differentiation (also called autograd for automatic gradient) to compute the gradients for us. After training, we have an embedding matrix $E$ such that each row is an embedding vector that we can look up for a particular word in our vocabulary.

GloVe

word2vec is a good start in providing us with a word representation that holds some semantics about the word but it has one major problem: the context is always local. When we create training examples, we always use a context window around the word. While this gives us good local co-occurrences, we could more accurately represent the word if we also looked at global co-occurrences of words. Rather than trying to learn the raw probabilities like what word2vec does, GloVe aims to learn a ratio of probabilities representing how much more likely is it that a particular word appears in a context of one word compared to another word.

To start with some notation, we define a word-word co-occurrence matrix with $X$ and let $X_{ij}$ represent the number of times word $j$ appeared in the context of word $i$. With that definition, let $X_i = \sum_j X_{ij}$ as the number of times any word appears in the context of word $i$; we can also define $p_{ij}=p(j\vert i)=\frac{X_{ij}}{X_i}$ as the probability that word $j$ appears in the context of word $i$. As an example, consider $i=\text{ice}$ and $j=\text{steam}$. With probe words $k$, we can consider the ratio $\frac{p_{ik}}{p_{jk}}$ that tells us how much more likely is word $k$ to appear in the context of word $i$ than word $j$. For words like $k=\text{solid}$ that are more closely related to $i=\text{ice}$, the ratio will be large; for words more closely related to $j=\text{steam}$ like $k=\text{gas}$, the ratio will be small. For words that are closely related to both such as $k=\text{water}$, the ratio will be close to 1. This ratio has more discriminative power in identifying which words are relevant or irrelevant than using the raw probabilities.

Rather than learning raw probabilities, the authors construct a model to learn the co-occurrence ratios and train it using a novel weighted least squares regression model.

\[J = \sum_{i,j} f(X_{ij})\Big(w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij} \Big)^2\]

where

$w_i$ is a learned word vector
$\tilde{w}_j$ is a learned context vector
$b_i$ is the learned bias of word $i$
$\tilde{b}_j$ is the learned bias of context word $j$
$f(x) = \Big(\frac{x}{x_\text{max}}\Big)^\alpha$ if $x < x_\text{max}$ or $1$ otherwise is a weighting function.

There are a few nice properties about this weighting function that carry over from tf-idf: $f(x)$ is non-decreasing so that more frequent words are weighted correctly but it has an upper bound of $1$ so that very frequent words are not overweighted. The additional numerical property required by this function is that $f(0) = 0$ else a co-occurrence entry could be 0 and the entire function would be ill-defined. The hyperparameters are $x_\text{max}$ and $\alpha$ and the authors found that the former doesn’t impact the quality as much as the latter; $\alpha=0.75$ tended to work better than a linear model, empirically. Solving for the weights, we get GloVe embeddings that can be used just like word2vec embeddings but they tend to perform better since we’re also considering global word co-occurrences in additional to local context windows. We’ll see an example later where we load pretrained GloVe embeddings and use them to solve word analogies.

Read the GloVe paper for more details!

Embedding Layer

Both word2vec and GloVe train embeddings that can be used across a number of different language modeling tasks. However, the cost to pay for the generalization is that they may not perform as well for very specific applications. In other words, since the embeddings are taken off-the-shelf, we’ll have to fine-tune them for a specific language modeling task. We can use the pretrained embeddings to start and then consider them to be “optimizable” variables as a smaller part of our language model. This gives us a good start but also allows us to fine-tune the pretrained embeddings for our particular language modeling task.

In some cases, it may be beneficial to actually train an embedding layer from scratch end-to-end as part of whatever the language modeling task-of-interest is. The training procedure is similar to word2vec in that we start with one-hot embeddings of the words and them map them into an embedding space with an embedding matrix, but then the output directly goes into the next layer or stage in the language model. When we train the language model, the gradients automatically update the embedding matrix based on the overall loss of the language modeling task. While this method does tend to produce more accurate results for the end-to-end task, it does require a large corpus to train since we’re training the embeddings from scratch along with the rest of the language model rather than pulling the embeddings off-the-shelf.

Semantic Properties

After we’ve trained embeddings, we can see how well they model word semantics. One canonical task that demonstrates semantic analysis is completing a word analogy. For example, “man is to king as woman is to X”. The correct answer is “queen”. If our embeddings are truly learning correct semantic relationships, then they should be able to solve these kinds of analogies. We can represent these in the embedding space with vector arithmetic (since vector spaces are linear) and look at which other embeddings lie close to the result.

\[\overrightarrow{\text{king}} - \overrightarrow{\text{man}} + \overrightarrow{\text{woman}} \approx \overrightarrow{\text{queen}}\]

In other words, the embedding for “king” minus the embedding for “man” plus the embedding for “woman” should be close to the embedding for “queen”. This turns out to be true for word2vec and GloVe embeddings! So it seems like they are actually capturing certain kinds of semantic relations. Let’s actually write some code to load some pre-trained GloVe embeddings and show this!

First, we’ll need to go to the official GloVe website and download the pre-trained embedding and unzip them. For this example, we’ll use the glove.6B.zip with 100-dimensional GloVe embeddings. Each line in the text file is the word followed by the values of the embeddings so we can load that into a dictionary for easy lookup. Let’s try computing the similarity of $\overrightarrow{\text{king}} - \overrightarrow{\text{man}} + \overrightarrow{\text{woman}}$ and $\overrightarrow{\text{queen}}$ and also an unrelated word like $\overrightarrow{\text{egg}}$ and see if the embeddings correctly note similarities.

import numpy as np
from numpy import dot
from numpy.linalg import norm

embedding_dim = 100

# Define the local path to save the downloaded embeddings
glove_filename = f"glove.6B/glove.6B.{embedding_dim}d.txt"

# Load the GloVe embeddings into a dictionary
e = {}
with open(glove_filename, 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        embedding = np.array(values[1:], dtype='float32')
        e[word] = embedding

# compute analogy
result = e['king'] - e['man'] + e['woman']

# cosine similarity of the result and the embedding for queen
cos_sim = dot(result, e['queen']) / (norm(result) * norm(e['queen']))
print(cos_sim)

# cosine similarity of the result and the embedding for egg
cos_sim = dot(result, e['egg']) / (norm(result) * norm(e['egg']))
print(cos_sim)

The cosine similarity for the result and queen is $0.7834413$ while the cosine similarity for the result and egg is only $0.19395089$. As expected, “queen” is a far more appropriate solution to the word analogy than “egg”!

It would be nice to visualize the embeddings of different words relative to each other. However embeddings tend to be higher-dimensional vectors so how can we meaningfully visualize them? There are two common dimensionality-reduction techniques: (i) principal component analysis (PCA) and (ii) t-distributed Stochastic Neighbor Estimation (t-SNE). The intuition behind PCA is to repeatedly project along the dimension with the highest variance (since it has higher discriminative power) using a linear algebra technique such as singular value decomposition (SVD) until we hit the target dimension. t-SNE solves an optimization problem that tries to project the data such that the distances in the higher-dimensional space are similar to distances in the lower-dimensional space, thus locally preserving the structure of the higher-dimensional space in the lower-dimensional space. Both are good techniques to lower the dimensionality of the embedding so we can visualize words as points on a plane (while still preserving their semantics).

Fortunately, Scikit provides implementations for both so we can plot them side-by-side and see the differences.

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

words_to_plot = ["king", "man", "queen", "woman", "egg", "chicken", "frog", "snake"]
embeddings_to_plot = np.array([e[word] for word in words_to_plot])

pca = PCA(n_components=2)
reduced_embeddings_pca = pca.fit_transform(embeddings_to_plot)

tsne = TSNE(n_components=2, random_state=42, perplexity=5)
reduced_embeddings_tsne = tsne.fit_transform(embeddings_to_plot)

plt.figure(figsize=(12, 6))

# PCA on the left
plt.subplot(1, 2, 1)
plt.scatter(reduced_embeddings_pca[:, 0], reduced_embeddings_pca[:, 1])
for i, word in enumerate(words_to_plot):
    plt.annotate(word, (reduced_embeddings_pca[i, 0], reduced_embeddings_pca[i, 1]))
plt.title('PCA Projection')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')

# t-SNE on the right
plt.subplot(1, 2, 2)
plt.scatter(reduced_embeddings_tsne[:, 0], reduced_embeddings_tsne[:, 1])
for i, word in enumerate(words_to_plot):
    plt.annotate(word, (reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1]))
plt.title('t-SNE Projection')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')

plt.tight_layout()
plt.show()

The resulting plot shows that the vector difference between “king” and “man” is roughly the same as that of “queen” and “woman” in both plots! In the t-SNE plot, however, we see that the vectors are a bit closer in terms of magnitude and direction.

Some other interesting observations is with the other words: we see that snake and frog are closer together than say, man and egg because while they’re not synonyms, they’re still related words (both being animals that lay eggs). Try plotting other words to see how they cluster together in the lower-dimensional space!

Conclusion

Embeddings are a word representation that preserves semantic properties of words, such as relations to other words and connotation, in a much better way than representing words as strings of characters. Representing documents as vectors is counter-intuitively more straightforward so we started with learning about term frequency and document frequency; that also helped illustrate some interesting concepts like how words that occur too frequently should be downweighted since they don’t have discriminative power. To transition to representing individual words as embeddings, we learned about the distributional hypothesis that stated the meaning of a word depends on the context around it. Our first word embedding model word2vec trained embeddings with that in mind: train a model to predict if a word lies in the context window of a target word. Our next embedding model did a bit better by also looking at global word-word co-occurrences in addition to the local context window approach that word2vec uses. The final embedding model we discussed was a more recent type of model where we learn the embeddings from scratch as part of the language modeling task in an end-to-end fashion. Finally, we used embeddings to show how they can model semantic relations using word analogies as an example semantic understanding task.

Now that we have a vectorized format for embeddings, we can use them for different kinds of language models, the most popular and accurate ones being neural network language models, which we’ll cover next time 🙂

Language Modeling - Part 1: n-gram Models

2023-07-14T00:00:00+00:00

Over the past several months, Large Language Models (LLMs) such as ChatGPT, GPT-4, and AutoGPT, have flooded the Internet with all kinds of different applications and use-cases. These are regarded as language models that can remember context as well as understand their own capabilities. They’re often treated as black-boxes where the majority of the implementation details are left to the researchers. However, having some understanding of how they work can also help people more clearly and concisely instruct the model to get the desired output.

Rather than jumping straight to how LLMs work, I think it’s helpful to cover some prerequisite knowledge to help us demystify LLMs. In this post, we’ll go back in time before neural networks and talk about language, language modeling, and n-gram language models since they’re simple to understand and we can do an example by hand.

Language

Before we start with n-gram models, we need to understand the kind of data we’re working with. If we were going to delve into convolutional neural networks (CNNs), we’d start our discussion with images and image data. Since we’re talking about language modeling, let’s talk about language so we can better motive why language modeling is very hard. One definition of language that’s particularly relevant to language modeling is a structured system of communication with a grammar and vocabulary (note this applies for spoken, written, and sign language). Given you’re reading this post in the English language, you’re probably already familiar with vocabulary and grammar so let me present to you a sentence.

The quick brown fox jumps over the lazy dog.

You might recognize this sentence as being one that uses each letter of the English/Latin alphabet at least once. Immediately we see the words belonging to the vocabulary and their part-of-speech: nouns like “fox” and “dog”; adjectives like “quick”, “brown”, “lazy”; articles like “the”; verbs like “jumps”; and prepositions like “over”.

Grammar is what dictates the ordering of the words in the vocabulary: the subject “fox” comes before the verb “jumps” and the direct object “dog”. This ordering depends on the language however. For example, if I translated the above sentence into Japanese, it would read: 素早い茶色のキツネが怠惰な犬を飛び越えます。A literal translation would go like “Quick brown fox lazy dog jumped over”. Notice how the verb came at the end rather than between the subject and direct object.

These problems help illustrate why we can’t simply have a model that performs a one-to-one mapping when we try to model languages. We might end up with more words, e.g., if the target language uses particles words, or fewer words, e.g., if the target language doesn’t have article words. Even if we did have the same number of words, the ordering might change. For example, in English, we’d say “red car” but in Spanish we’d say “carro rojo” which literally translates to “car red”: the adjective comes after the noun it describes.

To summarize, language is very difficult! Even for humans! So it’s going to be a challenge for computers too.

Applications of Language Modeling

With that little aside on languages, before we formally define language modeling, let’s look at a few applications that use some kind of language modeling under-the-hood.

Sentiment Analysis. When reading an Amazon review, as humans, we can tell if they’re positive or negative. We want to have a language model that can do the same kind of thing. Given a sequence of text, we want to see if the sentiment is good or bad. Cases like “It’s hard not to hate this movie” are particularly challenging and need to be handled correctly. This particular application of language modeling is used in “Voice of the Customer” style analysis to gauge perceptions about a company or their products.

Automatic Speech Recognition. Language modeling can be useful for speech recognition by being able to correctly model sentences, especially for words that sound the same but are written differently like “tear” and “tier”.

Neural Machine Translation. Google Translate is a great example of this! If we have language models of different languages, implicitly or explicitly, we can translate between the languages that they model!

Text Generation. This is what ChatGPT has grown famous for: generating text! This application of language modeling can be used for question answering, code generation, summarization, and a lot more applications.

Language Modeling

Now that we’ve seen a few applications, what do all of these haven in common? It seems like one point of commonality is that we want to understand and analyze text against the trained corpus to ensure that we’re consistent with it. In other words, if our model was trained on a dataset of English sentences, we don’t want it generating grammatically incorrect sentences. In other words, we want to ensure that the outputs “conform” to the dataset.

One way to measure this is to compute a probability of “belonging”. For a some random given input sequence, if the probability is high, then we expect that sequence to be close to what we’ve see in the dataset. If that probability is low, then that sequence is likely something that doesn’t make sense in the dataset. For example, a good language model would score something like $p(\texttt{The quick brown fox jumps over the lazy dog})$ high and something like $p(\texttt{The fox brown jumped dog laziness over lazy})$ low because the former has proper grammar and uses known words in the vocabulary.

This is what a language model does: given an input sequence $x_1,\cdots,x_N$, it assigns a probability $p(x_1,\cdots,x_N)$ that represents how likely it is to appear in the dataset. That seems a little strange given we’ve just discussed the above applications. What does something like generating text have to do with assigning probabilities to sequences? Well we want the generated text to match well with the dataset, don’t we? In other words, we don’t want text with poor grammar or broken sentences. This also explains why those phenomenal LLMs are trained on billions of examples: they need diversity in order to assign high probabilities to sentences that encode facts and data of the dataset.

So how do we actually compute this probability? Well the most basic definition of probability is “number of events that happened” / “number of all possible events” so we can try to do the same thing with this sequence of words.

\[p(w_1,\dots, w_N)=\displaystyle\frac{C(w_1,\dots, w_N)}{\displaystyle\sum_{w_1,\dots,w_N} C(w_1,\dots, w_N)}\]

So for a word sequence $w_1,\dots, w_N$, in our corpus, we count how many times we find that sequence divide by all possible word sequences of length $N$. There are several problems with this. To compute the numerator, we need to count a particular sequence in the dataset but notice that this gets harder to do the longer the sequence is. For example, finding the sequence “the cat” is far easier than finding the sequence “the cat sat on the mat wearing a burgundy hat”. To compute the denominator, we need the combination of all English words up to length $N$. To give a sense of scale, Merriam Webster estimates there are about ~1 million words so this becomes the combinatorial problem.

\[\binom{1\mathrm{e}6}{N} = \displaystyle\frac{1\mathrm{e}6!}{N!(1\mathrm{e}6-N)!}\]

In other words, for each word up to $N$, there are about a million possibilities we have to account for until we get up to the desired sequence length. The factorial of a million is an incredibly large number! So these reasons make it difficult to compute language model probabilities in that form so we have to try something else. If we remember some probability theory, we can try to rearrange the terms using the chain rule of probability.

\[\begin{align*} p(w_1,\dots, w_N) &= p(w_N|w_1,\dots,w_{N-1})p(w_1,\dots,w_{N-1})\\ &= p(w_N|w_1,\dots,w_{N-1})p(w_{N-1}|w_1,\dots,w_{N-2})p(w_1,\dots,w_{N-2})\\ &= \displaystyle\prod_{i=1}^N p(w_i|w_1,\dots,w_{i-1})\\ \end{align*}\]

So we’ve decomposed the joint distribution of the language model into a product of conditionals $p(w_i\vert w_1,\dots,w_{i-1})$. Intuitively, this measures the probability that word $w_i$ follows the previous sequence $w_1,\dots,w_{i-1}$. Basically for a word, we depend on all previous words. So let’s see if this is any easier to practically count up the sequences.

\[p(w_i|w_1,\dots,w_{i-1})=\displaystyle\frac{C(w_1,\dots,w_i)}{C(w_1,\dots,w_{i-1})}\]

This looks a little better! Intuitively, we count a particular sequence up to $i$: $w_1,\dots,w_i$ in the corpus. But the denominator, we only count up to the previous word $w_1,\dots,w_{i-1}$. This is a bit better than going up to the entire sequence length $N$ but still a problem. Particularly, the biggest problem is the history $w_1,\dots,w_{i-1}$. How do we deal with it?

n-gram Model

Rather than dealing with the entire history up to a certain word, we can approximate it using only the past few words! This is the premise behind n-gram models: we approximate the entire past history using the past $n$ words.

\[p(w_i|w_1,\dots,w_{i-1})\approx p(w_i|w_{1-(n-1)},\dots,w_{i-1})\]

A unigram model looks like $p(w_i)$; a bigram model looks like $p(w_i\vert w_{i-1})$; a trigram model looks like $p(w_i\vert w_{i-1},w_{i-2})$. Intuitively, a unigram model looks at no prior words; a bigram models looks only at the previous word; a trigram model looks at only the past two words. Now let’s see if it’s easier to compute these conditional distributions using the same counting equation.

\[\begin{align*} p(w_i|w_{i-1})&=\displaystyle\frac{C(w_{i-1}, w_i)}{\displaystyle\sum_w C(w_{i-1}, w)}\\ &\to\displaystyle\frac{C(w_{i-1}, w_i)}{C(w_{i-1})} \end{align*}\]

We go to the second line by using maximum likelihood estimation. Computing these counts is much easier! To see this, let’s actually compute an n-gram model by hand using a very small corpus.

\[\texttt{}\text{I am Sam}\texttt{}\] \[\texttt{}\text{Sam I am}\texttt{}\]

Practically, we use special tokens that denote the start of the sequence () and end of sequence (). The token is required to normalize the conditional distribution into a true probability distribution. The token is optional but it becomes useful for sampling the language model later so we’ll add it. Treating these as two special tokens, let’s compute the bigram word counts and probabilities by hand.

$w_i$	$w_{i-1}$	$p(w_i\vert w_{i-1})$
I		$\frac{1}{2}$
Sam		$\frac{1}{2}$
	Sam	$\frac{1}{2}$
I	Sam	$\frac{1}{2}$
Sam	am	$\frac{1}{2}$
	am	$\frac{1}{2}$
am	I	$1$

Concretely, let’s see how to compute $p(\text{I}\vert\text{Sam})$. Intuitively, this is asking for the likelihood that “I” follows “Sam”. In our corpus, we have two instances of “Sam” and the words after are “” and “I”. So overall, the likelihood is $\frac{1}{2}$. Notice how the conditionals form a valid probability distribution, e.g., $\sum_w p(w\vert\text{Sam}) = 1$.

With this model, we can approximate the full language model with a product of n-grams. Consider bigrams:

\[\begin{align*} p(w_1,\dots, w_N)&\approx p(w_2|w_1)p(w_3|w_2)\cdots p(w_N|w_{N-1})\\ p(\text{the cat sat on the mat}) &\approx p(\text{the}|\texttt{})p(\text{cat}|\text{the})\cdots p(\texttt{}|\text{mat}) \end{align*}\]

This is a lot more tractable! So now we have an approximation of the language model! What other kinds of things can we do? We can sample from language models. We start with the and then use the conditionals to sample. We can either keep sampling until we hit a or we can keep sampling for a fixed number of words. This is why we have a : if we didn’t, we’d need to specific a start token. But since we used , we have a uniform start token.

Practical Language Modeling

Now that we’ve covered the maths, let’s talk about some practical aspects of language modeling. The first problem we can address is what we just talked about: approximating a full language model with the product of n-grams.

\[p(w_1,\dots, w_N)\approx p(w_2|w_1)p(w_3|w_2)\cdots p(w_N|w_{N-1})\]

What’s the problem with this? Numerically, when we multiply a bunch of probabilities together, we’re multiplying together numbers that are in $[0, 1]$ which means the probability gets smaller and smaller. This has a risk of underflowing to 0. To avoid this, we use a trick called the exp-log-sum trick:

\[\exp\Big[\log p(w_2|w_1)+\log p(w_3|w_2)+\cdots+\log p(w_N|w_{N-1})\Big]\]

In the log-space, multiplying is adding so the number just gets increasingly negative rather than increasingly small. Then we can take the exponential to “undo” the log-space.

Going beyond the numerical aspects, practically, language models need to be trained on a large corpus because of sparsity. After we train, two major problems we encounter in the field are unknown words not in the training corpus and words that are known but used in an unknown context.

For the former, when we train language models, we often construct a vocabulary during training. This can either be an open vocabulary where we add words as we see them or a closed vocabulary where we agree on the words ahead of time (perhaps the most common $k$ words for example). In either case, during inference, we’ll encounter out-of-vocabulary (OOV) words. One solution to this is to create a special token called that represents unknown words. For any OOV word, we map it to the token and treat it like any other token in our vocabulary.

Smoothing

What about known words in an unknown context? Let’s consider how we compute bigrams.

\[p(w_i|w_{i-1})=\displaystyle\frac{C(w_{i-1},w_i)}{C(w_{i-1})}\]

Mathematically, the problem is that the numerator can be zero. So the simplest solution is to make it not zero by adding $1$. But we can’t simply add $1$ without correcting the denominator since we want a valid probability distribution. So we also need to add something to the denominator. Since we’re adding $1$ to each count for each word, we need to add a count for the total number of words in the vocabulary $V$.

\[p(w_i|w_{i-1})=\displaystyle\frac{C(w_{i-1},w_i)+1}{C(w_{i-1})+V}\]

With this, we’re guaranteed not to have zero counts! This is called Laplace Smoothing. The issue with this kind of smoothing is that the probability density moves too sharply since we’re just blindly adding $1$. We can generalize this so that we actually add some $k$ (and normalize by $kV$) to help better ease the probability density less sharply towards the unknown context event.

\[p(w_i|w_{i-1})=\displaystyle\frac{C(w_{i-1},w_i)+k}{C(w_{i-1})+kV}\]

This is called Add-$k$ Smoothing. It can perform better than Laplace Smoothing in most cases, with the appropriate choice of $k$ tuned.

Backoff and Interpolation

One alternative to smoothing is to try to use less information if it’s available. The intuition is that if we can’t find a bigram $p(w_{i-1},w_i)$, we can see if a unigram exists $p(w_i)$ that we can use in its place. This technique is called backoff because we back off to a smaller n-gram.

Going a step further, we don’t have to necessarily choose between using backing off to only the $(n-1)$-gram. We can choose to always consider all previous n-gram, but create a linear combination of them.

\[\begin{align*} p(w_i|w_{i-2},w_{i-1})&=\lambda_1 p(w_i)+\lambda_2 p(w_i|w_{i-1})+\lambda_3 p(w_i|w_{i-2},w_{i-1})\\ \displaystyle\sum_i \lambda_i &= 1 \end{align*}\]

Here the $\lambda_i$s are the interpolation coefficients and they have to sum to $1$ to create a valid probability distribution. This allows us to consider all previous n-grams in the absence of data. Backoff with interpolation works pretty well in practice.

Code

We’ve been talking about the theory of language models and n-gram models for a while but let’s actually try training one on a dataset and use it to generate text! Fortunately since they’ve been around for a while, training them is very simple with existing libraries.

from torchtext.datasets import AG_NEWS
import re

from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

N = 6

data = AG_NEWS(root='.', split='train')
train, vocab = padded_everygram_pipeline(N, 
    [re.sub(r'[^A-Za-z0-9 ]+', '', x[1]).split() for x in data])
lm = MLE(N)
lm.fit(train, vocab)
print(' '.join(lm.generate(20, random_seed=4)))

We’re using the AG_NEWS dataset that contains 120,000 training examples of news articles across World, Sports, Business, and Science/Tech. The padded_everygram_pipeline adds the and tokens and creates n-grams and backoff n-grams; we’re using 6-grams which tend to work well in practice. For simplicity, we ignore any non-alphanumeric character besides spaces. Then we use a maximum likelihood estimator (similar to the conditional distribution tables we created above) to train our model. Finally, we can generate some examples of length 20.

I tried a bunch of different seeds and here are a few cherry-picked examples (I’ve truncated them after the token):

Belgian cancer patient made infertile by chemotherapy has given birth following revolutionary treatment
Two US citizens were killed when a truck bomb exploded in downtown Kabul in the second deadly blast to strike
This year the White House had rejected a similar request made by 130 Republican and Democratic members of Congress
Greatly enlarged museum is expected to turn into a cacophony on Saturday

These look pretty good for just an n-gram model! Notice they retain some information, probabilistically, across the sequence. For example, in the first one, the word “infertile” comes before “birth” since, when generating “birth”, we could see “infertile” in our previous history.

But I also found scenarios where the generated text didn’t really make any. Here are some of those lemon-picked examples:

For small to medium businesses
Can close the gap with SAP the world 39s biggest software company after buying US rival PeopleSoft Oracle 39s Chairman
British athletics appoint psychologist for 2008 Olympics British athletics chiefs have appointed sports psychologist David Collins
Can close the gap with SAP the world 39s biggest software company after buying US rival PeopleSoft Oracle 39s Chairman

These are sometimes short phrases or nonsensical with random digits. In one case, the language model just generated a bunch of tokens! These examples also help show why neural language models tend to outperform simplistic n-gram models in general. Feel free to change the dataset and generate your own sentences!

Conclusion

Large Language Models (LLMs) are gaining traction online as being able to perform complex and sequential reasoning tasks. They’re often treated as black-box models but understanding a bit about how they work can make it easier to interact with them. Starting from the beginning, we learned a bit about language itself and why this problem is so difficult and why it wasn’t solved decades ago. We introduced language modeling as a task of assigning a probability to a sequence of words based on how likely it is to appear in the dataset. Then we learned about how $n$-gram models approximate this full previous history of a particular word using only the past $n$ words. We can use these models for language modeling and sampling. We finally discussed some practical considerations when training language models including handing unknown words and backoff and interpolation.

There’s still a lot more to cover! This is just the start of our journey to the precipice of language modeling 🙂

$x_1$	$x_2$	$h_1$	$h_2$	$y$
$0$	$0$	$0$	$1$	$0$
$0$	$1$	$1$	$1$	$1$
$1$	$0$	$1$	$1$	$1$
$1$	$1$	$1$	$0$	$0$

$x_1$	$x_2$	$h_1$	$h_2$	$y$
$0$	$0$	$0$	$1$	$0$
$0$	$1$	$1$	$1$	$1$
$1$	$0$	$1$	$1$	$1$
$1$	$1$	$1$	$0$	$0$

$x_1$	$x_2$	$h_1$	$h_2$	$y$
$0$	$0$	$0$	$1$	$0$
$0$	$1$	$1$	$1$	$1$
$1$	$0$	$1$	$1$	$1$
$1$	$1$	$1$	$0$	$0$