Functional, Declarative Audio Applications

This is an article that has been a long time coming, and one that I'm really excited to finally write. Today I want to introduce a project that I've been thinking about and working on for years: Elementary Audio.

Elementary is a JavaScript runtime for writing native audio applications, as well as a library and framework for composing audio signal processes. This is a tool heavily inspired by React.js with which you can write a complete audio DSP application, whether you're targeting web, desktop, mobile, embedded linux, or any of the various native plugin formats for integrating into a modern DAW.

I recently shared a short presentation on YouTube (which I'll show below) that describes some of the problems that I feel we commonly face when writing native audio software, and how Elementary aims to solve those problems. Rather than repeat myself here I want to share part of the story that's not covered as thoroughly in the presentation: the story of how I arrived at the idea of a declarative, functional, reactive model for writing audio DSP, and why I think it's worth betting on.

Before I started working in audio software, I spent many years working primarily in frontend web development, and primarily in JavaScript. I worked on various projects over several years in every flavor of frontend JavaScript application framework before eventually joining the Instagram team at Facebook shortly after their acquisition. There, I had the opportunity to deeply learn, invest in, and even contribute to React.js.

Working with React.js prompted a drastic shift in the way I thought about writing software for one primary reason: it allowed me to reason about my application itself as a pure function of my application state. From there, the actual development follows a remarkably simple perspective: "Given a predefined application state X, I want my app to look like Y, and to behave like Z." That's it; I can think strictly in terms of what my application should be, and defer all of the how to React itself.

Not long after that experience I immersed myself in audio software development with C++ and JUCE, writing creative plugins for computer musicians in their favorite DAWs. To say that this was a complete change of pace is an understatement. Of course, there's quite a learning curve when making such a stark transition, and add on top that I began learning digital signal processing on my own at the same time. But even after I had become comfortable with the domain and comfortable with the tools, I was still moving slowly. It dawned on me, especially after having the opportunity to work with a rock solid team of audio software developers, that it wasn't just me– it seems to me now that the way our industry tends to write audio software is just slow. Not because we're taking our time to get things right, but because we're using the wrong tools.

In audio software, we strongly prioritize performance, and we have to– missing a deadline on the realtime audio rendering thread has drastic consequences. For that reason, of course it makes sense to start with a highly optimized, compiled language like C/C++ (or nowadays maybe Rust) with vector intrinsics and all of that good stuff. Unfortunately, working in C/C++ brings along many challenges that, in other languages, we don't have to even think about– memory management, object lifetimes, thread safety, etc. Now, this is a tradeoff that we simply must make to deliver the types of audio rendering that our industry delivers, and that's fine. Rust and future languages will come along and make this tradeoff more favorable, and I look forward to those developments.

Taking that tradeoff means engaging a whole series of problem domains inherent to writing C++ applications, as we try to build the audio software we set out to ship. The problems that then arise from those domains take real time, and complicate the development process. This is still just part of the tradeoff that we must make when working at the layer of realtime audio rendering. But where I speak of the wrong tooling, I'm really thinking of what happens in the domains where we don't need to take such a tradeoff. It seems to me that we have something of an "everything is a nail when you have a hammer" problem in audio software: once we've committed to writing our audio rendering in C/C++ we carry on to every other edge of our application with the same tools and the same inherent complications. We spend inordinate amounts of time trying to address problems of memory management where we should really just be using a garbage collector. We invite race conditions and synchronization headaches where we should really just be using an asynchronous event-driven engine.

The fact that so many audio software user interfaces are written in C++ drives this point home for me. The types of constraints and requirements that necessitate a language like C++ at the level of realtime audio rendering simply don't exist in the user interface piece of our application. The tradeoff isn't worth it anymore without those requirements, and in my experience, our industry wastes a vast amount of time and resources developing major pieces of our applications with tools that don't make sense. This realization was a major impetus for kicking off my React-JUCE, formerly named Blueprint, project (an update on which is coming soon!).

As it regards writing audio DSP though, this way of thinking didn't start setting in for me until I found myself trying to design the right abstraction for developing my own library of easily composed audio processing blocks. The further I got into that problem, the more I realized that an object oriented approach is fundamentally incompatible with the type of abstraction I was looking for, and that managing object lifetimes compounded the difficulty of the problem drastically. Eventually it dawned on me that the task of composing audio processing blocks is itself free of those realtime performance constraints mentioned above: it's only the actual rendering step that must meet such requirements.

Once this idea took hold, the freedom to explore new abstractions for composing audio signal processes opened up immediately, and I knew right then that I wanted to arrive at an abstraction that brought with it that same perspective that I had grown to love in React.js: "Given a predefined application state X, I want the signal flow through my application to look like S." And I knew that to get there, I wanted to embrace the following ideas:

  • JavaScript, a language that's widely accessible, garbage collected, and fast
  • Pure functions. Anyone who has studied any of the various LISPs in the world knows that pure functions compose in a way that other structures don't, and it's this type of composition that enables assembling complex functions with ease
  • A declarative API for expressing signal flow as a composition of those pure functions. One which means that here too in audio DSP we can think in terms of the what, and leave the how to the framework.

Let me share an example that I think demonstrates this approach quite nicely. Imagine that we want to build a 3-band EQ with variable filter shapes (peaking, lowpass, highpass, bandpass, shelves, notch), cutoff, and resonance. To describe the state of our application at any point in time we can use plain objects:

let appState = {
    filters: [
        {type: 'lowshelf', cutoff: 200, res: 0.717, gain: 2.0},
        {type: 'peak', cutoff: 482, res: 2.331, gain: 2.0},
        {type: 'peak', cutoff: 2020, res: 1.351, gain: 3.0},
    ]
};

Given such a starting point, we can then use Elementary's functional, declarative API to write the signal flow of our application as a pure function of this state:

function eq3(state, input) {
    return state.filters.reduce(function(acc, next) {
        let {type, cutoff, res, gain} = next;
        
        switch (type) {
            case 'lowshelf': return el.lowshelf(cutoff, res, gain, acc);
            case 'highshelf': return el.highshelf(cutoff, res, gain, acc);
            case 'lowpass': return el.lowpass(cutoff, res, acc);
            case 'highpass': return el.highpass(cutoff, res, acc);
            case 'peak': return el.peak(cutoff, res, gain, acc);
            case 'notch': return el.notch(cutoff, res, acc);
        }
        
        return acc;
    }, input);
}

Now we have a pure function which describes the signal flow of our application as a function of the application state, and we have a chunk of application state that we can use to invoke said function. The rest, we leave up to Elementary:

core.on('load', function() {
  core.render(...el.inputs().map(function(input) {
    return eq3(appState, input);
  });
});

To me this is already a significant improvement over any form of signal composition I've seen in the audio DSP domain, but it's just the tip of the iceberg in Elementary. For example, if we wanted to extend our 3-band EQ into an 8-band EQ, the only thing that needs to change is to describe state for 8 filters rather than 3. If we then wanted to disable 3 of those filters in response to some hypothetical user gesture we could simply remove them from the state.

Now, to really complete this model we have to address the way that applications change over time. It's convenient to write a single static description of your audio processing pipeline and then never have to change it, and surely we already have tools that make that process nicer for the developer. But what happens when we need to address the dynamic behavior of our application? Consider again the above example, where perhaps our app initializes with 8 active filters in our EQ and then the user invokes an action that should disable one of them. What should we do? What should our tools help us do? For example, in a "modern" audio software approach we would likely set the appropriate filter node here to bypass. Personally, I wonder sometimes if the whole idea of bypassing audio processing blocks comes out of the fact that we don't have a good answer to dynamic behavior in audio DSP.

Addressing this situation is perhaps the bread and butter of both React.js and Elementary, and the answer is simple: update your app state, invoke the same function that you've already written, and render it. Under the hood, Elementary will consider both what you've already asked it to render, and what you are now asking it to render, and it will modify the realtime processing operation to affect a change from the current state to the new. The idea of bypass fundamentally doesn't exist: if you no longer need to process a node, you can reflect that need in your app state, omit it from your resulting signal flow description, and Elementary will smoothly remove it from the realtime processing.

lastFilterDisableButton.on('click', function(e) {
  // Remove the last filter from the app state
  appState.filters.pop();
  
  // Invoke the same render function with our new state.
  // Elementary will understand the change and apply it
  // dynamically
  core.render(...el.inputs().map(function(input) {
    return eq3(appState, input);
  });
});

To me, this point drives home the value of working with such a functional, declarative abstraction. Whether I'm trying to write a static audio signal process that never changes or a flexible, dynamic process that needs to adapt to the user's intentions, the approach is the same. We can think only in terms of what our application should be, and defer all of the how to Elementary.