Writing a Bundler - Part 1

Introduction

In today's web development landscape, web applications require a bundler to transpile, combine, and optimize the source code of a project so that end users receive distributable code that is optimally packaged for downloading and executing within their browser.

Bundlers themselves have been around for a while, gaining more features as web applications have become more complex.

Many bundlers have appeared over the years, each learning from their predecessors and brining with them distinct features and advancements - such as;

Many web tools are written in JavaScript or TypeScript, like the TypeScript compiler itself. Recently, a popular trend has been to create build tooling using a language better suited to handling the demanding workloads of those tools.

A common choice is Rust - rspack, turbopack. Another common choice is Go - esbuild.

Both languages offer fantastic features to enable building high performance and aggressively multi-threaded bundlers.

An interesting case study for writing a multi-threaded bundler in JavaScript is the Parcel bundler.

Parcel brought to the bundling world many novel features; like incremental bundling, a cache system, a plugin system that is both ergonomic and sensible, and an opinionated zero-configuration (or minimal-configuration) build environment.

This allowed for drastically simplified development environments that had a reduced burden of dependency maintenance.

Which Allows This:
{
  "devDependencies": {
    "copy-webpack-plugin": "*",
    "css-loader": "*",
    "html-webpack-plugin": "*",
    "mini-css-extract-plugin": "*",
    "sass": "*",
    "sass-loader": "*",
    "style-loader": "*",
    "ts-loader": "*",
    "typescript": "*",
    "webpack": "*",
    "webpack-bundle-analyzer": "*",
    "webpack-cli": "*",
    "webpack-dev-server": "*",
    "worker-loader": "*"
  }
}
To Become This:
{
  "devDependencies": {
    "parcel": "*"
  }
}

Parcel's plugin system allowed for extensibility where required, however by default Parcel ships everything you need to build a modern web application.

In addition - Parcel leverages Node.js worker threads, scheduling jobs across threads, to improve bundling performance.

This series will embark on a journey to create a simple bundler, written in Rust, that is heavily inspired by the philosophy that Parcel embodies.

Disclaimer:

Throughout the series, I will analyze and compare Parcel with our creation - this isn't intended as a criticism of Parcel or aimed at insinuating that my spin off is better than Parcel.

There is simply no way that I, alone, could create something as comprehensive as Parcel. Parcel is a project with many brilliant and dedicated contributors and ultimately a lot of what I will be creating here will be taken directly from Parcel - borrowing learnings from their work that I never would have come to myself.

Objective

We are going to write our own bundler in Rust, one that targets building applications which exist at "hyper" scale.

I define hyper scale as an application that has tens of thousands of source files - presumably operating with millions of users (aspirational I know).

I picked this target because this is a context where conventional bundler tooling tends to struggle and the biggest gains from threading are to be had.

I would like to preserve the plugin model offered by Parcel as best as possible as I find it to be extremely ergonomic, supporting both dynamically loaded plugins written in Rust and plugins written in JavaScript.

This first article will center around the creation of the first phase in the bundling process - transformation.

Bundler Phases At A High Level

Parcel's bundling pipeline follows roughly the following flow:



You can also view a more descriptive flow diagram on Parcel's website

The Transformation Pipeline

The transformation pipeline is a simple recursive loop (in behavior, not implementation) that traverses over the source files of a project; where imports are extracted and used to trigger the next loop.

The loop exits when there are no more imports to resolve and all the files have been processed.

The transformation pipeline is broken down into 2 major steps

To start off, an "entry" file is passed into the pipeline, the import declarations are extracted resulting in the pipeline discovering new files to transform - continuing the loop.

The new found files have their import declarations extracted, again continuing the loop, stopping once all the imports in a project have been extracted.

During transformation, the file is read, potentially converted from TypeScript to JavaScript then stored in memory within a "store".

Parcel actually writes these transform results to disk but for the sake of simplicity, we will keep them in memory unless it becomes a problem.

Resolution

The first step in the transformation loop is dependency resolution. To understand resolution we must first understand JavaScript imports.

Simply, JavaScript imports adhere to the following syntax:

import something from "specifier";
import "specifier"

The string used as the target of an import is known as the "specifier".

The specifier could be a relative path, absolute path, or an alias with rules described by the runtime or the bundler.

import { fromRelative } from "./relative"
import { fromAbsolute } from "/probably/dont/do/this"

// Rules specific to Node.js that obtains an import from 
// the node_modules folder. This is known as "node module resolution"
import { fromNodeModules } from "npm:lodash/get"

// Rules specific to Node.js that obtains an import from 
// the internal Node.js standard library
import { exists } from "node:fs"

// Special custom rules the developer can specify in the bundler
// to allow path aliases. These are arbitrary, I am using "~root/*"
// but it could be anything from "asd" to "&%$hello/*" 
import { fromAlias } from "~root/some-path"

A bundler written in JavaScript can inherit some of these resolution properties, however a bundler written in Rust must reimplement them.

Thankfully, Parcel has a Rust crate that implements the Node.js resolution algorithm - which we will use as is.

To resolve an import, the bundler needs two pieces of information:

The resulting output of an import resolution is an absolute path to the target file of the import statement.

Example

// Imported by: "/absolute/path/to/index.js"
import "./a.js" // Will resolve to: "/absolute/path/to/index.js"

Transforming the file contents and extracting new specifiers

After the specifier has been resolved to an absolute path of the dependency, the absolute path is passed to the transformer.

The transformer will read the file contents from the disk then, using a JavaScript parser like SWC, will convert the source code into an AST.

Source code:
import foo from 'foo'
Resulting AST:
{
  "type": "Module",
  "body": [
    {
      "type": "ImportDeclaration",
      "specifiers": [
        {
          "type": "ImportDefaultSpecifier",
          "local": {
            "type": "Identifier",
            "value": "foo",
            "optional": false
          }
        }
      ],
      "source": {
        "type": "StringLiteral",
        "value": "foo",
        "raw": "\"foo\""
      }
    }
  ]
}

We can then walk the AST, identify the imports, extract the specifiers and give them to the next iteration of the loop.

Once again, Parcel has a Rust crate for its transformation step that we will use in our bundler. We will actually reimplement this later on in the series, but it will be helpful in getting us off the ground.

Example

A file with the contents:

// Current file path: "/absolute/path/to/index.js"
import "./a.js"
import "./b.js"
import "node:fs"
import "npm:lodash"

Will see the transformer return following information to be used in the next iteration of the transformation pipeline loop:

{
    "from_path": "/absolute/path/to/index.js",
    "specifiers": [
        "./a.js",
        "./b.js",
        "node:fs",
        "npm:lodash"
    ]
}

The Benchmark

So let's get started.

The first thing we need is something to bundle. For this I have taken the familiar three-js benchmark used by esbuild, Parcel and other bundlers.

Three JS is a suitable benchmark because it has a large dependency graph, non trivial transformations, is written using valid module syntax and has no external dependencies.

To scale the benchmark up; we take the three-js source code and copy it 250 times. This results in a project that contain approximately 90,000 source files.

We then need to generate an entry point that imports each copy and re-exports it to avoid the bundler optimizing away unused imports.

import * as copy_1 from './copy_1/Three.js';
import * as copy_2 from './copy_2/Three.js';
import * as copy_3 from './copy_3/Three.js';
// ... to 250

// To avoid tree shaking:
window.copy_1 = copy_1;
window.copy_2 = copy_2;
window.copy_3 = copy_3;
// ... to 250

Test Hardware

I will be using my personal desktop computer that has the following hardware:

Baseline

To obtain a baseline for performance, I compiled the benchmark project using Parcel and recorded the resource usage with procmon.

The number of worker threads Parcel spawns is controlled with the PARCEL_WORKERS environment variable, by default it has a maximum value of 4.

Node.js has a quirk where the process has an explicit heap memory ceiling which applies to the main process as well as to worker threads. By default node grants something like 1gb per thread. For the build to pass I had to increase Node.js's worker memory ceiling with the node --max-old-space-size=xxxx command line argument to 6gb per thread.

Unfortunately, the project size was too great and Parcel was unable to compile the entire project, however I was able to complete the transformation pipeline and profile that successfully - the build fails during the later optimization stage.

Parcel was able to compile a project with 50 copies of three-js, I have yet to determine the point at which it becomes too large

env \
    PARCEL_WORKERS=16 \
    PARCEL_EXIT_TRANSFORM=true \
    PM_MEM_UNITS=mb \
    PM_TIME_UNITS=s \
    PM_TRACK=cpu,memory \
    PM_REPORT=parcel.csv \
        procmon \
            node \
                --max-old-space-size=6000 \
                ./node_modules/parcel/lib/bin.js build ./entry/entry.js \
                    --no-cache --dist-dir ./dist --cache-dir .cache

Results

Note: Keep in mind this is not a full build, this is just the transformation phase of the build

Running the build we see the following results:

The transformation took 60 seconds where the CPU usage peaked at about 11 cores and averaged 5 cores. The process consumed 7gb of RAM.

Let's inspect the relationship between the number of threads against the transformation time.

To do this I am using the following bash:

for THREADS in $(seq 0 15); do
    rm -rf dist .cache
    env \
        PARCEL_WORKERS=$THREADS \
        PARCEL_EXIT_TRANSFORM=true \
        PM_MEM_UNITS=mb \
        PM_TIME_UNITS=s \
        PM_TRACK=cpu,memory \
        PM_REPORT=parcel_$THREADS.csv \
            procmon \
                node \
                    --max-old-space-size=5000 \
                    ./node_modules/parcel/lib/bin.js build ./entry/entry.js \
                        --no-cache --dist-dir ./dist --cache-dir .cache
done

When the PARCEL_WORKER variable is set to 0, Parcel will run the compilation on the main thread. So that means main thread + n workers = total threads.

What we can observe is that we stop seeing improvements beyond 4 threads however there is a continued escalation of maximum CPU usage, while average CPU usage remains the same.

Why does Parcel behave like this?

Parcel.js is largely written in JavaScript and runs in a Node.js environment. While hot paths within Parcel are written as native Rust modules - the orchestration logic and data structures are held in JavaScript.

In most languages, state can be shared between threads where race conditions can be mitigated through the use of mutexes or other such forms of synchronization.

Node.js, by contrast, completely isolates each thread - making shared mutable access impossible (or impractical).

Instead, multi-threaded Node.js applications rely on the postMessage API to send messages between threads - where actions can be triggered independently of each other thereby eliminating the possibility of race conditions.

Data sent via postMessage is cloned using a lightweight deep clone implementation present in the structuredClone() function.

Node.js does not allow complex types like classes, functions or object containing functions to be sent across threads, however basic data types are permitted:

// Non exhaustive but you get the point
type BasicType = string | number | bool
type BasicArray = Array<BasicType | BasicArray | BasicObject>
type BasicObject = Record<string | number, BasicType | BasicArray | BasicObject>

Example

An example of how communication across threads occurs.

// This is analogous to starting a sub process with a separate entry point
const worker = new Worker('./worker.js')

// Listen to messages from the worker
worker.addEventListener('message', (data) => {
    console.log(data)
})

// Send a message to the worker
worker.postMessage('Hello from parent')

There is one type that Node.js allows sending across workers which does facilitate shared mutable state - the SharedArrayBuffer.

A SharedArrayBuffer is essentially a fixed sized array with 0|1 values within it.

Mutations to the SharedArrayBuffer are available instantly through threads as it's a shared mutable reference - however you need to cast normal data types (like string, etc) into the SharedArrayBuffer and read them out of the buffer.

This necessitates complex wrappers and, should you need more size than the SharedArrayBuffer contains, you must create a new SharedArrayBuffer instance and manually stitch together pages.

This makes using a SharedArrayBuffer highly impractical for most use cases.

Parcel stores the state of the bundle graph within pages of SharedArrayBuffer instances, however that is not used within the transformation pipeline.

During transformation, we are simply concerned with resolving and converting files - during this phase Parcel simply clones and sends many internal data structures between threads using postMessage.

The main thread acts as an orchestrator for this communication however in a project that is large enough, the main thread will be unable to cope with managing the messaging for more than a few CPUs at a time - forcing cores to sit idle waiting for work while the orchestrator is overwhelmed with the side quest of copying/serializing stuff.

That said, what the Parcel team have accomplished within the context of Node.js is truly remarkable. Parcel is a case study that demonstrates just how far you can push Node.js in terms of performance and, in the arena of bundlers written in JavaScript, Parcel outperforms every single one!

Side note: Parcel currently has an experimental implementation of a shared structured memory utility currently in RC phase known as ParcelDB. The Parcel team has the ambition to migrate Parcel to Rust in the future, however this must be an incremental process. ParcelDB aims to facilitate this migration while simultaneously improving the performance of the existing bundler. It's super cool.

Bundlers and other similar build tools must efficiently scale vertically on client machines to make maximal use of the available hardware and deliver an outstanding experience and it appears that, at the time of writing, Node.js presents a poor substrate for high performance multi-threaded applications.

So let's try figure out how to do it in Rust!

Let's Build A Bundler (Transformer)

TypeScript

Reviewing the circuit for the transformation pipeline, it's possible to create a simple, single threaded implementation in TypeScript that serves as a familiar environment to describe the flow.

import * as fs from 'node:fs'
import { resolve } from '@parcel/node-resolver-core'
import { transform } from '@parcel/transformer-js'

// Parse CLI arguments
let config = new Config();

// A queue that contains the files to resolve
let todo: Array<{ from_path: string, specifier: string }> = []

// A store that maps the absolute file path to the contents of an asset
let assets = new Map<string, string>()

// A set with the absolute file path of a parent concatenated with the
// absolute file path of its dependency
let asset_graph = new Set<string>()

// Add entry asset to queue
todo.unshift({ from_path: process.cwd(), specifier: config.entry_point })

// Loop until the queue is empty
while (todo.length !== 0) {
    // Take an item from the front of the queue
    const { from_path, specifier } = todo.pop()

    // Find the absolute path of the dependency
    let absolute_path = resolve(from_path, specifier)

    // Add that dependency as a child of the parent 
    asset_graph.set(`${from_path}->${absolute_path}`)

    // Continue if the child has already been transformed
    if (assets.has(absolute_path)) {
        continue
    }
    
    // Read the contents of the dependency
    let contents = fs.readFileSync(absolute_path, 'utf8')

    // Transform the contents and extract any import specifiers
    let result = transform(contents, absolute_path)

    // Add the specifiers to the back of the queue
    for (const specifier in result.specifiers) {
        todo.unshift({ from_path: absolute_path, specifier })
    }

    // Add the contents of the new asset to the asset store
    assets.set(absolute_path, result.content_transformed)
}

// .....
// Begin the next phase

Now in Rust

This is a good starting point and is not that different when ported to Rust (link to source):

use std::fs;
// ...

fn main() {
    // Queue of files to transform
    let mut queue = VecDeque::<(PathBuf, String)>::new();
    // Map of path -> asset contents
    let mut assets = HashMap::<PathBuf, String>::new();
    // Graph as tuples rather than strings
    let mut asset_graph = HashSet::<(PathBuf, PathBuf)>::new();

    // Add entry asset to queue
    queue.push_back((config.project_root.clone(), config.entry_point.clone().to_str().unwrap().to_string()));

    // Loop until the queue is empty
    while let Some((from_path, specifier)) = queue.pop_front() {
        // Resolve and ignore errors
        let absolute_path = resolve(&from_path, &specifier).unwrap();

        // Add to graph
        asset_graph.insert((from_path, absolute_path.clone()));

        // Skip if transformed previously
        if assets.contains_key(&absolute_path) {
            continue;
        }

        // Read file and ignore errors
        let contents = fs::read_to_string(&absolute_path).unwrap();

        // Transform file
        let (content_transformed, specifiers) = transform(&contents, &absolute_path);

        // Add new specifiers to queue
        for specifier in specifiers.iter() {
            queue.push_back((absolute_path.clone(), specifier.clone()));
        }

        // Add asset to the asset store
        assets.insert(absolute_path, content_transformed);
    }
}

// .....
// Begin the next phase

That's pretty sweet... how does the single threaded Rust transformer perform?

cargo build --release

env \
    PM_MEM_UNITS=mb \
    PM_TIME_UNITS=s \
    PM_TRACK=cpu,memory \
        procmon \
            ./target/release/single-threaded-transformer "../parcel/entry/entry.js"

Results

Looking at the chart, on a single thread, our home baked transformer circuit was able to transform the 90k source file three-js benchmark in 40 seconds and consumed 600mb of ram while doing it!

As expected, the CPU was locked at 1 core being used at 100% and the memory usage grew linearly as the contents of the files were loaded into memory.

To be fair, Parcel stores a lot more information about assets and the environment the bundler is running in - however this is a pretty impressive result.

Multithreading

Let's try to make this circuit multi-threaded.

The simplest approach is to spawn a number of OS threads and protect the shared data structures behind mutexes.

In Rust this is done by wrapping them in an Arc<Mutex<T>> like this:

let queue = Arc::new(Mutex::new(VecDeque::<(PathBuf, String)>::new()));
let assets = Arc::new(Mutex::new(HashMap::<PathBuf, String>::new()));
let asset_graph = Arc::new(Mutex::new(HashSet::<(PathBuf, PathBuf)>::new()));

An Arc is a "smart pointer", which you can think about as a clone-able reference to the value contained within. In effect an Arc<String> can be thought of as a *String.

So here we go (link to source):

fn main() {
    // Parse CLI arguments
    let config = Config::new();
    println!("Threads: {}", config.threads);

    // Queue of files to transform
    let queue = Arc::new(Mutex::new(VecDeque::<(PathBuf, String)>::new()));
    // Channel to notify parked threads when new work comes in
    let (notify_queue_push, on_queue_push) = crossbeam_channel::unbounded::<()>();
    // Counter of the tasks in flight
    let in_flight = Arc::new(AtomicUsize::new(0));

    // Map of path -> asset contents
    let assets = Arc::new(Mutex::new(HashMap::<PathBuf, Option<String>>::new()));
    // Graph as tuples rather than strings
    let asset_graph = Arc::new(Mutex::new(HashSet::<(PathBuf, PathBuf)>::new()));

    // Add the entry asset to queue
    in_flight.fetch_add(1, Ordering::Relaxed);
    queue.lock().unwrap().push_back((
        config.project_root.clone(),
        config.entry_point.clone().to_str().unwrap().to_string()
    ));

    // Hold the threads so we know when they all exit
    // In JavaScript terms, you can sort of think of this as an "Array<Promise<void>>"
    let mut thread_handles = Vec::<JoinHandle<()>>::new();

    // Spawn a bunch of threads, this loop will exit immediately and the main thread will block
    // at the end where we have the "handle.join()" code
    for _ in 0..config.threads {
        // Clone all of the Arcs - this clones the pointer, not the value
        // Rust will move the cloned pointers into the thread
        // You can think of this as a sort of "dependency injection" for each thread
        let queue = queue.clone();
        let assets = assets.clone();
        let asset_graph = asset_graph.clone();
        let in_flight = in_flight.clone();
        let notify_queue_push = notify_queue_push.clone();
        let on_queue_push = on_queue_push.clone();

        // In JavaScript terms, you can think of "thread::spawn()" as "setTimeout(cb)" if it returned a Promise
        thread_handles.push(thread::spawn(move || {
            // Keep looping while there are active jobs
            while in_flight.load(Ordering::Relaxed) > 0 {
                let Some((from_path, specifier)) = queue.lock().unwrap().pop_front() else {
                    // If there are no items in the queue wait to get a message saying there are
                    on_queue_push.recv().unwrap();
                    continue;
                };

                // Resolve and ignore errors
                let absolute_path = resolve(&from_path, &specifier).unwrap();

                // Add to graph
                asset_graph.lock().unwrap().insert((from_path, absolute_path.clone()));

                // In the case where multiple different files import the same target file, it's  
                // possible that the same file will be resolved from different threads. To prevent  
                // that file from being transformed multiple times, we hold the lock to the
                // the shared data store, check if the file has already been transformed and
                // if it hasn't, we insert a placeholder preventing others from trying.
                let mut assets_unlocked = assets.lock().unwrap();

                // Skip if transformed previously
                if assets_unlocked.contains_key(&absolute_path) {
                    // This job is done
                    in_flight.fetch_sub(1, Ordering::Relaxed);
                    continue;
                }

                // Insert a placeholder so other threads will skip it when checking if it exists
                assets_unlocked.insert(absolute_path.clone(), None);
                
                // Unlock the mutex manually
                drop(assets_unlocked);

                // Read file and ignore errors
                let contents = fs::read_to_string(&absolute_path).unwrap();

                // Transform file
                let (content_transformed, specifiers) = transform(&contents, &absolute_path);

                for specifier in specifiers.iter() {
                    // Add new specifiers to queue
                    in_flight.fetch_add(1, Ordering::Relaxed);
                    queue.lock().unwrap().push_back((absolute_path.clone(), specifier.clone()));
                    // Tell other parked threads that new work is available 
                    notify_queue_push.send(()).unwrap();
                }

                // Add asset to the asset store
                assets.lock().unwrap().insert(absolute_path, Some(content_transformed));

                // This job is done
                in_flight.fetch_sub(1, Ordering::Relaxed);
            }
        }));
    }

    // Wait for all the threads to finish
    // Kinda like "Promise.all()"
    for handle in thread_handles {
        handle.join().unwrap();
    }

    println!("Assets: {}", assets.lock().unwrap().len());
}

Results

Running this circuit against the 250x three-js project using 16 threads:

🤯🤯🤯🤯🤯🤯🤯🤯🤯🤯 3 seconds!?

Doing some napkin math, given the single threaded run took 40 seconds; 40 seconds divided by 16 cores is 2.5 seconds - so a transformation time of 3 seconds does make sense and indicates there is the potential to improve this further at some point.

In reality you never see gains that match dividing the time a single threaded task took by the number of threads, but we did get pretty close!

Another interesting thing to note is that the library I am using to track CPU usage started to lose it with the higher CPU utilization.

Let's have a look at the build time vs thread count looks like:

cargo build --release
for THREADS in $(seq 1 16); do
    env \
        HS_THREADS=$THREADS \
        PM_MEM_UNITS=mb \
        PM_TIME_UNITS=s \
        PM_TRACK=cpu,memory \
        PM_POLL_INTERVAL=250 \
        PM_REPORT=hypersonic_$THREADS.csv \
            procmon \
                ./target/release/multi-threaded-transformer "./parcel/entry/entry.js"
done

The blue line tracks the actual build time.

The green line tracks the "theoretical perfect" build time, calculated as build time of 1 thread / number of threads

Red and yellow track average and maximum CPU usage.

We can see that we are able to track very closely to the theoretical perfect time and it's likely not worth optimizing further to achieve a better time.

A benefit of this approach is that a time of 6 seconds is still very acceptable, limiting the thread count to 7 leaves us with 9 threads to continue working on other tasks while the build runs - like checking emails or watching YouTube videos.

Comparison to Parcel

While this transformer circuit attempts to mimic Parcel's transformation circuit - it does not completely encapsulate the work that Parcel is doing.

For example, Parcel is loading plugins dynamically, it's checking assets to determine which plugins to use on them, detecting configuration and supplying it to plugins.

It also support many more language features, automatically detecting if the file uses TypeScript or JSX. It supports Node's require() syntax as well as JavaScript's import syntax.

It's also generating a much more complex graph of the asset relationships to use for advanced optimizations like dead code elimination.

While we do see a 20x improvement in the time it takes for this circuit to run - this isn't indicative of the real world expectations of how a bundler like Parcel would perform if migrated to Rust.

That said, it does make me excited to see how the Parcel project performs once migrated to Rust - there certainly seems a lot to be gained in the way of performance.

Next Article

In the next article we will dive into bundling. We will walk through analyzing the asset graph to determine when to combine source files and generate a "bundle graph".

FIN 🌴