Broken Machine Windows - Debugging a custom TensorFlow.js DCGAN model

| Philipp Gross

A story of debugging and fixing a DCGAN model, trained with PyTorch, and deployed with Tensorflow.js, that gives inconsistent results in different browser platforms.

During our collaboration with machine learning researcher Claartje Barkhof Machine Windows we trained a GAN on a custom dataset of colored shapes following the PyTorch DCGAN tutorial and converted the generator network to a TensorFlow.js model for deployment in the browser.

As the generator network is compute intense we didn't execute it on the main thread but ran a service worker instead.

For example by sampling 16 random points from the 100-dimensional input space and passing them through the generator gives 16 images:

webgl-tilegrid 4x4 grid of generated images using the WebGL backend

In this example, I use Chrome as the browser, but when running the same experiment in Safari on a different machine things turned messy:

cpu-tilegrid 4x4 grid of generated images using the CPU backend

What changed? Consulting the documentation reveals that TensorFlow.js offloads the computation to a global backend that is determined by the environment. Here, Chrome uses the WebGL backend, but for some reason this didn't work in Safari but got silently replaced with the slower CPU backend.

It was disturbing. I can't be the first one stumbling over this.

Since TensorFlow.js is widely used and I basically just used the standard layers I must be doing something wrong. By dabbling with backend configuration, I was able to recreate the "bug" in Chrome by just executing

await tf.setBackend('cpu');

Ok, at least we can rule out hardware specific quirks. Something must be going on with the CPU backend. Probably, I am one of the few noobs trying to run a GAN on the CPU. Of course I tried hard and begged TensorFlow to just accept the WebGL backend but gave up. Much later I learned that it is currently just not supported in Safari when running a service worker, only on the main thread. Premature optimization strikes again. Yikes.

Anyway, why look the results so different when using the CPU backend?

Let's dismantle the neural network and check when things start to fall apart. After all the network is just a sequence of layers:

LayerTypeOutput Shape
26Conv2DTranspose1,512,4,4
27BatchNormalization1,512,4,4
28Activation1,512,4,4
29Conv2DTranspose1,256,10,10
29_cropCropping2D1,256,8,8
30BatchNormalization1,256,8,8
31Activation1,256,8,8
32Conv2DTranspose1,128,18,18
32_cropCropping2D1,128,16,16
33BatchNormalization1,128,16,16
34Activation1,128,16,16
35Conv2DTranspose1,64,34,34
35_cropCropping2D1,64,32,32
36BatchNormalization1,64,32,32
37Activation1,64,32,32
38Conv2DTranspose1,3,66,66
38_cropCropping2D1,3,64,64
image_batchActivation1,3,64,64

So by looping through the layers and visualizing the intermediate outputs we can hope to find the culprit.

We start with the 100-dimensional input vector represented as a grayscale image of size:

const z = tf.randomNormal([1, 100, 1, 1]);

input

Now we pass it manually through the layers:

let temp = z;
// ommit the input layer
for (let i = 1; i < model.getConfig().layers.length; i++) {
  const layer = model.getLayer(null, i);
  temp = layer.apply(temp);
  renderOutput(temp);
}
const result = temp;

The following shows the intermediate outputs for both the WebGL and CPU backend:

layerWebGLCPU
26webgl_layer_26cpu_layer_26
27webgl_layer_27cpu_layer_27
28webgl_layer_28cpu_layer_28
29webgl_layer_29cpu_layer_29
29_cropwebgl_layer_29_cropcpu_layer_29_crop
30webgl_layer_30cpu_layer_30
31webgl_layer_31cpu_layer_31
32webgl_layer_32cpu_layer_32
32_cropwebgl_layer_32_cropcpu_layer_32_crop
33webgl_layer_33cpu_layer_33
34webgl_layer_34cpu_layer_34
35webgl_layer_35cpu_layer_35
35_cropwebgl_layer_35_cropcpu_layer_35_crop
36webgl_layer_36cpu_layer_36
37webgl_layer_37cpu_layer_37
38webgl_layer_38cpu_layer_38
38_cropwebgl_layer_38_cropcpu_layer_38_crop
image_batchwebgl_layer_image_batchcpu_layer_image_batch

Clearly, in layer 30 we see the first difference. This is a Batch Normalization layer, and a bit of internet search shows that we are on the right track: There is the issue Batchnormalization is incorrect with cpu backend and channelsFirst data. #1106. But it was automatically closed due to lack of interest (ha). Apparently, the majority of TFJS users either uses the WebGL backend, or works with channelsLast data, which is TensorFlow's default configuration. But we converted our model from PyTorch, where channelsFirst is the standard.

Since we didn't have the time to change the model and retrain it, we decided to fix the channel data format on the fly and do the batch normalization manually:

let temp = z;
// ommit the input layer
for (let i = 1; i < model.getConfig().layers.length; i++) {
  const layer = model.getLayer(null, i);
  if (layer.getClassName() === 'BatchNormalization') {
    const [gamma, beta, movingMean, movingVar] = layer.weights.map(w =>
      w.read(),
    );
    // to channels last
    const x = temp.transpose([0, 2, 3, 1]);
    // run batchNorm4d on channels last data format
    const y = tf.batchNorm4d(
      x,
      movingMean,
      movingVar,
      beta,
      gamma,
      layer.epsilon,
    );
    // and convert back to channels first
    temp = y.transpose([0, 3, 1, 2]);
  } else {
    temp = layer.apply(temp);
  }
  renderOutput(temp);
}
const result = temp;

And it worked!

While this allows us to use the image generator network with the CPU backend it turned out that it is way to slow for our application. In the end we decided to run it on the main thread where the WebGL backend is available for all platforms, and just live with the consequences of having a blocked UI once in a while.