pix2pix: A General Framework for Image Translation with cGANs

Image-to-image translation is a classic class of problems: given an input image, generate a corresponding output image—for example, turning a semantic segmentation map into a photorealistic scene, colorizing a line drawing, or converting a daytime photo into a nighttime one. Such tasks have traditionally required a custom network architecture and loss function designed for each individual scenario, whereas pix2pix proposes using a conditional GAN (cGAN) as a general-purpose framework—the same network architecture, trained on a different dataset, transfers to different tasks.

Why Not Use L1/L2 Loss Directly

The most direct supervisory signal is the pixel-level L1 or L2 distance, which pushes the generated image to be as close as possible to the target at every pixel. But there is a fundamental problem here: L1/L2 averages over each pixel, so when the model is uncertain about the prediction for a given pixel, the optimal solution is to take the mean of all possible values, and the resulting image becomes blurry. More precisely, L1/L2 can only capture low-frequency information—the large-scale color distribution and overall contours can be learned well, but fine textures (high-frequency information) vanish after averaging.

The core insight of this paper is that a GAN is essentially learning a loss function—a structured loss. Traditional L1/L2 is an unstructured loss: for a task like image generation, each pixel depends only on the input pixel and is independent of the other pixels. The structured loss learned by a GAN’s discriminator, by contrast, can perceive the relationships between pixels, making it better suited to image generation.

Modeling High-Frequency Information with PatchGAN

Since the low frequencies can be handled by L1, how do we deal with the high-frequency information? High-frequency information is inherently local—the contrast among pixels within a small region and the repeating patterns of texture. Judging whether an image is realistic is, to a large extent, a matter of looking at whether the local textures match the statistical properties of real images.

This paper therefore does not make a global judgment over the entire image. Instead, it splits the image into N×N patches, judges whether each patch is real, and finally takes the average of the judgments over all patches as the score for the whole image. This is the PatchGAN discriminator.

Experiments show that the patch size does not need to be very large to achieve excellent results, because the realism of local texture is itself locally judgeable. This approach effectively models the image as a Markov random field—assuming that the pixels across different patches are mutually independent, with correlations existing only within a patch. This assumption is very common in texture modeling, so the PatchGAN loss amounts to a loss on texture. This paper combines L1 with PatchGAN: L1 handles low-frequency structure, while PatchGAN handles high-frequency texture.

The Problem of Noise Input and Stochasticity

In a standard GAN, the generator’s stochasticity comes from the noise vector. Adding Gaussian noise in a cGAN can, in theory, let the output cover greater diversity. But this paper finds that when the conditional input x is already sufficiently strong, the generator simply ignores the noise and fails to produce any substantial stochasticity.

The authors later switched to using dropout to introduce stochasticity, applied at both training and test time, but the conclusion remained similar: the diversity is very limited. The authors candidly admit that capturing the full entropy of the conditional distribution is a difficult problem and a direction for future work. A generator without noise will only fit to a delta function rather than learn the true conditional distribution.

The Use of Batch Normalization

At inference time, the generator in this paper uses the batch statistics of the test data itself rather than the statistics accumulated during training. When the batch size is 1, this is equivalent to Instance Normalization—each image is normalized independently. This proves useful in image generation tasks.

Experimental Results

The paper conducts many experiments, validating the effectiveness of the framework across a variety of image translation tasks.

The framework is evaluated on a wide range of image-translation tasks and datasets:

  • Semantic labels ↔ photo, trained on the Cityscapes dataset [12].
  • Architectural labels → photo, trained on CMP Facades [45].
  • Map ↔ aerial photo, trained on data scraped from Google Maps.
  • BW → color photos, trained on [51].
  • Edges → photo, trained on data from [65] and [60]; binary edges generated using the HED edge detector [58] plus postprocessing.
  • Sketch → photo: tests edges → photo models on human-drawn sketches from [19].
  • Day → night, trained on [33].
  • Thermal → color photos, trained on data from [27].
  • Photo with missing pixels → inpainted photo, trained on Paris StreetView from [14].