A Closer Look at Fourier Spectrum Discrepancies for
CNN-generated Images Detection

Keshigeyan Chandrasegaran / Ngoc‑Trung Tran / Ngai‑Man Cheung
Singapore University of Technology and Design (SUTD)
CVPR 2021 (Oral)

Abstract

CNN-based generative modelling has evolved to produce synthetic images indistinguishable from real images in the RGB pixel space. Recent works have observed that CNN-generated images share a systematic shortcoming in replicating high frequency Fourier spectrum decay attributes. Furthermore, these works have successfully exploited this systematic shortcoming to detect CNN-generated images reporting up to 99% accuracy across multiple state-of-the-art GAN models.

In this work, we investigate the validity of assertions claiming that CNN-generated images are unable to achieve high frequency spectral decay consistency. We meticulously construct a counterexample space of high frequency spectral decay consistent CNN-generated images emerging from our handcrafted experiments using DCGAN, LSGAN, WGAN-GP and StarGAN, where we empirically show that this frequency discrepancy can be avoided by a minor architecture change in the last upsampling operation. We subsequently use images from this counterexample space to successfully bypass the recently proposed forensics detector which leverages on high frequency Fourier spectrum decay attributes for CNN-generated image detection.

Through this study, we show that high frequency Fourier spectrum decay discrepancies are not inherent characteristics for existing CNN-based generative models—contrary to the belief of some existing work—, and such features are not robust to perform synthetic image detection. Our results prompt re-thinking of using high frequency Fourier spectrum decay attributes for CNN-generated image detection.

Discussion

In this study, we investigated the validity of contemporary beliefs that CNN-based generative models are unable to reproduce high frequency decay attributes of real images. We employ a systematic study to design counterexamples to challenge the existing beliefs. With maximum frequency bounded by the spatial resolution, and Fourier discrepancies reported at the highest frequencies, we hypothesized that the last upsampling operation is mostly related to this shortcoming. With carefully designed experiments spanning multiple GAN architectures, loss functions, datasets and resolutions, we observe that high frequency spectral decay discrepancies can be avoided by replacing zero insertion based scaling used by transpose convolutions with nearest or bilinear at the last step. Note that we do not claim that modifying the last feature map scaling method will always fix spectral decay discrepancies in every situation, but rather the goal of our study is to provide counterexamples to argue that high frequency spectral decay discrepancies are not inherent characteristicsof CNN-generated images. Further, we easily bypass the recently proposed synthetic image detector that exploits this discrepancy information to detect CNN-generated images indicating that such features are not robust for the purposes of synthetic image detection.

In Supplementary material, we provide more GAN models with no high frequency decay discrepancies. We also investigate whether such high frequency decay discrepancies are found in other types of computational image synthesis methods (synthesis using Unity game engine). To conclude, through this work we hope to help image forensics research manoeuvre in more plausible directions to combat the fight against CNN-synthesized visual disinformation.

spectral_issue — Figure 1: The curves show the average azimuthal integration over the power spectrum. Top row /Left column shows the evaluation on DCGAN, LSGAN, WGAN-GP. Note the discrepancies at the highest frequencies, the same as reported in recent works. Note also that these models use transpose convolutions. The botton row /right column shows the evaluation after replacing the last feature map scaling operation with nearest and bilinear interpolation methods. Refer to table 1 in main paper for experiment codes. All evaluation are done using CelebA (128x128). We observe that spectral consistent GANs are obtained when using nearest and bilinear interpolation methods for the last feature map scaling operation in these setups.

spectral_results — Figure 1: The curves show the average azimuthal integration over the power spectrum. Top row /Left column shows the evaluation on DCGAN, LSGAN, WGAN-GP. Note the discrepancies at the highest frequencies, the same as reported in recent works. Note also that these models use transpose convolutions. The botton row /right column shows the evaluation after replacing the last feature map scaling operation with nearest and bilinear interpolation methods. Refer to table 1 in main paper for experiment codes. All evaluation are done using CelebA (128x128). We observe that spectral consistent GANs are obtained when using nearest and bilinear interpolation methods for the last feature map scaling operation in these setups.

GAN Samples

Figure 2: DCGAN samples for Baseline (Left), N.1.5 (Middle) and B.1.5 (Right) for CelebA. We observe that the visual quality is comparable when replacing the last transpose convolutions with nearest and bilinear methods. FID scores for Baseline (Left), N.1.5 (Middle) and B.1.5 (Right) setups are 88.6, 87.52 and 84.65 respectively. (Measured using 50k real and generated samples.)

Figure 3: LSGAN samples for Baseline (Left), N.1.5 (Middle) and B.1.5 (Right) for CelebA. We observe that the visual quality is comparable when replacing the last transpose convolutions with nearest and bilinear methods. FID scores for Baseline (Left), N.1.5 (Middle) and B.1.5 (Right) setups are 73.26, 70.69 and 78.66 respectively. (Measured using 50k real and generated samples.)

Figure 4: WGAN-GP samples for Baseline (Left), N.1.5 (Middle) and B.1.5 (Right) for CelebA. We observe that the visual quality is comparable when replacing the last transpose convolutions with nearest and bilinear methods. FID scores for Baseline (Left), N.1.5 (Middle) and B.1.5 (Right) setups are 60.6, 48.69 and 52.18 respectively. (Measured using 50k real and generated samples.)

Image-to-Image Translation Samples

Figure 5: Image Translation results using StarGAN for CelebA at 256 x 256 resolution. Original Image (leftmost), Baseline (column 2), N.1.5 (column 3), B.1.5 (rightmost) for attribute Blonde hair.

Citation

@InProceedings{Chandrasegaran_2021_CVPR,
author = {Chandrasegaran, Keshigeyan and Tran, Ngoc-Trung and Cheung, Ngai-Man},
title = {A Closer Look at Fourier Spectrum Discrepancies for CNN-Generated Images Detection},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2021},
pages = {7200-7209}
}

Acknowledgements

This project was supported by SUTD project PIE-SGP-AI-2018-01. This research was also supported by the National Research Foundation Singapore under its AI Singapore Programme [Award Number: AISG-100E2018-005]. This work was also supported by ST Electronics and the National Research Foundation (NRF), Prime Minister’s Office, Singapore under Corporate Laboratory @ University Scheme (Programme Title: STEE Infosec - SUTD Corporate Laboratory).

We also gratefully acknowledge the support of NVIDIA AI Technology Center (NVAITC) for our research.

A Closer Look at Fourier Spectrum Discrepancies for
CNN-generated Images Detection

CVPR Paper

Code

Models

Abstract

Discussion

GAN Samples

Image-to-Image Translation Samples

Citation

Acknowledgements

A Closer Look at Fourier Spectrum Discrepancies forCNN-generated Images Detection

CVPR Paper

Code

Models

Abstract

Discussion

GAN Samples

Image-to-Image Translation Samples

Citation

Acknowledgements

A Closer Look at Fourier Spectrum Discrepancies for
CNN-generated Images Detection