資料缺乏下多樣且真實的影像生成

非監督的圖像到圖像轉換（Unsupervised Image-to-Image Translation）因其廣泛的應用範疇與不需要標註的特性，已成為圖像生成領域的研究重心之一並獲得了非常顯著的成果。然而，在資料有限的情況下，確保訓練的穩定性並產生多樣且真實的圖像仍是很困難的研究問題。為了解決這些挑戰，我們提出了兩種簡單且即插即用的方法：遮罩自動編碼器生成對抗網絡（MAE-GAN）和風格嵌入自適應歸一化塊（SEAN）。 MAE-GAN是一種用於非監督圖像到圖像（Unsupervised I2I）任務的預訓練方法，它融合了MAE和GAN的架構和優點，並且在預訓練期間能學習到不同領域的風格信息，從而使下游任務的訓練穩定性和圖像品質提高。SEAN塊是一種新的歸一化塊(Normalization Block)，它利用了大規模的預訓練特徵提取器(Large-scale Pre-trained Feature Extractor) ，並在模型的每一層中能各自學習每個不同領域的風格特徵空間。並且，它還能在多樣性和保真度之間進行選擇，使得可以生成更多樣化或更真實的圖像。我們的方法在資料型態較少見且具有挑戰性的混凝土缺陷橋樑圖像數據集（CODEBRIM）上取得了非常好的成果，此外，我們的方法也使用10％動物臉部數據集（AFHQ）進行訓練，達到了與原本訓練在完整數據集上的模型相進的圖像品質，並且還能獲得更好的圖像多樣性，證明了其在現實世界中的應用性和巨大的潛力。

關鍵字

非監督的圖像到圖像轉換；生成對抗網絡；資料缺乏下的圖像生成；遮罩自動編碼器

並列摘要

Unsupervised Image-to-Image Translation (Unsupervised I2I) has emerged as a significant area of interest and has recently seen substantial advancements due to its wide range of applications and reduced data annotation requirements. However, in scenarios with limited data, ensuring training stability and generating diverse, realistic images remain critical research directions. To address these challenges, we propose two simple, plug-and-play methods: the Masked AutoEncoder Generative Adversarial Network (MAE-GAN) and the Style Embedding Adaptive Normalization (SEAN) block. The MAE-GAN, a pre-training method for Unsupervised I2I tasks, integrates the architectures and strengths of both MAE and GAN. It also enhances learning style-specific information during pre-training, leading to stable training and improved image quality in downstream tasks. The SEAN block is a novel normalization block that leverages large-scale pre-trained feature extractors and self-learns the style feature space for each domain in each layer. Consequently, it allows for a choice between diversity and fidelity, enabling the generation of more diverse or realistic images. Our method achieves substantial success on the less common and challenging concrete defect bridge dataset (CODEBRIM), demonstrating its real-world applicability. Additionally, our methods, trained on just 10% of the Animal Faces HQ dataset (AFHQ), achieve image quality on par with models trained on the full dataset, while also reaching greater image diversity, proving its real-world applicability and immense potential.

並列關鍵字

Unsupervised Image-to-Image Translation ； Multiple Domain Image-to-Image Translation ； Data-Efficient Generative Adversarial Network ； Masked Autoencoder

參考文獻

[1] J. Cao, L. Hou, M.-H. Yang, R. He, and Z. Sun, “Remix: Towards image-to-image translation with limited data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15018–15027, 2021.

Google Scholar

[2] J. Cao, M. Luo, J. Yu, M.-H. Yang, and R. He, “Scoremix: A scalable augmentation strategy for training gans with limited data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

Google Scholar

[3] Z. Li, X. Wu, B. Xia, J. Zhang, C. Wang, and B. Li, “A comprehensive survey on data-efficient gans in image generation,” arXiv preprint arXiv:2204.08329, 2022.

Google Scholar

[4] Y. Wang, C. Wu, L. Herranz, J. Van de Weijer, A. Gonzalez-Garcia, and B. Raducanu, “Transferring gans: generating images from limited data,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 218–234, 2018.

Google Scholar

[5] S. Mo, M. Cho, and J. Shin, “Freeze the discriminator: a simple baseline for fine-tuning gans,” arXiv preprint arXiv:2002.10964, 2020.

Google Scholar

國際替代計量

資料缺乏下多樣且真實的影像生成

全文下載

主題瀏覽