Advancing Image Synthesis: Methods and Applications of Latent Diffusion Models

Authors: Soumika Chakraborty, Bikram Sarkar, Jeet Kumar Dash, Shatayu Bhowmick, Tuhin Akuli, Dr. Sangita Roy

DOI Link: https://doi.org/10.22214/ijraset.2024.65751

Abstract

Diffusion models (DMs) have revolutionized the field of generative modelling, delivering exceptional results in tasks such as image synthesis, inpainting, and super-resolution. Despite their success, the reliance on pixel-space processing in these models has imposed substantial computational challenges, requiring hundreds of GPU days for training and significant resources for inference. In this work, we introduce Latent Diffusion Models (LDMs), an innovative framework that addresses these limitations by operating within a perceptually compressed latent space derived from a pretrained autoencoder. This paradigm shift significantly reduces the computational complexity of both training and inference while maintaining the high fidelity and diversity of the generated outputs. LDMs leverage a two-stage approach: a pretrained autoencoder for efficient latent-space representation and a diffusion model trained directly within this space. By introducing cross-attention layers into the architecture, LDMs also support flexible conditioning on various modalities such as text descriptions, semantic maps, or low-resolution images. This versatility enables the model to perform a range of tasks, including text-to-image generation, class-conditional image synthesis, and high-resolution super-resolution. Our experiments demonstrate that LDMs achieve competitive or state-of-the-art performance across multiple benchmarks, including CelebA-HQ, ImageNet, and MS-COCO, while requiring significantly fewer computational resources than pixel-space diffusion models. For instance, LDMs reduce training time by up to 2.7× and inference memory requirements by 50%, all while improving sample quality. This work highlights the potential of latent-space generative models to democratize access to advanced generative AI technologies, making them feasible for researchers and practitioners with limited computational resources. At the same time, we discuss the ethical considerations of using such models, including their potential misuse for creating manipulated content. Latent Diffusion Models pave the way for efficient, scalable, and high-quality image synthesis, providing a robust foundation for future advancements in generative modelling.

Introduction

I. INTRODUCTION

The field of generative modelling has witnessed tremendous progress in recent years, with methods like Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and autoregressive models producing high-quality images. However, these methods face limitations: GANs often suffer from mode collapse, VAEs struggle with sample quality, and autoregressive models are computationally prohibitive for high-resolution images. Diffusion models overcome many of these challenges by modelling data distributions through iterative denoising processes. Despite their robustness and superior sample diversity, diffusion models operating in pixel space demand significant computational resources for both training and inference, often requiring hundreds of GPU days. This limits their accessibility and environmental sustainability.

We propose Latent Diffusion Models (LDMs) as a solution. By transitioning the generative process to a lower-dimensional latent space, we achieve a favourable balance between computational efficiency and output quality. Key contributions include:

A two-stage architecture combining autoencoders for latent compression and diffusion models for generative tasks.
Cross-attention mechanisms for multimodal conditioning, enabling diverse applications such as text-to-image synthesis.
Significant reductions in computational costs while maintaining or surpassing state-of-the-art performance in key benchmarks.

Figure 1. Boosting the upper bound on achievable quality with less aggressive down sampling.

II. FOUNDATIONS AND PROGRESS IN GENERATIVE MODELLING

A. Generative Modelling Approaches

Generative models can be broadly categorized into three classes:

GANs: Known for producing high-quality images but suffer from training instabilities and limited mode coverage.
VAEs: Efficient but often generate blurry samples due to their reliance on maximum-likelihood objectives.
Autoregressive Models: Strong in density estimation but computationally infeasible for large images due to their sequential nature.

B. Diffusion Models

Diffusion models have emerged as a powerful class of generative models due to their ability to learn complex data distributions through iterative denoising processes. These models simulate a Markov chain where noise is incrementally added to data in the forward process, and the reverse process, learned by the model, reconstructs the data from pure noise. Unlike GANs, which often suffer from mode collapse and training instability, diffusion models exhibit robust training dynamics and achieve excellent mode coverage. Their effectiveness is particularly evident in tasks like image synthesis, inpainting, and super-resolution, where they consistently set new benchmarks for quality. However, operating directly in pixel space comes with significant challenges. Processing high-dimensional image data, such as megapixel-level resolutions, results in substantial computational overhead for both training and inference. Training powerful diffusion models can require hundreds of GPU days, and their sequential nature further complicates efficient sampling, making them resource-intensive and less accessible. While advancements like cascaded and hierarchical diffusion models address some of these inefficiencies, they often require additional architectural complexity and computational resources. The computational intensity of pixel-space diffusion models not only limits their adoption but also raises concerns about their environmental impact. These challenges highlight the need for approaches like Latent Diffusion Models (LDMs), which retain the generative strengths of diffusion models while significantly reducing computational demands by operating in a compressed latent space.

C. Latent Space Models

Latent space models represent a crucial evolution in generative modeling, offering a method for efficiently compressing high-dimensional data while retaining the essential features necessary for generation. These models, such as VQ-VAE and VQGAN, rely on encoding input data into a discrete latent space where each image is represented by a set of learned latent variables. The primary benefit of working in latent space is the drastic reduction in computational complexity, as the latent representation typically has a much lower dimensionality than the raw pixel space. This compression allows models to operate more efficiently, requiring fewer parameters and less computational power to train. However, despite these advantages, earlier latent space models often faced trade-offs in terms of quality, as aggressive compression could lead to the loss of fine-grained details, resulting in blurry or less sharp outputs. Additionally, many of these models utilized discrete latent spaces, which restricted the model's ability to capture continuous variations in the data, limiting their flexibility. In contrast, continuous latent space models, which combine the benefits of both generative compression and fine-grained detail preservation, have the potential to overcome these challenges. Recent advancements in latent diffusion, particularly with Latent Diffusion Models (LDMs), take advantage of these continuous latent representations to strike a balance between computational efficiency and high-quality generation. By operating in a latent space that still retains rich semantic information but reduces the dimensionality of the data, LDMs offer a more scalable solution for high-resolution image synthesis while mitigating the quality loss associated with aggressive compression. This approach not only improves efficiency but also enhances the fidelity of the generated samples, making latent space models a promising direction for future generative AI research.

III. INNOVATIVE FRAMEWORK AND TECHNIQUES

A. Perceptual Compression via Autoencoders

We begin with an autoencoder trained to map high-dimensional image data into a compact latent space. The encoder compresses the image into a latent representation z, and the decoder reconstructs x from z.

Objective

……………..(1)

Advantages

Reduces computational complexity by processing in a lower-dimensional space.
Preserves essential details for downstream generative tasks.

B. Diffusion in Latent Space

The core innovation of Latent Diffusion Models (LDMs) lies in their ability to perform the generative process within a compressed latent space rather than directly in high-dimensional pixel space. Traditional diffusion models operate by adding Gaussian noise to the input image over multiple steps, followed by a reverse process where a neural network is trained to denoise and recover the image. While this method is highly effective for generating high-quality samples, the computational cost is significant when applied to pixel-level data, especially for high-resolution images.

In contrast, LDMs shift this process to a lower-dimensional latent space by first encoding the input images into a more compact representation using a pretrained autoencoder. The encoder compresses the image into a latent vector, which captures the essential features of the image while ignoring unnecessary high-frequency details. The diffusion model is then applied to this latent representation rather than the raw pixel data, significantly reducing the dimensionality of the problem. This lower-dimensional latent space is easier to process, requiring far less computational power for both training and inference.

During the forward process, noise is added to the latent vector instead of the pixel space, and the diffusion model learns to iteratively denoise the latent representation during the reverse process. The denoising process in latent space allows the model to capture and generate high-level semantic structures, preserving crucial features while filtering out noise. The decoder of the autoencoder is then used to map the generated latent code back to the image space, reconstructing the final image with high visual fidelity.

This approach effectively decouples the image synthesis task into two phases:

Latent-space diffusion, where the model operates efficiently with a compressed version of the data.
Decoding from the latent space back into pixel space, which allows for high-quality, high-resolution output.

The key advantage of this method is that it retains the inductive biases of diffusion models, such as the ability to model complex distributions, while dramatically reducing the computational burden. Furthermore, by working in latent space, LDMs avoid the need for aggressive compression or excessive down sampling, maintaining fine-grained details and high-resolution features in the generated images. This enables LDMs to achieve state-of-the-art performance across various tasks, including image synthesis, inpainting, and super-resolution, without the extensive computational cost typically associated with pixel-based diffusion models.

Figure 2. We condition LDMs either via concatenation or by a more general cross-attention mechanism.

C. Conditioning Mechanisms

One of the significant advancements in Latent Diffusion Models (LDMs) is their ability to handle various types of conditioning inputs, which allows for more flexible and powerful image generation. Conditioning is the process of guiding the generative model to produce outputs based on specific input information, such as text prompts, class labels, or spatial maps. LDMs incorporate cross-attention mechanisms, which enable them to seamlessly integrate multiple types of conditioning data. The key conditioning mechanisms in LDMs are as follows:

1) Cross-Attention Layers for Multimodal Conditioning

LDMs utilize cross-attention mechanisms, which are a form of attention mechanism that allows the model to focus on relevant parts of different input modalities simultaneously. This approach enables the model to combine information from different sources, such as a textual description or a semantic map, alongside the image generation process. The model is able to "attend" to these different inputs during each stage of the generation, ensuring that the output is consistent with the desired conditioning.

Text Conditioning: For tasks like text-to-image synthesis, cross-attention enables the model to condition the image generation process based on a given textual description. The text is encoded using a transformer-based model (such as BERT or GPT) to obtain a rich embedding, which is then used to guide the latent diffusion process, ensuring the generated image matches the semantics of the input text.
Image-to-Image Conditioning: In cases like image translation or inpainting, the model can condition the diffusion process on other images or semantic maps. For example, in semantic segmentation tasks, the model can generate a realistic image based on a low-resolution or incomplete input by focusing on the semantic layout provided by the conditioning.

Fig 3: Cross attention mechanisms

2) Flexible Input Modalities

Cross-attention is flexible, allowing LDMs to work with a wide variety of conditioning inputs, including:

Textual Descriptions: As in text-to-image generation, where the model learns to generate an image based on an arbitrary textual description. For example, a prompt like "A red apple on a wooden table" would guide the model to generate an image of a red apple in a specific context.
Bounding Boxes or Object Layouts: The model can also accept spatial information in the form of bounding boxes or other layouts. This can be useful for tasks like object localization or scene synthesis, where specific objects need to be placed in predefined regions of the image.
Semantic Maps: LDMs can generate images conditioned on semantic maps, which represent different objects or regions in an image as labelled segments. This is particularly useful in tasks like semantic image synthesis, where the goal is to generate images with specific content and structure, such as landscapes or cityscapes with defined features.

Figure 4. Illustrating perceptual and semantic compression: Most bits of a digital image correspond to imperceptible details.

3) Conditional Denoising Process

During the denoising process in the reverse diffusion phase, the model integrates the conditioning information at each timestep. In traditional diffusion models, the process is focused solely on restoring the pixel or latent representation of the image. In LDMs, the conditioning information is incorporated at each step to guide the denoising in a way that aligns with the given input, whether it's a text prompt or an image structure. This results in highly controlled and guided image generation, ensuring that the final output conforms to the desired characteristics.

4) General-Purpose Conditioning Architecture

The design of the cross-attention mechanism in LDMs allows for scalability across different conditioning types. Whether the task involves text, images, or other forms of conditioning data, the same underlying architecture can be adapted to various input modalities. This flexibility is a significant advantage, enabling LDMs to be applied to a wide range of tasks beyond just image synthesis, such as image editing, style transfer, and interactive design tools.

5) Increased Control and Diversity in Generation

Conditioning in LDMs enhances the model's ability to generate images that are not only high in quality but also diverse and contextually accurate. By integrating cross-attention, the model can condition the output on specific aspects of the input, such as object placement, stylistic elements, or contextual details. This allows for a higher degree of control over the image generation process, enabling users to fine-tune outputs based on their requirements.

IV. PERFORMANCE EVALUATION AND CASE STUDIES

A. Efficiency Analysis

We evaluate LDMs with varying compression levels (e.g., down sampling factors). Results show that LDMs achieve a 2.7x speedup in training and reduce memory requirements by up to 50% compared to pixel-space models.

B. Application Benchmarks

Unconditional Image Generation: On datasets like CelebA-HQ and ImageNet, LDMs surpass GANs and autoregressive models in FID and precision-recall metrics.
Text-to-Image Synthesis: Using LAION-400M and MS-COCO datasets, LDMs generate diverse and high-quality images from user-defined prompts.

Figure 5. Samples for user-defined text prompts from our model for text-to-image synthesis, LDM-8 (KL), which was trained on theLAION [78] database. Samples generated with 200 DDIM steps and _ = 1:0. We use unconditional guidance [32] with s = 10:0.

3) Inpainting and Super-Resolution: LDMs excel in completing missing image regions and upscaling low-resolution images. On ImageNet 4x super-resolution, LDMs achieve a new state-of-the-art FID of 2.8.

Figure 6. Samples from LDMs trained on CelebAHQ [39], FFHQ [41], LSUN-Churches [102], LSUN-Bedrooms [102] and class conditional

ImageNet [12], each with a resolution of 256 _ 256. Best viewed when zoomed in. For more samples cf . the supplement.

V. KEY ACHIEVEMENTS AND INSIGHTS

A. Competitive Performance Across Benchmarks

Latent Diffusion Models (LDMs) achieve state-of-the-art or competitive results across multiple standard benchmarks, including CelebA-HQ, ImageNet, and MS-COCO. In terms of FID (Fréchet Inception Distance) scores, LDMs outperform traditional pixel-based diffusion models, GANs, and other generative models, demonstrating their ability to generate high-quality images with minimal computational resources.

B. Significant Reduction in Computational Requirements

LDMs offer substantial improvements in computational efficiency, reducing training time by up to 2.7x and memory usage by 50% compared to pixel-space diffusion models. This reduction in resource demand makes LDMs more accessible, enabling faster experimentation and deployment of generative models, even for users with limited computational power.

C. High-Resolution Image Synthesis with Detail Preservation

LDMs excel at high-resolution image synthesis by operating in a compressed latent space, which maintains fine-grained details while enabling the generation of images with resolutions of up to 1024×1024 px. This high fidelity is achieved without the need for excessive down sampling, ensuring that both global structures and fine textures are captured effectively.

Figure 7. A LDM trained on 2562 resolution can generalize to larger resolution (here: 512_1024) for spatially conditioned tasks such as semantic synthesis of landscape images.

D. Superior Text-to-Image Generation

In text-to-image synthesis tasks, LDMs exhibit exceptional performance in generating coherent and visually accurate images based on textual descriptions. For example, prompts such as “A futuristic cityscape at sunset” or “A red apple on a wooden table” are translated into detailed, contextually accurate images, showcasing LDMs' ability to integrate multimodal conditioning inputs effectively.

Fig 8: Generated images from text inputs

E. Enhanced Flexibility in Image Editing and Conditional Generation

The cross-attention mechanism enables LDMs to handle complex image-to-image tasks such as inpainting, semantic image synthesis, and super-resolution. LDMs can seamlessly generate images from partial inputs (e.g., incomplete images or semantic maps) and refine them according to specific conditions, offering high flexibility in content creation and image manipulation.

Figure 9. Layout-to-image synthesis with an LDM on COCO [4], Quantitative evaluation in the supplement D.3.

VI. CHALLENGES AND ETHICAL CONSIDERATIONS

A. Limitations

While LDMs improve efficiency, their sequential sampling remains slower than GANs. Additionally, latent-space compression may lose fine-grained details critical for some applications.

B. Societal Considerations

The accessibility of generative models raises ethical concerns, including misuse for misinformation and privacy violations. Efforts must be made to mitigate these risks, such as implementing watermarking systems or robust training data anonymization.

Figure 10. ImageNet 64!256 super-resolution on ImageNet-Val. LDM-SR has advantages at rendering realistic textures but SR3 can synthesize more coherent fine structures. See appendix for additional samples and cropouts. SR3 results from.

Conclusion

The development of Latent Diffusion Models (LDMs) represents a significant advancement in the field of generative modelling, addressing the computational inefficiencies and scalability challenges inherent in traditional pixel-space diffusion models. By leveraging a compressed latent space derived from a pretrained autoencoder, LDMs achieve a unique balance between computational efficiency and high-quality image synthesis. This innovation drastically reduces training time and memory requirements without compromising on fidelity, thus making state-of-the-art generative modeling more accessible and sustainable. A key strength of LDMs lies in their versatility, enabled by the integration of cross-attention mechanisms that support multimodal conditioning. This feature allows LDMs to excel in a diverse range of applications, including: 1) Text-to-Image Synthesis: Generating high-quality images guided by textual descriptions, with applications in creative design and content creation. 2) Super-Resolution: Enhancing low-resolution images to high fidelity, critical for applications in medical imaging, surveillance, and media. 3) Image Inpainting: Seamlessly filling missing or corrupted regions in images, offering solutions for restoration and editing tasks. In quantitative and qualitative evaluations, LDMs consistently outperform or match existing state-of-the-art models across benchmarks like CelebA-HQ, ImageNet, and MS-COCO. Moreover, the reduction in computational demands—such as a 2.7× faster training time and 50% lower inference costs—highlights the potential of LDMs to democratize access to generative AI technologies. This makes advanced image synthesis feasible even for users with limited resources, expanding the reach of these tools to smaller research teams, startups, and creative industries. Despite their strengths, LDMs are not without limitations. While the latent space compression approach reduces computational overhead, it can occasionally compromise fine-grained details, especially in tasks requiring pixel-level precision. Additionally, the sequential sampling process, though optimized compared to previous models, remains slower than GAN-based approaches. These challenges offer directions for future research, such as further optimizing latent-space representations and improving sampling speed. From a broader perspective, the adoption of LDMs must be accompanied by responsible use. Generative models, including LDMs, have the potential for misuse, such as creating deceptive or harmful content. Researchers and developers should focus on implementing safeguards like watermarking systems and ethical guidelines to ensure these technologies are used responsibly. In conclusion, Latent Diffusion Models represent a transformative step forward in generative modeling, providing a robust framework for efficient and high-quality image synthesis. By addressing critical bottlenecks in training and inference, LDMs open up new possibilities for research and practical applications. As the field continues to evolve, LDMs offer a solid foundation for future advancements, contributing to the broader goal of accessible and ethical generative AI. Figures: Placeholder links for images can be replaced with actual diagrams or visualizations derived from the original paper.

References

[1] Goodfellow, I., et al. (2014). Generative Adversarial Networks. arXiv preprint arXiv:1406.2661. [2] Dhariwal, P., & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. Advances in Neural Information Processing Systems. [3] Ramesh, A., et al. (2021). Zero-Shot Text-to-Image Generation. Proceedings of the 38th International Conference on Machine Learning (ICML). [4] Ho, J., et al. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems. [5] Kingma, D.P., & Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv preprint arXiv:1312.6114. [6] Sohl-Dickstein, J., et al. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Proceedings of the 32nd International Conference on Machine Learning (ICML). [7] Dosovitskiy, A., et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations (ICLR) [8] Vaswani, A., et al. (2017). Attention is All You Need. Advances in Neural Information Processing Systems [9] Radford, A., et al. (2021). Learning Transferable Visual Models from Natural Language Supervision. arXiv preprint arXiv:2103.00020 [10] Karras, T., et al. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [11] Brock, A., Donahue, J., & Simonyan, K. (2018). Large Scale GAN Training for High Fidelity Natural Image Synthesis. International Conference on Learning Representations (ICLR) [12] Chen, T.Q., et al. (2018). PixelSNAIL: An Improved Autoregressive Generative Model. arXiv preprint arXiv:1712.09763. [13] He, K., et al. (2016). Deep Residual Learning for Image Recognition. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [14] Radford, A., et al. (2021). CLIP: Connecting Text and Images. Proceedings of the 38th International Conference on Machine Learning (ICML) [15] Jaitly, N., et al. (2015). A Parallel WaveNet: Fast High-Fidelity Speech Synthesis. arXiv preprint arXiv:1711.10433. [16] Bengio, Y., et al. (2006). Learning Deep Architectures for AI. Foundations and Trends in Machine Learning. [17] Oord, A.V., et al. (2016). Pixel Recurrent Neural Networks. Proceedings of the 33rd International Conference on Machine Learning (ICML). [18] Salimans, T., et al. (2016). Improved Techniques for Training GANs. Advances in Neural Information Processing Systems [19] Shazeer, N., et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv preprint arXiv:1701.06538. [20] Isola, P., et al. (2017). Image-to-Image Translation with Conditional Adversarial Networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). [21] Wang, X., et al. (2018). High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. IEEE/CVF International Conference on Computer Vision (ICCV). [22] Zhang, R., et al. (2017). Real-Time Image Super-Resolution with Conditional GANs. IEEE/CVF International Conference on Computer Vision (ICCV). [23] Mirza, M., & Osindero, S. (2014). Conditional Generative Adversarial Nets. arXiv preprint arXiv:1411.1784. [24] Karras, T., et al. (2020). Analyzing and Improving the Image Quality of StyleGAN. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). [25] Schmidhuber, J. (2015). Deep Learning in Neural Networks: An Overview. Neural Networks. [26] Wu, Z., et al. (2016). Multimodal Variational Autoencoders for Image Captioning. IEEE/CVF International Conference on Computer Vision (ICCV). [27] Batra, D., et al. (2019). Image Synthesis with Conditional Variational Autoencoders. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [28] Hinton, G.E., & Salakhutdinov, R.R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science. [29] Xie, L., et al. (2019). Generative Image Inpainting with Contextual Attention. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). [30] Chen, J., et al. (2020). Towards High-Quality GANs: A Survey. Computers and Graphics. [31] Zhu, J.Y., et al. (2017). Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. IEEE/CVF International Conference on Computer Vision (ICCV). [32] Kumar, R., & Bhattacharya, S. (2020). Image-to-Image Translation with LSTM Networks. IEEE Transactions on Image Processing. [33] Tan, M., & Le, Q.V. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Proceedings of the 36th International Conference on Machine Learning (ICML). [34] Kaiser, ?., et al. (2017). One Model to Learn Them All. arXiv preprint arXiv:1707.06121. [35] Gao, S., et al. (2018). A Survey on Generative Adversarial Networks. IEEE Access. [36] Bojanowski, P., et al. (2016). Generating Word Vectors from Sentence Descriptions. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP). [37] Radford, A., et al. (2021). Learning Transferable Visual Models from Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning (ICML). [38] Odena, A., et al. (2016). Conditional Image Synthesis with Auxiliary Classifier GANs. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). [39] Zhu, Y., & Liu, X. (2019). Face Generation from Textual Descriptions via Deep Generative Models. IEEE Transactions on Image Processing. [40] Xie, L., et al. (2021). Generative Models for Image Inpainting: A Review. IEEE Transactions on Neural Networks and Learning Systems. [41] Papernot, N., et al. (2016). The Limitations of Deep Learning in Adversarial Settings. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. [42] Huang, X., et al. (2018). Multimodal Cycle-Consistent Adversarial Networks for Image Synthesis. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). [43] Zhu, J.Y., et al. (2018). Domain Transfer for Image Synthesis. IEEE/CVF International Conference on Computer Vision (ICCV). [44] Liu, M.Y., & Tuzel, O. (2016). Coupled Generative Adversarial Networks. Advances in Neural Information Processing Systems. [45] Odena, A., et al. (2017). Conditional Image Synthesis with Auxiliary Classifier GANs. arXiv preprint arXiv:1610.09585. [46] Chia, D., et al. (2020). A Survey of Generative Adversarial Networks. IEEE Access. [47] Ren, Z., et al. (2018). Generative Models for Image Inpainting. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). [48] Kim, B., et al. (2020). GAN-based Deep Learning for Image S. [49] Zhou, X., & Zhang, X. (2020). Fine-grained Image Generation with Variational Autoencoders. IEEE Transactions on Image Processing. [50] Xu, T., et al. (2018). AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). [51] Liu, Z., et al. (2019). Semi-Supervised Image Synthesis with Semantic Consistency. IEEE Transactions on Neural Networks and Learning Systems. [52] Xie, S., et al. (2019). Diverse Image Generation via Latent Space Smoothing. IEEE/CVF International Conference on Computer Vision (ICCV). [53] Saito, S., et al. (2017). Coupled GANs for Image Generation with Semantic Constraints. International Journal of Computer Vision. [54] Choi, Y., et al. (2018). StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). [55] Lee, H., et al. (2018). Diverse Image Synthesis with Generative Adversarial Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. [56] Denton, E., & Birodkar, V. (2017). Unsupervised Learning of Probabilistic Models. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). [57] Huang, L., et al. (2017). Semantic Image Synthesis with Generative Adversarial Networks. IEEE Transactions on Image Processing. [58] Yang, J., et al. (2019). Image Synthesis from Sketch using Deep Convolutional Generative Adversarial Networks. IEEE/CVF International Conference on Computer Vision (ICCV). [59] Liu, S., & Zeng, Y. (2020). Enhancing Image Synthesis with Pixel-wise Consistency and Discriminative Models. IEEE Transactions on Neural Networks and Learning Systems. [60] Yang, M., et al. (2020). Multi-modal Image Synthesis using Conditional Variational Autoencoders. IEEE Transactions on Image Processing. [61] Zhang, Y., et al. (2018). Deep Visual Domain Adaptation for Image Synthesis. IEEE Transactions on Image Processing. [62] Wang, L., et al. (2020). Generative Adversarial Networks for Visual Data Generation: A Survey. Neurocomputing. [63] Zhu, Y., et al. (2020). Multi-modal Image-to-Image Translation with Auxiliary Classifiers. IEEE Transactions on Neural Networks and Learning Systems. [64] Fang, H., et al. (2017). Image Generation by Conditional Generative Adversarial Networks. IEEE Transactions on Image Processing. [65] Zhu, J.Y., et al. (2018). Inpainting for Images with Missing Content using Generative Models. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). [66] Nguyen, H., et al. (2020). Adversarial Networks for Generating Realistic Images. Neural Networks [67] Liu, Y., et al. (2019). Robust Image Generation and Restoration with Generative Models. IEEE Transactions on Computational Imaging. [68] Zhu, H., et al. (2020). Super-Resolution with Generative Adversarial Networks. IEEE Transactions on Image Processing [69] Odena, A., et al. (2018). Conditional Image Generation with Deep Convolutional Generative Models. IEEE Transactions on Pattern Analysis and Machine Intelligence. [70] Pumarola, A., et al. (2019). Dm-GAN: Dynamic Memory Generative Adversarial Networks for Image-to-Image Translation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [71] Zhu, X., et al. (2018). Cycle-Consistent Adversarial Networks for Image Translation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [72] Zhang, H., et al. (2019). Multi-Scale Adversarial Image Synthesis using Conditional Variational Networks. IEEE Transactions on Neural Networks and Learning Systems [73] Gulrajani, I., et al. (2017). Improved Training of Wasserstein GANs. Advances in Neural Information Processing Systems [74] Dumoulin, V., et al. (2017). Learning to Generate with Conditional Generative Adversarial Networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [75] Fader, R., et al. (2018). Improving Image Synthesis using Generative Adversarial Networks with Semantic Constraints. IEEE Transactions on Pattern Analysis and Machine Intelligence [76] Choi, Y., et al. (2019). One-Shot Image-to-Image Translation using Generative Adversarial Networks. Proceedings of the IEEE International Conference on Computer Vision (ICCV) [77] Wu, Q., et al. (2020). Realistic Image Generation with Contextual GANs. IEEE Transactions on Neural Networks and Learning Systems [78] Zhang, X., et al. (2021). Text-to-Image Generation using Generative Adversarial Networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [79] Denton, E., et al. (2015). Deep Generative Models with Variational Inference. Proceedings of the International Conference on Learning Representations (ICLR) [80] Li, W., et al. (2018). Large Scale Image Generation with GANs. IEEE Transactions on Pattern Analysis and Machine Intelligence [81] Peng, Y., et al. (2017). Image Synthesis with Conditional Generative Models. IEEE Transactions on Neural Networks and Learning Systems [82] Tang, Y., et al. (2021). High-Resolution Image Synthesis using Generative Models. IEEE Transactions on Image Processing [83] Bouchacourt, D., et al. (2017). Generative Image Modeling with Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems. [84] Yi, S., et al. (2020). Image Generation with Conditional Variational Autoencoders. IEEE Transactions on Neural Networks and Learning Systems. [85] Zhao, H., et al. (2019). Multi-modal Image Generation with Conditional GANs. Proceedings of the IEEE International Conference on Computer Vision (ICCV). [86] Yu, J., et al. (2019). Multi-Scale Generative Image Synthesis. IEEE Transactions on Image Processing. [87] Liu, X., et al. (2021). Cross-Modal Generative Adversarial Networks for Image-to-Image Translation. IEEE Transactions on Pattern Analysis and Machine Intelligence [88] Berthelot, D., et al. (2017). BEGAN: Boundary Equilibrium Generative Adversarial Networks. International Conference on Learning Representations (ICLR) [89] Zhao, Y., et al. (2020). Image Synthesis with Variational Autoencoders and GANs. IEEE Transactions on Neural Networks and Learning Systems [90] Wu, C., et al. (2018). Image Synthesis and Image-to-Image Translation with Generative Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [91] Zhu, X., et al. (2019). Conditional Image Generation using Generative Adversarial Networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [92] Zhang, Y., et al. (2020). Image-to-Image Translation with Conditional GANs. IEEE Transactions on Computational Imaging [93] Song, L., et al. (2020). High-Resolution Image Generation with Conditional GANs. IEEE Transactions on Pattern Analysis and Machine Intelligence. [94] Wu, X., et al. (2017). Conditional Generative Adversarial Networks for Image Synthesis. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [95] Li, P., et al. (2019). Generating Realistic Images with GANs and Variational Autoencoders. IEEE Transactions on Neural Networks and Learning Systems [96] Yan, Z., et al. (2020). Fine-Grained Image Synthesis with Generative Models. IEEE Transactions on Pattern Analysis and Machine Intelligence [97] Zhou, X., et al. (2020). Text-Driven Image Generation with GANs. IEEE Transactions on Image Processing [98] Chen, Y., et al. (2019). Image Synthesis from Text using Generative Networks. *IEEE/CVF International Conference on Computer [99] P. S. Chavez (Jr.), An improved dark-object subtraction technique for atmospheric scattering correction of multispectral data, Remote Sensing and Environment, Elsevier, vol-24(3), pp-459-479,198 [100] P. Oakley and B. L. Satherley, “Improving image quality in poor visibility conditions using a physical model for contrast degradation,” IEEE Trans. Image Process., vol. 7, no. 2, pp. 167-179, Feb. 1998 [101] Simonyan, K., and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. Computational and Biological Learning Society, 2015, pp. 1–14 [102] J. Kim, J. K. Lee and K. M. Lee, \"Accurate Image Super-Resolution Using Very Deep Convolutional Networks,\" 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1646-1654, doi: 10.1109/CVPR.2016.182 [103] W. Yang, X. Zhang, Y. Tian, W. Wang, J. Xue and Q. Liao, \"Deep Learning for Single Image Super-Resolution: A Brief Review,\" in IEEE Transactions on Multimedia, vol. 21, no. 12, pp. 3106-3121, Dec. 2019, doi: 10.1109/TMM.2019.2919431 [104] D. Das, S. S. Chaudhuri and S. Roy, \"Dehazing technique based on dark channel prior model with sky masking and its quantitative analysis,\" 2016 2nd International Conference on Control, Instrumentation, Energy & Communication (CIEC), 2016, pp. 207-210, doi: 10.1109/CIEC.2016.7513741. [105] Sangita Roy and Sheli Sinha Chaudhuri, “Fast Single Image Haze Removal Scheme Using Self-Adjusting: Haziness Factor Evaluation”, International Journal of Virtual and Augmented Reality (IJVAR), 3 (1), 2019, pp. 42-57 [106] Sangita Roy and Sheli Sinha Chaudhuri, “WLMS-based Transmission Refined Self-Adjusted No Reference Weather Independent Image Visibility Improvement”, IETE Journal of Research, September 2020. https://doi.org/10.1080/03772063.2019.16623 [107] S Roy, S S Chaudhuri, Low Complexity Single Color Image Dehazing Technique, Intelligent Multidimensional Data and Image Processing, IGI Global,2018(special session) [108] H. Koschmieder, Theorie der horizontalen sichtweite, Beitr.Phys. Freien Atm., vol. 12, 1924, pp. 171–181 [109] E J McCartney, Optics of the Atmosphere: Scattering by Molecules and Particles, New York, NY, USA:Wiley, 1976 [110] He, Kaiming, et al. \"Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.\" Proceedings of the IEEE international conference on computer vision. 2015 [111] Zhou Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, \"Image quality assessment: from error visibility to structural similarity,\" in IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, April 2004, doi: 10.1109/TIP.2003.819861 [112] Sara, U., Akter, M. and Uddin, M. (2019) Image Quality Assessment through FSIM, SSIM, MSE and PSNR—A Comparative Study. Journal of Computer and Communications, 7, 8-18. doi: 10.4236/jcc.2019.73002. [113] A. Mittal, A. K. Moorthy and A. C. Bovik, \"No-Reference Image Quality Assessment in the Spatial Domain,\" in IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4695-4708, Dec. 2012, doi: 10.1109/TIP.2012.2214050 [114] Mittal, A., R. Soundararajan, and A. C. Bovik. \"Making a Completely Blind Image Quality Analyzer.\" IEEE Signal Processing Letters. Vol. 22, Number 3, March 2013, pp. 209–212 [115] C. O. Ancuti, C. Ancuti, R. Timofte and C. De Vleeschouwer, \"O-HAZE: A Dehazing Benchmark with Real Hazy and Haze-Free Outdoor Images,\" 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 867-8678, doi: 10.1109/CVPRW.2018.00119. [116] K. He, J. Sun and X. Tang, \"Single Image Haze Removal Using Dark Channel Prior,\" in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 12, pp. 2341-2353, Dec. 2011, doi: 10.1109/TPAMI.2010.168 [117] G. Meng, Y. Wang, J. Duan, S. Xiang, and C. Pan. Efficient image dehazing with boundary constraint and contextual regularization. In IEEE Int. Conf. on Computer Vision, 2013 [118] R. Fattal. Dehazing using color-lines. ACM Trans. on Graph., 2014. [119] D. Berman, T. Treibitz, and S. Avidan. Non-local image dehazing. IEEE Intl. Conf. Comp. Vision, and Pattern Recog, 2016 [120] Ren W., Liu S., Zhang H., Pan J., Cao X., Yang MH. (2016) Single Image Dehazing via Multi-scale Convolutional Neural Networks. In: Leibe B., Matas J., Sebe N., Welling M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol 9906. Springer, Cham. https://doi.org/10.1007/978-3-319-46475-6_1 [121] W. Yang, X. Zhang, Y. Tian, W. Wang, J. Xue and Q. Liao, \"Deep Learning for Single Image Super-Resolution: A Brief Review,\" in IEEE Transactions on Multimedia, vol. 21, no. 12, pp. 3106-3121, Dec. 2019, doi: 10.1109/TMM.2019.2919431 [122] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 184–199. [123] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 2, pp. 295–307, Feb. 2016 [124] A. S. Parihar, Y. K. Gupta, Y. Singodia, V. Singh and K. Singh, \"A Comparative Study of Image Dehazing Algorithms,\" 2020 5th International Conference on Communication and Electronics Systems (ICCES), 2020, pp. 766-771, doi: 10.1109/ICCES48766.2020.9138037. [125] W. Yang et al., \"Advancing Image Understanding in Poor Visibility Environments: A Collective Benchmark Study,\" in IEEE Transactions on Image Processing, vol. 29, pp. 5737-5752, 2020, doi: 10.1109/TIP.2020.2981922 [126] C. Chengtao, Z. Qiuyu and L. Yanhua, \"A survey of image dehazing approaches,\" The 27th Chinese Control and Decision Conference (2015 CCDC), 2015, pp. 3964-3969, doi: 10.1109/CCDC.2015.7162616 [127] B. Cai, X. Xu, K. Jia, C. Qing and D. Tao, \"DehazeNet: An End-to-End System for Single Image Haze Removal,\" in IEEE Transactions on Image Processing, vol. 25, no. 11, pp. 5187-5198, Nov. 2016, doi: 10.1109/TIP.2016.2598681 [128] Ph.D. Thesis, Sangita Roy, Development of Improved Visibility Restoration Techniques using Various Intensity Parameter Tuning, ETCE Department, Jadavpur University, July 2021. [129] J. -B. Huang, A. Singh and N. Ahuja, \"Single image super-resolution from transformed self-exemplars,\" 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 5197-5206, doi: 10.1109/CVPR.2015.7299156. [130] J. Yang, J. Wright, T. S. Huang and Y. Ma, \"Image Super-Resolution Via Sparse Representation,\" in IEEE Transactions on Image Processing, vol. 19, no. 11, pp. 2861-2873, Nov. 2010, doi: 10.1109/TIP.2010.2050625. [131] G. Sharma, W. Wu, and E. Dalal. The ciede2000 color-difference formula: Implementation notes, sup- plementary test data, and mathematical observations. Color Research and Applications, 2005, https://doi.org/10.1002/col.20070. [132] S Westland, C Ripamonte, Computational Colour Science using MATLAB, Wiley-IS&T Series in Imaging Science and Technology. [133] Y. Y. Schechner, S. G. Narasimhan and S. K. Nayar, \"Instant dehazing of images using polarization,\" Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, 2001, pp. I-I, doi: 10.1109/CVPR.2001.990493 [134] S. G. Narasimhan and S. K. Nayar, \"Chromatic framework for vision in bad weather,\" Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662), 2000, pp. 598-605 vol.1, doi: 10.1109/CVPR.2000.855874 [135] CIE. Improvement to industrial colour-difference evaluation. Vienna:CIE Publication No. 142-2001, Central Bureau of the CIE; 200 [136] R. T. Tan, \"Visibility in bad weather from a single image,\" 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1-8, doi: 10.1109/CVPR.2008.4587643 [137] Kopf, Johannes, et al. \"Deep photo: Model-based photograph enhancement and viewing.\" ACM transactions on graphics (TOG) 27.5 (2008): 1-10 [138] S Roy, S S Chaudhuri, Single Image Very Deep Super Resolution (SIVDSR) Dehazing, BECITHCON 2021, 4-5 December 2021 [139] S Roy, Single Image DnCNN Visibility Improvement (SImDnCNNVI), Scientific Visualization, volume 14, number 3, pages 92 – 106, September 19th 2022, https://doi.org/10.26583/sv.14.3.07. [140] Roy, Sangita, and S. S. Chaudhuri. \"SIVDSR-Dhaze: Single Image Dehazing with Very Deep Super Resolution Framework and Its Analysis.\" Scientific Visualization 14.5 (2022).

Copyright

Copyright © 2024 Soumika Chakraborty, Bikram Sarkar, Jeet Kumar Dash, Shatayu Bhowmick, Tuhin Akuli, Dr. Sangita Roy. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET65751

Publish Date : 2024-12-04

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here