Chapter 3: Deep Dive into Generative Adversarial Networks (GANs)

3.4 Evaluating GANs

Critical to the process of understanding the effectiveness of Generative Adversarial Networks (GANs) is the essential step of their evaluation. This evaluation process ensures that the data generated by these networks meets the standards of quality that were originally envisaged.

This is not a straightforward task because, in stark contrast to more traditional machine learning models, GANs do not have a direct and uncomplicated evaluation metric. This is largely due to their overarching goal, which is to generate data that is, in its complexity and detail, as realistic as possible.

In this section, we will embark on a comprehensive exploration of the various methods available for the evaluation of GANs. We will delve into both quantitative and qualitative approaches, examining their respective merits and potential drawbacks. Furthermore, we will explore some of the most commonly used metrics in this area of study. To complement this theoretical discussion, we will also provide practical examples to further illuminate the concepts and techniques under discussion.

3.4.1 Quantitative Evaluation Metrics

Quantitative evaluation metrics offer a range of objective measures that are crucial for assessing the performance of Generative Adversarial Networks (GANs). These metrics serve to provide a clear, definitive, and unbiased evaluation of the effectiveness of these networks, and are therefore essential in understanding the overall performance and potential improvements that could enhance the operation of GANs.

Some commonly used metrics include:

1. Inception Score (IS):

The Inception Score (IS) is a significant quantitative metric used to evaluate the performance of Generative Adversarial Networks (GANs), particularly in the quality of the images they generate. It was introduced as a means to both quantify and qualify the generated images based on two main factors: diversity and quality.

Diversity refers to the range of different images the GAN can produce. A model that generates a variety of images, rather than repeatedly producing similar or identical ones, would be considered as having high diversity. A higher score in diversity reflects the GAN's ability to capture a wide representation of the dataset it was trained on.

Quality, on the other hand, pertains to how 'real' the generated images are or how close they are to the real images in the training dataset. High-quality images should be indistinguishable from actual photos, demonstrating that the GAN has accurately learned the data distribution of the training set.

The Inception Score uses a pre-trained Inception v3 network to compute these factors. Each generated image is passed through the Inception network, which produces a conditional label distribution. The score is then calculated using these distributions, with the assumption that good models would produce diverse images (high entropy of marginal distribution) but also be confident in their predictions for individual images (low entropy of conditional distribution).

A high Inception Score generally indicates that the GAN is producing diverse, high-quality images that are similar to the real data. However, it's important to note that while the Inception Score can be a useful tool for evaluating and comparing GANs, it's not perfect and has its limitations. For instance, it relies heavily on the Inception model for its calculations, meaning its accuracy is bounded by how well the Inception model was trained.

The Inception Score evaluates the quality and diversity of generated images. It uses a pre-trained Inception v3 network to compute the conditional label distribution p(y|x) for each generated image x and the marginal label distribution p(y). The score is given by:

IS(G)=exp(Ex[DKL(p(y∣x)∣∣p(y))])

A high Inception Score indicates that the generated images are both diverse and of high quality.

2. Fréchet Inception Distance (FID):

The FID metric calculates the similarity between two datasets of images. In this case, it is used to compare the distribution of generated images against the distribution of real images.

The FID calculation involves using a pre-trained Inception v3 model, a model that was originally designed and trained for image classification tasks. This model is used to extract features from both the real and generated images. The extracted features are then represented as a multivariate Gaussian distribution, characterised by a mean and a covariance.

The Fréchet distance is then calculated between these two Gaussians. This distance gives a measure of the similarity between the two sets of images. The lower the FID score, the closer the generated images are to the real images in terms of the distributions of features. Therefore, a lower FID indicates that the generative model has performed better in producing images that are more realistic.

In the context of GANs, the FID score is often used as an evaluation measure to compare the performance of different models or different configurations of the same model. It provides a more reliable and robust evaluation than some other metrics, such as the Inception Score, as it takes into account the full multi-dimensional distribution of features, rather than just looking at marginal and conditional distributions.

The FID score measures the distance between the distributions of real and generated images in the feature space of a pre-trained Inception v3 network. Lower FID scores indicate that the generated images are more similar to the real images.

FID=∣∣μr−μg∣∣2+Tr(Σr+Σg−2(ΣrΣg)1/2)

where (μr,Σr) and (μg,Σg) are the mean and covariance of the real and generated images' feature vectors, respectively.

3. Precision and Recall for Distributions:

Precision and Recall for Distributions are statistical measures used to evaluate the performance of Generative Adversarial Networks (GANs), particularly in terms of the quality and diversity of the data they generate. These metrics are borrowed from the field of information retrieval and are also commonly used to evaluate classification tasks in machine learning.

Precision measures the quality of the generated samples. In the context of GANs, it evaluates how many of the generated samples are 'real' or close to the real data distribution. A high precision score implies that most of the generated samples are of high quality, resembling the real data closely. It indicates that the GAN is doing a good job in generating samples that are almost indistinguishable from the real samples.

Recall, on the other hand, measures the coverage of the real data distribution by the generated samples. It evaluates whether the GAN is able to generate samples that cover the whole range of the real data distribution. A high recall score implies that the GAN has a good understanding of the real data distribution and is able to generate diverse samples that cover different aspects of the real data.

Together, precision and recall provide a comprehensive evaluation of GANs. High precision and recall values indicate that the GAN is generating high-quality samples that cover the diversity of the real data. However, there is often a trade-off between precision and recall. A model that is too focused on getting high-quality samples might miss out on the diversity of the data (high precision, low recall), while a model that focuses on covering the whole data distribution might generate more low-quality samples (low precision, high recall).

To get a balanced view of the model's performance, it's common to combine precision and recall into a single metric called the F1 score. The F1 score is the harmonic mean of precision and recall, and gives equal weight to both measures. A high F1 score indicates that the GAN is performing well in both aspects, generating diverse and high-quality samples.

Precision measures the quality of generated samples, while recall measures the coverage of the real data distribution by the generated samples. High precision and recall values indicate that the GAN is generating high-quality samples that cover the diversity of the real data.

Example: Calculating Inception Score and FID

Here’s how you can calculate the Inception Score and FID using TensorFlow and pre-trained models:

import tensorflow as tf
import numpy as np
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input
from scipy.linalg import sqrtm

# Function to calculate Inception Score
def calculate_inception_score(images, num_splits=10):
    model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
    images = tf.image.resize(images, (299, 299))
    images = preprocess_input(images)
    preds = model.predict(images)

    scores = []
    for i in range(num_splits):
        part = preds[i * len(preds) // num_splits: (i + 1) * len(preds) // num_splits]
        py = np.mean(part, axis=0)
        scores.append(np.exp(np.mean([np.sum(p * np.log(p / py)) for p in part])))
    return np.mean(scores), np.std(scores)

# Function to calculate FID score
def calculate_fid(real_images, generated_images):
    model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
    real_images = tf.image.resize(real_images, (299, 299))
    real_images = preprocess_input(real_images)
    gen_images = tf.image.resize(generated_images, (299, 299))
    gen_images = preprocess_input(gen_images)

    act1 = model.predict(real_images)
    act2 = model.predict(gen_images)

    mu1, sigma1 = act1.mean(axis=0), np.cov(act1, rowvar=False)
    mu2, sigma2 = act2.mean(axis=0), np.cov(act2, rowvar=False)

    ssdiff = np.sum((mu1 - mu2) ** 2.0)
    covmean = sqrtm(sigma1.dot(sigma2))

    if np.iscomplexobj(covmean):
        covmean = covmean.real

    fid = ssdiff + np.trace(sigma1 + sigma2 - 2.0 * covmean)
    return fid

# Generate some fake images using the trained GAN generator
noise = np.random.normal(0, 1, (1000, 100))
generated_images = generator.predict(noise)

# Calculate Inception Score
is_mean, is_std = calculate_inception_score(generated_images)
print(f"Inception Score: {is_mean} ± {is_std}")

# Calculate FID Score
real_images = x_train[np.random.choice(x_train.shape[0], 1000, replace=False)]
fid_score = calculate_fid(real_images, generated_images)
print(f"FID Score: {fid_score}")

This example uses the TensorFlow library to calculate two key metrics for evaluating Generative Adversarial Networks (GANs): the Inception Score (IS) and the Fréchet Inception Distance (FID). These metrics are essential for assessing the quality and diversity of the synthetic images produced by GANs.

The first function, calculate_inception_score, is designed to compute the Inception Score. The Inception Score is a metric that gauges the quality and diversity of images produced by a GAN. It does this by using a pre-trained model, specifically the InceptionV3 model, to make predictions on the generated images. The function resizes the images to match the input shape expected by the InceptionV3 model, preprocesses them to match the format the model expects, and then passes them to the model to get predictions. It then calculates the score based on these predictions.

The score calculation involves splitting the predictions into several subsets (the number of which is determined by the num_splits parameter), calculating the mean of each subset, and then using these means to compute the KL divergence between the distribution of predicted labels and the uniform distribution. The KL divergence measures how much one probability distribution differs from a second, expected distribution. The final Inception Score is the exponential of the mean KL divergence across all subsets.

The second function, calculate_fid, is used to compute the Fréchet Inception Distance. The Fréchet Inception Distance is another metric for evaluating GANs, but it specifically measures the similarity between two sets of images. In the context of GANs, these two sets are typically the real images from the training set and the synthetic images generated by the GAN.

The FID calculation involves using the same InceptionV3 model to extract features from both the real and generated images. These features are then used to create a multivariate Gaussian distribution, characterized by a mean and a covariance. The Fréchet distance between these two Gaussian distributions is then computed. The Fréchet distance is a measure of similarity between two distributions, so a lower FID score indicates that the generated images are more similar to the real images.

After defining these two functions, the code goes on to generate some fake images using a GAN generator. The generator is fed with random noise, following a normal distribution, to generate these synthetic images. The Inception Score and FID for these generated images are then calculated using the previously defined functions. Finally, the results of these calculations are printed out.

In summary, this example provides a practical demonstration of how to evaluate the performance of a Generative Adversarial Network (GAN) using two commonly used metrics: the Inception Score and the Fréchet Inception Distance. Both these metrics provide valuable insights into the quality and diversity of the images generated by the GAN, which are crucial for assessing the effectiveness of the GAN.

Example of Fréchet Inception Distance (FID) for GAN Evaluation with TensorFlow

Here's a comprehensive example of calculating FID for GAN evaluation using TensorFlow:

1. Dependencies:

import tensorflow as tf
from tensorflow.keras.applications import inception_v3
from tensorflow.keras.preprocessing import image
from scipy import linalg
import numpy as np

2. InceptionV3 Model for Feature Extraction:

def inception_model():
  """
  Loads the pre-trained InceptionV3 model for feature extraction.
  Removes the final classification layer.
  """
  model = inception_v3.InceptionV3(include_top=False, weights='imagenet')
  model.output = model.layers[-1].output
  return model

This function defines inception_model which loads the pre-trained InceptionV3 model excluding the final classification layer. This layer is not needed for FID calculation, and we only want the feature representation learned by the model.

3. Preprocessing Function:

def preprocess_image(img_path):
  """
  Preprocesses an image for InceptionV3 input.
  """
  target_size = (299, 299)
  img = image.load_img(img_path, target_size=target_size)
  img = image.img_to_array(img)
  img = img / 255.0
  img = np.expand_dims(img, axis=0)
  return img

This function defines preprocess_image which takes an image path and preprocesses it for InceptionV3 input. This includes resizing the image to the target size (299x299 for InceptionV3) and normalization.

4. Feature Extraction Function:

def extract_features(model, img_paths):
  """
  Extracts features from a list of images using the InceptionV3 model.
  """
  features = []
  for img_path in img_paths:
    img = preprocess_image(img_path)
    feature = model.predict(img)
    features.append(feature)
  return np.array(features)

This function defines extract_features which takes the InceptionV3 model and a list of image paths. It iterates through each path, preprocesses the image, feeds it to the model, and stores the extracted features in a NumPy array.

5. FID Calculation Function:

def calculate_fid(real_imgs, generated_imgs):
  """
  Calculates the Fréchet Inception Distance (FID) between two sets of images.
  """
  # Load InceptionV3 model
  model = inception_model()

  # Extract features for real and generated images
  real_features = extract_features(model, real_imgs)
  generated_features = extract_features(model, generated_imgs)

  # Calculate statistics for real and generated features
  real_mean = np.mean(real_features, axis=0)
  real_cov = np.cov(real_features.reshape(real_features.shape[0], -1), rowvar=False)
  generated_mean = np.mean(generated_features, axis=0)
  generated_cov = np.cov(generated_features.reshape(generated_features.shape[0], -1), rowvar=False)

  # Calculate squared mean difference
  ssdiff = np.sum((real_mean - generated_mean)**2)

  # Calculate FID
  covmean = linalg.sqrtm(np.dot(real_cov, generated_cov))
  if np.iscomplexobj(covmean):
    covmean = covmean.real
  fid = ssdiff + np.trace(real_cov + generated_cov - 2.0 * covmean)
  return fid

This function defines calculate_fid which takes two lists of image paths (real and generated). It utilizes the previously defined functions to extract features from both sets and then calculates the FID. Here's a breakdown of the key steps:

Extracts features for real and generated images using the InceptionV3 model.
Calculates the mean and covariance matrix for both feature sets.
Computes the squared mean difference between real and generated means.
Calculates the square root of the product of covariances.
Handles potential complex number issues arising from square root of product.
FID is defined as the sum of squared mean differences.

Example of Precision and Recall for Generative Models with TensorFlow

While there's no standard implementation using TensorFlow for Precision and Recall (PR) specifically designed for generative models, we can explore a similar approach leveraging Inception features as proposed in the paper "Assessing Generative Models via Precision and Recall". Here's a breakdown of the concept and an example implementation:

1. Understanding PR for Generative Models:

Precision: Measures the quality of samples generated by the model. High precision indicates a higher percentage of generated samples resemble the real data distribution.
Recall: Measures the model's ability to capture the diversity of the real data distribution. High recall indicates the generated samples cover a broader range of variations present in the real data.

2. Inception Feature Matching:

This approach utilizes the pre-trained InceptionV3 model to extract features from both real and generated data. The idea is to compare these features to assess how well the generated data aligns with the real data distribution.

3. Implementation Example:

import tensorflow as tf
from tensorflow.keras.applications import inception_v3
from tensorflow.keras.preprocessing import image
import numpy as np

def inception_model():
  """
  Loads the pre-trained InceptionV3 model for feature extraction.
  """
  model = inception_v3.InceptionV3(include_top=False, weights='imagenet')
  model.output = model.layers[-1].output
  return model

def preprocess_image(img_path):
  """
  Preprocesses an image for InceptionV3 input.
  """
  target_size = (299, 299)
  img = image.load_img(img_path, target_size=target_size)
  img = image.img_to_array(img)
  img = img / 255.0
  img = np.expand_dims(img, axis=0)
  return img

def extract_features(model, img_paths):
  """
  Extracts features from a list of images using the InceptionV3 model.
  """
  features = []
  for img_path in img_paths:
    img = preprocess_image(img_path)
    feature = model.predict(img)
    features.append(feature)
  return np.array(features)

def compute_pr(real_features, generated_features):
  """
  Estimates precision and recall based on Inception feature distances.
  **Note:** This is a simplified approach and may not capture the full 
  complexity of PR for generative models.

  Parameters:
      real_features: NumPy array of features from real data.
      generated_features: NumPy array of features from generated data.

  Returns:
      precision: Estimated precision value.
      recall: Estimated recall value.
  """
  # Calculate pairwise distances between real and generated features
  real_distances = np.linalg.norm(real_features[:, np.newaxis] - generated_features, axis=2)

  # Threshold for considering a generated sample close to real data (hyperparameter)
  threshold = 0.5

  # Count samples within the threshold distance
  close_samples = np.sum(real_distances < threshold, axis=1)

  # Precision: Ratio of close generated samples to total generated samples
  precision = np.mean(close_samples / generated_features.shape[0])

  # Recall: Ratio of generated samples close to at least one real sample
  recall = np.mean(close_samples > 0)

  return precision, recall

# Example usage
model = inception_model()
real_imgs = ["path/to/real/image1.jpg", "path/to/real/image2.png"]
generated_imgs = ["path/to/generated/image1.jpg", "path/to/generated/image2.png"]

real_features = extract_features(model, real_imgs)
generated_features = extract_features(model, generated_imgs)

precision, recall = compute_pr(real_features, generated_features)

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")

4. Important Note:

This example provides a simplified approach to estimating PR for generative models using Inception features. It utilizes a distance threshold to categorize generated samples as "close" to real data. For a more comprehensive evaluation, consider techniques from the original paper that involve kernel density estimation and calculating PR curves over a range of thresholds.

Assessing Generative Models via Precision and Recall: https://proceedings.neurips.cc/paper_files/paper/2018/file/f7696a9b362ac5a51c3dc8f098b73923-Paper.pdf

3.4.2 Qualitative Evaluation

Qualitative evaluation involves visually inspecting the generated samples to assess their quality. This approach is subjective but provides valuable insights into the realism and diversity of the generated data.

In the context of Generative Adversarial Networks (GANs), qualitative evaluation involves closely examining the generated samples to assess their level of realism and the diversity they present. This might involve looking for any visual artifacts, evaluating the clarity or blurriness of the samples, and checking how well the generated samples represent the diversity of the real data.

For example, if the GAN is designed to generate images of faces, a qualitative evaluation might involve looking at the generated faces to see how well they resemble real human faces and how diverse the faces are in terms of age, gender, ethnicity, and other features.

Although qualitative evaluation does not provide a concrete, numerical metric for evaluating performance like quantitative evaluation does, it provides valuable insights that can help improve the model. For instance, if the observer notices that the generated images are mostly blurry, this might indicate that the GAN's generator is not powerful enough and needs to be adjusted.

In addition to the visual inspection, qualitative evaluation might also involve comparison with real data. This involves comparing the generated samples side by side with real data samples to evaluate how similar they are. This method, while still subjective, might provide a more objective comparison than visual inspection alone.

Overall, qualitative evaluation plays an essential role in assessing the performance of Generative Adversarial Networks. While it should ideally be used alongside quantitative methods for a more comprehensive evaluation, it can provide valuable insights that can guide the fine-tuning of the model.

Example: Visual Inspection of Generated Images

import matplotlib.pyplot as plt

# Generate new samples
noise = np.random.normal(0, 1, (10, 100))
generated_images = generator.predict(noise)

# Plot generated images
fig, axs = plt.subplots(1, 10, figsize=(20, 2))
for i, img in enumerate(generated_images):
    axs[i].imshow(img.squeeze(), cmap='gray')
    axs[i].axis('off')
plt.show()

In this example:

This example code snippet focuses on the visualization of data generated by the GAN. It begins by importing the Matplotlib library, which is used extensively in Python for creating static, animated, and interactive visualizations in Python.

The first part of the analysis involves the generation of new data samples. This is done by creating 'noise' - random numbers that follow a normal distribution (in this case, centered around 0 with a standard deviation of 1). An array of size 10x100 is created, where each row is a separate noise sample. These noise samples serve as inputs to the GAN's generator, which uses them to create new data samples. In this case, the generator is expected to return images, hence the name 'generated_images'.

The second part of the analysis involves the visualization of these generated images. A figure is created with a grid of 1 row and 10 columns, and a size of 20x2. Each cell in this grid will contain one of the generated images. The images are plotted one by one in each cell. The images are squeezed (to remove single-dimensional entries from their shapes) and converted to grayscale for visual clarity. The axes are also turned off for each image, to make the images themselves the focus of the visualization.

Once all the images have been plotted, the figure is displayed using plt.show(). This command reveals the entire figure and its subplots as a single output. This visualization would provide a look at the diversity and quality of the images generated by the GAN, based on the initial random noise inputs.

This type of visualization is extremely helpful in assessing how well the GAN has learned to generate new data. By visually inspecting the generated images, we can get a sense of how realistic they look and how well they mimic the training data. This qualitative evaluation, while subjective, is an important part of assessing the effectiveness of GANs.

3.4.3 User Studies

The process of conducting user studies is an essential part of assessing the quality of the generated data. This method involves obtaining valuable feedback from human participants who interact with the data. The primary purpose of these studies is to gauge the perceived quality and realism of the images generated by the system.

Participants in these studies are typically asked to provide their ratings on a variety of criteria. Some of these criteria may include aspects such as the realism of the images, the diversity of the images produced, and the overall visual appeal of the generated outputs. By soliciting feedback on these specific aspects, researchers can gain a comprehensive understanding of how well the system performs in terms of data generation.

It's important to note that user studies offer a significant advantage over other forms of assessment. Unlike relying solely on visual inspection, where the assessment can be somewhat subjective and prone to bias, user studies provide a more objective and robust evaluation of the system's performance.

This is due to the fact that they incorporate a broad range of perspectives from multiple participants, thus enhancing the reliability and credibility of the evaluation results.

Example: Conducting a User Study

# Generate new samples for the user study
noise = np.random.normal(0, 1, (20, 100))
generated_images = generator.predict(noise)

# Save generated images to disk for user study
for i, img in enumerate(generated_images):
    plt.imsave(f'generated_image_{i}.png', img.squeeze(), cmap='gray')

# Instructions for the user study:
# 1. Show participants the saved generated images.
# 2. Ask participants to rate each image on a scale of 1 to 5 for realism and visual appeal.
# 3. Collect the ratings and analyze the results to assess the quality of the GAN.

This example code is generating new image samples for a user study. It creates random noise and uses it as input for a generative model (the generator) to produce images. These images are then saved to the disk. The rest of the comments outline instructions for a user study.

Users are to be shown the generated images and asked to rate them on a scale of 1 to 5 for their realism and visual appeal. The collected ratings are then analyzed to assess the quality of the Generative Adversarial Network (GAN) that produced the images.

3.4.4 Evaluating Specific Applications

The criteria for evaluating Generative Adversarial Networks (GANs) can differ significantly based on the particular application for which they are being used. It's essential to adapt the evaluation metrics to suit the specific purpose and demands of the application at hand. Here are a few examples:

Image Super-Resolution: In this case, the key is to assess the quality of images that have been upsampled. The evaluation should focus on determining the sharpness and clarity of the enhanced images, for which metrics like the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) are typically employed. These are quantitative measures that provide a clear indication of the success of the super-resolution process.
Text Generation: When GANs are used for text generation, the focus shifts to assessing the fluency and coherence of the text that has been generated. This can be a somewhat subjective process, but there are some established metrics, such as BLEU or ROUGE scores, that provide an objective measure of the quality of the generated text.
Style Transfer: For applications involving style transfer, the evaluation should center on the consistency and artistic quality of the styles that have been transferred onto target images. This involves comparing the output images with reference images to determine how well the style has been captured and transferred. The quality of the style transfer can often be a more subjective measure, as it can depend on individual perceptions of artistic quality.

Example: Evaluating Image Super-Resolution

from skimage.metrics import peak_signal_noise_ratio as psnr
from skimage.metrics import structural_similarity as ssim

# Low-resolution and high-resolution images
low_res_images = ...  # Load low-resolution images
high_res_images = ...  # Load corresponding high-resolution images

# Generate super-resolved images using the GAN generator
super_res_images = generator.predict(low_res_images)

# Calculate PSNR and SSIM for each image
psnr_values = [psnr(hr, sr) for hr, sr in zip(high_res_images, super_res_images)]
ssim_values = [ssim(hr, sr, multichannel=True) for hr, sr in zip(high_res_images, super_res_images)]

# Print average PSNR and SSIM
print(f"Average PSNR: {np.mean(psnr_values)}")
print(f"Average SSIM: {np.mean(ssim_values)}")

This example code effectively calculates the PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index Measure) between high-resolution (HR) images and their corresponding super-resolved (SR) images generated by a GAN.

Here's a breakdown of the steps:

Import Metrics:

peak_signal_noise_ratio (psnr) and structural_similarity (ssim) are imported from skimage.metrics. These functions are used to calculate the respective metrics.

Load Images:

low_res_images: This variable likely holds the pre-loaded low-resolution images you want to use for super-resolution.
high_res_images: This variable holds the corresponding high-resolution ground truth images for comparison.

Generate Super-Resolved Images:

super_res_images = generator.predict(low_res_images): This line assumes you have a trained GAN model with a generator function that takes low-resolution images as input and predicts super-resolved images.

Calculate PSNR and SSIM:

The code iterates through corresponding HR and SR image pairs using zip.
psnr_values: For each pair, it calculates the PSNR between the HR and SR images using the psnr function and appends the value to a list named psnr_values.
ssim_values: Similarly, it calculates the SSIM between each HR and SR image pair using the ssim function with multichannel=True (assuming RGB images) and appends the value to a list named ssim_values.

Print Average Values:

np.mean(psnr_values) calculates the average PSNR across all image pairs.
np.mean(ssim_values) calculates the average SSIM across all image pairs.
Finally, the code prints the average PSNR and SSIM values.

Overall, this code example effectively evaluates the quality of the generated super-resolved images by comparing them to the ground truth high-resolution images using PSNR and SSIM metrics.

3.4 Evaluating GANs

Critical to the process of understanding the effectiveness of Generative Adversarial Networks (GANs) is the essential step of their evaluation. This evaluation process ensures that the data generated by these networks meets the standards of quality that were originally envisaged.

This is not a straightforward task because, in stark contrast to more traditional machine learning models, GANs do not have a direct and uncomplicated evaluation metric. This is largely due to their overarching goal, which is to generate data that is, in its complexity and detail, as realistic as possible.

In this section, we will embark on a comprehensive exploration of the various methods available for the evaluation of GANs. We will delve into both quantitative and qualitative approaches, examining their respective merits and potential drawbacks. Furthermore, we will explore some of the most commonly used metrics in this area of study. To complement this theoretical discussion, we will also provide practical examples to further illuminate the concepts and techniques under discussion.

3.4.1 Quantitative Evaluation Metrics

Quantitative evaluation metrics offer a range of objective measures that are crucial for assessing the performance of Generative Adversarial Networks (GANs). These metrics serve to provide a clear, definitive, and unbiased evaluation of the effectiveness of these networks, and are therefore essential in understanding the overall performance and potential improvements that could enhance the operation of GANs.

Some commonly used metrics include:

1. Inception Score (IS):

The Inception Score (IS) is a significant quantitative metric used to evaluate the performance of Generative Adversarial Networks (GANs), particularly in the quality of the images they generate. It was introduced as a means to both quantify and qualify the generated images based on two main factors: diversity and quality.

Diversity refers to the range of different images the GAN can produce. A model that generates a variety of images, rather than repeatedly producing similar or identical ones, would be considered as having high diversity. A higher score in diversity reflects the GAN's ability to capture a wide representation of the dataset it was trained on.

Quality, on the other hand, pertains to how 'real' the generated images are or how close they are to the real images in the training dataset. High-quality images should be indistinguishable from actual photos, demonstrating that the GAN has accurately learned the data distribution of the training set.

The Inception Score uses a pre-trained Inception v3 network to compute these factors. Each generated image is passed through the Inception network, which produces a conditional label distribution. The score is then calculated using these distributions, with the assumption that good models would produce diverse images (high entropy of marginal distribution) but also be confident in their predictions for individual images (low entropy of conditional distribution).

A high Inception Score generally indicates that the GAN is producing diverse, high-quality images that are similar to the real data. However, it's important to note that while the Inception Score can be a useful tool for evaluating and comparing GANs, it's not perfect and has its limitations. For instance, it relies heavily on the Inception model for its calculations, meaning its accuracy is bounded by how well the Inception model was trained.

The Inception Score evaluates the quality and diversity of generated images. It uses a pre-trained Inception v3 network to compute the conditional label distribution p(y|x) for each generated image x and the marginal label distribution p(y). The score is given by:

IS(G)=exp(Ex[DKL(p(y∣x)∣∣p(y))])

A high Inception Score indicates that the generated images are both diverse and of high quality.

2. Fréchet Inception Distance (FID):

The FID metric calculates the similarity between two datasets of images. In this case, it is used to compare the distribution of generated images against the distribution of real images.

The FID calculation involves using a pre-trained Inception v3 model, a model that was originally designed and trained for image classification tasks. This model is used to extract features from both the real and generated images. The extracted features are then represented as a multivariate Gaussian distribution, characterised by a mean and a covariance.

The Fréchet distance is then calculated between these two Gaussians. This distance gives a measure of the similarity between the two sets of images. The lower the FID score, the closer the generated images are to the real images in terms of the distributions of features. Therefore, a lower FID indicates that the generative model has performed better in producing images that are more realistic.

In the context of GANs, the FID score is often used as an evaluation measure to compare the performance of different models or different configurations of the same model. It provides a more reliable and robust evaluation than some other metrics, such as the Inception Score, as it takes into account the full multi-dimensional distribution of features, rather than just looking at marginal and conditional distributions.

The FID score measures the distance between the distributions of real and generated images in the feature space of a pre-trained Inception v3 network. Lower FID scores indicate that the generated images are more similar to the real images.

FID=∣∣μr−μg∣∣2+Tr(Σr+Σg−2(ΣrΣg)1/2)

where (μr,Σr) and (μg,Σg) are the mean and covariance of the real and generated images' feature vectors, respectively.

3. Precision and Recall for Distributions:

Precision and Recall for Distributions are statistical measures used to evaluate the performance of Generative Adversarial Networks (GANs), particularly in terms of the quality and diversity of the data they generate. These metrics are borrowed from the field of information retrieval and are also commonly used to evaluate classification tasks in machine learning.

Precision measures the quality of the generated samples. In the context of GANs, it evaluates how many of the generated samples are 'real' or close to the real data distribution. A high precision score implies that most of the generated samples are of high quality, resembling the real data closely. It indicates that the GAN is doing a good job in generating samples that are almost indistinguishable from the real samples.

Recall, on the other hand, measures the coverage of the real data distribution by the generated samples. It evaluates whether the GAN is able to generate samples that cover the whole range of the real data distribution. A high recall score implies that the GAN has a good understanding of the real data distribution and is able to generate diverse samples that cover different aspects of the real data.

Together, precision and recall provide a comprehensive evaluation of GANs. High precision and recall values indicate that the GAN is generating high-quality samples that cover the diversity of the real data. However, there is often a trade-off between precision and recall. A model that is too focused on getting high-quality samples might miss out on the diversity of the data (high precision, low recall), while a model that focuses on covering the whole data distribution might generate more low-quality samples (low precision, high recall).

To get a balanced view of the model's performance, it's common to combine precision and recall into a single metric called the F1 score. The F1 score is the harmonic mean of precision and recall, and gives equal weight to both measures. A high F1 score indicates that the GAN is performing well in both aspects, generating diverse and high-quality samples.

Precision measures the quality of generated samples, while recall measures the coverage of the real data distribution by the generated samples. High precision and recall values indicate that the GAN is generating high-quality samples that cover the diversity of the real data.

Example: Calculating Inception Score and FID

Here’s how you can calculate the Inception Score and FID using TensorFlow and pre-trained models:

import tensorflow as tf
import numpy as np
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input
from scipy.linalg import sqrtm

# Function to calculate Inception Score
def calculate_inception_score(images, num_splits=10):
    model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
    images = tf.image.resize(images, (299, 299))
    images = preprocess_input(images)
    preds = model.predict(images)

    scores = []
    for i in range(num_splits):
        part = preds[i * len(preds) // num_splits: (i + 1) * len(preds) // num_splits]
        py = np.mean(part, axis=0)
        scores.append(np.exp(np.mean([np.sum(p * np.log(p / py)) for p in part])))
    return np.mean(scores), np.std(scores)

# Function to calculate FID score
def calculate_fid(real_images, generated_images):
    model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
    real_images = tf.image.resize(real_images, (299, 299))
    real_images = preprocess_input(real_images)
    gen_images = tf.image.resize(generated_images, (299, 299))
    gen_images = preprocess_input(gen_images)

    act1 = model.predict(real_images)
    act2 = model.predict(gen_images)

    mu1, sigma1 = act1.mean(axis=0), np.cov(act1, rowvar=False)
    mu2, sigma2 = act2.mean(axis=0), np.cov(act2, rowvar=False)

    ssdiff = np.sum((mu1 - mu2) ** 2.0)
    covmean = sqrtm(sigma1.dot(sigma2))

    if np.iscomplexobj(covmean):
        covmean = covmean.real

    fid = ssdiff + np.trace(sigma1 + sigma2 - 2.0 * covmean)
    return fid

# Generate some fake images using the trained GAN generator
noise = np.random.normal(0, 1, (1000, 100))
generated_images = generator.predict(noise)

# Calculate Inception Score
is_mean, is_std = calculate_inception_score(generated_images)
print(f"Inception Score: {is_mean} ± {is_std}")

# Calculate FID Score
real_images = x_train[np.random.choice(x_train.shape[0], 1000, replace=False)]
fid_score = calculate_fid(real_images, generated_images)
print(f"FID Score: {fid_score}")