Chapter 5: Exploring Variational Autoencoders (VAEs)

5.4 Evaluating VAEs

The evaluation process of VAEs is crucial to ensure that the model has learned meaningful latent representations of the data and can generate high-quality samples. This evaluation is done through a combination of quantitative and qualitative methods.

On the quantitative side, evaluation metrics like Reconstruction Loss, Kullback-Leibler (KL) Divergence, Inception Score (IS), and Fréchet Inception Distance (FID) are used. Reconstruction Loss measures how well the VAE's decoder can recreate the original input data. KL Divergence measures the difference between the learned latent distribution and a prior distribution, usually a standard normal distribution. The Inception Score (IS) evaluates the quality and diversity of generated images, while the Fréchet Inception Distance (FID) measures the distance between the distributions of real and generated images.

On the qualitative side, methods such as visual inspection and latent space traversal are used. Visual inspection involves generating a set of images and examining them for realism and diversity. Latent space traversal involves interpolating between points in the latent space and generating images at each step. This can reveal the structure of the latent space and show how smoothly the VAE transitions between different data points.

The evaluation process is crucial in fine-tuning the model and identifying areas for improvement, ultimately leading to better generative performance. By thoroughly evaluating the VAE using these methods, you can ensure that the model has learned meaningful latent representations and can generate high-quality samples.

This section covers both quantitative and qualitative approaches. By the end of this section, you will have a comprehensive understanding of how to assess the performance of VAEs and interpret the results.

5.4.1 Quantitative Evaluation Metrics

Quantitative evaluation metrics are essential tools that provide objective measures to assess the performance of Variational Autoencoders (VAEs), a particular type of machine learning models. These metrics offer a robust way to quantify how well the models are performing in their tasks.

Among the most commonly used metrics in this field are Reconstruction Loss, Kullback-Leibler (KL) Divergence, Inception Score (IS), and Fréchet Inception Distance (FID). Each of these play a different role in evaluating the model.

Reconstruction Loss measures how well the model can reconstruct the input data, KL Divergence quantifies the difference between the model's learned distribution and the true data distribution, Inception Score (IS) evaluates the quality and diversity of generated samples, and Fréchet Inception Distance (FID) compares the distribution of generated samples to real samples.

Understanding Reconstruction Loss

Reconstruction loss is a critical component in the evaluation of Variational Autoencoders (VAEs). It essentially measures the effectiveness of the decoder in reconstructing the original input data starting from the latent variables. These latent variables are a set of representations that capture useful, simplified information about the original data.

In the context of a VAE, the reconstruction loss serves as a means of quantifying the quality of the data generated by the decoder. It is calculated by comparing the generated data to the original input data. The idea here is that a well-performing VAE should be able to recreate data that very closely matches the original input.

Therefore, a lower reconstruction loss is a positive indicator of performance. It suggests that the VAE is able to generate data that is highly similar to the original input. The closer the generated data is to the original, the lower the reconstruction loss. It's a key measurement in understanding the efficacy of the VAE and its ability to generate believable, accurate results.

Formula:

Reconstruction Loss=E
q(z∣x)

[−logp(x∣z)]

Kullback-Leibler Divergence (KL)

The Kullback-Leibler (KL) divergence, also known as relative entropy, is a measure that quantifies the difference between two probability distributions. In the context of machine learning, the KL divergence is often used to evaluate the disparity between the learned latent distribution and the prior distribution.

The latent distribution is learned from the data during the training process, while the prior distribution is a predefined distribution that we wish the latent distribution to resemble. The KL divergence provides a numerical measure of how much the learned latent distribution deviates from the prior.

A lower KL divergence value indicates that the learned latent distribution is closer to the desired prior distribution. In essence, the smaller the KL divergence, the better the learned model is at approximating the desired distribution. Therefore, minimizing the KL divergence is often a goal in machine learning tasks.

Formula:

KL Divergence=D
KL

(q(z∣x)∥p(z))

Example: Calculating Reconstruction Loss and KL Divergence

import numpy as np

# Calculate Reconstruction Loss and KL Divergence
def calculate_losses(vae, x_test):
    z_mean, z_log_var, z = vae.get_layer('encoder').predict(x_test)
    x_decoded = vae.predict(x_test)

    # Reconstruction Loss
    reconstruction_loss = tf.keras.losses.binary_crossentropy(x_test, x_decoded)
    reconstruction_loss = np.mean(reconstruction_loss * x_test.shape[1])

    # KL Divergence
    kl_loss = 1 + z_log_var - np.square(z_mean) - np.exp(z_log_var)
    kl_loss = np.mean(-0.5 * np.sum(kl_loss, axis=-1))

    return reconstruction_loss, kl_loss

# Calculate losses on test data
reconstruction_loss, kl_loss = calculate_losses(vae, x_test)
print(f"Reconstruction Loss: {reconstruction_loss}")
print(f"KL Divergence: {kl_loss}")

This example code defines a function to calculate two types of losses, Reconstruction Loss and KL Divergence, in a variational autoencoder (VAE).

The function 'calculate_losses' takes the VAE model and test data as inputs. It first uses the encoder part of the VAE to predict the latent vector 'z' from the test data and then uses the complete VAE to generate the reconstructed data.

The Reconstruction Loss is the mean binary cross-entropy loss between the original test data and the reconstructed data, scaled by the number of features in the data.

The Kullback-Leibler (KL) Divergence loss is computed from the mean and log variance of the latent vector 'z'. It measures the divergence of the learned distribution of 'z' from the standard normal distribution.

Lastly, the function returns both losses. The last part of the code uses this function to compute the losses on the test data and print them.

Inception Score (IS)

The Inception Score is a popular metric used to evaluate the quality and diversity of images generated by generative models, primarily Generative Adversarial Networks (GANs). It acts as a quantitative measure that reflects how good the generated images are.

The Inception Score uses a pre-trained Inception network, a type of deep convolutional neural network designed for image classification. This pre-trained network is used to predict the class labels of the generated images. The class labels, in this case, could be any predefined categories that the images could potentially fall into.

Once these class labels are predicted, the Inception Score then calculates the Kullback-Leibler (KL) divergence between the predicted class distribution and the marginal class distribution. The KL divergence essentially measures how one probability distribution diverges from a second, expected probability distribution. In this context, a higher KL divergence means that the generated images cover a broader range of categories, indicating both good quality and diversity of the images produced by the generative model.

Formula:

IS=exp(Ex[DKL(p(y∣x)∥p(y))])

Where:

p(y∣x) is the conditional probability of label y given image x, as predicted by the Inception network.
p(y) is the marginal distribution of labels, calculated as the mean of p(y∣x) over the generated images.

This formula calculates the KL divergence between the conditional label distribution for each image and the marginal label distribution, averaged over all generated images, and then exponentiated. The Inception Score thus measures both the quality (high confidence predictions) and diversity (similar distribution to real images) of the generated images.

Example: Calculating Inception Score

from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input
from scipy.stats import entropy

# Function to calculate Inception Score
def calculate_inception_score(images, n_split=10, eps=1E-16):
    model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
    images_resized = tf.image.resize(images, (299, 299))
    images_preprocessed = preprocess_input(images_resized)
    preds = model.predict(images_preprocessed)

    split_scores = []
    for i in range(n_split):
        part = preds[i * preds.shape[0] // n_split: (i + 1) * preds.shape[0] // n_split]
        py = np.mean(part, axis=0)
        scores = []
        for p in part:
            scores.append(entropy(p, py))
        split_scores.append(np.exp(np.mean(scores)))
    return np.mean(split_scores), np.std(split_scores)

# Generate images for evaluation
n_samples = 1000
random_latent_vectors = np.random.normal(size=(n_samples, latent_dim))
generated_images = decoder.predict(random_latent_vectors)
generated_images = generated_images.reshape((n_samples, 28, 28, 1))

# Calculate Inception Score
is_mean, is_std = calculate_inception_score(generated_images)
print(f"Inception Score: {is_mean} ± {is_std}")

In this example:

The function calculate_inception_score in the code takes three parameters: a set of images, the number of parts to split these images into (n_split), and a small constant (eps) to prevent division by zero errors or taking the logarithm of zero. The function starts by loading a pre-trained InceptionV3 model from the Keras applications module. This model is a deep convolutional neural network that has been trained on over a million images from the ImageNet database, and it is capable of classifying images into 1000 object categories.

Next, the function resizes the images to match the input shape expected by the InceptionV3 model (299x299 pixels), and applies the necessary pre-processing steps. It then uses the InceptionV3 model to predict the class labels for the pre-processed images. The resulting predictions are probabilities for each of the 1000 object categories, for each image.

Following this, the function calculates the Inception Score for each part of the split images. It does so by dividing the predictions into parts, and for each part, it calculates the average prediction (which serves as an estimate of the marginal class distribution). Then, for each image in the part, it calculates the entropy between the image's predicted class distribution and the average class distribution. The entropy measures the similarity between these two distributions, with smaller values indicating more similar distributions. The function then calculates the exponential of the mean of these entropies, to yield the Inception Score for the part.

This process is repeated for all parts, and the function finally returns the mean and standard deviation of all the Inception Scores. These two values give an overall measure of the quality and diversity of the generated images, with higher mean values indicating better quality and diversity, and lower standard deviation values indicating more consistent results across different parts.

Finally, the code generates a number of images using a decoder. This is a part of a generative model (such as a GAN or VAE) that transforms points in a latent space to images. The latent space is a lower-dimensional space that the model has learned to represent the input data.

The code generates random points in this latent space, using the standard normal distribution, and applies the decoder to these points to generate images. It then reshapes the images to the desired shape and calculates their Inception Score using the function defined earlier. The resulting Inception Score gives a quantitative measure of the quality and diversity of the images that the generative model is capable of producing.

Fréchet Inception Distance (FID)

The Fréchet Inception Distance, often abbreviated as FID, is a metric that quantifies the difference between the distribution of images that are generated by a model and the distribution of real-life images. This measurement is based on the concept of the Fréchet distance, which can be understood as a measure of similarity between two statistical distributions.

In the context of FID, these two distributions are derived from features extracted from an intermediate layer of the Inception network. One distribution is obtained from genuine, real-life images while the other is derived from images generated by a model.

The central principle underpinning the FID score is that if the generated images are of high quality, the two distributions should be similar, thus resulting in a lower FID score. Conversely, if the generated images are less like the real images, the FID score will be higher. Therefore, a lower FID score is indicative of better performance as it signifies that the model-generated images are more similar to the distribution of real images.

Formula:

FID=∣∣μr−μg∣∣2+Tr(Σr+Σg−2(ΣrΣg)1/2)
where μr,Σr and μg,Σg are the means and covariances of the real and generated image distributions, respectively.

Example: Calculating FID

from numpy import cov, trace, iscomplexobj
from scipy.linalg import sqrtm

# Function to calculate FID
def calculate_fid(real_images, generated_images):
    model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
    real_images_resized = tf.image.resize(real_images, (299, 299))
    generated_images_resized = tf.image.resize(generated_images, (299, 299))
    real_images_preprocessed = preprocess_input(real_images_resized)
    generated_images_preprocessed = preprocess_input(generated_images_resized)
    act1 = model.predict(real_images_preprocessed)
    act2 = model.predict(generated_images_preprocessed)

    mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False)
    mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False)
    ssdiff = np.sum((mu1 - mu2) ** 2.0)
    covmean = sqrtm(sigma1.dot(sigma2))
    if iscomplexobj(covmean):
        covmean = covmean.real
    fid = ssdiff + trace(sigma1 + sigma2 - 2.0 * covmean)
    return fid

# Sample real images
real_images = x_test[:n_samples].reshape((n_samples, 28, 28, 1))

# Calculate FID
fid_score = calculate_fid(real_images, generated_images)
print(f"FID Score: {fid_score}")

In this example:

The code begins by importing the necessary libraries and defining a function, calculate_fid(), which takes as input the two sets of images to be compared.

Next, the script loads the InceptionV3 model. This model is a pre-trained convolutional neural network that has been trained on a large image dataset and can classify images into a thousand different categories. It is highly effective at extracting useful features from images and is often used in tasks that require understanding the content of images.

The code then resizes the input images to fit the InceptionV3 model's expected input size of 299x299 pixels. Images are also preprocessed to match the format expected by the model.

The preprocessed and resized images are then passed through the InceptionV3 model to extract the activations. These activations serve as a kind of 'summary' of the image content, capturing important features but discarding redundant information.

Following this, the script calculates the mean and covariance of the activations for both the real and generated images. These statistical properties capture important characteristics of the distributions of the images in the latent (feature) space.

The FID score is then computed using a formula that takes into account both the difference in means and covariances of the real and generated images. The square root of the product of the covariances is calculated using the sqrtm() function from the scipy.linalg library. If the result is a complex number, only the real part is kept.

The final FID score is calculated by adding the sum of squared differences between the means of the real and generated images and the trace of the sum of the covariances of the real and generated images minus twice the square root of the product of the covariances.

The function calculate_fid() returns this calculated FID score. The lower the FID score, the more similar the two sets of images are in terms of their distributions in the latent space. Hence, this score serves as an effective measure of the quality of the images generated by the GAN or similar models.

A sample of real images is then selected from the test set and reshaped to suit the model's requirements.

Finally, the FID score is calculated for the real and generated images, and the result is printed to the console. This score provides a quantifiable measure of how well the model is performing at generating new images that resemble the real ones.

5.4.2 Qualitative Evaluation

Qualitative evaluation is a critical step in the process of assessing the output of any generative model. This non-numerical method involves a detailed visual inspection of the images that the model produces. The primary purpose of this visual inspection is to evaluate the quality and diversity of the generated images.

Although this method might seem subjective due to the reliance on visual assessment, it actually offers valuable insights into the model's performance that quantitative methods might not capture.

By evaluating the images visually, we can get a sense of the model's ability to produce diverse outputs and to capture the essential characteristics of the input data. This, in turn, helps us to understand the strengths and weaknesses of the model and to make informed decisions about possible improvements or adjustments.

Visual Inspection Process

The process of visual inspection involves the creation of a diverse set of images that are then carefully examined to evaluate their level of realism and the variety they exhibit. This hands-on approach is crucial in identifying any glaring issues that may be present.

Some of these potential problems could include a lack of sharpness resulting in blurriness, unwanted elements or irregularities that are referred to as artifacts, or a phenomenon known as mode collapse. The latter is a situation where the model, instead of generating a wide variety of outputs, repeatedly produces the same or very similar images.

Through this detailed visual inspection, we can ensure that the generated images not only appear realistic but also display a wide array of different characteristics, thus enhancing the overall performance and practical application of the model.

Example: Visualizing Generated Images

import matplotlib.pyplot as plt

# Function to visualize generated images
def visualize_generated_images(decoder, latent_dim, n_samples=10):
    random_latent_vectors = np.random.normal(size=(n_samples, latent_dim))
    generated_images = decoder.predict(random_latent_vectors)
    generated_images = generated_images.reshape((n_samples, 28, 28))

    plt.figure(figsize=(10, 2))
    for i in range(n_samples):
        plt.subplot(1, n_samples, i + 1)
        plt.imshow(generated_images[i], cmap='gray')
        plt.axis('off')
    plt.show()

# Visualize generated images
visualize_generated_images(decoder, latent_dim)

In this example:

The Python code provided is used to visualize images that are generated by a decoder, a component of the VAE. The function visualize_generated_images takes three parameters: decoder, latent_dim, and n_samples. The decoder is a trained model that can generate images from points in the latent space. The latent_dim is the dimension of the latent space, and n_samples is the number of images to be generated.

The function begins by generating random latent vectors. These vectors are points in the latent space from which the images will be generated. The latent vectors are generated from a standard normal distribution with a size of (n_samples, latent_dim).

These random latent vectors are then passed to the decoder using the predict function. The decoder generates the images from these latent vectors. The generated images are then reshaped to a 2D format suitable for plotting.

The function then creates a figure using matplotlib.pyplot and plots each generated image in a subplot. The images are displayed in grayscale. The axis('off') function is used to turn off the axis for each subplot.

Finally, the function displays the plot using plt.show().

The last line of code in the snippet calls this function, passing the decoder and latent_dim as arguments, to visualize the images generated by the decoder from the latent space. This visualization is useful in qualitative evaluations of the VAE model, where the quality and diversity of the images generated by the model are assessed.

Latent Space Traversal

Latent space traversal is a powerful technique that is primarily concerned with the interpolation between distinct points within the latent space, which is a compressed representation of our data. At each step of this process, images are generated which provide a visual representation of these points within the latent space.

This method serves as an important tool for the visualization of the smooth transitions that the Variational Autoencoder (VAE) makes between different data points. By observing these transitions, we can gain valuable insights into how the VAE processes and interprets data.

Furthermore, latent space traversal can be used to reveal the inherent structure of the latent space. By understanding this structure, we can better comprehend how the VAE learns to encode and decode data, and how it identifies and leverages the key features of the data to create a robust and efficient representation.

Example: Latent Space Traversal

# Function to perform latent space traversal
def latent_space_traversal(decoder, latent_dim, n_steps=10):
    start_point = np.random.normal(size=(1, latent_dim))
    end_point = np.random.normal(size=(1, latent_dim))
    interpolation = np.linspace(start_point, end_point, n_steps)

    generated_images = decoder.predict(interpolation)
    generated_images = generated_images.reshape((n_steps, 28, 28))

    plt.figure(figsize=(15, 2))
    for i in range(n_steps):
        plt.subplot(1, n_steps, i + 1)
        plt.imshow(generated_images[i], cmap='gray')
        plt.axis('off')
    plt.show()

# Perform latent space traversal
latent_space_traversal(decoder, latent_dim)

This example code defines and executes a function called latent_space_traversal. This function is used to explore the latent space in generative models, such as autoencoders or GANs.

In this function, a start point and an end point are randomly selected in the latent space. Then, a linear interpolation between these two points is created. The decoder is used to generate images from these interpolated points.

The generated images are then reshaped and displayed in a row, providing a visual representation of the traversal through the latent space from the start point to the end point.

Summary

Evaluating Variational Autoencoders (VAEs) involves a combination of quantitative and qualitative methods. Quantitative metrics such as Reconstruction Loss, KL Divergence, Inception Score (IS), and Fréchet Inception Distance (FID) provide objective measures of the model's performance.

Qualitative evaluation through visual inspection and latent space traversal offers insights into the quality and diversity of the generated images. By thoroughly evaluating the VAE using these methods, you can ensure that the model has learned meaningful latent representations and can generate high-quality samples. This comprehensive evaluation process helps in fine-tuning the model and identifying areas for improvement, ultimately leading to better generative performance.

5.4 Evaluating VAEs

The evaluation process of VAEs is crucial to ensure that the model has learned meaningful latent representations of the data and can generate high-quality samples. This evaluation is done through a combination of quantitative and qualitative methods.

On the quantitative side, evaluation metrics like Reconstruction Loss, Kullback-Leibler (KL) Divergence, Inception Score (IS), and Fréchet Inception Distance (FID) are used. Reconstruction Loss measures how well the VAE's decoder can recreate the original input data. KL Divergence measures the difference between the learned latent distribution and a prior distribution, usually a standard normal distribution. The Inception Score (IS) evaluates the quality and diversity of generated images, while the Fréchet Inception Distance (FID) measures the distance between the distributions of real and generated images.

On the qualitative side, methods such as visual inspection and latent space traversal are used. Visual inspection involves generating a set of images and examining them for realism and diversity. Latent space traversal involves interpolating between points in the latent space and generating images at each step. This can reveal the structure of the latent space and show how smoothly the VAE transitions between different data points.

The evaluation process is crucial in fine-tuning the model and identifying areas for improvement, ultimately leading to better generative performance. By thoroughly evaluating the VAE using these methods, you can ensure that the model has learned meaningful latent representations and can generate high-quality samples.

This section covers both quantitative and qualitative approaches. By the end of this section, you will have a comprehensive understanding of how to assess the performance of VAEs and interpret the results.

5.4.1 Quantitative Evaluation Metrics

Quantitative evaluation metrics are essential tools that provide objective measures to assess the performance of Variational Autoencoders (VAEs), a particular type of machine learning models. These metrics offer a robust way to quantify how well the models are performing in their tasks.

Among the most commonly used metrics in this field are Reconstruction Loss, Kullback-Leibler (KL) Divergence, Inception Score (IS), and Fréchet Inception Distance (FID). Each of these play a different role in evaluating the model.

Reconstruction Loss measures how well the model can reconstruct the input data, KL Divergence quantifies the difference between the model's learned distribution and the true data distribution, Inception Score (IS) evaluates the quality and diversity of generated samples, and Fréchet Inception Distance (FID) compares the distribution of generated samples to real samples.

Understanding Reconstruction Loss

Reconstruction loss is a critical component in the evaluation of Variational Autoencoders (VAEs). It essentially measures the effectiveness of the decoder in reconstructing the original input data starting from the latent variables. These latent variables are a set of representations that capture useful, simplified information about the original data.

In the context of a VAE, the reconstruction loss serves as a means of quantifying the quality of the data generated by the decoder. It is calculated by comparing the generated data to the original input data. The idea here is that a well-performing VAE should be able to recreate data that very closely matches the original input.

Therefore, a lower reconstruction loss is a positive indicator of performance. It suggests that the VAE is able to generate data that is highly similar to the original input. The closer the generated data is to the original, the lower the reconstruction loss. It's a key measurement in understanding the efficacy of the VAE and its ability to generate believable, accurate results.

Formula:

Reconstruction Loss=E
q(z∣x)

[−logp(x∣z)]

Kullback-Leibler Divergence (KL)

The Kullback-Leibler (KL) divergence, also known as relative entropy, is a measure that quantifies the difference between two probability distributions. In the context of machine learning, the KL divergence is often used to evaluate the disparity between the learned latent distribution and the prior distribution.

The latent distribution is learned from the data during the training process, while the prior distribution is a predefined distribution that we wish the latent distribution to resemble. The KL divergence provides a numerical measure of how much the learned latent distribution deviates from the prior.

A lower KL divergence value indicates that the learned latent distribution is closer to the desired prior distribution. In essence, the smaller the KL divergence, the better the learned model is at approximating the desired distribution. Therefore, minimizing the KL divergence is often a goal in machine learning tasks.

Formula:

KL Divergence=D
KL

(q(z∣x)∥p(z))

Example: Calculating Reconstruction Loss and KL Divergence

import numpy as np

# Calculate Reconstruction Loss and KL Divergence
def calculate_losses(vae, x_test):
    z_mean, z_log_var, z = vae.get_layer('encoder').predict(x_test)
    x_decoded = vae.predict(x_test)

    # Reconstruction Loss
    reconstruction_loss = tf.keras.losses.binary_crossentropy(x_test, x_decoded)
    reconstruction_loss = np.mean(reconstruction_loss * x_test.shape[1])

    # KL Divergence
    kl_loss = 1 + z_log_var - np.square(z_mean) - np.exp(z_log_var)
    kl_loss = np.mean(-0.5 * np.sum(kl_loss, axis=-1))

    return reconstruction_loss, kl_loss

# Calculate losses on test data
reconstruction_loss, kl_loss = calculate_losses(vae, x_test)
print(f"Reconstruction Loss: {reconstruction_loss}")
print(f"KL Divergence: {kl_loss}")

This example code defines a function to calculate two types of losses, Reconstruction Loss and KL Divergence, in a variational autoencoder (VAE).

The function 'calculate_losses' takes the VAE model and test data as inputs. It first uses the encoder part of the VAE to predict the latent vector 'z' from the test data and then uses the complete VAE to generate the reconstructed data.

The Reconstruction Loss is the mean binary cross-entropy loss between the original test data and the reconstructed data, scaled by the number of features in the data.

The Kullback-Leibler (KL) Divergence loss is computed from the mean and log variance of the latent vector 'z'. It measures the divergence of the learned distribution of 'z' from the standard normal distribution.

Lastly, the function returns both losses. The last part of the code uses this function to compute the losses on the test data and print them.

Inception Score (IS)

The Inception Score is a popular metric used to evaluate the quality and diversity of images generated by generative models, primarily Generative Adversarial Networks (GANs). It acts as a quantitative measure that reflects how good the generated images are.

The Inception Score uses a pre-trained Inception network, a type of deep convolutional neural network designed for image classification. This pre-trained network is used to predict the class labels of the generated images. The class labels, in this case, could be any predefined categories that the images could potentially fall into.

Once these class labels are predicted, the Inception Score then calculates the Kullback-Leibler (KL) divergence between the predicted class distribution and the marginal class distribution. The KL divergence essentially measures how one probability distribution diverges from a second, expected probability distribution. In this context, a higher KL divergence means that the generated images cover a broader range of categories, indicating both good quality and diversity of the images produced by the generative model.

Formula:

IS=exp(Ex[DKL(p(y∣x)∥p(y))])

Where:

p(y∣x) is the conditional probability of label y given image x, as predicted by the Inception network.
p(y) is the marginal distribution of labels, calculated as the mean of p(y∣x) over the generated images.

This formula calculates the KL divergence between the conditional label distribution for each image and the marginal label distribution, averaged over all generated images, and then exponentiated. The Inception Score thus measures both the quality (high confidence predictions) and diversity (similar distribution to real images) of the generated images.

Example: Calculating Inception Score

from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input
from scipy.stats import entropy

# Function to calculate Inception Score
def calculate_inception_score(images, n_split=10, eps=1E-16):
    model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
    images_resized = tf.image.resize(images, (299, 299))
    images_preprocessed = preprocess_input(images_resized)
    preds = model.predict(images_preprocessed)

    split_scores = []
    for i in range(n_split):
        part = preds[i * preds.shape[0] // n_split: (i + 1) * preds.shape[0] // n_split]
        py = np.mean(part, axis=0)
        scores = []
        for p in part:
            scores.append(entropy(p, py))
        split_scores.append(np.exp(np.mean(scores)))
    return np.mean(split_scores), np.std(split_scores)

# Generate images for evaluation
n_samples = 1000
random_latent_vectors = np.random.normal(size=(n_samples, latent_dim))
generated_images = decoder.predict(random_latent_vectors)
generated_images = generated_images.reshape((n_samples, 28, 28, 1))

# Calculate Inception Score
is_mean, is_std = calculate_inception_score(generated_images)
print(f"Inception Score: {is_mean} ± {is_std}")

In this example:

The function calculate_inception_score in the code takes three parameters: a set of images, the number of parts to split these images into (n_split), and a small constant (eps) to prevent division by zero errors or taking the logarithm of zero. The function starts by loading a pre-trained InceptionV3 model from the Keras applications module. This model is a deep convolutional neural network that has been trained on over a million images from the ImageNet database, and it is capable of classifying images into 1000 object categories.

Next, the function resizes the images to match the input shape expected by the InceptionV3 model (299x299 pixels), and applies the necessary pre-processing steps. It then uses the InceptionV3 model to predict the class labels for the pre-processed images. The resulting predictions are probabilities for each of the 1000 object categories, for each image.

Following this, the function calculates the Inception Score for each part of the split images. It does so by dividing the predictions into parts, and for each part, it calculates the average prediction (which serves as an estimate of the marginal class distribution). Then, for each image in the part, it calculates the entropy between the image's predicted class distribution and the average class distribution. The entropy measures the similarity between these two distributions, with smaller values indicating more similar distributions. The function then calculates the exponential of the mean of these entropies, to yield the Inception Score for the part.

This process is repeated for all parts, and the function finally returns the mean and standard deviation of all the Inception Scores. These two values give an overall measure of the quality and diversity of the generated images, with higher mean values indicating better quality and diversity, and lower standard deviation values indicating more consistent results across different parts.

Finally, the code generates a number of images using a decoder. This is a part of a generative model (such as a GAN or VAE) that transforms points in a latent space to images. The latent space is a lower-dimensional space that the model has learned to represent the input data.

The code generates random points in this latent space, using the standard normal distribution, and applies the decoder to these points to generate images. It then reshapes the images to the desired shape and calculates their Inception Score using the function defined earlier. The resulting Inception Score gives a quantitative measure of the quality and diversity of the images that the generative model is capable of producing.

Fréchet Inception Distance (FID)

The Fréchet Inception Distance, often abbreviated as FID, is a metric that quantifies the difference between the distribution of images that are generated by a model and the distribution of real-life images. This measurement is based on the concept of the Fréchet distance, which can be understood as a measure of similarity between two statistical distributions.

In the context of FID, these two distributions are derived from features extracted from an intermediate layer of the Inception network. One distribution is obtained from genuine, real-life images while the other is derived from images generated by a model.

The central principle underpinning the FID score is that if the generated images are of high quality, the two distributions should be similar, thus resulting in a lower FID score. Conversely, if the generated images are less like the real images, the FID score will be higher. Therefore, a lower FID score is indicative of better performance as it signifies that the model-generated images are more similar to the distribution of real images.

Formula:

FID=∣∣μr−μg∣∣2+Tr(Σr+Σg−2(ΣrΣg)1/2)
where μr,Σr and μg,Σg are the means and covariances of the real and generated image distributions, respectively.

Example: Calculating FID

from numpy import cov, trace, iscomplexobj
from scipy.linalg import sqrtm

# Function to calculate FID
def calculate_fid(real_images, generated_images):
    model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
    real_images_resized = tf.image.resize(real_images, (299, 299))
    generated_images_resized = tf.image.resize(generated_images, (299, 299))
    real_images_preprocessed = preprocess_input(real_images_resized)
    generated_images_preprocessed = preprocess_input(generated_images_resized)
    act1 = model.predict(real_images_preprocessed)
    act2 = model.predict(generated_images_preprocessed)

    mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False)
    mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False)
    ssdiff = np.sum((mu1 - mu2) ** 2.0)
    covmean = sqrtm(sigma1.dot(sigma2))
    if iscomplexobj(covmean):
        covmean = covmean.real
    fid = ssdiff + trace(sigma1 + sigma2 - 2.0 * covmean)
    return fid

# Sample real images
real_images = x_test[:n_samples].reshape((n_samples, 28, 28, 1))

# Calculate FID
fid_score = calculate_fid(real_images, generated_images)
print(f"FID Score: {fid_score}")