Chapter 4: Project Face Generation with GANs
4.5 Evaluating the Model
Evaluating the performance of a Generative Adversarial Network (GAN) is crucial to understand how well the model is generating realistic images and to identify areas for improvement. This section will cover both qualitative and quantitative methods for evaluating the GAN model trained on face generation. We will discuss metrics like Inception Score (IS) and Fréchet Inception Distance (FID), and provide example codes to calculate these metrics.
4.5.1 Qualitative Evaluation
Qualitative evaluation involves visually inspecting the generated images to assess their realism and diversity. This method is subjective but essential for gaining an initial understanding of the model's performance. Here are some aspects to consider during qualitative evaluation:
- Realism: Do the generated images look like real faces?
- Diversity: Are the generated images diverse, covering a wide range of facial features and expressions?
- Artifacts: Are there any noticeable artifacts or inconsistencies in the generated images?
Example: Visualizing Generated Images
You can visualize the generated images using matplotlib to perform a qualitative evaluation:
import matplotlib.pyplot as plt
import numpy as np
def plot_generated_images(generator, latent_dim, n_samples=10):
noise = np.random.normal(0, 1, (n_samples, latent_dim))
generated_images = generator.predict(noise)
generated_images = (generated_images * 127.5 + 127.5).astype(np.uint8) # Rescale to [0, 255]
plt.figure(figsize=(20, 2))
for i in range(n_samples):
plt.subplot(1, n_samples, i + 1)
plt.imshow(generated_images[i])
plt.axis('off')
plt.show()
# Generate and plot new faces for qualitative evaluation
latent_dim = 100
plot_generated_images(generator, latent_dim, n_samples=10)
The function plot_generated_images
generates a specified number of images (default is 10) using the generator. It creates random noise with a normal distribution, feeds it to the generator model, and then rescales the outputted images to have pixel values in the range of [0, 255]. The images are then displayed in a plot with the specified figure size.
The last two lines of code call this function using a generator model and a latent dimension of 100, generating and displaying 10 images.
4.5.2 Quantitative Evaluation
Quantitative evaluation provides objective measures of the quality and diversity of the generated images. Two widely used metrics for evaluating GANs are the Inception Score (IS) and the Fréchet Inception Distance (FID).
Inception Score (IS)
The Inception Score measures the quality and diversity of the generated images by evaluating how well they match the class labels predicted by a pre-trained Inception network. Higher scores indicate better quality and diversity.
Formula:
FID=∣∣μr−μg∣∣2+Tr(Σr+Σg−2(ΣrΣg)1/2) where μr,Σr and μg,Σg are the means and covariances of the real and generated image distributions, respectively.
Example: Calculating Inception Score
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input
from scipy.stats import entropy
import numpy as np
def calculate_inception_score(images, n_split=10, eps=1E-16):
# Load InceptionV3 model
model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
images_resized = tf.image.resize(images, (299, 299))
images_preprocessed = preprocess_input(images_resized)
# Predict the probability distribution
preds = model.predict(images_preprocessed)
# Calculate the mean KL divergence
split_scores = []
for i in range(n_split):
part = preds[i * preds.shape[0] // n_split: (i + 1) * preds.shape[0] // n_split]
py = np.mean(part, axis=0)
scores = []
for p in part:
scores.append(entropy(p, py))
split_scores.append(np.exp(np.mean(scores)))
return np.mean(split_scores), np.std(split_scores)
# Generate images
n_samples = 1000
noise = np.random.normal(0, 1, (n_samples, latent_dim))
generated_images = generator.predict(noise)
# Calculate Inception Score
is_mean, is_std = calculate_inception_score(generated_images)
print(f"Inception Score: {is_mean} ± {is_std}")
The code first imports necessary modules and defines a function 'calculate_inception_score'. This function uses the InceptionV3 model from TensorFlow to predict the probability distribution of classes for each image. It then calculates the Kullback-Leibler (KL) divergence between the predicted distributions and the mean distribution, which is used to calculate the Inception Score.
A high Inception Score indicates that the model generates diverse and realistic images. The function returns the mean and standard deviation of the Inception Scores for a given set of images.
The last part of the code generates images from random noise using a 'generator' model, and then calculates and prints the Inception Score for these images.
Fréchet Inception Distance (FID)
The Fréchet Inception Distance measures the distance between the distributions of real and generated images. Lower FID scores indicate better quality and diversity of the generated images.
Formula:
FID=∣∣μr−μg∣∣2+Tr(Σr+Σg−2(ΣrΣg)1/2) where μr,Σr and μg,Σg are the means and covariances of the real and generated image distributions, respectively.
Example: Calculating FID
from numpy import cov, trace, iscomplexobj
from scipy.linalg import sqrtm
def calculate_fid(real_images, generated_images):
# Load InceptionV3 model
model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
# Resize and preprocess images
real_images_resized = tf.image.resize(real_images, (299, 299))
generated_images_resized = tf.image.resize(generated_images, (299, 299))
real_images_preprocessed = preprocess_input(real_images_resized)
generated_images_preprocessed = preprocess_input(generated_images_resized)
# Calculate activations
act1 = model.predict(real_images_preprocessed)
act2 = model.predict(generated_images_preprocessed)
# Calculate mean and covariance
mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False)
mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False)
# Calculate FID
ssdiff = np.sum((mu1 - mu2)**2.0)
covmean = sqrtm(sigma1.dot(sigma2))
if iscomplexobj(covmean):
covmean = covmean.real
fid = ssdiff + trace(sigma1 + sigma2 - 2.0*covmean)
return fid
# Generate images
n_samples = 1000
noise = np.random.normal(0, 1, (n_samples, latent_dim))
generated_images = generator.predict(noise)
# Sample real images
real_images = x_train[np.random.choice(x_train.shape[0], n_samples, replace=False)]
# Calculate FID
fid_score = calculate_fid(real_images, generated_images)
print(f"FID Score: {fid_score}")
The script includes a function calculate_fid(real_images, generated_images)
that computes the FID score. It uses the InceptionV3 model from Keras to calculate activations of real and generated images. These activations are then used to compute the mean and covariance of the image sets.
The FID score is calculated as the sum of the squared difference between the means and the trace of the sum of the covariances minus twice the square root of the product of the covariances.
The function is then used with a set of real images and a set of generated images to compute a FID score. The generated images are created by a generator network from random noise, and the real images are sampled from a training set x_train
. Finally, the FID score is printed.
4.5.3 Comparing with Baseline Models
To understand the performance of your GAN model, it’s useful to compare the results with baseline models. This could involve:
- Comparing with a GAN trained with a different architecture.
- Comparing with a GAN trained with different hyperparameters.
- Comparing with other generative models like VAEs (Variational Autoencoders).
4.5.4 Addressing Common Issues
During evaluation, you might encounter common issues such as:
- Mode Collapse: The generator produces limited diversity in the output images. This can be addressed by techniques such as minibatch discrimination, unrolled GANs, or using different loss functions.
- Training Instability: The generator and discriminator losses oscillate significantly. This can be mitigated by using techniques like Wasserstein GANs (WGANs) or spectral normalization.
Summary
Evaluating a GAN involves both qualitative and quantitative methods to ensure that the generated images are realistic and diverse. Qualitative evaluation through visual inspection helps in identifying immediate issues, while quantitative metrics like Inception Score and Fréchet Inception Distance provide objective measures of performance. By systematically evaluating and comparing the model's outputs, you can identify areas for improvement and refine your GAN to produce high-quality images.
4.5 Evaluating the Model
Evaluating the performance of a Generative Adversarial Network (GAN) is crucial to understand how well the model is generating realistic images and to identify areas for improvement. This section will cover both qualitative and quantitative methods for evaluating the GAN model trained on face generation. We will discuss metrics like Inception Score (IS) and Fréchet Inception Distance (FID), and provide example codes to calculate these metrics.
4.5.1 Qualitative Evaluation
Qualitative evaluation involves visually inspecting the generated images to assess their realism and diversity. This method is subjective but essential for gaining an initial understanding of the model's performance. Here are some aspects to consider during qualitative evaluation:
- Realism: Do the generated images look like real faces?
- Diversity: Are the generated images diverse, covering a wide range of facial features and expressions?
- Artifacts: Are there any noticeable artifacts or inconsistencies in the generated images?
Example: Visualizing Generated Images
You can visualize the generated images using matplotlib to perform a qualitative evaluation:
import matplotlib.pyplot as plt
import numpy as np
def plot_generated_images(generator, latent_dim, n_samples=10):
noise = np.random.normal(0, 1, (n_samples, latent_dim))
generated_images = generator.predict(noise)
generated_images = (generated_images * 127.5 + 127.5).astype(np.uint8) # Rescale to [0, 255]
plt.figure(figsize=(20, 2))
for i in range(n_samples):
plt.subplot(1, n_samples, i + 1)
plt.imshow(generated_images[i])
plt.axis('off')
plt.show()
# Generate and plot new faces for qualitative evaluation
latent_dim = 100
plot_generated_images(generator, latent_dim, n_samples=10)
The function plot_generated_images
generates a specified number of images (default is 10) using the generator. It creates random noise with a normal distribution, feeds it to the generator model, and then rescales the outputted images to have pixel values in the range of [0, 255]. The images are then displayed in a plot with the specified figure size.
The last two lines of code call this function using a generator model and a latent dimension of 100, generating and displaying 10 images.
4.5.2 Quantitative Evaluation
Quantitative evaluation provides objective measures of the quality and diversity of the generated images. Two widely used metrics for evaluating GANs are the Inception Score (IS) and the Fréchet Inception Distance (FID).
Inception Score (IS)
The Inception Score measures the quality and diversity of the generated images by evaluating how well they match the class labels predicted by a pre-trained Inception network. Higher scores indicate better quality and diversity.
Formula:
FID=∣∣μr−μg∣∣2+Tr(Σr+Σg−2(ΣrΣg)1/2) where μr,Σr and μg,Σg are the means and covariances of the real and generated image distributions, respectively.
Example: Calculating Inception Score
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input
from scipy.stats import entropy
import numpy as np
def calculate_inception_score(images, n_split=10, eps=1E-16):
# Load InceptionV3 model
model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
images_resized = tf.image.resize(images, (299, 299))
images_preprocessed = preprocess_input(images_resized)
# Predict the probability distribution
preds = model.predict(images_preprocessed)
# Calculate the mean KL divergence
split_scores = []
for i in range(n_split):
part = preds[i * preds.shape[0] // n_split: (i + 1) * preds.shape[0] // n_split]
py = np.mean(part, axis=0)
scores = []
for p in part:
scores.append(entropy(p, py))
split_scores.append(np.exp(np.mean(scores)))
return np.mean(split_scores), np.std(split_scores)
# Generate images
n_samples = 1000
noise = np.random.normal(0, 1, (n_samples, latent_dim))
generated_images = generator.predict(noise)
# Calculate Inception Score
is_mean, is_std = calculate_inception_score(generated_images)
print(f"Inception Score: {is_mean} ± {is_std}")
The code first imports necessary modules and defines a function 'calculate_inception_score'. This function uses the InceptionV3 model from TensorFlow to predict the probability distribution of classes for each image. It then calculates the Kullback-Leibler (KL) divergence between the predicted distributions and the mean distribution, which is used to calculate the Inception Score.
A high Inception Score indicates that the model generates diverse and realistic images. The function returns the mean and standard deviation of the Inception Scores for a given set of images.
The last part of the code generates images from random noise using a 'generator' model, and then calculates and prints the Inception Score for these images.
Fréchet Inception Distance (FID)
The Fréchet Inception Distance measures the distance between the distributions of real and generated images. Lower FID scores indicate better quality and diversity of the generated images.
Formula:
FID=∣∣μr−μg∣∣2+Tr(Σr+Σg−2(ΣrΣg)1/2) where μr,Σr and μg,Σg are the means and covariances of the real and generated image distributions, respectively.
Example: Calculating FID
from numpy import cov, trace, iscomplexobj
from scipy.linalg import sqrtm
def calculate_fid(real_images, generated_images):
# Load InceptionV3 model
model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
# Resize and preprocess images
real_images_resized = tf.image.resize(real_images, (299, 299))
generated_images_resized = tf.image.resize(generated_images, (299, 299))
real_images_preprocessed = preprocess_input(real_images_resized)
generated_images_preprocessed = preprocess_input(generated_images_resized)
# Calculate activations
act1 = model.predict(real_images_preprocessed)
act2 = model.predict(generated_images_preprocessed)
# Calculate mean and covariance
mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False)
mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False)
# Calculate FID
ssdiff = np.sum((mu1 - mu2)**2.0)
covmean = sqrtm(sigma1.dot(sigma2))
if iscomplexobj(covmean):
covmean = covmean.real
fid = ssdiff + trace(sigma1 + sigma2 - 2.0*covmean)
return fid
# Generate images
n_samples = 1000
noise = np.random.normal(0, 1, (n_samples, latent_dim))
generated_images = generator.predict(noise)
# Sample real images
real_images = x_train[np.random.choice(x_train.shape[0], n_samples, replace=False)]
# Calculate FID
fid_score = calculate_fid(real_images, generated_images)
print(f"FID Score: {fid_score}")
The script includes a function calculate_fid(real_images, generated_images)
that computes the FID score. It uses the InceptionV3 model from Keras to calculate activations of real and generated images. These activations are then used to compute the mean and covariance of the image sets.
The FID score is calculated as the sum of the squared difference between the means and the trace of the sum of the covariances minus twice the square root of the product of the covariances.
The function is then used with a set of real images and a set of generated images to compute a FID score. The generated images are created by a generator network from random noise, and the real images are sampled from a training set x_train
. Finally, the FID score is printed.
4.5.3 Comparing with Baseline Models
To understand the performance of your GAN model, it’s useful to compare the results with baseline models. This could involve:
- Comparing with a GAN trained with a different architecture.
- Comparing with a GAN trained with different hyperparameters.
- Comparing with other generative models like VAEs (Variational Autoencoders).
4.5.4 Addressing Common Issues
During evaluation, you might encounter common issues such as:
- Mode Collapse: The generator produces limited diversity in the output images. This can be addressed by techniques such as minibatch discrimination, unrolled GANs, or using different loss functions.
- Training Instability: The generator and discriminator losses oscillate significantly. This can be mitigated by using techniques like Wasserstein GANs (WGANs) or spectral normalization.
Summary
Evaluating a GAN involves both qualitative and quantitative methods to ensure that the generated images are realistic and diverse. Qualitative evaluation through visual inspection helps in identifying immediate issues, while quantitative metrics like Inception Score and Fréchet Inception Distance provide objective measures of performance. By systematically evaluating and comparing the model's outputs, you can identify areas for improvement and refine your GAN to produce high-quality images.
4.5 Evaluating the Model
Evaluating the performance of a Generative Adversarial Network (GAN) is crucial to understand how well the model is generating realistic images and to identify areas for improvement. This section will cover both qualitative and quantitative methods for evaluating the GAN model trained on face generation. We will discuss metrics like Inception Score (IS) and Fréchet Inception Distance (FID), and provide example codes to calculate these metrics.
4.5.1 Qualitative Evaluation
Qualitative evaluation involves visually inspecting the generated images to assess their realism and diversity. This method is subjective but essential for gaining an initial understanding of the model's performance. Here are some aspects to consider during qualitative evaluation:
- Realism: Do the generated images look like real faces?
- Diversity: Are the generated images diverse, covering a wide range of facial features and expressions?
- Artifacts: Are there any noticeable artifacts or inconsistencies in the generated images?
Example: Visualizing Generated Images
You can visualize the generated images using matplotlib to perform a qualitative evaluation:
import matplotlib.pyplot as plt
import numpy as np
def plot_generated_images(generator, latent_dim, n_samples=10):
noise = np.random.normal(0, 1, (n_samples, latent_dim))
generated_images = generator.predict(noise)
generated_images = (generated_images * 127.5 + 127.5).astype(np.uint8) # Rescale to [0, 255]
plt.figure(figsize=(20, 2))
for i in range(n_samples):
plt.subplot(1, n_samples, i + 1)
plt.imshow(generated_images[i])
plt.axis('off')
plt.show()
# Generate and plot new faces for qualitative evaluation
latent_dim = 100
plot_generated_images(generator, latent_dim, n_samples=10)
The function plot_generated_images
generates a specified number of images (default is 10) using the generator. It creates random noise with a normal distribution, feeds it to the generator model, and then rescales the outputted images to have pixel values in the range of [0, 255]. The images are then displayed in a plot with the specified figure size.
The last two lines of code call this function using a generator model and a latent dimension of 100, generating and displaying 10 images.
4.5.2 Quantitative Evaluation
Quantitative evaluation provides objective measures of the quality and diversity of the generated images. Two widely used metrics for evaluating GANs are the Inception Score (IS) and the Fréchet Inception Distance (FID).
Inception Score (IS)
The Inception Score measures the quality and diversity of the generated images by evaluating how well they match the class labels predicted by a pre-trained Inception network. Higher scores indicate better quality and diversity.
Formula:
FID=∣∣μr−μg∣∣2+Tr(Σr+Σg−2(ΣrΣg)1/2) where μr,Σr and μg,Σg are the means and covariances of the real and generated image distributions, respectively.
Example: Calculating Inception Score
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input
from scipy.stats import entropy
import numpy as np
def calculate_inception_score(images, n_split=10, eps=1E-16):
# Load InceptionV3 model
model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
images_resized = tf.image.resize(images, (299, 299))
images_preprocessed = preprocess_input(images_resized)
# Predict the probability distribution
preds = model.predict(images_preprocessed)
# Calculate the mean KL divergence
split_scores = []
for i in range(n_split):
part = preds[i * preds.shape[0] // n_split: (i + 1) * preds.shape[0] // n_split]
py = np.mean(part, axis=0)
scores = []
for p in part:
scores.append(entropy(p, py))
split_scores.append(np.exp(np.mean(scores)))
return np.mean(split_scores), np.std(split_scores)
# Generate images
n_samples = 1000
noise = np.random.normal(0, 1, (n_samples, latent_dim))
generated_images = generator.predict(noise)
# Calculate Inception Score
is_mean, is_std = calculate_inception_score(generated_images)
print(f"Inception Score: {is_mean} ± {is_std}")
The code first imports necessary modules and defines a function 'calculate_inception_score'. This function uses the InceptionV3 model from TensorFlow to predict the probability distribution of classes for each image. It then calculates the Kullback-Leibler (KL) divergence between the predicted distributions and the mean distribution, which is used to calculate the Inception Score.
A high Inception Score indicates that the model generates diverse and realistic images. The function returns the mean and standard deviation of the Inception Scores for a given set of images.
The last part of the code generates images from random noise using a 'generator' model, and then calculates and prints the Inception Score for these images.
Fréchet Inception Distance (FID)
The Fréchet Inception Distance measures the distance between the distributions of real and generated images. Lower FID scores indicate better quality and diversity of the generated images.
Formula:
FID=∣∣μr−μg∣∣2+Tr(Σr+Σg−2(ΣrΣg)1/2) where μr,Σr and μg,Σg are the means and covariances of the real and generated image distributions, respectively.
Example: Calculating FID
from numpy import cov, trace, iscomplexobj
from scipy.linalg import sqrtm
def calculate_fid(real_images, generated_images):
# Load InceptionV3 model
model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
# Resize and preprocess images
real_images_resized = tf.image.resize(real_images, (299, 299))
generated_images_resized = tf.image.resize(generated_images, (299, 299))
real_images_preprocessed = preprocess_input(real_images_resized)
generated_images_preprocessed = preprocess_input(generated_images_resized)
# Calculate activations
act1 = model.predict(real_images_preprocessed)
act2 = model.predict(generated_images_preprocessed)
# Calculate mean and covariance
mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False)
mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False)
# Calculate FID
ssdiff = np.sum((mu1 - mu2)**2.0)
covmean = sqrtm(sigma1.dot(sigma2))
if iscomplexobj(covmean):
covmean = covmean.real
fid = ssdiff + trace(sigma1 + sigma2 - 2.0*covmean)
return fid
# Generate images
n_samples = 1000
noise = np.random.normal(0, 1, (n_samples, latent_dim))
generated_images = generator.predict(noise)
# Sample real images
real_images = x_train[np.random.choice(x_train.shape[0], n_samples, replace=False)]
# Calculate FID
fid_score = calculate_fid(real_images, generated_images)
print(f"FID Score: {fid_score}")
The script includes a function calculate_fid(real_images, generated_images)
that computes the FID score. It uses the InceptionV3 model from Keras to calculate activations of real and generated images. These activations are then used to compute the mean and covariance of the image sets.
The FID score is calculated as the sum of the squared difference between the means and the trace of the sum of the covariances minus twice the square root of the product of the covariances.
The function is then used with a set of real images and a set of generated images to compute a FID score. The generated images are created by a generator network from random noise, and the real images are sampled from a training set x_train
. Finally, the FID score is printed.
4.5.3 Comparing with Baseline Models
To understand the performance of your GAN model, it’s useful to compare the results with baseline models. This could involve:
- Comparing with a GAN trained with a different architecture.
- Comparing with a GAN trained with different hyperparameters.
- Comparing with other generative models like VAEs (Variational Autoencoders).
4.5.4 Addressing Common Issues
During evaluation, you might encounter common issues such as:
- Mode Collapse: The generator produces limited diversity in the output images. This can be addressed by techniques such as minibatch discrimination, unrolled GANs, or using different loss functions.
- Training Instability: The generator and discriminator losses oscillate significantly. This can be mitigated by using techniques like Wasserstein GANs (WGANs) or spectral normalization.
Summary
Evaluating a GAN involves both qualitative and quantitative methods to ensure that the generated images are realistic and diverse. Qualitative evaluation through visual inspection helps in identifying immediate issues, while quantitative metrics like Inception Score and Fréchet Inception Distance provide objective measures of performance. By systematically evaluating and comparing the model's outputs, you can identify areas for improvement and refine your GAN to produce high-quality images.
4.5 Evaluating the Model
Evaluating the performance of a Generative Adversarial Network (GAN) is crucial to understand how well the model is generating realistic images and to identify areas for improvement. This section will cover both qualitative and quantitative methods for evaluating the GAN model trained on face generation. We will discuss metrics like Inception Score (IS) and Fréchet Inception Distance (FID), and provide example codes to calculate these metrics.
4.5.1 Qualitative Evaluation
Qualitative evaluation involves visually inspecting the generated images to assess their realism and diversity. This method is subjective but essential for gaining an initial understanding of the model's performance. Here are some aspects to consider during qualitative evaluation:
- Realism: Do the generated images look like real faces?
- Diversity: Are the generated images diverse, covering a wide range of facial features and expressions?
- Artifacts: Are there any noticeable artifacts or inconsistencies in the generated images?
Example: Visualizing Generated Images
You can visualize the generated images using matplotlib to perform a qualitative evaluation:
import matplotlib.pyplot as plt
import numpy as np
def plot_generated_images(generator, latent_dim, n_samples=10):
noise = np.random.normal(0, 1, (n_samples, latent_dim))
generated_images = generator.predict(noise)
generated_images = (generated_images * 127.5 + 127.5).astype(np.uint8) # Rescale to [0, 255]
plt.figure(figsize=(20, 2))
for i in range(n_samples):
plt.subplot(1, n_samples, i + 1)
plt.imshow(generated_images[i])
plt.axis('off')
plt.show()
# Generate and plot new faces for qualitative evaluation
latent_dim = 100
plot_generated_images(generator, latent_dim, n_samples=10)
The function plot_generated_images
generates a specified number of images (default is 10) using the generator. It creates random noise with a normal distribution, feeds it to the generator model, and then rescales the outputted images to have pixel values in the range of [0, 255]. The images are then displayed in a plot with the specified figure size.
The last two lines of code call this function using a generator model and a latent dimension of 100, generating and displaying 10 images.
4.5.2 Quantitative Evaluation
Quantitative evaluation provides objective measures of the quality and diversity of the generated images. Two widely used metrics for evaluating GANs are the Inception Score (IS) and the Fréchet Inception Distance (FID).
Inception Score (IS)
The Inception Score measures the quality and diversity of the generated images by evaluating how well they match the class labels predicted by a pre-trained Inception network. Higher scores indicate better quality and diversity.
Formula:
FID=∣∣μr−μg∣∣2+Tr(Σr+Σg−2(ΣrΣg)1/2) where μr,Σr and μg,Σg are the means and covariances of the real and generated image distributions, respectively.
Example: Calculating Inception Score
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input
from scipy.stats import entropy
import numpy as np
def calculate_inception_score(images, n_split=10, eps=1E-16):
# Load InceptionV3 model
model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
images_resized = tf.image.resize(images, (299, 299))
images_preprocessed = preprocess_input(images_resized)
# Predict the probability distribution
preds = model.predict(images_preprocessed)
# Calculate the mean KL divergence
split_scores = []
for i in range(n_split):
part = preds[i * preds.shape[0] // n_split: (i + 1) * preds.shape[0] // n_split]
py = np.mean(part, axis=0)
scores = []
for p in part:
scores.append(entropy(p, py))
split_scores.append(np.exp(np.mean(scores)))
return np.mean(split_scores), np.std(split_scores)
# Generate images
n_samples = 1000
noise = np.random.normal(0, 1, (n_samples, latent_dim))
generated_images = generator.predict(noise)
# Calculate Inception Score
is_mean, is_std = calculate_inception_score(generated_images)
print(f"Inception Score: {is_mean} ± {is_std}")
The code first imports necessary modules and defines a function 'calculate_inception_score'. This function uses the InceptionV3 model from TensorFlow to predict the probability distribution of classes for each image. It then calculates the Kullback-Leibler (KL) divergence between the predicted distributions and the mean distribution, which is used to calculate the Inception Score.
A high Inception Score indicates that the model generates diverse and realistic images. The function returns the mean and standard deviation of the Inception Scores for a given set of images.
The last part of the code generates images from random noise using a 'generator' model, and then calculates and prints the Inception Score for these images.
Fréchet Inception Distance (FID)
The Fréchet Inception Distance measures the distance between the distributions of real and generated images. Lower FID scores indicate better quality and diversity of the generated images.
Formula:
FID=∣∣μr−μg∣∣2+Tr(Σr+Σg−2(ΣrΣg)1/2) where μr,Σr and μg,Σg are the means and covariances of the real and generated image distributions, respectively.
Example: Calculating FID
from numpy import cov, trace, iscomplexobj
from scipy.linalg import sqrtm
def calculate_fid(real_images, generated_images):
# Load InceptionV3 model
model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))
# Resize and preprocess images
real_images_resized = tf.image.resize(real_images, (299, 299))
generated_images_resized = tf.image.resize(generated_images, (299, 299))
real_images_preprocessed = preprocess_input(real_images_resized)
generated_images_preprocessed = preprocess_input(generated_images_resized)
# Calculate activations
act1 = model.predict(real_images_preprocessed)
act2 = model.predict(generated_images_preprocessed)
# Calculate mean and covariance
mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False)
mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False)
# Calculate FID
ssdiff = np.sum((mu1 - mu2)**2.0)
covmean = sqrtm(sigma1.dot(sigma2))
if iscomplexobj(covmean):
covmean = covmean.real
fid = ssdiff + trace(sigma1 + sigma2 - 2.0*covmean)
return fid
# Generate images
n_samples = 1000
noise = np.random.normal(0, 1, (n_samples, latent_dim))
generated_images = generator.predict(noise)
# Sample real images
real_images = x_train[np.random.choice(x_train.shape[0], n_samples, replace=False)]
# Calculate FID
fid_score = calculate_fid(real_images, generated_images)
print(f"FID Score: {fid_score}")
The script includes a function calculate_fid(real_images, generated_images)
that computes the FID score. It uses the InceptionV3 model from Keras to calculate activations of real and generated images. These activations are then used to compute the mean and covariance of the image sets.
The FID score is calculated as the sum of the squared difference between the means and the trace of the sum of the covariances minus twice the square root of the product of the covariances.
The function is then used with a set of real images and a set of generated images to compute a FID score. The generated images are created by a generator network from random noise, and the real images are sampled from a training set x_train
. Finally, the FID score is printed.
4.5.3 Comparing with Baseline Models
To understand the performance of your GAN model, it’s useful to compare the results with baseline models. This could involve:
- Comparing with a GAN trained with a different architecture.
- Comparing with a GAN trained with different hyperparameters.
- Comparing with other generative models like VAEs (Variational Autoencoders).
4.5.4 Addressing Common Issues
During evaluation, you might encounter common issues such as:
- Mode Collapse: The generator produces limited diversity in the output images. This can be addressed by techniques such as minibatch discrimination, unrolled GANs, or using different loss functions.
- Training Instability: The generator and discriminator losses oscillate significantly. This can be mitigated by using techniques like Wasserstein GANs (WGANs) or spectral normalization.
Summary
Evaluating a GAN involves both qualitative and quantitative methods to ensure that the generated images are realistic and diverse. Qualitative evaluation through visual inspection helps in identifying immediate issues, while quantitative metrics like Inception Score and Fréchet Inception Distance provide objective measures of performance. By systematically evaluating and comparing the model's outputs, you can identify areas for improvement and refine your GAN to produce high-quality images.