Menu iconMenu iconIntroduction to Natural Language Processing with Transformers
Introduction to Natural Language Processing with Transformers

Chapter 2: Machine Learning and Deep Learning for NLP

2.6 Practical Exercises of Chapter 2: Machine Learning and Deep Learning for NLP

Exercise 1: Text Preprocessing

For this exercise, we will practice text preprocessing techniques we've just learned about.

Exercise 1.1: Tokenization

  1. Take a sample sentence: "The quick brown fox jumps over the lazy dog."
  2. Tokenize this sentence using NLTK's word_tokenize function.
import nltk

sentence = "The quick brown fox jumps over the lazy dog."
tokens = nltk.word_tokenize(sentence)
print(tokens)

Exercise 1.2: Lowercasing

  1. Convert all tokens to lowercase.
lowercase_tokens = [token.lower() for token in tokens]
print(lowercase_tokens)

Exercise 1.3: Stopword Removal

  1. Remove stopwords from the list of tokens using NLTK's English stopword list.
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in lowercase_tokens if token not in stop_words]
print(filtered_tokens)

Exercise 1.4: Stemming

  1. Perform stemming on the filtered tokens using NLTK's PorterStemmer.
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print(stemmed_tokens)

Exercise 2: Word Embeddings

For this exercise, you will practice working with word embeddings using Word2Vec and an Embedding layer in Keras.

Exercise 2.1: Word2Vec

  1. Load pre-trained Word2Vec embeddings using Gensim's KeyedVectors.load_word2vec_format.
  2. Print the vector representation for a word of your choice.
from gensim.models import KeyedVectors

# Load vectors directly from the file
word2vec = KeyedVectors.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)

# Now we can use the word2vec model to get the embedding of any word
embedding = word2vec['computer']
print(embedding)

Exercise 2.2: Embedding Layer in Keras

  1. Create a Sequential model in Keras and add an Embedding layer with 10,000 possible tokens and an embedding dimensionality of 32.
  2. Print the model summary.
from keras.models import Sequential
from keras.layers import Embedding

model = Sequential()

# The Embedding layer takes at least two arguments:
# the number of possible tokens (here, 10,000) and the dimensionality of the embeddings (here, 32).
model.add(Embedding(10000, 32))

# Let's print the model summary
model.summary()

Remember to replace 'path/to/GoogleNews-vectors-negative300.bin' with the actual path to the file in Exercise 2.1. If you don't have the file already, you can download it from the Word2Vec Google News vectors here: https://code.google.com/archive/p/word2vec/(warning: the file is about 1.5 GB).

These exercises are intended to help you get a hands-on understanding of the concepts we've discussed. Happy coding!

Chapter 2 Conclusion

This chapter took us on a journey through the intersection of Machine Learning, Deep Learning, and Natural Language Processing. We started with an overview of machine learning, exploring its roots, the concept of supervised and unsupervised learning, and its relevance to NLP. The understanding of this foundation is crucial as machine learning provides the underpinning for many NLP tasks, from text classification to named entity recognition, and beyond.

We then dived into the neural networks, exploring their basic structure and how they can be leveraged in NLP tasks. We also touched upon the concept of word embeddings, understanding why they're vital in representing words in a way that machines can understand. In particular, we explored Word2Vec and GloVe, two methods for creating word embeddings that paved the way for more complex models in NLP.

We then ventured into the importance of preprocessing text data, as text data, unlike numerical data, requires a good amount of cleaning and structuring before it can be fed into an NLP model. Techniques such as tokenization, lowercasing, stopword removal, and stemming and lemmatization were explained and demonstrated with code examples.

Towards the end of the chapter, we rolled up our sleeves and dove into practical exercises aimed at cementing the theoretical knowledge. We practiced preprocessing techniques and worked with word embeddings using Python, NLTK, and Keras.

As we close this chapter, it's important to remember that while we have learned a lot, there's still more ground to cover. As technology evolves, so does the field of NLP, and the models and techniques we'll explore in the next chapters are testament to that. Models like LSTM, GRU, the Transformer, and its descendants (like BERT and GPT-3) have revolutionized how we approach NLP problems, bringing us closer than ever to machines that can truly understand and generate human language.

In the next chapter, we'll take a closer look at some of these advanced models and explore their workings in depth. We will begin with recurrent neural networks, a type of neural network specially designed for sequential data like text, and move on to discuss models that have been built on top of them. We'll also delve into the idea of attention, a powerful mechanism that allows models to focus on important parts of the input when generating an output. So, stay tuned, and let's continue our journey into the world of NLP!

2.6 Practical Exercises of Chapter 2: Machine Learning and Deep Learning for NLP

Exercise 1: Text Preprocessing

For this exercise, we will practice text preprocessing techniques we've just learned about.

Exercise 1.1: Tokenization

  1. Take a sample sentence: "The quick brown fox jumps over the lazy dog."
  2. Tokenize this sentence using NLTK's word_tokenize function.
import nltk

sentence = "The quick brown fox jumps over the lazy dog."
tokens = nltk.word_tokenize(sentence)
print(tokens)

Exercise 1.2: Lowercasing

  1. Convert all tokens to lowercase.
lowercase_tokens = [token.lower() for token in tokens]
print(lowercase_tokens)

Exercise 1.3: Stopword Removal

  1. Remove stopwords from the list of tokens using NLTK's English stopword list.
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in lowercase_tokens if token not in stop_words]
print(filtered_tokens)

Exercise 1.4: Stemming

  1. Perform stemming on the filtered tokens using NLTK's PorterStemmer.
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print(stemmed_tokens)

Exercise 2: Word Embeddings

For this exercise, you will practice working with word embeddings using Word2Vec and an Embedding layer in Keras.

Exercise 2.1: Word2Vec

  1. Load pre-trained Word2Vec embeddings using Gensim's KeyedVectors.load_word2vec_format.
  2. Print the vector representation for a word of your choice.
from gensim.models import KeyedVectors

# Load vectors directly from the file
word2vec = KeyedVectors.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)

# Now we can use the word2vec model to get the embedding of any word
embedding = word2vec['computer']
print(embedding)

Exercise 2.2: Embedding Layer in Keras

  1. Create a Sequential model in Keras and add an Embedding layer with 10,000 possible tokens and an embedding dimensionality of 32.
  2. Print the model summary.
from keras.models import Sequential
from keras.layers import Embedding

model = Sequential()

# The Embedding layer takes at least two arguments:
# the number of possible tokens (here, 10,000) and the dimensionality of the embeddings (here, 32).
model.add(Embedding(10000, 32))

# Let's print the model summary
model.summary()

Remember to replace 'path/to/GoogleNews-vectors-negative300.bin' with the actual path to the file in Exercise 2.1. If you don't have the file already, you can download it from the Word2Vec Google News vectors here: https://code.google.com/archive/p/word2vec/(warning: the file is about 1.5 GB).

These exercises are intended to help you get a hands-on understanding of the concepts we've discussed. Happy coding!

Chapter 2 Conclusion

This chapter took us on a journey through the intersection of Machine Learning, Deep Learning, and Natural Language Processing. We started with an overview of machine learning, exploring its roots, the concept of supervised and unsupervised learning, and its relevance to NLP. The understanding of this foundation is crucial as machine learning provides the underpinning for many NLP tasks, from text classification to named entity recognition, and beyond.

We then dived into the neural networks, exploring their basic structure and how they can be leveraged in NLP tasks. We also touched upon the concept of word embeddings, understanding why they're vital in representing words in a way that machines can understand. In particular, we explored Word2Vec and GloVe, two methods for creating word embeddings that paved the way for more complex models in NLP.

We then ventured into the importance of preprocessing text data, as text data, unlike numerical data, requires a good amount of cleaning and structuring before it can be fed into an NLP model. Techniques such as tokenization, lowercasing, stopword removal, and stemming and lemmatization were explained and demonstrated with code examples.

Towards the end of the chapter, we rolled up our sleeves and dove into practical exercises aimed at cementing the theoretical knowledge. We practiced preprocessing techniques and worked with word embeddings using Python, NLTK, and Keras.

As we close this chapter, it's important to remember that while we have learned a lot, there's still more ground to cover. As technology evolves, so does the field of NLP, and the models and techniques we'll explore in the next chapters are testament to that. Models like LSTM, GRU, the Transformer, and its descendants (like BERT and GPT-3) have revolutionized how we approach NLP problems, bringing us closer than ever to machines that can truly understand and generate human language.

In the next chapter, we'll take a closer look at some of these advanced models and explore their workings in depth. We will begin with recurrent neural networks, a type of neural network specially designed for sequential data like text, and move on to discuss models that have been built on top of them. We'll also delve into the idea of attention, a powerful mechanism that allows models to focus on important parts of the input when generating an output. So, stay tuned, and let's continue our journey into the world of NLP!

2.6 Practical Exercises of Chapter 2: Machine Learning and Deep Learning for NLP

Exercise 1: Text Preprocessing

For this exercise, we will practice text preprocessing techniques we've just learned about.

Exercise 1.1: Tokenization

  1. Take a sample sentence: "The quick brown fox jumps over the lazy dog."
  2. Tokenize this sentence using NLTK's word_tokenize function.
import nltk

sentence = "The quick brown fox jumps over the lazy dog."
tokens = nltk.word_tokenize(sentence)
print(tokens)

Exercise 1.2: Lowercasing

  1. Convert all tokens to lowercase.
lowercase_tokens = [token.lower() for token in tokens]
print(lowercase_tokens)

Exercise 1.3: Stopword Removal

  1. Remove stopwords from the list of tokens using NLTK's English stopword list.
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in lowercase_tokens if token not in stop_words]
print(filtered_tokens)

Exercise 1.4: Stemming

  1. Perform stemming on the filtered tokens using NLTK's PorterStemmer.
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print(stemmed_tokens)

Exercise 2: Word Embeddings

For this exercise, you will practice working with word embeddings using Word2Vec and an Embedding layer in Keras.

Exercise 2.1: Word2Vec

  1. Load pre-trained Word2Vec embeddings using Gensim's KeyedVectors.load_word2vec_format.
  2. Print the vector representation for a word of your choice.
from gensim.models import KeyedVectors

# Load vectors directly from the file
word2vec = KeyedVectors.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)

# Now we can use the word2vec model to get the embedding of any word
embedding = word2vec['computer']
print(embedding)

Exercise 2.2: Embedding Layer in Keras

  1. Create a Sequential model in Keras and add an Embedding layer with 10,000 possible tokens and an embedding dimensionality of 32.
  2. Print the model summary.
from keras.models import Sequential
from keras.layers import Embedding

model = Sequential()

# The Embedding layer takes at least two arguments:
# the number of possible tokens (here, 10,000) and the dimensionality of the embeddings (here, 32).
model.add(Embedding(10000, 32))

# Let's print the model summary
model.summary()

Remember to replace 'path/to/GoogleNews-vectors-negative300.bin' with the actual path to the file in Exercise 2.1. If you don't have the file already, you can download it from the Word2Vec Google News vectors here: https://code.google.com/archive/p/word2vec/(warning: the file is about 1.5 GB).

These exercises are intended to help you get a hands-on understanding of the concepts we've discussed. Happy coding!

Chapter 2 Conclusion

This chapter took us on a journey through the intersection of Machine Learning, Deep Learning, and Natural Language Processing. We started with an overview of machine learning, exploring its roots, the concept of supervised and unsupervised learning, and its relevance to NLP. The understanding of this foundation is crucial as machine learning provides the underpinning for many NLP tasks, from text classification to named entity recognition, and beyond.

We then dived into the neural networks, exploring their basic structure and how they can be leveraged in NLP tasks. We also touched upon the concept of word embeddings, understanding why they're vital in representing words in a way that machines can understand. In particular, we explored Word2Vec and GloVe, two methods for creating word embeddings that paved the way for more complex models in NLP.

We then ventured into the importance of preprocessing text data, as text data, unlike numerical data, requires a good amount of cleaning and structuring before it can be fed into an NLP model. Techniques such as tokenization, lowercasing, stopword removal, and stemming and lemmatization were explained and demonstrated with code examples.

Towards the end of the chapter, we rolled up our sleeves and dove into practical exercises aimed at cementing the theoretical knowledge. We practiced preprocessing techniques and worked with word embeddings using Python, NLTK, and Keras.

As we close this chapter, it's important to remember that while we have learned a lot, there's still more ground to cover. As technology evolves, so does the field of NLP, and the models and techniques we'll explore in the next chapters are testament to that. Models like LSTM, GRU, the Transformer, and its descendants (like BERT and GPT-3) have revolutionized how we approach NLP problems, bringing us closer than ever to machines that can truly understand and generate human language.

In the next chapter, we'll take a closer look at some of these advanced models and explore their workings in depth. We will begin with recurrent neural networks, a type of neural network specially designed for sequential data like text, and move on to discuss models that have been built on top of them. We'll also delve into the idea of attention, a powerful mechanism that allows models to focus on important parts of the input when generating an output. So, stay tuned, and let's continue our journey into the world of NLP!

2.6 Practical Exercises of Chapter 2: Machine Learning and Deep Learning for NLP

Exercise 1: Text Preprocessing

For this exercise, we will practice text preprocessing techniques we've just learned about.

Exercise 1.1: Tokenization

  1. Take a sample sentence: "The quick brown fox jumps over the lazy dog."
  2. Tokenize this sentence using NLTK's word_tokenize function.
import nltk

sentence = "The quick brown fox jumps over the lazy dog."
tokens = nltk.word_tokenize(sentence)
print(tokens)

Exercise 1.2: Lowercasing

  1. Convert all tokens to lowercase.
lowercase_tokens = [token.lower() for token in tokens]
print(lowercase_tokens)

Exercise 1.3: Stopword Removal

  1. Remove stopwords from the list of tokens using NLTK's English stopword list.
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in lowercase_tokens if token not in stop_words]
print(filtered_tokens)

Exercise 1.4: Stemming

  1. Perform stemming on the filtered tokens using NLTK's PorterStemmer.
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print(stemmed_tokens)

Exercise 2: Word Embeddings

For this exercise, you will practice working with word embeddings using Word2Vec and an Embedding layer in Keras.

Exercise 2.1: Word2Vec

  1. Load pre-trained Word2Vec embeddings using Gensim's KeyedVectors.load_word2vec_format.
  2. Print the vector representation for a word of your choice.
from gensim.models import KeyedVectors

# Load vectors directly from the file
word2vec = KeyedVectors.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)

# Now we can use the word2vec model to get the embedding of any word
embedding = word2vec['computer']
print(embedding)

Exercise 2.2: Embedding Layer in Keras

  1. Create a Sequential model in Keras and add an Embedding layer with 10,000 possible tokens and an embedding dimensionality of 32.
  2. Print the model summary.
from keras.models import Sequential
from keras.layers import Embedding

model = Sequential()

# The Embedding layer takes at least two arguments:
# the number of possible tokens (here, 10,000) and the dimensionality of the embeddings (here, 32).
model.add(Embedding(10000, 32))

# Let's print the model summary
model.summary()

Remember to replace 'path/to/GoogleNews-vectors-negative300.bin' with the actual path to the file in Exercise 2.1. If you don't have the file already, you can download it from the Word2Vec Google News vectors here: https://code.google.com/archive/p/word2vec/(warning: the file is about 1.5 GB).

These exercises are intended to help you get a hands-on understanding of the concepts we've discussed. Happy coding!

Chapter 2 Conclusion

This chapter took us on a journey through the intersection of Machine Learning, Deep Learning, and Natural Language Processing. We started with an overview of machine learning, exploring its roots, the concept of supervised and unsupervised learning, and its relevance to NLP. The understanding of this foundation is crucial as machine learning provides the underpinning for many NLP tasks, from text classification to named entity recognition, and beyond.

We then dived into the neural networks, exploring their basic structure and how they can be leveraged in NLP tasks. We also touched upon the concept of word embeddings, understanding why they're vital in representing words in a way that machines can understand. In particular, we explored Word2Vec and GloVe, two methods for creating word embeddings that paved the way for more complex models in NLP.

We then ventured into the importance of preprocessing text data, as text data, unlike numerical data, requires a good amount of cleaning and structuring before it can be fed into an NLP model. Techniques such as tokenization, lowercasing, stopword removal, and stemming and lemmatization were explained and demonstrated with code examples.

Towards the end of the chapter, we rolled up our sleeves and dove into practical exercises aimed at cementing the theoretical knowledge. We practiced preprocessing techniques and worked with word embeddings using Python, NLTK, and Keras.

As we close this chapter, it's important to remember that while we have learned a lot, there's still more ground to cover. As technology evolves, so does the field of NLP, and the models and techniques we'll explore in the next chapters are testament to that. Models like LSTM, GRU, the Transformer, and its descendants (like BERT and GPT-3) have revolutionized how we approach NLP problems, bringing us closer than ever to machines that can truly understand and generate human language.

In the next chapter, we'll take a closer look at some of these advanced models and explore their workings in depth. We will begin with recurrent neural networks, a type of neural network specially designed for sequential data like text, and move on to discuss models that have been built on top of them. We'll also delve into the idea of attention, a powerful mechanism that allows models to focus on important parts of the input when generating an output. So, stay tuned, and let's continue our journey into the world of NLP!