Chapter 6: Advance Level Exercises
Advance Level Exercises Part 1
Exercise 1: File Parsing
Concepts:
- File I/O
- Regular expressions
Description: Write a Python script that reads a text file and extracts all URLs that are present in the file. The output should be a list of URLs.
Solution:
import re
# Open the file for reading
with open('input_file.txt', 'r') as f:
# Read the file contents
file_contents = f.read()
# Use regular expression to extract URLs
urls = re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', file_contents)
# Print the list of URLs
print(urls)
Exercise 2: Data Analysis
Concepts:
- File I/O
- Data manipulation
- Pandas library
Description: Write a Python script that reads a CSV file containing sales data and calculates the total sales revenue for each product category.
Solution:
import pandas as pd
# Read the CSV file into a pandas dataframe
df = pd.read_csv('sales_data.csv')
# Group the data by product category and sum the sales revenue
total_revenue = df.groupby('Product Category')['Sales Revenue'].sum()
# Print the total revenue for each product category
print(total_revenue)
Exercise 3: Web Scraping
Concepts
- Web scraping
- Requests library
- Beautiful Soup library
- CSV file I/O
Description: Write a Python script that scrapes the title and price of all products listed on an e-commerce website and stores them in a CSV file.
Solution:
import requests
from bs4 import BeautifulSoup
import csv
# Define the target URL
url = 'https://www.example.com/products'
# Headers to mimic a real browser request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
# Make a GET request to the website
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')
# Find all product titles and prices
titles = [title.get_text(strip=True) for title in soup.find_all('h3', class_='product-title')]
prices = [price.get_text(strip=True) for price in soup.find_all('div', class_='product-price')]
# Zip the titles and prices together
data = list(zip(titles, prices))
# Write the data to a CSV file with headers
with open('product_data.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Product Title', 'Price']) # Add headers
writer.writerows(data)
print("Scraping completed. Data saved to 'product_data.csv'.")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
Exercise 4: Multithreading
Concepts:
- Multithreading
- Requests library
- Threading library
Description: Write a Python script that uses multithreading to download multiple images from a URL list simultaneously.
Solution:
import requests
import threading
# URL list of images to download
url_list = ['https://www.example.com/image1.jpg', 'https://www.example.com/image2.jpg', 'https://www.example.com/image3.jpg']
# Function to download an image from a URL
def download_image(url):
response = requests.get(url)
with open(url.split('/')[-1], 'wb') as f:
f.write(response.content)
# Create a thread for each URL and start them all simultaneously
threads = []
for url in url_list:
thread = threading.Thread(target=download_image, args=(url,))
threads.append(thread)
thread.start()
# Wait for all threads to finish
for thread in threads:
thread.join()
Exercise 5: Machine Learning
Concepts:
- Machine learning
- Scikit-learn library
Description: Write a Python script that trains a machine learning model on a dataset and uses it to predict the output for new data.
Solution:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Read the dataset into a pandas dataframe
df = pd.read_csv('dataset.csv')
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[['feature1', 'feature2']], df['target'], test_size=0.2, random_state=42)
# Train a linear regression model on the training data
model = LinearRegression()
model.fit(X_train, y_train)
# Use the model to predict the output for the testing data
y_pred = model.predict(X_test)
# Evaluate the model performance using the mean squared error metric
mse = ((y_test - y_pred) ** 2).mean()
print("Mean squared error:", mse)
In this exercise, we first read a dataset into a pandas dataframe. Then, we split the data into training and testing sets using the train_test_split
function from the sklearn.model_selection
module. We trained a linear regression model on the training data using the LinearRegression
class from the sklearn.linear_model
module. Finally, we used the trained model to predict the output for the testing data and evaluated the model performance using the mean squared error metric.
Exercise 6: Natural Language Processing
Concepts:
- Natural Language Processing
- Sentiment Analysis
- NLTK library
Description: Write a Python script that reads a text file and performs sentiment analysis on the text using a pre-trained NLP model.
Solution:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Ensure the VADER lexicon is downloaded
nltk.download('vader_lexicon')
# Read the text file into a string
with open('input_file.txt', 'r', encoding='utf-8') as f:
text = f.read()
# Create a SentimentIntensityAnalyzer object
sid = SentimentIntensityAnalyzer()
# Perform sentiment analysis on the text
scores = sid.polarity_scores(text)
# Print the sentiment scores
print(scores)
In this exercise, we first read a text file into a string. Then, we create a SentimentIntensityAnalyzer
object from the nltk.sentiment.vader
module. We use the polarity_scores
method of the SentimentIntensityAnalyzer
object to perform sentiment analysis on the text and get a dictionary of sentiment scores.
Exercise 7: Web Development
Concepts:
- Web Development
- Flask framework
- File Uploads
Description: Write a Python script that creates a web application using the Flask framework that allows users to upload a file and performs some processing on the file.
Solution:
from flask import Flask, render_template, request
import os
app = Flask(__name__)
# Set the path for file uploads
UPLOAD_FOLDER = 'uploads'
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
# Ensure the upload directory exists
if not os.path.exists(UPLOAD_FOLDER):
os.makedirs(UPLOAD_FOLDER)
# Route for the home page
@app.route('/')
def index():
return render_template('index.html')
# Route for file uploads
@app.route('/upload', methods=['POST'])
def upload():
if 'file' not in request.files:
return 'No file part', 400
file = request.files['file']
if file.filename == '':
return 'No selected file', 400
# Save the file to the uploads folder
file.save(os.path.join(app.config['UPLOAD_FOLDER'], file.filename))
return 'File uploaded successfully'
if __name__ == '__main__':
app.run(debug=True)
In this exercise, we first import the Flask module and create a Flask application. We set up a route for the home page that returns an HTML template. We set up a route for file uploads that receives an uploaded file and saves it to a designated uploads folder. We can perform processing on the uploaded file inside the upload
function.
Exercise 8: Data Visualization
Concepts:
- Data Visualization
- Matplotlib library
- Candlestick Charts
Description: Write a Python script that reads a CSV file containing stock market data and plots a candlestick chart of the data.
Solution:
import pandas as pd
import matplotlib.pyplot as plt
import mplfinance as mpf
# Read the CSV file into a pandas dataframe
df = pd.read_csv('stock_data.csv', parse_dates=['Date'])
df.set_index('Date', inplace=True) # Set Date as index
# Plot the candlestick chart using mplfinance
mpf.plot(df, type='candle', style='charles', title='Stock Market Data', ylabel='Price')
# Display the chart
plt.show()
In this exercise, we first read a CSV file containing stock market data into a pandas dataframe. We convert the date column to Matplotlib dates format and create a figure and axis objects. We plot the candlestick chart using the candlestick_ohlc
function from the mpl_finance
module. We format the x-axis as dates and set the axis labels and title. Finally, we display the chart using the show
function from the matplotlib.pyplot
module.
Exercise 9: Machine Learning
Concepts:
- Machine Learning
- Scikit-learn library
Description: Write a Python script that reads a dataset containing information about different types of flowers and trains a machine learning model to predict the type of a flower based on its features.
Solution:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Read the dataset into a pandas dataframe
df = pd.read_csv('flower_data.csv')
# Check for missing values
if df.isnull().sum().sum() > 0:
df = df.dropna() # Drop rows with missing values
# Define feature columns and target column
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = df['species']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the feature values
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train a logistic regression model on the training data
model = LogisticRegression(solver='saga', max_iter=5000) # Increased iterations & changed solver
model.fit(X_train, y_train)
# Use the model to predict the output for the testing data
y_pred = model.predict(X_test)
# Evaluate the model performance using the accuracy score metric
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
In this exercise, we first read a dataset containing information about different types of flowers into a pandas dataframe. We split the data into training and testing sets using the train_test_split
function from the sklearn.model_selection
module. We trained a logistic regression model on the training data using the LogisticRegression
class from the sklearn.linear_model
module. Finally, we used the trained model to predict the output for the testing data and evaluated the model performance using the accuracy score metric.
Exercise 10: Data Analysis
Concepts:
- Data Analysis
- Recommendation Systems
- Collaborative Filtering
- Surprise library
Description: Write a Python script that reads a CSV file containing customer purchase data and generates a recommendation system that recommends products to customers based on their purchase history.
Solution:
import pandas as pd
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import train_test_split
# Read the CSV file into a pandas dataframe
df = pd.read_csv('purchase_data.csv')
# Ensure that the dataset has no missing values
df = df.dropna(subset=['customer_id', 'product_id', 'rating'])
# Convert the pandas dataframe to a Surprise dataset
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['customer_id', 'product_id', 'rating']], reader)
# Split the data into training and testing sets
trainset, testset = train_test_split(data, test_size=0.2)
# Train an SVD model on the training data
model = SVD(n_factors=50, n_epochs=20, lr_all=0.005, reg_all=0.02)
model.fit(trainset)
# Use the model to predict the output for the testing data
predictions = model.test(testset)
# Evaluate the model performance using the root mean squared error metric
rmse = accuracy.rmse(predictions)
print("RMSE:", rmse)
# Recommend products to customers based on their purchase history
customer_ids = df['customer_id'].unique()
product_ids = df['product_id'].unique()
recommendations = {}
for customer_id in customer_ids:
purchased_products = set(df[df['customer_id'] == customer_id]['product_id'].values)
potential_recommendations = []
for product_id in product_ids:
if product_id not in purchased_products:
pred = model.predict(customer_id, product_id)
potential_recommendations.append((product_id, pred.est))
# Sort by predicted rating and take the top 5 recommendations
top_recommendations = sorted(potential_recommendations, key=lambda x: x[1], reverse=True)[:5]
recommendations[customer_id] = top_recommendations
# Display recommendations
for customer, recs in recommendations.items():
print(f"Customer {customer} recommended products: {recs}")
In this exercise, we first read a CSV file containing customer purchase data into a pandas dataframe. We convert the pandas dataframe to a surprise dataset using the Reader
and Dataset
classes from the surprise
module. We split the data into training and testing sets using the train_test_split
function from the surprise.model_selection
module. We trained an SVD model on the training data using the SVD
class from the surprise
module. We used the trained model to predict the output for the testing data and evaluated the model performance using the root mean squared error metric. Finally, we recommended products to customers based on their purchase history using the trained model.
Exercise 11: Computer Vision
Concepts:
- Computer Vision
- Object Detection
- OpenCV library
- Pre-trained models
Description: Write a Python script that reads an image and performs object detection on the image using a pre-trained object detection model.
Solution:
import cv2
import numpy as np
# Read the image file
img = cv2.imread('image.jpg')
# Check if the image is loaded correctly
if img is None:
raise FileNotFoundError("Error: Image file not found or unable to load.")
# Load the pre-trained object detection model
model = cv2.dnn.readNetFromTensorflow('frozen_inference_graph.pb', 'ssd_mobilenet_v2_coco_2018_03_29.pbtxt')
# Prepare the input image for the model
blob = cv2.dnn.blobFromImage(img, size=(300, 300), swapRB=True, crop=False)
model.setInput(blob)
# Perform object detection
output = model.forward()
# Loop through detected objects and draw bounding boxes
h, w, _ = img.shape # Get image dimensions
for detection in output[0, 0, :, :]:
confidence = float(detection[2])
if confidence > 0.5:
x1 = int(detection[3] * w)
y1 = int(detection[4] * h)
x2 = int(detection[5] * w)
y2 = int(detection[6] * h)
# Draw bounding box with label and confidence score
cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2)
label = f'Confidence: {confidence:.2f}'
cv2.putText(img, label, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
# Display the image with detections
cv2.imshow('Object Detection', img)
cv2.waitKey(0)
cv2.destroyAllWindows()
In this exercise, we first read an image file into a NumPy array using the imread
function from the cv2
module of OpenCV. We load a pre-trained object detection model using the readNetFromTensorflow
function from the cv2.dnn
module. We set the input image to the model and perform object detection using the setInput
and forward
methods of the model object. Finally, we loop through the detected objects and draw bounding boxes around them using the rectangle
function from the cv2
module.
Exercise 12: Natural Language Processing
Concepts:
- Natural Language Processing
- Topic Modeling
- Latent Dirichlet Allocation
- Gensim library
Description: Write a Python script that reads a text file and performs topic modeling on the text using Latent Dirichlet Allocation (LDA).
Solution:
import gensim
from gensim import corpora
from gensim.models import LdaModel
# Read the text file into a list of strings
with open('input_file.txt', 'r') as f:
text = f.readlines()
# Remove newlines and convert to lowercase
text = [line.strip().lower() for line in text]
# Tokenize the text into words
tokens = [line.split() for line in text]
# Create a dictionary of words and their frequency
dictionary = corpora.Dictionary(tokens)
# Create a bag-of-words representation of the text
corpus = [dictionary.doc2bow(token) for token in tokens]
# Train an LDA model on the text
model = LdaModel(corpus, id2word=dictionary, num_topics=5, passes=10)
# Print the topics and their associated words
for topic in model.print_topics(num_words=5):
print(topic)
In this exercise, we first read a text file into a list of strings. We preprocess the text by removing newlines, converting to lowercase, and tokenizing into words using the split
method. We create a dictionary of words and their frequency and create a bag-of-words representation of the text using the doc2bow
method of the dictionary object. We train an LDA model on the corpus using the LdaModel
class from the gensim.models
module. Finally, we print the topics and their associated words using the print_topics
method of the model object.
Exercise 13: Web Scraping
Concepts:
- Web Scraping
- Beautiful Soup library
- Requests library
- CSV file handling
Description: Write a Python script that scrapes a website for product information and saves the information to a CSV file.
Solution:
import requests
from bs4 import BeautifulSoup
import csv
# Define the URL of the website to scrape
url = 'https://www.example.com/products'
# Add headers to mimic a browser request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
# Send a request to the website
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code != 200:
print(f"Failed to fetch data. Status Code: {response.status_code}")
exit()
# Parse the HTML content of the response using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find all the product listings on the page
listings = soup.find_all('div', class_='product-listing')
# Write the product information to a CSV file
with open('products.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Product Name', 'Price', 'Description'])
for listing in listings:
name = listing.find('h3')
price = listing.find('span', class_='price')
description = listing.find('p')
# Extract text safely, handling missing elements
name = name.get_text(strip=True) if name else 'N/A'
price = price.get_text(strip=True) if price else 'N/A'
description = description.get_text(strip=True) if description else 'N/A'
writer.writerow([name, price, description])
print("Scraping completed. Data saved to 'products.csv'.")
In this exercise, we first define the URL of the website to scrape and send a request to the website using the get
function from the requests
module. We parse the HTML content of the response using Beautiful Soup and find all the product listings on the page using the find_all
method. We write the product information to a CSV file using the csv
module.
Exercise 14: Big Data Processing
Concepts:
- Big Data Processing
- PySpark
- Data Transformations
- Aggregation
- Parquet file format
Description: Write a PySpark script that reads a CSV file containing customer purchase data, performs some data transformations and aggregation, and saves the results to a Parquet file.
Solution:
from pyspark.sql import SparkSession
# Create a SparkSession object
spark = SparkSession.builder.appName('customer-purchases').getOrCreate()
# Verify if the file exists before reading (optional but useful)
import os
if not os.path.exists('customer_purchases.csv'):
raise FileNotFoundError("Error: The file 'customer_purchases.csv' does not exist.")
# Read the CSV file into a Spark DataFrame
df = spark.read.csv('customer_purchases.csv', header=True, inferSchema=True)
# Perform some data transformations
df = df.filter((df['purchase_date'] >= '2020-01-01') & (df['purchase_date'] <= '2020-12-31'))
df = df.select('customer_id', 'product_id', 'price')
# Group by customer and calculate total spending
df = df.groupBy('customer_id').sum('price').withColumnRenamed('sum(price)', 'total_spent')
# Save the results to a Parquet file
df.write.mode('overwrite').parquet('customer_spending.parquet')
print("Processing completed. Data saved to 'customer_spending.parquet'.")
In this exercise, we first create a SparkSession object using the SparkSession
class from the pyspark.sql
module. We read a CSV file containing customer purchase data into a Spark DataFrame using the read.csv
method. We perform some data transformations on the DataFrame using the filter
, select
, and groupBy
methods. Finally, we save the results to a Parquet file using the write.parquet
method.
Exercise 15: DevOps
Concepts:
- DevOps
- Fabric library
Description: Write a Python script that automates the deployment of a web application to a remote server using the Fabric library.
Solution:
from fabric import Connection
import getpass
# Define the host and user credentials for the remote server
host = 'example.com'
user = 'user'
password = getpass.getpass("Enter SSH password: ") # Secure password entry
# Define the path to the web application on the local machine and the remote server
local_path = '/path/to/local/app'
remote_path = '/path/to/remote/app'
# Create a connection to the remote server
c = Connection(host=host, user=user, connect_kwargs={'password': password})
# Ensure the remote directory exists
c.run(f'mkdir -p {remote_path}')
# Upload the local files to the remote server
c.put(local_path, remote_path, recursive=True) # Enables recursive copy
# Change to the application directory
with c.cd(remote_path):
# Install required dependencies
c.run('sudo apt-get update && sudo apt-get install -y python3-pip')
c.run('pip3 install -r requirements.txt')
# Start the web application in the background
c.run('nohup python3 app.py > app.log 2>&1 &', pty=False)
print("Deployment completed successfully.")
In this exercise, we first define the host and user credentials for the remote server. We define the path to the web application on the local machine and the remote server. We create a connection to the remote server using the Connection
class from the fabric
module. We upload the local files to the remote server using the put
method of the connection object. We install any required dependencies on the remote server using the run
method of the connection object. Finally, we start the web application on the remote server using the run
method.
Exercise 16: Reinforcement Learning
Concepts:
- Reinforcement Learning
- Q-Learning
- OpenAI Gym library
Description: Write a Python script that implements a reinforcement learning algorithm to teach an agent to play a simple game.
Solution:
import gym
import numpy as np
import time
# Create the FrozenLake environment
env = gym.make("FrozenLake-v1", is_slippery=True)
# Initialize the Q-table
Q = np.zeros([env.observation_space.n, env.action_space.n])
# Set hyperparameters
alpha = 0.8 # Learning rate
gamma = 0.95 # Discount factor
epsilon = 0.1 # Exploration probability
num_episodes = 2000 # Training episodes
# Train the agent using Q-learning
for episode in range(num_episodes):
state, _ = env.reset()
done = False
while not done:
# Choose action using epsilon-greedy policy
if np.random.uniform() < epsilon:
action = env.action_space.sample() # Random action (exploration)
else:
action = np.argmax(Q[state, :]) # Best action from Q-table
# Take the action and observe the next state
next_state, reward, done, _, _ = env.step(action)
# Update Q-value using the Bellman equation
Q[state, action] = (1 - alpha) * Q[state, action] + alpha * (reward + gamma * np.max(Q[next_state, :]))
# Move to the next state
state = next_state
# Test the agent by playing the game
state, _ = env.reset()
done = False
print("\nTesting trained agent:\n")
while not done:
action = np.argmax(Q[state, :])
next_state, reward, done, _, _ = env.step(action)
# Render the environment
env.render()
time.sleep(0.5) # Pause for visibility
state = next_state
print("\nGame Over!")
In this exercise, we first create an OpenAI Gym environment for the game using the make
function from the gym
module. We define the Q-table for the agent as a NumPy array and set the hyperparameters for the Q-learning algorithm. We train the agent using the Q-learning algorithm by looping through a specified number of episodes and updating the Q-table based on the rewards and next states. Finally, we test the agent by playing the game using the Q-table and visualizing the game using the render
method.
Exercise 17: Time Series Analysis
Concepts:
- Time Series Analysis
- Data Preprocessing
- Data Visualization
- ARIMA model
- Statsmodels library
Description: Write a Python script that reads a CSV file containing time series data, performs some data preprocessing and visualization, and fits a time series model to the data.
Solution:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Read the CSV file into a pandas dataframe
df = pd.read_csv('time_series.csv')
# Convert the date column to a datetime object and set it as the index
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Check for missing values before resampling
if df.isnull().values.any():
df = df.fillna(method='ffill')
# Ensure the column name is correct
target_col = df.columns[0] # Assuming first column is the time series value
# Resample the data to a monthly frequency
df = df.resample('M').mean()
# Plot the time series data
plt.figure(figsize=(10, 5))
plt.plot(df.index, df[target_col], label="Time Series")
plt.xlabel("Date")
plt.ylabel("Value")
plt.title("Time Series Visualization")
plt.legend()
plt.grid()
plt.show()
# Fit an ARIMA model
model = sm.tsa.ARIMA(df[target_col].dropna(), order=(1, 1, 1)) # Use dropna() to avoid errors
results = model.fit()
# Print the model summary
print(results.summary())
In this exercise, we first read a CSV file containing time series data into a pandas dataframe. We convert the date column to a datetime object and set it as the index. We resample the data to a monthly frequency and fill any missing values using forward fill. We visualize the data using the plot
function from the matplotlib.pyplot
module. Finally, we fit an ARIMA model to the data using the ARIMA
function from the statsmodels.api
module and print the summary of the model using the summary
method of the results object.
Exercise 18: Computer Networking
Concepts:
- Computer Networking
- TCP/IP Protocol
- Socket Programming
Description: Write a Python script that implements a simple TCP server that accepts client connections and sends and receives data.
Solution:
import socket
# Define the host and port for the server
host = 'localhost'
port = 12345
# Create a socket object
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Bind the socket to the host and port
s.bind((host, port))
# Listen for incoming connections
s.listen(1)
print('Server listening on', host, port)
# Accept a client connection
conn, addr = s.accept()
print('Connected by', addr)
# Send data to the client
conn.sendall(b'Hello, client!')
# Receive data from the client
data = conn.recv(1024)
print('Received:', data.decode())
# Close the connection
conn.close()
In this exercise, we first define the host and port for the server. We create a socket object using the socket
function from the socket
module and bind the socket to the host and port using the bind
method. We listen for incoming connections using the listen
method and accept a client connection using the accept
method, which returns a connection object and the address of the client. We send data to the client using the sendall
method of the connection object and receive data from the client using the recv
method. Finally, we close the connection using the close
method.
Exercise 19: Data Analysis and Visualization
Concepts:
- Data Analysis
- Data Visualization
- PDF Report Generation
- Pandas library
- Matplotlib library
- ReportLab library
Description: Write a Python script that reads a CSV file containing sales data for a retail store, performs some data analysis and visualization, and saves the results to a PDF report.
Solution:
import pandas as pd
import matplotlib.pyplot as plt
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
import os
# Read the CSV file into a pandas dataframe
df = pd.read_csv('sales_data.csv')
# Calculate the total sales by category and month
totals = df.groupby(['category', 'month'])['sales'].sum()
# Get unique categories
categories = df['category'].unique()
# Create subplots dynamically based on the number of categories
fig, axes = plt.subplots(nrows=len(categories), ncols=1, figsize=(8.5, 11))
# Ensure `axes` is always iterable (even if there's only one category)
if len(categories) == 1:
axes = [axes]
# Plot total sales by category and month
for i, category in enumerate(categories):
totals.loc[category].plot(ax=axes[i], kind='bar', title=f"Category: {category}")
axes[i].set_ylabel("Sales")
plt.tight_layout()
plt.savefig('sales_plot.png') # Save the figure
plt.close(fig) # Close to free memory
# Create a PDF report
pdf_filename = 'sales_report.pdf'
c = canvas.Canvas(pdf_filename, pagesize=letter)
# Add title and description
c.setFont("Helvetica-Bold", 16)
c.drawString(50, 750, 'Sales Report')
c.setFont("Helvetica", 12)
c.drawString(50, 730, 'Total Sales by Category and Month')
# Add the image to the PDF if it exists
if os.path.exists('sales_plot.png'):
c.drawImage('sales_plot.png', 50, 450, width=500, height=300)
# Save and close the PDF
c.showPage()
c.save()
print(f"Report saved as {pdf_filename}")
In this exercise, we first read a CSV file containing sales data for a retail store into a pandas dataframe. We calculate the total sales by category and month using the groupby
and sum
methods. We plot the total sales by category and month using the plot
function from the matplotlib.pyplot
module and save the plot to a PNG file. Finally, we generate a PDF report using the Canvas
and Image
functions from the reportlab
module.
Exercise 20: Machine Learning
Concepts:
- Machine Learning
- Convolutional Neural Networks
- Keras library
- MNIST dataset
Description: Write a Python script that trains a machine learning model to classify images of handwritten digits from the MNIST dataset.
Solution:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
# Normalize the pixel values and reshape the data
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
x_train = x_train.reshape(-1, 28, 28, 1)
x_test = x_test.reshape(-1, 28, 28, 1)
# Define the CNN model
model = keras.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(128, activation='relu'), # Added a fully connected layer
layers.Dropout(0.5), # Prevent overfitting
layers.Dense(10, activation='softmax')
])
# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test), batch_size=64)
# Evaluate the model on the test data
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print('Test accuracy:', test_acc)
In this exercise, we first load the MNIST dataset using the load_data
function from the keras.datasets.mnist
module. We normalize the pixel values and reshape the data using NumPy. We define a convolutional neural network model using the Sequential
class and various layers from the layers
module of Keras. We compile the model using the compile
method with the Adam optimizer and sparse categorical crossentropy loss function. We train the model using the fit
method and evaluate the model on the test data using the evaluate
method.
Exercise 21: Natural Language Processing
Concepts:
- Natural Language Processing
- Text Preprocessing
- Text Representation
- Topic Modeling
- Latent Dirichlet Allocation
- Gensim library
Description: Write a Python script that uses natural language processing techniques to analyze a corpus of text data and extract useful insights.
Solution:
import gensim
from gensim import corpora
from gensim.models import LdaModel
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Download required resources
nltk.download('stopwords')
nltk.download('punkt')
# Read the text data into a pandas dataframe
df = pd.read_csv('text_data.csv')
# Handle missing values
df['text'] = df['text'].fillna('')
# Define stop words and clean text
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
tokens = word_tokenize(text.lower()) # Tokenization & lowercasing
return [word for word in tokens if word.isalnum() and word not in stop_words] # Remove punctuation & stopwords
df['cleaned_text'] = df['text'].apply(preprocess_text)
# Create a document-term matrix
texts = df['cleaned_text'].tolist()
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
# Train LDA model
num_topics = 5
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=10)
# Print topics and top words for each
for topic_id, words in lda_model.show_topics(num_topics=num_topics, formatted=False):
print(f'Topic {topic_id}:', ', '.join(word for word, _ in words))
# Convert topic distributions into a structured DataFrame
topic_dists = [{f"Topic_{topic}": prob for topic, prob in lda_model.get_document_topics(doc, minimum_probability=0)} for doc in corpus]
topic_df = pd.DataFrame(topic_dists)
# Merge topic distributions with original data
df = pd.concat([df, topic_df], axis=1)
# Save the results
df.to_csv('text_data_topics.csv', index=False)
print("Saved processed data to 'text_data_topics.csv'.")
In this exercise, we first read a corpus of text data into a pandas dataframe. We define the stop words using the stopwords
function from the nltk.corpus
module and remove them from the text data using list comprehension and apply
method of pandas. We create a document-term matrix from the text data using the Dictionary
and corpus
functions from the gensim
module. We perform topic modeling using latent Dirichlet allocation (LDA) using the LdaModel
function and extract the topic distributions for each document. Finally, we save the results to a CSV file using the to_csv
method of pandas.
Exercise 22: Web Scraping
Concepts:
- Web Scraping
- HTML Parsing
- BeautifulSoup library
- CSV File I/O
Description: Write a Python script that scrapes data from a website using the BeautifulSoup library and saves it to a CSV file.
Solution:
import requests
from bs4 import BeautifulSoup
import csv
# Define the URL to scrape
url = 'https://www.example.com'
# Headers to mimic a real browser request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
# Send a GET request
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code != 200:
print(f"Error: Unable to fetch data (Status Code: {response.status_code})")
exit()
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Extract data
data = []
for item in soup.find_all('div', class_='item'):
name_tag = item.find('h3')
price_tag = item.find('span', class_='price')
# Extract text safely, handling missing elements
name = name_tag.get_text(strip=True) if name_tag else 'N/A'
price = price_tag.get_text(strip=True) if price_tag else 'N/A'
data.append([name, price])
# Save to CSV
csv_filename = 'data.csv'
with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['Name', 'Price']) # Add headers
writer.writerows(data)
print(f"Scraping completed. Data saved to '{csv_filename}'.")
In this exercise, we first define the URL to scrape using the requests
library and parse the HTML content using the BeautifulSoup
library. We extract the data from the HTML content using the find_all
and find
methods of the soup
object. Finally, we save the data to a CSV file using the csv
module.
Exercise 23: Database Interaction
Concepts:
- Database Interaction
- SQLite database
- SQL queries
- SQLite3 module
Description: Write a Python script that interacts with a database to retrieve and manipulate data.
Solution:
import sqlite3
# Connect to the database
conn = sqlite3.connect('example.db')
# Create a cursor object
c = conn.cursor()
# Execute an SQL query to create a table
c.execute('''CREATE TABLE IF NOT EXISTS customers
(id INTEGER PRIMARY KEY, name TEXT, email TEXT, phone TEXT)''')
# Execute an SQL query to insert data into the table
c.execute("INSERT INTO customers (name, email, phone) VALUES ('John Smith', 'john@example.com', '555-1234')")
# Execute an SQL query to retrieve data from the table
c.execute("SELECT * FROM customers")
rows = c.fetchall()
for row in rows:
print(row)
# Execute an SQL query to update data in the table
c.execute("UPDATE customers SET phone='555-5678' WHERE name='John Smith'")
# Execute an SQL query to delete data from the table
c.execute("DELETE FROM customers WHERE name='John Smith'")
# Commit the changes to the database
conn.commit()
# Close the database connection
conn.close()
In this exercise, we first connect to an SQLite database using the connect
function from the sqlite3
module. We create a cursor object using the cursor
method of the connection object and execute SQL queries using the execute
method of the cursor object. We retrieve data from the table using the fetchall
method and print the results. We update data in the table using the UPDATE
statement and delete data from the table using the DELETE
statement. Finally, we commit the changes to the database and close the connection.
Exercise 24: Parallel Processing
Concepts:
- Parallel Processing
- Multiprocessing
- Process Pool
- CPU-bound tasks
Description: Write a Python script that performs a time-consuming computation using parallel processing to speed up the computation.
Solution:
import time
import multiprocessing
# Define an optimized CPU-bound function
def compute(num):
return num * (num - 1) // 2 # Uses O(1) formula instead of a loop
if __name__ == '__main__':
# Create a process pool with the number of CPUs available
num_cpus = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cpus)
# Generate a list of numbers to compute
num_list = [10000000] * num_cpus
# Compute the results using parallel processing
start_time = time.time()
results = pool.map(compute, num_list)
# Close the pool properly
pool.close()
pool.join()
end_time = time.time()
# Print the results and computation time
print('Results:', results)
print('Computation time:', end_time - start_time, 'seconds')
In this exercise, we first define a CPU-bound function that takes a long time to compute. We then create a process pool using the Pool
function from the multiprocessing
module with the number of CPUs available. We generate a list of numbers to compute and compute the results using the map
method of the process pool. Finally, we print the results and computation time.
Exercise 25: Image Processing
Concepts:
- Image Processing
- Pillow library
- Image Manipulation
- Image Filtering
Description: Write a Python script that performs basic image processing operations on an image file.
Solution:
from PIL import Image, ImageFilter
import os
# Define image paths
input_path = 'example.jpg'
output_path = 'processed.jpg'
# Check if the input file exists
if not os.path.exists(input_path):
raise FileNotFoundError(f"Error: The file '{input_path}' was not found.")
try:
# Open the image file using a context manager
with Image.open(input_path) as image:
# Display the original image (optional, may not work in all environments)
image.show()
# Resize the image
image = image.resize((500, 500))
# Convert the image to grayscale
image = image.convert('L')
# Apply a Gaussian blur filter
image = image.filter(ImageFilter.GaussianBlur(radius=2))
# Save the processed image to a file
image.save(output_path)
# Display the processed image
image.show()
print(f"Processed image saved as '{output_path}'.")
except Exception as e:
print(f"An error occurred: {e}")
In this exercise, we first open an image file using the Image
class from the Pillow
library. We resize the image using the resize
method and convert it to grayscale using the convert
method with the 'L'
mode. We apply a Gaussian blur filter using the filter
method with the GaussianBlur
class from the ImageFilter
module. Finally, we save the processed image to a file using the save
method and display it using the show
method.
I hope you find these exercises useful! Let me know if you have any further questions.
Advance Level Exercises Part 1
Exercise 1: File Parsing
Concepts:
- File I/O
- Regular expressions
Description: Write a Python script that reads a text file and extracts all URLs that are present in the file. The output should be a list of URLs.
Solution:
import re
# Open the file for reading
with open('input_file.txt', 'r') as f:
# Read the file contents
file_contents = f.read()
# Use regular expression to extract URLs
urls = re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', file_contents)
# Print the list of URLs
print(urls)
Exercise 2: Data Analysis
Concepts:
- File I/O
- Data manipulation
- Pandas library
Description: Write a Python script that reads a CSV file containing sales data and calculates the total sales revenue for each product category.
Solution:
import pandas as pd
# Read the CSV file into a pandas dataframe
df = pd.read_csv('sales_data.csv')
# Group the data by product category and sum the sales revenue
total_revenue = df.groupby('Product Category')['Sales Revenue'].sum()
# Print the total revenue for each product category
print(total_revenue)
Exercise 3: Web Scraping
Concepts
- Web scraping
- Requests library
- Beautiful Soup library
- CSV file I/O
Description: Write a Python script that scrapes the title and price of all products listed on an e-commerce website and stores them in a CSV file.
Solution:
import requests
from bs4 import BeautifulSoup
import csv
# Define the target URL
url = 'https://www.example.com/products'
# Headers to mimic a real browser request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
# Make a GET request to the website
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')
# Find all product titles and prices
titles = [title.get_text(strip=True) for title in soup.find_all('h3', class_='product-title')]
prices = [price.get_text(strip=True) for price in soup.find_all('div', class_='product-price')]
# Zip the titles and prices together
data = list(zip(titles, prices))
# Write the data to a CSV file with headers
with open('product_data.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Product Title', 'Price']) # Add headers
writer.writerows(data)
print("Scraping completed. Data saved to 'product_data.csv'.")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
Exercise 4: Multithreading
Concepts:
- Multithreading
- Requests library
- Threading library
Description: Write a Python script that uses multithreading to download multiple images from a URL list simultaneously.
Solution:
import requests
import threading
# URL list of images to download
url_list = ['https://www.example.com/image1.jpg', 'https://www.example.com/image2.jpg', 'https://www.example.com/image3.jpg']
# Function to download an image from a URL
def download_image(url):
response = requests.get(url)
with open(url.split('/')[-1], 'wb') as f:
f.write(response.content)
# Create a thread for each URL and start them all simultaneously
threads = []
for url in url_list:
thread = threading.Thread(target=download_image, args=(url,))
threads.append(thread)
thread.start()
# Wait for all threads to finish
for thread in threads:
thread.join()
Exercise 5: Machine Learning
Concepts:
- Machine learning
- Scikit-learn library
Description: Write a Python script that trains a machine learning model on a dataset and uses it to predict the output for new data.
Solution:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Read the dataset into a pandas dataframe
df = pd.read_csv('dataset.csv')
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[['feature1', 'feature2']], df['target'], test_size=0.2, random_state=42)
# Train a linear regression model on the training data
model = LinearRegression()
model.fit(X_train, y_train)
# Use the model to predict the output for the testing data
y_pred = model.predict(X_test)
# Evaluate the model performance using the mean squared error metric
mse = ((y_test - y_pred) ** 2).mean()
print("Mean squared error:", mse)
In this exercise, we first read a dataset into a pandas dataframe. Then, we split the data into training and testing sets using the train_test_split
function from the sklearn.model_selection
module. We trained a linear regression model on the training data using the LinearRegression
class from the sklearn.linear_model
module. Finally, we used the trained model to predict the output for the testing data and evaluated the model performance using the mean squared error metric.
Exercise 6: Natural Language Processing
Concepts:
- Natural Language Processing
- Sentiment Analysis
- NLTK library
Description: Write a Python script that reads a text file and performs sentiment analysis on the text using a pre-trained NLP model.
Solution:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Ensure the VADER lexicon is downloaded
nltk.download('vader_lexicon')
# Read the text file into a string
with open('input_file.txt', 'r', encoding='utf-8') as f:
text = f.read()
# Create a SentimentIntensityAnalyzer object
sid = SentimentIntensityAnalyzer()
# Perform sentiment analysis on the text
scores = sid.polarity_scores(text)
# Print the sentiment scores
print(scores)
In this exercise, we first read a text file into a string. Then, we create a SentimentIntensityAnalyzer
object from the nltk.sentiment.vader
module. We use the polarity_scores
method of the SentimentIntensityAnalyzer
object to perform sentiment analysis on the text and get a dictionary of sentiment scores.
Exercise 7: Web Development
Concepts:
- Web Development
- Flask framework
- File Uploads
Description: Write a Python script that creates a web application using the Flask framework that allows users to upload a file and performs some processing on the file.
Solution:
from flask import Flask, render_template, request
import os
app = Flask(__name__)
# Set the path for file uploads
UPLOAD_FOLDER = 'uploads'
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
# Ensure the upload directory exists
if not os.path.exists(UPLOAD_FOLDER):
os.makedirs(UPLOAD_FOLDER)
# Route for the home page
@app.route('/')
def index():
return render_template('index.html')
# Route for file uploads
@app.route('/upload', methods=['POST'])
def upload():
if 'file' not in request.files:
return 'No file part', 400
file = request.files['file']
if file.filename == '':
return 'No selected file', 400
# Save the file to the uploads folder
file.save(os.path.join(app.config['UPLOAD_FOLDER'], file.filename))
return 'File uploaded successfully'
if __name__ == '__main__':
app.run(debug=True)
In this exercise, we first import the Flask module and create a Flask application. We set up a route for the home page that returns an HTML template. We set up a route for file uploads that receives an uploaded file and saves it to a designated uploads folder. We can perform processing on the uploaded file inside the upload
function.
Exercise 8: Data Visualization
Concepts:
- Data Visualization
- Matplotlib library
- Candlestick Charts
Description: Write a Python script that reads a CSV file containing stock market data and plots a candlestick chart of the data.
Solution:
import pandas as pd
import matplotlib.pyplot as plt
import mplfinance as mpf
# Read the CSV file into a pandas dataframe
df = pd.read_csv('stock_data.csv', parse_dates=['Date'])
df.set_index('Date', inplace=True) # Set Date as index
# Plot the candlestick chart using mplfinance
mpf.plot(df, type='candle', style='charles', title='Stock Market Data', ylabel='Price')
# Display the chart
plt.show()
In this exercise, we first read a CSV file containing stock market data into a pandas dataframe. We convert the date column to Matplotlib dates format and create a figure and axis objects. We plot the candlestick chart using the candlestick_ohlc
function from the mpl_finance
module. We format the x-axis as dates and set the axis labels and title. Finally, we display the chart using the show
function from the matplotlib.pyplot
module.
Exercise 9: Machine Learning
Concepts:
- Machine Learning
- Scikit-learn library
Description: Write a Python script that reads a dataset containing information about different types of flowers and trains a machine learning model to predict the type of a flower based on its features.
Solution:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Read the dataset into a pandas dataframe
df = pd.read_csv('flower_data.csv')
# Check for missing values
if df.isnull().sum().sum() > 0:
df = df.dropna() # Drop rows with missing values
# Define feature columns and target column
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = df['species']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the feature values
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train a logistic regression model on the training data
model = LogisticRegression(solver='saga', max_iter=5000) # Increased iterations & changed solver
model.fit(X_train, y_train)
# Use the model to predict the output for the testing data
y_pred = model.predict(X_test)
# Evaluate the model performance using the accuracy score metric
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
In this exercise, we first read a dataset containing information about different types of flowers into a pandas dataframe. We split the data into training and testing sets using the train_test_split
function from the sklearn.model_selection
module. We trained a logistic regression model on the training data using the LogisticRegression
class from the sklearn.linear_model
module. Finally, we used the trained model to predict the output for the testing data and evaluated the model performance using the accuracy score metric.
Exercise 10: Data Analysis
Concepts:
- Data Analysis
- Recommendation Systems
- Collaborative Filtering
- Surprise library
Description: Write a Python script that reads a CSV file containing customer purchase data and generates a recommendation system that recommends products to customers based on their purchase history.
Solution:
import pandas as pd
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import train_test_split
# Read the CSV file into a pandas dataframe
df = pd.read_csv('purchase_data.csv')
# Ensure that the dataset has no missing values
df = df.dropna(subset=['customer_id', 'product_id', 'rating'])
# Convert the pandas dataframe to a Surprise dataset
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['customer_id', 'product_id', 'rating']], reader)
# Split the data into training and testing sets
trainset, testset = train_test_split(data, test_size=0.2)
# Train an SVD model on the training data
model = SVD(n_factors=50, n_epochs=20, lr_all=0.005, reg_all=0.02)
model.fit(trainset)
# Use the model to predict the output for the testing data
predictions = model.test(testset)
# Evaluate the model performance using the root mean squared error metric
rmse = accuracy.rmse(predictions)
print("RMSE:", rmse)
# Recommend products to customers based on their purchase history
customer_ids = df['customer_id'].unique()
product_ids = df['product_id'].unique()
recommendations = {}
for customer_id in customer_ids:
purchased_products = set(df[df['customer_id'] == customer_id]['product_id'].values)
potential_recommendations = []
for product_id in product_ids:
if product_id not in purchased_products:
pred = model.predict(customer_id, product_id)
potential_recommendations.append((product_id, pred.est))
# Sort by predicted rating and take the top 5 recommendations
top_recommendations = sorted(potential_recommendations, key=lambda x: x[1], reverse=True)[:5]
recommendations[customer_id] = top_recommendations
# Display recommendations
for customer, recs in recommendations.items():
print(f"Customer {customer} recommended products: {recs}")
In this exercise, we first read a CSV file containing customer purchase data into a pandas dataframe. We convert the pandas dataframe to a surprise dataset using the Reader
and Dataset
classes from the surprise
module. We split the data into training and testing sets using the train_test_split
function from the surprise.model_selection
module. We trained an SVD model on the training data using the SVD
class from the surprise
module. We used the trained model to predict the output for the testing data and evaluated the model performance using the root mean squared error metric. Finally, we recommended products to customers based on their purchase history using the trained model.
Exercise 11: Computer Vision
Concepts:
- Computer Vision
- Object Detection
- OpenCV library
- Pre-trained models
Description: Write a Python script that reads an image and performs object detection on the image using a pre-trained object detection model.
Solution:
import cv2
import numpy as np
# Read the image file
img = cv2.imread('image.jpg')
# Check if the image is loaded correctly
if img is None:
raise FileNotFoundError("Error: Image file not found or unable to load.")
# Load the pre-trained object detection model
model = cv2.dnn.readNetFromTensorflow('frozen_inference_graph.pb', 'ssd_mobilenet_v2_coco_2018_03_29.pbtxt')
# Prepare the input image for the model
blob = cv2.dnn.blobFromImage(img, size=(300, 300), swapRB=True, crop=False)
model.setInput(blob)
# Perform object detection
output = model.forward()
# Loop through detected objects and draw bounding boxes
h, w, _ = img.shape # Get image dimensions
for detection in output[0, 0, :, :]:
confidence = float(detection[2])
if confidence > 0.5:
x1 = int(detection[3] * w)
y1 = int(detection[4] * h)
x2 = int(detection[5] * w)
y2 = int(detection[6] * h)
# Draw bounding box with label and confidence score
cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2)
label = f'Confidence: {confidence:.2f}'
cv2.putText(img, label, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
# Display the image with detections
cv2.imshow('Object Detection', img)
cv2.waitKey(0)
cv2.destroyAllWindows()
In this exercise, we first read an image file into a NumPy array using the imread
function from the cv2
module of OpenCV. We load a pre-trained object detection model using the readNetFromTensorflow
function from the cv2.dnn
module. We set the input image to the model and perform object detection using the setInput
and forward
methods of the model object. Finally, we loop through the detected objects and draw bounding boxes around them using the rectangle
function from the cv2
module.
Exercise 12: Natural Language Processing
Concepts:
- Natural Language Processing
- Topic Modeling
- Latent Dirichlet Allocation
- Gensim library
Description: Write a Python script that reads a text file and performs topic modeling on the text using Latent Dirichlet Allocation (LDA).
Solution:
import gensim
from gensim import corpora
from gensim.models import LdaModel
# Read the text file into a list of strings
with open('input_file.txt', 'r') as f:
text = f.readlines()
# Remove newlines and convert to lowercase
text = [line.strip().lower() for line in text]
# Tokenize the text into words
tokens = [line.split() for line in text]
# Create a dictionary of words and their frequency
dictionary = corpora.Dictionary(tokens)
# Create a bag-of-words representation of the text
corpus = [dictionary.doc2bow(token) for token in tokens]
# Train an LDA model on the text
model = LdaModel(corpus, id2word=dictionary, num_topics=5, passes=10)
# Print the topics and their associated words
for topic in model.print_topics(num_words=5):
print(topic)
In this exercise, we first read a text file into a list of strings. We preprocess the text by removing newlines, converting to lowercase, and tokenizing into words using the split
method. We create a dictionary of words and their frequency and create a bag-of-words representation of the text using the doc2bow
method of the dictionary object. We train an LDA model on the corpus using the LdaModel
class from the gensim.models
module. Finally, we print the topics and their associated words using the print_topics
method of the model object.
Exercise 13: Web Scraping
Concepts:
- Web Scraping
- Beautiful Soup library
- Requests library
- CSV file handling
Description: Write a Python script that scrapes a website for product information and saves the information to a CSV file.
Solution:
import requests
from bs4 import BeautifulSoup
import csv
# Define the URL of the website to scrape
url = 'https://www.example.com/products'
# Add headers to mimic a browser request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
# Send a request to the website
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code != 200:
print(f"Failed to fetch data. Status Code: {response.status_code}")
exit()
# Parse the HTML content of the response using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find all the product listings on the page
listings = soup.find_all('div', class_='product-listing')
# Write the product information to a CSV file
with open('products.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Product Name', 'Price', 'Description'])
for listing in listings:
name = listing.find('h3')
price = listing.find('span', class_='price')
description = listing.find('p')
# Extract text safely, handling missing elements
name = name.get_text(strip=True) if name else 'N/A'
price = price.get_text(strip=True) if price else 'N/A'
description = description.get_text(strip=True) if description else 'N/A'
writer.writerow([name, price, description])
print("Scraping completed. Data saved to 'products.csv'.")
In this exercise, we first define the URL of the website to scrape and send a request to the website using the get
function from the requests
module. We parse the HTML content of the response using Beautiful Soup and find all the product listings on the page using the find_all
method. We write the product information to a CSV file using the csv
module.
Exercise 14: Big Data Processing
Concepts:
- Big Data Processing
- PySpark
- Data Transformations
- Aggregation
- Parquet file format
Description: Write a PySpark script that reads a CSV file containing customer purchase data, performs some data transformations and aggregation, and saves the results to a Parquet file.
Solution:
from pyspark.sql import SparkSession
# Create a SparkSession object
spark = SparkSession.builder.appName('customer-purchases').getOrCreate()
# Verify if the file exists before reading (optional but useful)
import os
if not os.path.exists('customer_purchases.csv'):
raise FileNotFoundError("Error: The file 'customer_purchases.csv' does not exist.")
# Read the CSV file into a Spark DataFrame
df = spark.read.csv('customer_purchases.csv', header=True, inferSchema=True)
# Perform some data transformations
df = df.filter((df['purchase_date'] >= '2020-01-01') & (df['purchase_date'] <= '2020-12-31'))
df = df.select('customer_id', 'product_id', 'price')
# Group by customer and calculate total spending
df = df.groupBy('customer_id').sum('price').withColumnRenamed('sum(price)', 'total_spent')
# Save the results to a Parquet file
df.write.mode('overwrite').parquet('customer_spending.parquet')
print("Processing completed. Data saved to 'customer_spending.parquet'.")
In this exercise, we first create a SparkSession object using the SparkSession
class from the pyspark.sql
module. We read a CSV file containing customer purchase data into a Spark DataFrame using the read.csv
method. We perform some data transformations on the DataFrame using the filter
, select
, and groupBy
methods. Finally, we save the results to a Parquet file using the write.parquet
method.
Exercise 15: DevOps
Concepts:
- DevOps
- Fabric library
Description: Write a Python script that automates the deployment of a web application to a remote server using the Fabric library.
Solution:
from fabric import Connection
import getpass
# Define the host and user credentials for the remote server
host = 'example.com'
user = 'user'
password = getpass.getpass("Enter SSH password: ") # Secure password entry
# Define the path to the web application on the local machine and the remote server
local_path = '/path/to/local/app'
remote_path = '/path/to/remote/app'
# Create a connection to the remote server
c = Connection(host=host, user=user, connect_kwargs={'password': password})
# Ensure the remote directory exists
c.run(f'mkdir -p {remote_path}')
# Upload the local files to the remote server
c.put(local_path, remote_path, recursive=True) # Enables recursive copy
# Change to the application directory
with c.cd(remote_path):
# Install required dependencies
c.run('sudo apt-get update && sudo apt-get install -y python3-pip')
c.run('pip3 install -r requirements.txt')
# Start the web application in the background
c.run('nohup python3 app.py > app.log 2>&1 &', pty=False)
print("Deployment completed successfully.")
In this exercise, we first define the host and user credentials for the remote server. We define the path to the web application on the local machine and the remote server. We create a connection to the remote server using the Connection
class from the fabric
module. We upload the local files to the remote server using the put
method of the connection object. We install any required dependencies on the remote server using the run
method of the connection object. Finally, we start the web application on the remote server using the run
method.
Exercise 16: Reinforcement Learning
Concepts:
- Reinforcement Learning
- Q-Learning
- OpenAI Gym library
Description: Write a Python script that implements a reinforcement learning algorithm to teach an agent to play a simple game.
Solution:
import gym
import numpy as np
import time
# Create the FrozenLake environment
env = gym.make("FrozenLake-v1", is_slippery=True)
# Initialize the Q-table
Q = np.zeros([env.observation_space.n, env.action_space.n])
# Set hyperparameters
alpha = 0.8 # Learning rate
gamma = 0.95 # Discount factor
epsilon = 0.1 # Exploration probability
num_episodes = 2000 # Training episodes
# Train the agent using Q-learning
for episode in range(num_episodes):
state, _ = env.reset()
done = False
while not done:
# Choose action using epsilon-greedy policy
if np.random.uniform() < epsilon:
action = env.action_space.sample() # Random action (exploration)
else:
action = np.argmax(Q[state, :]) # Best action from Q-table
# Take the action and observe the next state
next_state, reward, done, _, _ = env.step(action)
# Update Q-value using the Bellman equation
Q[state, action] = (1 - alpha) * Q[state, action] + alpha * (reward + gamma * np.max(Q[next_state, :]))
# Move to the next state
state = next_state
# Test the agent by playing the game
state, _ = env.reset()
done = False
print("\nTesting trained agent:\n")
while not done:
action = np.argmax(Q[state, :])
next_state, reward, done, _, _ = env.step(action)
# Render the environment
env.render()
time.sleep(0.5) # Pause for visibility
state = next_state
print("\nGame Over!")
In this exercise, we first create an OpenAI Gym environment for the game using the make
function from the gym
module. We define the Q-table for the agent as a NumPy array and set the hyperparameters for the Q-learning algorithm. We train the agent using the Q-learning algorithm by looping through a specified number of episodes and updating the Q-table based on the rewards and next states. Finally, we test the agent by playing the game using the Q-table and visualizing the game using the render
method.
Exercise 17: Time Series Analysis
Concepts:
- Time Series Analysis
- Data Preprocessing
- Data Visualization
- ARIMA model
- Statsmodels library
Description: Write a Python script that reads a CSV file containing time series data, performs some data preprocessing and visualization, and fits a time series model to the data.
Solution:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Read the CSV file into a pandas dataframe
df = pd.read_csv('time_series.csv')
# Convert the date column to a datetime object and set it as the index
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Check for missing values before resampling
if df.isnull().values.any():
df = df.fillna(method='ffill')
# Ensure the column name is correct
target_col = df.columns[0] # Assuming first column is the time series value
# Resample the data to a monthly frequency
df = df.resample('M').mean()
# Plot the time series data
plt.figure(figsize=(10, 5))
plt.plot(df.index, df[target_col], label="Time Series")
plt.xlabel("Date")
plt.ylabel("Value")
plt.title("Time Series Visualization")
plt.legend()
plt.grid()
plt.show()
# Fit an ARIMA model
model = sm.tsa.ARIMA(df[target_col].dropna(), order=(1, 1, 1)) # Use dropna() to avoid errors
results = model.fit()
# Print the model summary
print(results.summary())
In this exercise, we first read a CSV file containing time series data into a pandas dataframe. We convert the date column to a datetime object and set it as the index. We resample the data to a monthly frequency and fill any missing values using forward fill. We visualize the data using the plot
function from the matplotlib.pyplot
module. Finally, we fit an ARIMA model to the data using the ARIMA
function from the statsmodels.api
module and print the summary of the model using the summary
method of the results object.
Exercise 18: Computer Networking
Concepts:
- Computer Networking
- TCP/IP Protocol
- Socket Programming
Description: Write a Python script that implements a simple TCP server that accepts client connections and sends and receives data.
Solution:
import socket
# Define the host and port for the server
host = 'localhost'
port = 12345
# Create a socket object
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Bind the socket to the host and port
s.bind((host, port))
# Listen for incoming connections
s.listen(1)
print('Server listening on', host, port)
# Accept a client connection
conn, addr = s.accept()
print('Connected by', addr)
# Send data to the client
conn.sendall(b'Hello, client!')
# Receive data from the client
data = conn.recv(1024)
print('Received:', data.decode())
# Close the connection
conn.close()
In this exercise, we first define the host and port for the server. We create a socket object using the socket
function from the socket
module and bind the socket to the host and port using the bind
method. We listen for incoming connections using the listen
method and accept a client connection using the accept
method, which returns a connection object and the address of the client. We send data to the client using the sendall
method of the connection object and receive data from the client using the recv
method. Finally, we close the connection using the close
method.
Exercise 19: Data Analysis and Visualization
Concepts:
- Data Analysis
- Data Visualization
- PDF Report Generation
- Pandas library
- Matplotlib library
- ReportLab library
Description: Write a Python script that reads a CSV file containing sales data for a retail store, performs some data analysis and visualization, and saves the results to a PDF report.
Solution:
import pandas as pd
import matplotlib.pyplot as plt
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
import os
# Read the CSV file into a pandas dataframe
df = pd.read_csv('sales_data.csv')
# Calculate the total sales by category and month
totals = df.groupby(['category', 'month'])['sales'].sum()
# Get unique categories
categories = df['category'].unique()
# Create subplots dynamically based on the number of categories
fig, axes = plt.subplots(nrows=len(categories), ncols=1, figsize=(8.5, 11))
# Ensure `axes` is always iterable (even if there's only one category)
if len(categories) == 1:
axes = [axes]
# Plot total sales by category and month
for i, category in enumerate(categories):
totals.loc[category].plot(ax=axes[i], kind='bar', title=f"Category: {category}")
axes[i].set_ylabel("Sales")
plt.tight_layout()
plt.savefig('sales_plot.png') # Save the figure
plt.close(fig) # Close to free memory
# Create a PDF report
pdf_filename = 'sales_report.pdf'
c = canvas.Canvas(pdf_filename, pagesize=letter)
# Add title and description
c.setFont("Helvetica-Bold", 16)
c.drawString(50, 750, 'Sales Report')
c.setFont("Helvetica", 12)
c.drawString(50, 730, 'Total Sales by Category and Month')
# Add the image to the PDF if it exists
if os.path.exists('sales_plot.png'):
c.drawImage('sales_plot.png', 50, 450, width=500, height=300)
# Save and close the PDF
c.showPage()
c.save()
print(f"Report saved as {pdf_filename}")
In this exercise, we first read a CSV file containing sales data for a retail store into a pandas dataframe. We calculate the total sales by category and month using the groupby
and sum
methods. We plot the total sales by category and month using the plot
function from the matplotlib.pyplot
module and save the plot to a PNG file. Finally, we generate a PDF report using the Canvas
and Image
functions from the reportlab
module.
Exercise 20: Machine Learning
Concepts:
- Machine Learning
- Convolutional Neural Networks
- Keras library
- MNIST dataset
Description: Write a Python script that trains a machine learning model to classify images of handwritten digits from the MNIST dataset.
Solution:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
# Normalize the pixel values and reshape the data
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
x_train = x_train.reshape(-1, 28, 28, 1)
x_test = x_test.reshape(-1, 28, 28, 1)
# Define the CNN model
model = keras.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(128, activation='relu'), # Added a fully connected layer
layers.Dropout(0.5), # Prevent overfitting
layers.Dense(10, activation='softmax')
])
# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test), batch_size=64)
# Evaluate the model on the test data
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print('Test accuracy:', test_acc)
In this exercise, we first load the MNIST dataset using the load_data
function from the keras.datasets.mnist
module. We normalize the pixel values and reshape the data using NumPy. We define a convolutional neural network model using the Sequential
class and various layers from the layers
module of Keras. We compile the model using the compile
method with the Adam optimizer and sparse categorical crossentropy loss function. We train the model using the fit
method and evaluate the model on the test data using the evaluate
method.
Exercise 21: Natural Language Processing
Concepts:
- Natural Language Processing
- Text Preprocessing
- Text Representation
- Topic Modeling
- Latent Dirichlet Allocation
- Gensim library
Description: Write a Python script that uses natural language processing techniques to analyze a corpus of text data and extract useful insights.
Solution:
import gensim
from gensim import corpora
from gensim.models import LdaModel
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Download required resources
nltk.download('stopwords')
nltk.download('punkt')
# Read the text data into a pandas dataframe
df = pd.read_csv('text_data.csv')
# Handle missing values
df['text'] = df['text'].fillna('')
# Define stop words and clean text
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
tokens = word_tokenize(text.lower()) # Tokenization & lowercasing
return [word for word in tokens if word.isalnum() and word not in stop_words] # Remove punctuation & stopwords
df['cleaned_text'] = df['text'].apply(preprocess_text)
# Create a document-term matrix
texts = df['cleaned_text'].tolist()
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
# Train LDA model
num_topics = 5
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=10)
# Print topics and top words for each
for topic_id, words in lda_model.show_topics(num_topics=num_topics, formatted=False):
print(f'Topic {topic_id}:', ', '.join(word for word, _ in words))
# Convert topic distributions into a structured DataFrame
topic_dists = [{f"Topic_{topic}": prob for topic, prob in lda_model.get_document_topics(doc, minimum_probability=0)} for doc in corpus]
topic_df = pd.DataFrame(topic_dists)
# Merge topic distributions with original data
df = pd.concat([df, topic_df], axis=1)
# Save the results
df.to_csv('text_data_topics.csv', index=False)
print("Saved processed data to 'text_data_topics.csv'.")
In this exercise, we first read a corpus of text data into a pandas dataframe. We define the stop words using the stopwords
function from the nltk.corpus
module and remove them from the text data using list comprehension and apply
method of pandas. We create a document-term matrix from the text data using the Dictionary
and corpus
functions from the gensim
module. We perform topic modeling using latent Dirichlet allocation (LDA) using the LdaModel
function and extract the topic distributions for each document. Finally, we save the results to a CSV file using the to_csv
method of pandas.
Exercise 22: Web Scraping
Concepts:
- Web Scraping
- HTML Parsing
- BeautifulSoup library
- CSV File I/O
Description: Write a Python script that scrapes data from a website using the BeautifulSoup library and saves it to a CSV file.
Solution:
import requests
from bs4 import BeautifulSoup
import csv
# Define the URL to scrape
url = 'https://www.example.com'
# Headers to mimic a real browser request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
# Send a GET request
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code != 200:
print(f"Error: Unable to fetch data (Status Code: {response.status_code})")
exit()
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Extract data
data = []
for item in soup.find_all('div', class_='item'):
name_tag = item.find('h3')
price_tag = item.find('span', class_='price')
# Extract text safely, handling missing elements
name = name_tag.get_text(strip=True) if name_tag else 'N/A'
price = price_tag.get_text(strip=True) if price_tag else 'N/A'
data.append([name, price])
# Save to CSV
csv_filename = 'data.csv'
with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['Name', 'Price']) # Add headers
writer.writerows(data)
print(f"Scraping completed. Data saved to '{csv_filename}'.")
In this exercise, we first define the URL to scrape using the requests
library and parse the HTML content using the BeautifulSoup
library. We extract the data from the HTML content using the find_all
and find
methods of the soup
object. Finally, we save the data to a CSV file using the csv
module.
Exercise 23: Database Interaction
Concepts:
- Database Interaction
- SQLite database
- SQL queries
- SQLite3 module
Description: Write a Python script that interacts with a database to retrieve and manipulate data.
Solution:
import sqlite3
# Connect to the database
conn = sqlite3.connect('example.db')
# Create a cursor object
c = conn.cursor()
# Execute an SQL query to create a table
c.execute('''CREATE TABLE IF NOT EXISTS customers
(id INTEGER PRIMARY KEY, name TEXT, email TEXT, phone TEXT)''')
# Execute an SQL query to insert data into the table
c.execute("INSERT INTO customers (name, email, phone) VALUES ('John Smith', 'john@example.com', '555-1234')")
# Execute an SQL query to retrieve data from the table
c.execute("SELECT * FROM customers")
rows = c.fetchall()
for row in rows:
print(row)
# Execute an SQL query to update data in the table
c.execute("UPDATE customers SET phone='555-5678' WHERE name='John Smith'")
# Execute an SQL query to delete data from the table
c.execute("DELETE FROM customers WHERE name='John Smith'")
# Commit the changes to the database
conn.commit()
# Close the database connection
conn.close()
In this exercise, we first connect to an SQLite database using the connect
function from the sqlite3
module. We create a cursor object using the cursor
method of the connection object and execute SQL queries using the execute
method of the cursor object. We retrieve data from the table using the fetchall
method and print the results. We update data in the table using the UPDATE
statement and delete data from the table using the DELETE
statement. Finally, we commit the changes to the database and close the connection.
Exercise 24: Parallel Processing
Concepts:
- Parallel Processing
- Multiprocessing
- Process Pool
- CPU-bound tasks
Description: Write a Python script that performs a time-consuming computation using parallel processing to speed up the computation.
Solution:
import time
import multiprocessing
# Define an optimized CPU-bound function
def compute(num):
return num * (num - 1) // 2 # Uses O(1) formula instead of a loop
if __name__ == '__main__':
# Create a process pool with the number of CPUs available
num_cpus = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cpus)
# Generate a list of numbers to compute
num_list = [10000000] * num_cpus
# Compute the results using parallel processing
start_time = time.time()
results = pool.map(compute, num_list)
# Close the pool properly
pool.close()
pool.join()
end_time = time.time()
# Print the results and computation time
print('Results:', results)
print('Computation time:', end_time - start_time, 'seconds')
In this exercise, we first define a CPU-bound function that takes a long time to compute. We then create a process pool using the Pool
function from the multiprocessing
module with the number of CPUs available. We generate a list of numbers to compute and compute the results using the map
method of the process pool. Finally, we print the results and computation time.
Exercise 25: Image Processing
Concepts:
- Image Processing
- Pillow library
- Image Manipulation
- Image Filtering
Description: Write a Python script that performs basic image processing operations on an image file.
Solution:
from PIL import Image, ImageFilter
import os
# Define image paths
input_path = 'example.jpg'
output_path = 'processed.jpg'
# Check if the input file exists
if not os.path.exists(input_path):
raise FileNotFoundError(f"Error: The file '{input_path}' was not found.")
try:
# Open the image file using a context manager
with Image.open(input_path) as image:
# Display the original image (optional, may not work in all environments)
image.show()
# Resize the image
image = image.resize((500, 500))
# Convert the image to grayscale
image = image.convert('L')
# Apply a Gaussian blur filter
image = image.filter(ImageFilter.GaussianBlur(radius=2))
# Save the processed image to a file
image.save(output_path)
# Display the processed image
image.show()
print(f"Processed image saved as '{output_path}'.")
except Exception as e:
print(f"An error occurred: {e}")
In this exercise, we first open an image file using the Image
class from the Pillow
library. We resize the image using the resize
method and convert it to grayscale using the convert
method with the 'L'
mode. We apply a Gaussian blur filter using the filter
method with the GaussianBlur
class from the ImageFilter
module. Finally, we save the processed image to a file using the save
method and display it using the show
method.
I hope you find these exercises useful! Let me know if you have any further questions.
Advance Level Exercises Part 1
Exercise 1: File Parsing
Concepts:
- File I/O
- Regular expressions
Description: Write a Python script that reads a text file and extracts all URLs that are present in the file. The output should be a list of URLs.
Solution:
import re
# Open the file for reading
with open('input_file.txt', 'r') as f:
# Read the file contents
file_contents = f.read()
# Use regular expression to extract URLs
urls = re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', file_contents)
# Print the list of URLs
print(urls)
Exercise 2: Data Analysis
Concepts:
- File I/O
- Data manipulation
- Pandas library
Description: Write a Python script that reads a CSV file containing sales data and calculates the total sales revenue for each product category.
Solution:
import pandas as pd
# Read the CSV file into a pandas dataframe
df = pd.read_csv('sales_data.csv')
# Group the data by product category and sum the sales revenue
total_revenue = df.groupby('Product Category')['Sales Revenue'].sum()
# Print the total revenue for each product category
print(total_revenue)
Exercise 3: Web Scraping
Concepts
- Web scraping
- Requests library
- Beautiful Soup library
- CSV file I/O
Description: Write a Python script that scrapes the title and price of all products listed on an e-commerce website and stores them in a CSV file.
Solution:
import requests
from bs4 import BeautifulSoup
import csv
# Define the target URL
url = 'https://www.example.com/products'
# Headers to mimic a real browser request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
# Make a GET request to the website
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')
# Find all product titles and prices
titles = [title.get_text(strip=True) for title in soup.find_all('h3', class_='product-title')]
prices = [price.get_text(strip=True) for price in soup.find_all('div', class_='product-price')]
# Zip the titles and prices together
data = list(zip(titles, prices))
# Write the data to a CSV file with headers
with open('product_data.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Product Title', 'Price']) # Add headers
writer.writerows(data)
print("Scraping completed. Data saved to 'product_data.csv'.")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
Exercise 4: Multithreading
Concepts:
- Multithreading
- Requests library
- Threading library
Description: Write a Python script that uses multithreading to download multiple images from a URL list simultaneously.
Solution:
import requests
import threading
# URL list of images to download
url_list = ['https://www.example.com/image1.jpg', 'https://www.example.com/image2.jpg', 'https://www.example.com/image3.jpg']
# Function to download an image from a URL
def download_image(url):
response = requests.get(url)
with open(url.split('/')[-1], 'wb') as f:
f.write(response.content)
# Create a thread for each URL and start them all simultaneously
threads = []
for url in url_list:
thread = threading.Thread(target=download_image, args=(url,))
threads.append(thread)
thread.start()
# Wait for all threads to finish
for thread in threads:
thread.join()
Exercise 5: Machine Learning
Concepts:
- Machine learning
- Scikit-learn library
Description: Write a Python script that trains a machine learning model on a dataset and uses it to predict the output for new data.
Solution:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Read the dataset into a pandas dataframe
df = pd.read_csv('dataset.csv')
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[['feature1', 'feature2']], df['target'], test_size=0.2, random_state=42)
# Train a linear regression model on the training data
model = LinearRegression()
model.fit(X_train, y_train)
# Use the model to predict the output for the testing data
y_pred = model.predict(X_test)
# Evaluate the model performance using the mean squared error metric
mse = ((y_test - y_pred) ** 2).mean()
print("Mean squared error:", mse)
In this exercise, we first read a dataset into a pandas dataframe. Then, we split the data into training and testing sets using the train_test_split
function from the sklearn.model_selection
module. We trained a linear regression model on the training data using the LinearRegression
class from the sklearn.linear_model
module. Finally, we used the trained model to predict the output for the testing data and evaluated the model performance using the mean squared error metric.
Exercise 6: Natural Language Processing
Concepts:
- Natural Language Processing
- Sentiment Analysis
- NLTK library
Description: Write a Python script that reads a text file and performs sentiment analysis on the text using a pre-trained NLP model.
Solution:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Ensure the VADER lexicon is downloaded
nltk.download('vader_lexicon')
# Read the text file into a string
with open('input_file.txt', 'r', encoding='utf-8') as f:
text = f.read()
# Create a SentimentIntensityAnalyzer object
sid = SentimentIntensityAnalyzer()
# Perform sentiment analysis on the text
scores = sid.polarity_scores(text)
# Print the sentiment scores
print(scores)
In this exercise, we first read a text file into a string. Then, we create a SentimentIntensityAnalyzer
object from the nltk.sentiment.vader
module. We use the polarity_scores
method of the SentimentIntensityAnalyzer
object to perform sentiment analysis on the text and get a dictionary of sentiment scores.
Exercise 7: Web Development
Concepts:
- Web Development
- Flask framework
- File Uploads
Description: Write a Python script that creates a web application using the Flask framework that allows users to upload a file and performs some processing on the file.
Solution:
from flask import Flask, render_template, request
import os
app = Flask(__name__)
# Set the path for file uploads
UPLOAD_FOLDER = 'uploads'
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
# Ensure the upload directory exists
if not os.path.exists(UPLOAD_FOLDER):
os.makedirs(UPLOAD_FOLDER)
# Route for the home page
@app.route('/')
def index():
return render_template('index.html')
# Route for file uploads
@app.route('/upload', methods=['POST'])
def upload():
if 'file' not in request.files:
return 'No file part', 400
file = request.files['file']
if file.filename == '':
return 'No selected file', 400
# Save the file to the uploads folder
file.save(os.path.join(app.config['UPLOAD_FOLDER'], file.filename))
return 'File uploaded successfully'
if __name__ == '__main__':
app.run(debug=True)
In this exercise, we first import the Flask module and create a Flask application. We set up a route for the home page that returns an HTML template. We set up a route for file uploads that receives an uploaded file and saves it to a designated uploads folder. We can perform processing on the uploaded file inside the upload
function.
Exercise 8: Data Visualization
Concepts:
- Data Visualization
- Matplotlib library
- Candlestick Charts
Description: Write a Python script that reads a CSV file containing stock market data and plots a candlestick chart of the data.
Solution:
import pandas as pd
import matplotlib.pyplot as plt
import mplfinance as mpf
# Read the CSV file into a pandas dataframe
df = pd.read_csv('stock_data.csv', parse_dates=['Date'])
df.set_index('Date', inplace=True) # Set Date as index
# Plot the candlestick chart using mplfinance
mpf.plot(df, type='candle', style='charles', title='Stock Market Data', ylabel='Price')
# Display the chart
plt.show()
In this exercise, we first read a CSV file containing stock market data into a pandas dataframe. We convert the date column to Matplotlib dates format and create a figure and axis objects. We plot the candlestick chart using the candlestick_ohlc
function from the mpl_finance
module. We format the x-axis as dates and set the axis labels and title. Finally, we display the chart using the show
function from the matplotlib.pyplot
module.
Exercise 9: Machine Learning
Concepts:
- Machine Learning
- Scikit-learn library
Description: Write a Python script that reads a dataset containing information about different types of flowers and trains a machine learning model to predict the type of a flower based on its features.
Solution:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Read the dataset into a pandas dataframe
df = pd.read_csv('flower_data.csv')
# Check for missing values
if df.isnull().sum().sum() > 0:
df = df.dropna() # Drop rows with missing values
# Define feature columns and target column
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = df['species']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the feature values
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train a logistic regression model on the training data
model = LogisticRegression(solver='saga', max_iter=5000) # Increased iterations & changed solver
model.fit(X_train, y_train)
# Use the model to predict the output for the testing data
y_pred = model.predict(X_test)
# Evaluate the model performance using the accuracy score metric
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
In this exercise, we first read a dataset containing information about different types of flowers into a pandas dataframe. We split the data into training and testing sets using the train_test_split
function from the sklearn.model_selection
module. We trained a logistic regression model on the training data using the LogisticRegression
class from the sklearn.linear_model
module. Finally, we used the trained model to predict the output for the testing data and evaluated the model performance using the accuracy score metric.
Exercise 10: Data Analysis
Concepts:
- Data Analysis
- Recommendation Systems
- Collaborative Filtering
- Surprise library
Description: Write a Python script that reads a CSV file containing customer purchase data and generates a recommendation system that recommends products to customers based on their purchase history.
Solution:
import pandas as pd
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import train_test_split
# Read the CSV file into a pandas dataframe
df = pd.read_csv('purchase_data.csv')
# Ensure that the dataset has no missing values
df = df.dropna(subset=['customer_id', 'product_id', 'rating'])
# Convert the pandas dataframe to a Surprise dataset
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['customer_id', 'product_id', 'rating']], reader)
# Split the data into training and testing sets
trainset, testset = train_test_split(data, test_size=0.2)
# Train an SVD model on the training data
model = SVD(n_factors=50, n_epochs=20, lr_all=0.005, reg_all=0.02)
model.fit(trainset)
# Use the model to predict the output for the testing data
predictions = model.test(testset)
# Evaluate the model performance using the root mean squared error metric
rmse = accuracy.rmse(predictions)
print("RMSE:", rmse)
# Recommend products to customers based on their purchase history
customer_ids = df['customer_id'].unique()
product_ids = df['product_id'].unique()
recommendations = {}
for customer_id in customer_ids:
purchased_products = set(df[df['customer_id'] == customer_id]['product_id'].values)
potential_recommendations = []
for product_id in product_ids:
if product_id not in purchased_products:
pred = model.predict(customer_id, product_id)
potential_recommendations.append((product_id, pred.est))
# Sort by predicted rating and take the top 5 recommendations
top_recommendations = sorted(potential_recommendations, key=lambda x: x[1], reverse=True)[:5]
recommendations[customer_id] = top_recommendations
# Display recommendations
for customer, recs in recommendations.items():
print(f"Customer {customer} recommended products: {recs}")
In this exercise, we first read a CSV file containing customer purchase data into a pandas dataframe. We convert the pandas dataframe to a surprise dataset using the Reader
and Dataset
classes from the surprise
module. We split the data into training and testing sets using the train_test_split
function from the surprise.model_selection
module. We trained an SVD model on the training data using the SVD
class from the surprise
module. We used the trained model to predict the output for the testing data and evaluated the model performance using the root mean squared error metric. Finally, we recommended products to customers based on their purchase history using the trained model.
Exercise 11: Computer Vision
Concepts:
- Computer Vision
- Object Detection
- OpenCV library
- Pre-trained models
Description: Write a Python script that reads an image and performs object detection on the image using a pre-trained object detection model.
Solution:
import cv2
import numpy as np
# Read the image file
img = cv2.imread('image.jpg')
# Check if the image is loaded correctly
if img is None:
raise FileNotFoundError("Error: Image file not found or unable to load.")
# Load the pre-trained object detection model
model = cv2.dnn.readNetFromTensorflow('frozen_inference_graph.pb', 'ssd_mobilenet_v2_coco_2018_03_29.pbtxt')
# Prepare the input image for the model
blob = cv2.dnn.blobFromImage(img, size=(300, 300), swapRB=True, crop=False)
model.setInput(blob)
# Perform object detection
output = model.forward()
# Loop through detected objects and draw bounding boxes
h, w, _ = img.shape # Get image dimensions
for detection in output[0, 0, :, :]:
confidence = float(detection[2])
if confidence > 0.5:
x1 = int(detection[3] * w)
y1 = int(detection[4] * h)
x2 = int(detection[5] * w)
y2 = int(detection[6] * h)
# Draw bounding box with label and confidence score
cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2)
label = f'Confidence: {confidence:.2f}'
cv2.putText(img, label, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
# Display the image with detections
cv2.imshow('Object Detection', img)
cv2.waitKey(0)
cv2.destroyAllWindows()
In this exercise, we first read an image file into a NumPy array using the imread
function from the cv2
module of OpenCV. We load a pre-trained object detection model using the readNetFromTensorflow
function from the cv2.dnn
module. We set the input image to the model and perform object detection using the setInput
and forward
methods of the model object. Finally, we loop through the detected objects and draw bounding boxes around them using the rectangle
function from the cv2
module.
Exercise 12: Natural Language Processing
Concepts:
- Natural Language Processing
- Topic Modeling
- Latent Dirichlet Allocation
- Gensim library
Description: Write a Python script that reads a text file and performs topic modeling on the text using Latent Dirichlet Allocation (LDA).
Solution:
import gensim
from gensim import corpora
from gensim.models import LdaModel
# Read the text file into a list of strings
with open('input_file.txt', 'r') as f:
text = f.readlines()
# Remove newlines and convert to lowercase
text = [line.strip().lower() for line in text]
# Tokenize the text into words
tokens = [line.split() for line in text]
# Create a dictionary of words and their frequency
dictionary = corpora.Dictionary(tokens)
# Create a bag-of-words representation of the text
corpus = [dictionary.doc2bow(token) for token in tokens]
# Train an LDA model on the text
model = LdaModel(corpus, id2word=dictionary, num_topics=5, passes=10)
# Print the topics and their associated words
for topic in model.print_topics(num_words=5):
print(topic)
In this exercise, we first read a text file into a list of strings. We preprocess the text by removing newlines, converting to lowercase, and tokenizing into words using the split
method. We create a dictionary of words and their frequency and create a bag-of-words representation of the text using the doc2bow
method of the dictionary object. We train an LDA model on the corpus using the LdaModel
class from the gensim.models
module. Finally, we print the topics and their associated words using the print_topics
method of the model object.
Exercise 13: Web Scraping
Concepts:
- Web Scraping
- Beautiful Soup library
- Requests library
- CSV file handling
Description: Write a Python script that scrapes a website for product information and saves the information to a CSV file.
Solution:
import requests
from bs4 import BeautifulSoup
import csv
# Define the URL of the website to scrape
url = 'https://www.example.com/products'
# Add headers to mimic a browser request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
# Send a request to the website
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code != 200:
print(f"Failed to fetch data. Status Code: {response.status_code}")
exit()
# Parse the HTML content of the response using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find all the product listings on the page
listings = soup.find_all('div', class_='product-listing')
# Write the product information to a CSV file
with open('products.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Product Name', 'Price', 'Description'])
for listing in listings:
name = listing.find('h3')
price = listing.find('span', class_='price')
description = listing.find('p')
# Extract text safely, handling missing elements
name = name.get_text(strip=True) if name else 'N/A'
price = price.get_text(strip=True) if price else 'N/A'
description = description.get_text(strip=True) if description else 'N/A'
writer.writerow([name, price, description])
print("Scraping completed. Data saved to 'products.csv'.")
In this exercise, we first define the URL of the website to scrape and send a request to the website using the get
function from the requests
module. We parse the HTML content of the response using Beautiful Soup and find all the product listings on the page using the find_all
method. We write the product information to a CSV file using the csv
module.
Exercise 14: Big Data Processing
Concepts:
- Big Data Processing
- PySpark
- Data Transformations
- Aggregation
- Parquet file format
Description: Write a PySpark script that reads a CSV file containing customer purchase data, performs some data transformations and aggregation, and saves the results to a Parquet file.
Solution:
from pyspark.sql import SparkSession
# Create a SparkSession object
spark = SparkSession.builder.appName('customer-purchases').getOrCreate()
# Verify if the file exists before reading (optional but useful)
import os
if not os.path.exists('customer_purchases.csv'):
raise FileNotFoundError("Error: The file 'customer_purchases.csv' does not exist.")
# Read the CSV file into a Spark DataFrame
df = spark.read.csv('customer_purchases.csv', header=True, inferSchema=True)
# Perform some data transformations
df = df.filter((df['purchase_date'] >= '2020-01-01') & (df['purchase_date'] <= '2020-12-31'))
df = df.select('customer_id', 'product_id', 'price')
# Group by customer and calculate total spending
df = df.groupBy('customer_id').sum('price').withColumnRenamed('sum(price)', 'total_spent')
# Save the results to a Parquet file
df.write.mode('overwrite').parquet('customer_spending.parquet')
print("Processing completed. Data saved to 'customer_spending.parquet'.")
In this exercise, we first create a SparkSession object using the SparkSession
class from the pyspark.sql
module. We read a CSV file containing customer purchase data into a Spark DataFrame using the read.csv
method. We perform some data transformations on the DataFrame using the filter
, select
, and groupBy
methods. Finally, we save the results to a Parquet file using the write.parquet
method.
Exercise 15: DevOps
Concepts:
- DevOps
- Fabric library
Description: Write a Python script that automates the deployment of a web application to a remote server using the Fabric library.
Solution:
from fabric import Connection
import getpass
# Define the host and user credentials for the remote server
host = 'example.com'
user = 'user'
password = getpass.getpass("Enter SSH password: ") # Secure password entry
# Define the path to the web application on the local machine and the remote server
local_path = '/path/to/local/app'
remote_path = '/path/to/remote/app'
# Create a connection to the remote server
c = Connection(host=host, user=user, connect_kwargs={'password': password})
# Ensure the remote directory exists
c.run(f'mkdir -p {remote_path}')
# Upload the local files to the remote server
c.put(local_path, remote_path, recursive=True) # Enables recursive copy
# Change to the application directory
with c.cd(remote_path):
# Install required dependencies
c.run('sudo apt-get update && sudo apt-get install -y python3-pip')
c.run('pip3 install -r requirements.txt')
# Start the web application in the background
c.run('nohup python3 app.py > app.log 2>&1 &', pty=False)
print("Deployment completed successfully.")
In this exercise, we first define the host and user credentials for the remote server. We define the path to the web application on the local machine and the remote server. We create a connection to the remote server using the Connection
class from the fabric
module. We upload the local files to the remote server using the put
method of the connection object. We install any required dependencies on the remote server using the run
method of the connection object. Finally, we start the web application on the remote server using the run
method.
Exercise 16: Reinforcement Learning
Concepts:
- Reinforcement Learning
- Q-Learning
- OpenAI Gym library
Description: Write a Python script that implements a reinforcement learning algorithm to teach an agent to play a simple game.
Solution:
import gym
import numpy as np
import time
# Create the FrozenLake environment
env = gym.make("FrozenLake-v1", is_slippery=True)
# Initialize the Q-table
Q = np.zeros([env.observation_space.n, env.action_space.n])
# Set hyperparameters
alpha = 0.8 # Learning rate
gamma = 0.95 # Discount factor
epsilon = 0.1 # Exploration probability
num_episodes = 2000 # Training episodes
# Train the agent using Q-learning
for episode in range(num_episodes):
state, _ = env.reset()
done = False
while not done:
# Choose action using epsilon-greedy policy
if np.random.uniform() < epsilon:
action = env.action_space.sample() # Random action (exploration)
else:
action = np.argmax(Q[state, :]) # Best action from Q-table
# Take the action and observe the next state
next_state, reward, done, _, _ = env.step(action)
# Update Q-value using the Bellman equation
Q[state, action] = (1 - alpha) * Q[state, action] + alpha * (reward + gamma * np.max(Q[next_state, :]))
# Move to the next state
state = next_state
# Test the agent by playing the game
state, _ = env.reset()
done = False
print("\nTesting trained agent:\n")
while not done:
action = np.argmax(Q[state, :])
next_state, reward, done, _, _ = env.step(action)
# Render the environment
env.render()
time.sleep(0.5) # Pause for visibility
state = next_state
print("\nGame Over!")
In this exercise, we first create an OpenAI Gym environment for the game using the make
function from the gym
module. We define the Q-table for the agent as a NumPy array and set the hyperparameters for the Q-learning algorithm. We train the agent using the Q-learning algorithm by looping through a specified number of episodes and updating the Q-table based on the rewards and next states. Finally, we test the agent by playing the game using the Q-table and visualizing the game using the render
method.
Exercise 17: Time Series Analysis
Concepts:
- Time Series Analysis
- Data Preprocessing
- Data Visualization
- ARIMA model
- Statsmodels library
Description: Write a Python script that reads a CSV file containing time series data, performs some data preprocessing and visualization, and fits a time series model to the data.
Solution:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Read the CSV file into a pandas dataframe
df = pd.read_csv('time_series.csv')
# Convert the date column to a datetime object and set it as the index
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Check for missing values before resampling
if df.isnull().values.any():
df = df.fillna(method='ffill')
# Ensure the column name is correct
target_col = df.columns[0] # Assuming first column is the time series value
# Resample the data to a monthly frequency
df = df.resample('M').mean()
# Plot the time series data
plt.figure(figsize=(10, 5))
plt.plot(df.index, df[target_col], label="Time Series")
plt.xlabel("Date")
plt.ylabel("Value")
plt.title("Time Series Visualization")
plt.legend()
plt.grid()
plt.show()
# Fit an ARIMA model
model = sm.tsa.ARIMA(df[target_col].dropna(), order=(1, 1, 1)) # Use dropna() to avoid errors
results = model.fit()
# Print the model summary
print(results.summary())
In this exercise, we first read a CSV file containing time series data into a pandas dataframe. We convert the date column to a datetime object and set it as the index. We resample the data to a monthly frequency and fill any missing values using forward fill. We visualize the data using the plot
function from the matplotlib.pyplot
module. Finally, we fit an ARIMA model to the data using the ARIMA
function from the statsmodels.api
module and print the summary of the model using the summary
method of the results object.
Exercise 18: Computer Networking
Concepts:
- Computer Networking
- TCP/IP Protocol
- Socket Programming
Description: Write a Python script that implements a simple TCP server that accepts client connections and sends and receives data.
Solution:
import socket
# Define the host and port for the server
host = 'localhost'
port = 12345
# Create a socket object
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Bind the socket to the host and port
s.bind((host, port))
# Listen for incoming connections
s.listen(1)
print('Server listening on', host, port)
# Accept a client connection
conn, addr = s.accept()
print('Connected by', addr)
# Send data to the client
conn.sendall(b'Hello, client!')
# Receive data from the client
data = conn.recv(1024)
print('Received:', data.decode())
# Close the connection
conn.close()
In this exercise, we first define the host and port for the server. We create a socket object using the socket
function from the socket
module and bind the socket to the host and port using the bind
method. We listen for incoming connections using the listen
method and accept a client connection using the accept
method, which returns a connection object and the address of the client. We send data to the client using the sendall
method of the connection object and receive data from the client using the recv
method. Finally, we close the connection using the close
method.
Exercise 19: Data Analysis and Visualization
Concepts:
- Data Analysis
- Data Visualization
- PDF Report Generation
- Pandas library
- Matplotlib library
- ReportLab library
Description: Write a Python script that reads a CSV file containing sales data for a retail store, performs some data analysis and visualization, and saves the results to a PDF report.
Solution:
import pandas as pd
import matplotlib.pyplot as plt
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
import os
# Read the CSV file into a pandas dataframe
df = pd.read_csv('sales_data.csv')
# Calculate the total sales by category and month
totals = df.groupby(['category', 'month'])['sales'].sum()
# Get unique categories
categories = df['category'].unique()
# Create subplots dynamically based on the number of categories
fig, axes = plt.subplots(nrows=len(categories), ncols=1, figsize=(8.5, 11))
# Ensure `axes` is always iterable (even if there's only one category)
if len(categories) == 1:
axes = [axes]
# Plot total sales by category and month
for i, category in enumerate(categories):
totals.loc[category].plot(ax=axes[i], kind='bar', title=f"Category: {category}")
axes[i].set_ylabel("Sales")
plt.tight_layout()
plt.savefig('sales_plot.png') # Save the figure
plt.close(fig) # Close to free memory
# Create a PDF report
pdf_filename = 'sales_report.pdf'
c = canvas.Canvas(pdf_filename, pagesize=letter)
# Add title and description
c.setFont("Helvetica-Bold", 16)
c.drawString(50, 750, 'Sales Report')
c.setFont("Helvetica", 12)
c.drawString(50, 730, 'Total Sales by Category and Month')
# Add the image to the PDF if it exists
if os.path.exists('sales_plot.png'):
c.drawImage('sales_plot.png', 50, 450, width=500, height=300)
# Save and close the PDF
c.showPage()
c.save()
print(f"Report saved as {pdf_filename}")
In this exercise, we first read a CSV file containing sales data for a retail store into a pandas dataframe. We calculate the total sales by category and month using the groupby
and sum
methods. We plot the total sales by category and month using the plot
function from the matplotlib.pyplot
module and save the plot to a PNG file. Finally, we generate a PDF report using the Canvas
and Image
functions from the reportlab
module.
Exercise 20: Machine Learning
Concepts:
- Machine Learning
- Convolutional Neural Networks
- Keras library
- MNIST dataset
Description: Write a Python script that trains a machine learning model to classify images of handwritten digits from the MNIST dataset.
Solution:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
# Normalize the pixel values and reshape the data
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
x_train = x_train.reshape(-1, 28, 28, 1)
x_test = x_test.reshape(-1, 28, 28, 1)
# Define the CNN model
model = keras.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(128, activation='relu'), # Added a fully connected layer
layers.Dropout(0.5), # Prevent overfitting
layers.Dense(10, activation='softmax')
])
# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test), batch_size=64)
# Evaluate the model on the test data
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print('Test accuracy:', test_acc)
In this exercise, we first load the MNIST dataset using the load_data
function from the keras.datasets.mnist
module. We normalize the pixel values and reshape the data using NumPy. We define a convolutional neural network model using the Sequential
class and various layers from the layers
module of Keras. We compile the model using the compile
method with the Adam optimizer and sparse categorical crossentropy loss function. We train the model using the fit
method and evaluate the model on the test data using the evaluate
method.
Exercise 21: Natural Language Processing
Concepts:
- Natural Language Processing
- Text Preprocessing
- Text Representation
- Topic Modeling
- Latent Dirichlet Allocation
- Gensim library
Description: Write a Python script that uses natural language processing techniques to analyze a corpus of text data and extract useful insights.
Solution:
import gensim
from gensim import corpora
from gensim.models import LdaModel
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Download required resources
nltk.download('stopwords')
nltk.download('punkt')
# Read the text data into a pandas dataframe
df = pd.read_csv('text_data.csv')
# Handle missing values
df['text'] = df['text'].fillna('')
# Define stop words and clean text
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
tokens = word_tokenize(text.lower()) # Tokenization & lowercasing
return [word for word in tokens if word.isalnum() and word not in stop_words] # Remove punctuation & stopwords
df['cleaned_text'] = df['text'].apply(preprocess_text)
# Create a document-term matrix
texts = df['cleaned_text'].tolist()
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
# Train LDA model
num_topics = 5
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=10)
# Print topics and top words for each
for topic_id, words in lda_model.show_topics(num_topics=num_topics, formatted=False):
print(f'Topic {topic_id}:', ', '.join(word for word, _ in words))
# Convert topic distributions into a structured DataFrame
topic_dists = [{f"Topic_{topic}": prob for topic, prob in lda_model.get_document_topics(doc, minimum_probability=0)} for doc in corpus]
topic_df = pd.DataFrame(topic_dists)
# Merge topic distributions with original data
df = pd.concat([df, topic_df], axis=1)
# Save the results
df.to_csv('text_data_topics.csv', index=False)
print("Saved processed data to 'text_data_topics.csv'.")
In this exercise, we first read a corpus of text data into a pandas dataframe. We define the stop words using the stopwords
function from the nltk.corpus
module and remove them from the text data using list comprehension and apply
method of pandas. We create a document-term matrix from the text data using the Dictionary
and corpus
functions from the gensim
module. We perform topic modeling using latent Dirichlet allocation (LDA) using the LdaModel
function and extract the topic distributions for each document. Finally, we save the results to a CSV file using the to_csv
method of pandas.
Exercise 22: Web Scraping
Concepts:
- Web Scraping
- HTML Parsing
- BeautifulSoup library
- CSV File I/O
Description: Write a Python script that scrapes data from a website using the BeautifulSoup library and saves it to a CSV file.
Solution:
import requests
from bs4 import BeautifulSoup
import csv
# Define the URL to scrape
url = 'https://www.example.com'
# Headers to mimic a real browser request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
# Send a GET request
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code != 200:
print(f"Error: Unable to fetch data (Status Code: {response.status_code})")
exit()
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Extract data
data = []
for item in soup.find_all('div', class_='item'):
name_tag = item.find('h3')
price_tag = item.find('span', class_='price')
# Extract text safely, handling missing elements
name = name_tag.get_text(strip=True) if name_tag else 'N/A'
price = price_tag.get_text(strip=True) if price_tag else 'N/A'
data.append([name, price])
# Save to CSV
csv_filename = 'data.csv'
with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['Name', 'Price']) # Add headers
writer.writerows(data)
print(f"Scraping completed. Data saved to '{csv_filename}'.")
In this exercise, we first define the URL to scrape using the requests
library and parse the HTML content using the BeautifulSoup
library. We extract the data from the HTML content using the find_all
and find
methods of the soup
object. Finally, we save the data to a CSV file using the csv
module.
Exercise 23: Database Interaction
Concepts:
- Database Interaction
- SQLite database
- SQL queries
- SQLite3 module
Description: Write a Python script that interacts with a database to retrieve and manipulate data.
Solution:
import sqlite3
# Connect to the database
conn = sqlite3.connect('example.db')
# Create a cursor object
c = conn.cursor()
# Execute an SQL query to create a table
c.execute('''CREATE TABLE IF NOT EXISTS customers
(id INTEGER PRIMARY KEY, name TEXT, email TEXT, phone TEXT)''')
# Execute an SQL query to insert data into the table
c.execute("INSERT INTO customers (name, email, phone) VALUES ('John Smith', 'john@example.com', '555-1234')")
# Execute an SQL query to retrieve data from the table
c.execute("SELECT * FROM customers")
rows = c.fetchall()
for row in rows:
print(row)
# Execute an SQL query to update data in the table
c.execute("UPDATE customers SET phone='555-5678' WHERE name='John Smith'")
# Execute an SQL query to delete data from the table
c.execute("DELETE FROM customers WHERE name='John Smith'")
# Commit the changes to the database
conn.commit()
# Close the database connection
conn.close()
In this exercise, we first connect to an SQLite database using the connect
function from the sqlite3
module. We create a cursor object using the cursor
method of the connection object and execute SQL queries using the execute
method of the cursor object. We retrieve data from the table using the fetchall
method and print the results. We update data in the table using the UPDATE
statement and delete data from the table using the DELETE
statement. Finally, we commit the changes to the database and close the connection.
Exercise 24: Parallel Processing
Concepts:
- Parallel Processing
- Multiprocessing
- Process Pool
- CPU-bound tasks
Description: Write a Python script that performs a time-consuming computation using parallel processing to speed up the computation.
Solution:
import time
import multiprocessing
# Define an optimized CPU-bound function
def compute(num):
return num * (num - 1) // 2 # Uses O(1) formula instead of a loop
if __name__ == '__main__':
# Create a process pool with the number of CPUs available
num_cpus = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cpus)
# Generate a list of numbers to compute
num_list = [10000000] * num_cpus
# Compute the results using parallel processing
start_time = time.time()
results = pool.map(compute, num_list)
# Close the pool properly
pool.close()
pool.join()
end_time = time.time()
# Print the results and computation time
print('Results:', results)
print('Computation time:', end_time - start_time, 'seconds')
In this exercise, we first define a CPU-bound function that takes a long time to compute. We then create a process pool using the Pool
function from the multiprocessing
module with the number of CPUs available. We generate a list of numbers to compute and compute the results using the map
method of the process pool. Finally, we print the results and computation time.
Exercise 25: Image Processing
Concepts:
- Image Processing
- Pillow library
- Image Manipulation
- Image Filtering
Description: Write a Python script that performs basic image processing operations on an image file.
Solution:
from PIL import Image, ImageFilter
import os
# Define image paths
input_path = 'example.jpg'
output_path = 'processed.jpg'
# Check if the input file exists
if not os.path.exists(input_path):
raise FileNotFoundError(f"Error: The file '{input_path}' was not found.")
try:
# Open the image file using a context manager
with Image.open(input_path) as image:
# Display the original image (optional, may not work in all environments)
image.show()
# Resize the image
image = image.resize((500, 500))
# Convert the image to grayscale
image = image.convert('L')
# Apply a Gaussian blur filter
image = image.filter(ImageFilter.GaussianBlur(radius=2))
# Save the processed image to a file
image.save(output_path)
# Display the processed image
image.show()
print(f"Processed image saved as '{output_path}'.")
except Exception as e:
print(f"An error occurred: {e}")
In this exercise, we first open an image file using the Image
class from the Pillow
library. We resize the image using the resize
method and convert it to grayscale using the convert
method with the 'L'
mode. We apply a Gaussian blur filter using the filter
method with the GaussianBlur
class from the ImageFilter
module. Finally, we save the processed image to a file using the save
method and display it using the show
method.
I hope you find these exercises useful! Let me know if you have any further questions.
Advance Level Exercises Part 1
Exercise 1: File Parsing
Concepts:
- File I/O
- Regular expressions
Description: Write a Python script that reads a text file and extracts all URLs that are present in the file. The output should be a list of URLs.
Solution:
import re
# Open the file for reading
with open('input_file.txt', 'r') as f:
# Read the file contents
file_contents = f.read()
# Use regular expression to extract URLs
urls = re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', file_contents)
# Print the list of URLs
print(urls)
Exercise 2: Data Analysis
Concepts:
- File I/O
- Data manipulation
- Pandas library
Description: Write a Python script that reads a CSV file containing sales data and calculates the total sales revenue for each product category.
Solution:
import pandas as pd
# Read the CSV file into a pandas dataframe
df = pd.read_csv('sales_data.csv')
# Group the data by product category and sum the sales revenue
total_revenue = df.groupby('Product Category')['Sales Revenue'].sum()
# Print the total revenue for each product category
print(total_revenue)
Exercise 3: Web Scraping
Concepts
- Web scraping
- Requests library
- Beautiful Soup library
- CSV file I/O
Description: Write a Python script that scrapes the title and price of all products listed on an e-commerce website and stores them in a CSV file.
Solution:
import requests
from bs4 import BeautifulSoup
import csv
# Define the target URL
url = 'https://www.example.com/products'
# Headers to mimic a real browser request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
# Make a GET request to the website
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')
# Find all product titles and prices
titles = [title.get_text(strip=True) for title in soup.find_all('h3', class_='product-title')]
prices = [price.get_text(strip=True) for price in soup.find_all('div', class_='product-price')]
# Zip the titles and prices together
data = list(zip(titles, prices))
# Write the data to a CSV file with headers
with open('product_data.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Product Title', 'Price']) # Add headers
writer.writerows(data)
print("Scraping completed. Data saved to 'product_data.csv'.")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
Exercise 4: Multithreading
Concepts:
- Multithreading
- Requests library
- Threading library
Description: Write a Python script that uses multithreading to download multiple images from a URL list simultaneously.
Solution:
import requests
import threading
# URL list of images to download
url_list = ['https://www.example.com/image1.jpg', 'https://www.example.com/image2.jpg', 'https://www.example.com/image3.jpg']
# Function to download an image from a URL
def download_image(url):
response = requests.get(url)
with open(url.split('/')[-1], 'wb') as f:
f.write(response.content)
# Create a thread for each URL and start them all simultaneously
threads = []
for url in url_list:
thread = threading.Thread(target=download_image, args=(url,))
threads.append(thread)
thread.start()
# Wait for all threads to finish
for thread in threads:
thread.join()
Exercise 5: Machine Learning
Concepts:
- Machine learning
- Scikit-learn library
Description: Write a Python script that trains a machine learning model on a dataset and uses it to predict the output for new data.
Solution:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Read the dataset into a pandas dataframe
df = pd.read_csv('dataset.csv')
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[['feature1', 'feature2']], df['target'], test_size=0.2, random_state=42)
# Train a linear regression model on the training data
model = LinearRegression()
model.fit(X_train, y_train)
# Use the model to predict the output for the testing data
y_pred = model.predict(X_test)
# Evaluate the model performance using the mean squared error metric
mse = ((y_test - y_pred) ** 2).mean()
print("Mean squared error:", mse)
In this exercise, we first read a dataset into a pandas dataframe. Then, we split the data into training and testing sets using the train_test_split
function from the sklearn.model_selection
module. We trained a linear regression model on the training data using the LinearRegression
class from the sklearn.linear_model
module. Finally, we used the trained model to predict the output for the testing data and evaluated the model performance using the mean squared error metric.
Exercise 6: Natural Language Processing
Concepts:
- Natural Language Processing
- Sentiment Analysis
- NLTK library
Description: Write a Python script that reads a text file and performs sentiment analysis on the text using a pre-trained NLP model.
Solution:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Ensure the VADER lexicon is downloaded
nltk.download('vader_lexicon')
# Read the text file into a string
with open('input_file.txt', 'r', encoding='utf-8') as f:
text = f.read()
# Create a SentimentIntensityAnalyzer object
sid = SentimentIntensityAnalyzer()
# Perform sentiment analysis on the text
scores = sid.polarity_scores(text)
# Print the sentiment scores
print(scores)
In this exercise, we first read a text file into a string. Then, we create a SentimentIntensityAnalyzer
object from the nltk.sentiment.vader
module. We use the polarity_scores
method of the SentimentIntensityAnalyzer
object to perform sentiment analysis on the text and get a dictionary of sentiment scores.
Exercise 7: Web Development
Concepts:
- Web Development
- Flask framework
- File Uploads
Description: Write a Python script that creates a web application using the Flask framework that allows users to upload a file and performs some processing on the file.
Solution:
from flask import Flask, render_template, request
import os
app = Flask(__name__)
# Set the path for file uploads
UPLOAD_FOLDER = 'uploads'
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
# Ensure the upload directory exists
if not os.path.exists(UPLOAD_FOLDER):
os.makedirs(UPLOAD_FOLDER)
# Route for the home page
@app.route('/')
def index():
return render_template('index.html')
# Route for file uploads
@app.route('/upload', methods=['POST'])
def upload():
if 'file' not in request.files:
return 'No file part', 400
file = request.files['file']
if file.filename == '':
return 'No selected file', 400
# Save the file to the uploads folder
file.save(os.path.join(app.config['UPLOAD_FOLDER'], file.filename))
return 'File uploaded successfully'
if __name__ == '__main__':
app.run(debug=True)
In this exercise, we first import the Flask module and create a Flask application. We set up a route for the home page that returns an HTML template. We set up a route for file uploads that receives an uploaded file and saves it to a designated uploads folder. We can perform processing on the uploaded file inside the upload
function.
Exercise 8: Data Visualization
Concepts:
- Data Visualization
- Matplotlib library
- Candlestick Charts
Description: Write a Python script that reads a CSV file containing stock market data and plots a candlestick chart of the data.
Solution:
import pandas as pd
import matplotlib.pyplot as plt
import mplfinance as mpf
# Read the CSV file into a pandas dataframe
df = pd.read_csv('stock_data.csv', parse_dates=['Date'])
df.set_index('Date', inplace=True) # Set Date as index
# Plot the candlestick chart using mplfinance
mpf.plot(df, type='candle', style='charles', title='Stock Market Data', ylabel='Price')
# Display the chart
plt.show()
In this exercise, we first read a CSV file containing stock market data into a pandas dataframe. We convert the date column to Matplotlib dates format and create a figure and axis objects. We plot the candlestick chart using the candlestick_ohlc
function from the mpl_finance
module. We format the x-axis as dates and set the axis labels and title. Finally, we display the chart using the show
function from the matplotlib.pyplot
module.
Exercise 9: Machine Learning
Concepts:
- Machine Learning
- Scikit-learn library
Description: Write a Python script that reads a dataset containing information about different types of flowers and trains a machine learning model to predict the type of a flower based on its features.
Solution:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Read the dataset into a pandas dataframe
df = pd.read_csv('flower_data.csv')
# Check for missing values
if df.isnull().sum().sum() > 0:
df = df.dropna() # Drop rows with missing values
# Define feature columns and target column
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = df['species']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the feature values
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train a logistic regression model on the training data
model = LogisticRegression(solver='saga', max_iter=5000) # Increased iterations & changed solver
model.fit(X_train, y_train)
# Use the model to predict the output for the testing data
y_pred = model.predict(X_test)
# Evaluate the model performance using the accuracy score metric
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
In this exercise, we first read a dataset containing information about different types of flowers into a pandas dataframe. We split the data into training and testing sets using the train_test_split
function from the sklearn.model_selection
module. We trained a logistic regression model on the training data using the LogisticRegression
class from the sklearn.linear_model
module. Finally, we used the trained model to predict the output for the testing data and evaluated the model performance using the accuracy score metric.
Exercise 10: Data Analysis
Concepts:
- Data Analysis
- Recommendation Systems
- Collaborative Filtering
- Surprise library
Description: Write a Python script that reads a CSV file containing customer purchase data and generates a recommendation system that recommends products to customers based on their purchase history.
Solution:
import pandas as pd
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import train_test_split
# Read the CSV file into a pandas dataframe
df = pd.read_csv('purchase_data.csv')
# Ensure that the dataset has no missing values
df = df.dropna(subset=['customer_id', 'product_id', 'rating'])
# Convert the pandas dataframe to a Surprise dataset
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['customer_id', 'product_id', 'rating']], reader)
# Split the data into training and testing sets
trainset, testset = train_test_split(data, test_size=0.2)
# Train an SVD model on the training data
model = SVD(n_factors=50, n_epochs=20, lr_all=0.005, reg_all=0.02)
model.fit(trainset)
# Use the model to predict the output for the testing data
predictions = model.test(testset)
# Evaluate the model performance using the root mean squared error metric
rmse = accuracy.rmse(predictions)
print("RMSE:", rmse)
# Recommend products to customers based on their purchase history
customer_ids = df['customer_id'].unique()
product_ids = df['product_id'].unique()
recommendations = {}
for customer_id in customer_ids:
purchased_products = set(df[df['customer_id'] == customer_id]['product_id'].values)
potential_recommendations = []
for product_id in product_ids:
if product_id not in purchased_products:
pred = model.predict(customer_id, product_id)
potential_recommendations.append((product_id, pred.est))
# Sort by predicted rating and take the top 5 recommendations
top_recommendations = sorted(potential_recommendations, key=lambda x: x[1], reverse=True)[:5]
recommendations[customer_id] = top_recommendations
# Display recommendations
for customer, recs in recommendations.items():
print(f"Customer {customer} recommended products: {recs}")
In this exercise, we first read a CSV file containing customer purchase data into a pandas dataframe. We convert the pandas dataframe to a surprise dataset using the Reader
and Dataset
classes from the surprise
module. We split the data into training and testing sets using the train_test_split
function from the surprise.model_selection
module. We trained an SVD model on the training data using the SVD
class from the surprise
module. We used the trained model to predict the output for the testing data and evaluated the model performance using the root mean squared error metric. Finally, we recommended products to customers based on their purchase history using the trained model.
Exercise 11: Computer Vision
Concepts:
- Computer Vision
- Object Detection
- OpenCV library
- Pre-trained models
Description: Write a Python script that reads an image and performs object detection on the image using a pre-trained object detection model.
Solution:
import cv2
import numpy as np
# Read the image file
img = cv2.imread('image.jpg')
# Check if the image is loaded correctly
if img is None:
raise FileNotFoundError("Error: Image file not found or unable to load.")
# Load the pre-trained object detection model
model = cv2.dnn.readNetFromTensorflow('frozen_inference_graph.pb', 'ssd_mobilenet_v2_coco_2018_03_29.pbtxt')
# Prepare the input image for the model
blob = cv2.dnn.blobFromImage(img, size=(300, 300), swapRB=True, crop=False)
model.setInput(blob)
# Perform object detection
output = model.forward()
# Loop through detected objects and draw bounding boxes
h, w, _ = img.shape # Get image dimensions
for detection in output[0, 0, :, :]:
confidence = float(detection[2])
if confidence > 0.5:
x1 = int(detection[3] * w)
y1 = int(detection[4] * h)
x2 = int(detection[5] * w)
y2 = int(detection[6] * h)
# Draw bounding box with label and confidence score
cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2)
label = f'Confidence: {confidence:.2f}'
cv2.putText(img, label, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
# Display the image with detections
cv2.imshow('Object Detection', img)
cv2.waitKey(0)
cv2.destroyAllWindows()
In this exercise, we first read an image file into a NumPy array using the imread
function from the cv2
module of OpenCV. We load a pre-trained object detection model using the readNetFromTensorflow
function from the cv2.dnn
module. We set the input image to the model and perform object detection using the setInput
and forward
methods of the model object. Finally, we loop through the detected objects and draw bounding boxes around them using the rectangle
function from the cv2
module.
Exercise 12: Natural Language Processing
Concepts:
- Natural Language Processing
- Topic Modeling
- Latent Dirichlet Allocation
- Gensim library
Description: Write a Python script that reads a text file and performs topic modeling on the text using Latent Dirichlet Allocation (LDA).
Solution:
import gensim
from gensim import corpora
from gensim.models import LdaModel
# Read the text file into a list of strings
with open('input_file.txt', 'r') as f:
text = f.readlines()
# Remove newlines and convert to lowercase
text = [line.strip().lower() for line in text]
# Tokenize the text into words
tokens = [line.split() for line in text]
# Create a dictionary of words and their frequency
dictionary = corpora.Dictionary(tokens)
# Create a bag-of-words representation of the text
corpus = [dictionary.doc2bow(token) for token in tokens]
# Train an LDA model on the text
model = LdaModel(corpus, id2word=dictionary, num_topics=5, passes=10)
# Print the topics and their associated words
for topic in model.print_topics(num_words=5):
print(topic)
In this exercise, we first read a text file into a list of strings. We preprocess the text by removing newlines, converting to lowercase, and tokenizing into words using the split
method. We create a dictionary of words and their frequency and create a bag-of-words representation of the text using the doc2bow
method of the dictionary object. We train an LDA model on the corpus using the LdaModel
class from the gensim.models
module. Finally, we print the topics and their associated words using the print_topics
method of the model object.
Exercise 13: Web Scraping
Concepts:
- Web Scraping
- Beautiful Soup library
- Requests library
- CSV file handling
Description: Write a Python script that scrapes a website for product information and saves the information to a CSV file.
Solution:
import requests
from bs4 import BeautifulSoup
import csv
# Define the URL of the website to scrape
url = 'https://www.example.com/products'
# Add headers to mimic a browser request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
# Send a request to the website
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code != 200:
print(f"Failed to fetch data. Status Code: {response.status_code}")
exit()
# Parse the HTML content of the response using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find all the product listings on the page
listings = soup.find_all('div', class_='product-listing')
# Write the product information to a CSV file
with open('products.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Product Name', 'Price', 'Description'])
for listing in listings:
name = listing.find('h3')
price = listing.find('span', class_='price')
description = listing.find('p')
# Extract text safely, handling missing elements
name = name.get_text(strip=True) if name else 'N/A'
price = price.get_text(strip=True) if price else 'N/A'
description = description.get_text(strip=True) if description else 'N/A'
writer.writerow([name, price, description])
print("Scraping completed. Data saved to 'products.csv'.")
In this exercise, we first define the URL of the website to scrape and send a request to the website using the get
function from the requests
module. We parse the HTML content of the response using Beautiful Soup and find all the product listings on the page using the find_all
method. We write the product information to a CSV file using the csv
module.
Exercise 14: Big Data Processing
Concepts:
- Big Data Processing
- PySpark
- Data Transformations
- Aggregation
- Parquet file format
Description: Write a PySpark script that reads a CSV file containing customer purchase data, performs some data transformations and aggregation, and saves the results to a Parquet file.
Solution:
from pyspark.sql import SparkSession
# Create a SparkSession object
spark = SparkSession.builder.appName('customer-purchases').getOrCreate()
# Verify if the file exists before reading (optional but useful)
import os
if not os.path.exists('customer_purchases.csv'):
raise FileNotFoundError("Error: The file 'customer_purchases.csv' does not exist.")
# Read the CSV file into a Spark DataFrame
df = spark.read.csv('customer_purchases.csv', header=True, inferSchema=True)
# Perform some data transformations
df = df.filter((df['purchase_date'] >= '2020-01-01') & (df['purchase_date'] <= '2020-12-31'))
df = df.select('customer_id', 'product_id', 'price')
# Group by customer and calculate total spending
df = df.groupBy('customer_id').sum('price').withColumnRenamed('sum(price)', 'total_spent')
# Save the results to a Parquet file
df.write.mode('overwrite').parquet('customer_spending.parquet')
print("Processing completed. Data saved to 'customer_spending.parquet'.")
In this exercise, we first create a SparkSession object using the SparkSession
class from the pyspark.sql
module. We read a CSV file containing customer purchase data into a Spark DataFrame using the read.csv
method. We perform some data transformations on the DataFrame using the filter
, select
, and groupBy
methods. Finally, we save the results to a Parquet file using the write.parquet
method.
Exercise 15: DevOps
Concepts:
- DevOps
- Fabric library
Description: Write a Python script that automates the deployment of a web application to a remote server using the Fabric library.
Solution:
from fabric import Connection
import getpass
# Define the host and user credentials for the remote server
host = 'example.com'
user = 'user'
password = getpass.getpass("Enter SSH password: ") # Secure password entry
# Define the path to the web application on the local machine and the remote server
local_path = '/path/to/local/app'
remote_path = '/path/to/remote/app'
# Create a connection to the remote server
c = Connection(host=host, user=user, connect_kwargs={'password': password})
# Ensure the remote directory exists
c.run(f'mkdir -p {remote_path}')
# Upload the local files to the remote server
c.put(local_path, remote_path, recursive=True) # Enables recursive copy
# Change to the application directory
with c.cd(remote_path):
# Install required dependencies
c.run('sudo apt-get update && sudo apt-get install -y python3-pip')
c.run('pip3 install -r requirements.txt')
# Start the web application in the background
c.run('nohup python3 app.py > app.log 2>&1 &', pty=False)
print("Deployment completed successfully.")
In this exercise, we first define the host and user credentials for the remote server. We define the path to the web application on the local machine and the remote server. We create a connection to the remote server using the Connection
class from the fabric
module. We upload the local files to the remote server using the put
method of the connection object. We install any required dependencies on the remote server using the run
method of the connection object. Finally, we start the web application on the remote server using the run
method.
Exercise 16: Reinforcement Learning
Concepts:
- Reinforcement Learning
- Q-Learning
- OpenAI Gym library
Description: Write a Python script that implements a reinforcement learning algorithm to teach an agent to play a simple game.
Solution:
import gym
import numpy as np
import time
# Create the FrozenLake environment
env = gym.make("FrozenLake-v1", is_slippery=True)
# Initialize the Q-table
Q = np.zeros([env.observation_space.n, env.action_space.n])
# Set hyperparameters
alpha = 0.8 # Learning rate
gamma = 0.95 # Discount factor
epsilon = 0.1 # Exploration probability
num_episodes = 2000 # Training episodes
# Train the agent using Q-learning
for episode in range(num_episodes):
state, _ = env.reset()
done = False
while not done:
# Choose action using epsilon-greedy policy
if np.random.uniform() < epsilon:
action = env.action_space.sample() # Random action (exploration)
else:
action = np.argmax(Q[state, :]) # Best action from Q-table
# Take the action and observe the next state
next_state, reward, done, _, _ = env.step(action)
# Update Q-value using the Bellman equation
Q[state, action] = (1 - alpha) * Q[state, action] + alpha * (reward + gamma * np.max(Q[next_state, :]))
# Move to the next state
state = next_state
# Test the agent by playing the game
state, _ = env.reset()
done = False
print("\nTesting trained agent:\n")
while not done:
action = np.argmax(Q[state, :])
next_state, reward, done, _, _ = env.step(action)
# Render the environment
env.render()
time.sleep(0.5) # Pause for visibility
state = next_state
print("\nGame Over!")
In this exercise, we first create an OpenAI Gym environment for the game using the make
function from the gym
module. We define the Q-table for the agent as a NumPy array and set the hyperparameters for the Q-learning algorithm. We train the agent using the Q-learning algorithm by looping through a specified number of episodes and updating the Q-table based on the rewards and next states. Finally, we test the agent by playing the game using the Q-table and visualizing the game using the render
method.
Exercise 17: Time Series Analysis
Concepts:
- Time Series Analysis
- Data Preprocessing
- Data Visualization
- ARIMA model
- Statsmodels library
Description: Write a Python script that reads a CSV file containing time series data, performs some data preprocessing and visualization, and fits a time series model to the data.
Solution:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Read the CSV file into a pandas dataframe
df = pd.read_csv('time_series.csv')
# Convert the date column to a datetime object and set it as the index
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Check for missing values before resampling
if df.isnull().values.any():
df = df.fillna(method='ffill')
# Ensure the column name is correct
target_col = df.columns[0] # Assuming first column is the time series value
# Resample the data to a monthly frequency
df = df.resample('M').mean()
# Plot the time series data
plt.figure(figsize=(10, 5))
plt.plot(df.index, df[target_col], label="Time Series")
plt.xlabel("Date")
plt.ylabel("Value")
plt.title("Time Series Visualization")
plt.legend()
plt.grid()
plt.show()
# Fit an ARIMA model
model = sm.tsa.ARIMA(df[target_col].dropna(), order=(1, 1, 1)) # Use dropna() to avoid errors
results = model.fit()
# Print the model summary
print(results.summary())
In this exercise, we first read a CSV file containing time series data into a pandas dataframe. We convert the date column to a datetime object and set it as the index. We resample the data to a monthly frequency and fill any missing values using forward fill. We visualize the data using the plot
function from the matplotlib.pyplot
module. Finally, we fit an ARIMA model to the data using the ARIMA
function from the statsmodels.api
module and print the summary of the model using the summary
method of the results object.
Exercise 18: Computer Networking
Concepts:
- Computer Networking
- TCP/IP Protocol
- Socket Programming
Description: Write a Python script that implements a simple TCP server that accepts client connections and sends and receives data.
Solution:
import socket
# Define the host and port for the server
host = 'localhost'
port = 12345
# Create a socket object
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Bind the socket to the host and port
s.bind((host, port))
# Listen for incoming connections
s.listen(1)
print('Server listening on', host, port)
# Accept a client connection
conn, addr = s.accept()
print('Connected by', addr)
# Send data to the client
conn.sendall(b'Hello, client!')
# Receive data from the client
data = conn.recv(1024)
print('Received:', data.decode())
# Close the connection
conn.close()
In this exercise, we first define the host and port for the server. We create a socket object using the socket
function from the socket
module and bind the socket to the host and port using the bind
method. We listen for incoming connections using the listen
method and accept a client connection using the accept
method, which returns a connection object and the address of the client. We send data to the client using the sendall
method of the connection object and receive data from the client using the recv
method. Finally, we close the connection using the close
method.
Exercise 19: Data Analysis and Visualization
Concepts:
- Data Analysis
- Data Visualization
- PDF Report Generation
- Pandas library
- Matplotlib library
- ReportLab library
Description: Write a Python script that reads a CSV file containing sales data for a retail store, performs some data analysis and visualization, and saves the results to a PDF report.
Solution:
import pandas as pd
import matplotlib.pyplot as plt
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
import os
# Read the CSV file into a pandas dataframe
df = pd.read_csv('sales_data.csv')
# Calculate the total sales by category and month
totals = df.groupby(['category', 'month'])['sales'].sum()
# Get unique categories
categories = df['category'].unique()
# Create subplots dynamically based on the number of categories
fig, axes = plt.subplots(nrows=len(categories), ncols=1, figsize=(8.5, 11))
# Ensure `axes` is always iterable (even if there's only one category)
if len(categories) == 1:
axes = [axes]
# Plot total sales by category and month
for i, category in enumerate(categories):
totals.loc[category].plot(ax=axes[i], kind='bar', title=f"Category: {category}")
axes[i].set_ylabel("Sales")
plt.tight_layout()
plt.savefig('sales_plot.png') # Save the figure
plt.close(fig) # Close to free memory
# Create a PDF report
pdf_filename = 'sales_report.pdf'
c = canvas.Canvas(pdf_filename, pagesize=letter)
# Add title and description
c.setFont("Helvetica-Bold", 16)
c.drawString(50, 750, 'Sales Report')
c.setFont("Helvetica", 12)
c.drawString(50, 730, 'Total Sales by Category and Month')
# Add the image to the PDF if it exists
if os.path.exists('sales_plot.png'):
c.drawImage('sales_plot.png', 50, 450, width=500, height=300)
# Save and close the PDF
c.showPage()
c.save()
print(f"Report saved as {pdf_filename}")
In this exercise, we first read a CSV file containing sales data for a retail store into a pandas dataframe. We calculate the total sales by category and month using the groupby
and sum
methods. We plot the total sales by category and month using the plot
function from the matplotlib.pyplot
module and save the plot to a PNG file. Finally, we generate a PDF report using the Canvas
and Image
functions from the reportlab
module.
Exercise 20: Machine Learning
Concepts:
- Machine Learning
- Convolutional Neural Networks
- Keras library
- MNIST dataset
Description: Write a Python script that trains a machine learning model to classify images of handwritten digits from the MNIST dataset.
Solution:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
# Normalize the pixel values and reshape the data
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
x_train = x_train.reshape(-1, 28, 28, 1)
x_test = x_test.reshape(-1, 28, 28, 1)
# Define the CNN model
model = keras.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(128, activation='relu'), # Added a fully connected layer
layers.Dropout(0.5), # Prevent overfitting
layers.Dense(10, activation='softmax')
])
# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test), batch_size=64)
# Evaluate the model on the test data
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print('Test accuracy:', test_acc)
In this exercise, we first load the MNIST dataset using the load_data
function from the keras.datasets.mnist
module. We normalize the pixel values and reshape the data using NumPy. We define a convolutional neural network model using the Sequential
class and various layers from the layers
module of Keras. We compile the model using the compile
method with the Adam optimizer and sparse categorical crossentropy loss function. We train the model using the fit
method and evaluate the model on the test data using the evaluate
method.
Exercise 21: Natural Language Processing
Concepts:
- Natural Language Processing
- Text Preprocessing
- Text Representation
- Topic Modeling
- Latent Dirichlet Allocation
- Gensim library
Description: Write a Python script that uses natural language processing techniques to analyze a corpus of text data and extract useful insights.
Solution:
import gensim
from gensim import corpora
from gensim.models import LdaModel
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Download required resources
nltk.download('stopwords')
nltk.download('punkt')
# Read the text data into a pandas dataframe
df = pd.read_csv('text_data.csv')
# Handle missing values
df['text'] = df['text'].fillna('')
# Define stop words and clean text
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
tokens = word_tokenize(text.lower()) # Tokenization & lowercasing
return [word for word in tokens if word.isalnum() and word not in stop_words] # Remove punctuation & stopwords
df['cleaned_text'] = df['text'].apply(preprocess_text)
# Create a document-term matrix
texts = df['cleaned_text'].tolist()
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
# Train LDA model
num_topics = 5
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=10)
# Print topics and top words for each
for topic_id, words in lda_model.show_topics(num_topics=num_topics, formatted=False):
print(f'Topic {topic_id}:', ', '.join(word for word, _ in words))
# Convert topic distributions into a structured DataFrame
topic_dists = [{f"Topic_{topic}": prob for topic, prob in lda_model.get_document_topics(doc, minimum_probability=0)} for doc in corpus]
topic_df = pd.DataFrame(topic_dists)
# Merge topic distributions with original data
df = pd.concat([df, topic_df], axis=1)
# Save the results
df.to_csv('text_data_topics.csv', index=False)
print("Saved processed data to 'text_data_topics.csv'.")
In this exercise, we first read a corpus of text data into a pandas dataframe. We define the stop words using the stopwords
function from the nltk.corpus
module and remove them from the text data using list comprehension and apply
method of pandas. We create a document-term matrix from the text data using the Dictionary
and corpus
functions from the gensim
module. We perform topic modeling using latent Dirichlet allocation (LDA) using the LdaModel
function and extract the topic distributions for each document. Finally, we save the results to a CSV file using the to_csv
method of pandas.
Exercise 22: Web Scraping
Concepts:
- Web Scraping
- HTML Parsing
- BeautifulSoup library
- CSV File I/O
Description: Write a Python script that scrapes data from a website using the BeautifulSoup library and saves it to a CSV file.
Solution:
import requests
from bs4 import BeautifulSoup
import csv
# Define the URL to scrape
url = 'https://www.example.com'
# Headers to mimic a real browser request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
# Send a GET request
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code != 200:
print(f"Error: Unable to fetch data (Status Code: {response.status_code})")
exit()
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Extract data
data = []
for item in soup.find_all('div', class_='item'):
name_tag = item.find('h3')
price_tag = item.find('span', class_='price')
# Extract text safely, handling missing elements
name = name_tag.get_text(strip=True) if name_tag else 'N/A'
price = price_tag.get_text(strip=True) if price_tag else 'N/A'
data.append([name, price])
# Save to CSV
csv_filename = 'data.csv'
with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['Name', 'Price']) # Add headers
writer.writerows(data)
print(f"Scraping completed. Data saved to '{csv_filename}'.")
In this exercise, we first define the URL to scrape using the requests
library and parse the HTML content using the BeautifulSoup
library. We extract the data from the HTML content using the find_all
and find
methods of the soup
object. Finally, we save the data to a CSV file using the csv
module.
Exercise 23: Database Interaction
Concepts:
- Database Interaction
- SQLite database
- SQL queries
- SQLite3 module
Description: Write a Python script that interacts with a database to retrieve and manipulate data.
Solution:
import sqlite3
# Connect to the database
conn = sqlite3.connect('example.db')
# Create a cursor object
c = conn.cursor()
# Execute an SQL query to create a table
c.execute('''CREATE TABLE IF NOT EXISTS customers
(id INTEGER PRIMARY KEY, name TEXT, email TEXT, phone TEXT)''')
# Execute an SQL query to insert data into the table
c.execute("INSERT INTO customers (name, email, phone) VALUES ('John Smith', 'john@example.com', '555-1234')")
# Execute an SQL query to retrieve data from the table
c.execute("SELECT * FROM customers")
rows = c.fetchall()
for row in rows:
print(row)
# Execute an SQL query to update data in the table
c.execute("UPDATE customers SET phone='555-5678' WHERE name='John Smith'")
# Execute an SQL query to delete data from the table
c.execute("DELETE FROM customers WHERE name='John Smith'")
# Commit the changes to the database
conn.commit()
# Close the database connection
conn.close()
In this exercise, we first connect to an SQLite database using the connect
function from the sqlite3
module. We create a cursor object using the cursor
method of the connection object and execute SQL queries using the execute
method of the cursor object. We retrieve data from the table using the fetchall
method and print the results. We update data in the table using the UPDATE
statement and delete data from the table using the DELETE
statement. Finally, we commit the changes to the database and close the connection.
Exercise 24: Parallel Processing
Concepts:
- Parallel Processing
- Multiprocessing
- Process Pool
- CPU-bound tasks
Description: Write a Python script that performs a time-consuming computation using parallel processing to speed up the computation.
Solution:
import time
import multiprocessing
# Define an optimized CPU-bound function
def compute(num):
return num * (num - 1) // 2 # Uses O(1) formula instead of a loop
if __name__ == '__main__':
# Create a process pool with the number of CPUs available
num_cpus = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cpus)
# Generate a list of numbers to compute
num_list = [10000000] * num_cpus
# Compute the results using parallel processing
start_time = time.time()
results = pool.map(compute, num_list)
# Close the pool properly
pool.close()
pool.join()
end_time = time.time()
# Print the results and computation time
print('Results:', results)
print('Computation time:', end_time - start_time, 'seconds')
In this exercise, we first define a CPU-bound function that takes a long time to compute. We then create a process pool using the Pool
function from the multiprocessing
module with the number of CPUs available. We generate a list of numbers to compute and compute the results using the map
method of the process pool. Finally, we print the results and computation time.
Exercise 25: Image Processing
Concepts:
- Image Processing
- Pillow library
- Image Manipulation
- Image Filtering
Description: Write a Python script that performs basic image processing operations on an image file.
Solution:
from PIL import Image, ImageFilter
import os
# Define image paths
input_path = 'example.jpg'
output_path = 'processed.jpg'
# Check if the input file exists
if not os.path.exists(input_path):
raise FileNotFoundError(f"Error: The file '{input_path}' was not found.")
try:
# Open the image file using a context manager
with Image.open(input_path) as image:
# Display the original image (optional, may not work in all environments)
image.show()
# Resize the image
image = image.resize((500, 500))
# Convert the image to grayscale
image = image.convert('L')
# Apply a Gaussian blur filter
image = image.filter(ImageFilter.GaussianBlur(radius=2))
# Save the processed image to a file
image.save(output_path)
# Display the processed image
image.show()
print(f"Processed image saved as '{output_path}'.")
except Exception as e:
print(f"An error occurred: {e}")
In this exercise, we first open an image file using the Image
class from the Pillow
library. We resize the image using the resize
method and convert it to grayscale using the convert
method with the 'L'
mode. We apply a Gaussian blur filter using the filter
method with the GaussianBlur
class from the ImageFilter
module. Finally, we save the processed image to a file using the save
method and display it using the show
method.
I hope you find these exercises useful! Let me know if you have any further questions.