Shimmer and Shine: Let's Build Our First Machine Learning Model

Serah Rashidi (She/Her)
Jun 6, 2024
3 min read

Updated: Jun 11, 2024

Hey there, and thanks for joining again! Today, we're blending two of my passions: Taylor Swift and machine learning. Who says you can't mix pop culture with serious data science? We'll be building a machine learning model using Taylor Swift's tour setlists from Kaggle. It’s preferable to have a background in machine learning for this guide. But if you're just excited to try it out and need a clearer understanding of the underlying principles, stay tuned for part two, where we'll theoretically explain the concepts used here. So Let's get started !

Step 1: Accessing the Dataset

To get started, you'll need to access our dataset, which features Taylor Swift's tour setlists. You can find this dataset on Kaggle. Just follow this link: Taylor Swift The Eras Tour Official Setlist Data.

Step 2: Setting Up Your Kaggle Environment

Once you're on the dataset's Kaggle page, initiate your analysis by clicking "New Notebook" at the top of the dataset page. By taking this step, the dataset will be smoothly loaded into your notebook on Kaggle, making it easier for you to manipulate and analyze the data right away.

Step 3: Exploratory Data Analysis (EDA)

Let's begin exploring our data together. To get a better understanding of its structure and details, please copy the code below and paste it into a new cell in your Kaggle notebook. This simple step will give us an initial overview of what the dataset looks like.

import pandas as pd

# Load the dataset
data = pd.read_csv('../input/taylor-swift-the-eras-tour-official-setlist-data/era_tour_setlist.csv')

# Display the first few rows of the dataset

print(data.head())

# Generate summary statistics

print(data.describe())

# Check for missing values

print(data.isnull().sum())

This code block loads the dataset, shows the first few entries, provides summary statistics, and checks for any missing data points, giving you a solid starting point for further analysis.

Step 4: Data Preprocessing

Let's get our data ready for the model by ensuring it's clean and formatted correctly. Please copy the following code .

# Handling missing values if any

data.ffill(inplace=True)

# Encoding categorical variables if necessary

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

data['track_name'] = encoder.fit_transform(data['track_name'])

Step 5: Model Building

Now for the exciting part , let's start building our model! We'll begin with a linear regression model, which will allow us to predict a track's popularity based on features like danceability, energy, loudness, acousticness, and the newly encoded track name. Please copy and paste the following code into your Kaggle notebook to set up and train your model.

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

# Define the features and the target variable

X = data[['danceability', 'energy', 'loudness', 'acousticness', 'track_name']]

y = data['popularity']

# Split the dataset into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model

model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate the model

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

This step initializes our model, fits it on the training data, and evaluates it using the test data to see how well it can predict popularity.

Step 6: Model Evaluation

After building and running the model, it’s crucial to understand its performance. This is where we evaluate how accurately our model can predict the popularity of Taylor Swift's tracks. To assess this, we’ll look at the model’s accuracy and mean squared error. Copy the following code to evaluate the model.

# Calculate model accuracy

accuracy = model.score(X_test, y_test)

print("Model Accuracy:", accuracy)

# Output the mean squared error

print(f'Mean Squared Error: {mse}')

This step will provide us with two crucial metrics to assess our model's performance: the accuracy percentage and the mean squared error. The mean squared error helps us gauge the average difference between our model's predictions and the actual values, giving us insight into how close we are to the mark.

You've Just Built Your Very First Machine Learning Model !

Incredible work on creating and assessing your first machine learning model on Kaggle! You've cleverly utilized real-world data from Taylor Swift's tour setlists to forecast song popularity.

It turns out, being a Swifty now comes with a bonus ! You're also on your way to becoming machine learning engineers! Way to go Swifties !

What's Next?

This has truly been a hands-on guide, but don't go anywhere just yet! Stay tuned for our upcoming blog posts where we'll dive deeper into machine learning concepts, explore career paths, and much more.

In Part 2 of this series, we’ll dig into the theoretical side of the machine learning techniques we’ve explored today. We’ll break down each concept and library in more detail to further enhance your understanding and ability to apply these methods in various scenarios. Keep experimenting and see you soon!

Incase of any questions or suggestions, feel free to reach out to me via LinkedIn . I'm always open to fruitful discussions.🍏🦜

™