Creating a Linear regression model using Python's scikit-learn library

·

9 min read

Creating a Linear regression model using Python's scikit-learn library

Introduction

The rapidly expanding field of machine learning entails teaching computers to learn from data and make predictions or decisions based on this learning. Linear regression is a common machine learning technique used to predict a continuous outcome variable based on one or more predictor variables. Python is a popular language for machine learning due to its adaptability, usability, and abundance of robust libraries. One such library is sci-kit-learn, which offers a variety of machine learning algorithms and tools for data processing, model selection, and model evaluation. Using Python's sci-kit-learn library, we will create a linear regression model in this tutorial. Beginning with an overview of linear regression and its operation, we will then move on to preparing the data for our model, constructing and evaluating the model, and making predictions on new data.

By the conclusion of this tutorial, you will have a thorough understanding of how to create a linear regression model with sci-kit-learn and be able to apply this knowledge to your own machine-learning projects.

Understanding Linear Regression

The statistical method of linear regression is used to model the relationship between a dependent variable and one or more independent variables. The objective of linear regression is to identify the line that best characterizes the relationship between independent and dependent variables. Simple linear regression and multiple linear regression are the two primary types of linear regression. When there is only one independent variable, simple linear regression is utilized, whereas multiple linear regression is utilized when there are two or more independent variables. In both types of linear regression, the line is represented by the following equation:

y = b0 + b1x1 + b2x2 + ... + bn*xn

where y is the dependent variable, b0 is the intercept, and b1, b2, ..., bn are the coefficients for each independent variable x1, x2, ..., xn.

Finding the values of b0, b1, b2,..., bn that best fit the data is the objective of linear regression. This is typically accomplished using the least squares technique, which minimizes the sum of the squared differences between the predicted and actual values. Regression lineare peut be utilized for both prediction and inference. The objective of prediction is to use the model to predict new data. The objective of inference is to comprehend the relationship between the independent and dependent variables and to determine which variables are most crucial for predicting the outcome.

Data Preparation

Before we can build a linear regression model, we need to prepare our data. This typically involves the following steps:

  1. Data collection and exploration: We first need to collect our data and explore it to gain an understanding of its structure, variables, and relationships.

  2. Data cleaning and preprocessing: We then need to clean and preprocess our data to handle missing values, outliers, and other anomalies. This may involve techniques such as imputation, scaling, and normalization.

  3. Splitting the data into training and testing sets: We need to split our data into two sets - a training set and a testing set. The training set will be used to train our model, while the testing set will be used to evaluate its performance.

Let's see how these steps can be implemented using Python and the sci-kit-learn library. Python

# Step 1: Data collection and exploration
import pandas as pd

# Load the dataset
data =
pd.read_csv('data.csv')

# Print the first five rows of the dataset
print(data.head())

# Print summary statistics of the dataset
print(data.describe())
# Step 2: Data cleaning and preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Handle missing values
imputer = SimpleImputer(strategy='mean')
data[['age', 'income']] =
imputer.fit_transform(data[['age', 'income']])

# Scale the features
scaler = StandardScaler()
data[['age', 'income']] =
scaler.fit_transform(data[['age', 'income']])


# Step 3: Splitting the data into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data[['age', 'income']], data['target'], test_size=0.2, random_state=42)

● The initial step in the preceding code snippets involves loading the data into a panda's DataFrame object and then exploring it to understand its structure and variables. In the provided code snippet, we use the panda's read_csv function to read data from a CSV file and store it in a "data" DataFrame object. The head method is then used to print the first five rows of the dataset, while the described method is used to print summary statistics such as the mean, standard deviation, minimum, and maximum values for each numeric column in the dataset.

The second step consists of cleaning and preprocessing the data to account for missing values, outliers, and other anomalies. Using the SimpleImputer class from sci-kit-learn, we replace missing values in the "age" and "income" columns with the respective column's mean in the provided code snippet. Then, we utilize the StandardScaler class from sci-kit-learn to scale the "age" and "income" columns to have a mean of 0 and a standard deviation of 1. This is essential for linear regression, which assumes that the features have a normal distribution.

In the third step, the data is separated into training and testing sets. This is essential, as we must evaluate the performance of our model using new, unobserved data. Using the train_test_split function from sci-kit-learn, we randomly split the data into a training set and a testing set in the provided code snippet. We specify a value of 0.2 for the "test_size" parameter, indicating that 20% of the data will be used for testing and 80% for training. Additionally, we specify a random seed of 42 to ensure that the split can be replicated. The function returns four arrays corresponding to the training and testing sets for the features and target variables, X_train, X_test, y_train, and y_test.

Building the Model

Now that we have prepared our data, we can build our linear regression model. This involves the following steps:

  1. Initializing the model: We first need to initialize a linear regression model. In sci-kit-learn, this can be done using the LinearRegression class.

  2. Fitting the model: We then need to fit the model to our training data. This involves finding the coefficients of the linear regression equation that best fit the training data. In sci-kit-learn, this can be done using the fit method.

  3. Making predictions: Once the model has been fitted, we can use it to make predictions on new data. In sci-kit-learn, this can be done using the predict method.

These steps can be implemented using Python and Scikit-learn.

# Step 1: Initializing the model
from sklearn.linear_model import LinearRegression

# Initialize a linear regression model
model = LinearRegression()

# Step 2: Fitting the model
model.fit(X_train, y_train)

# Step 3: Making predictions
y_pred = model.predict(X_test)

Using sci-kit-learn's LinearRegression class, we initialize a linear regression model. The model is then fitted to the training data using the fit method. Lastly, we use the predict method to make predictions based on our testing data, which is stored in the variable y_pred.

Once predictions have been made, the performance of our model can be evaluated using metrics such as mean squared error, R-squared, and others. We can also visualize the results with plots to gain additional insight into the performance of our model.

Model Evaluation

Now that we have built our linear regression model and made predictions on our testing data, we need to evaluate its performance. This involves comparing the predicted values to the actual values using various metrics. The most commonly used metrics for evaluating linear regression models are:

  1. Mean squared error (MSE): This measures the average squared difference between the predicted and actual values.

  2. R-squared (R²): This measures the proportion of variance in the target variable that is explained by the model.

Let's see how these metrics can be computed using Python and Scikit-learn.

# Import the necessary libraries
from sklearn.metrics import mean_squared_error, r2_score

# Compute the mean squared error
mse = mean_squared_error(y_test, y_pred)

# Compute the R-squared score
r2 = r2_score(y_test, y_pred)

# Print the results
print("Mean Squared Error: ", mse)
print("R-squared: ", r2)

In the above code snippet, we first import the necessary libraries for computing the mean squared error and R-squared score. We then compute these metrics using the mean_squared_error and r2_score functions from sci-kit-learn, respectively. Finally, we print the results. In addition to these metrics, we can also visualize the results using plots. For example, we can plot the predicted values against the actual values to see how well our model fits the data.

import matplotlib.pyplot as plt

# Plot the predicted values against the actual values
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs Predicted Values")
plt.show()

In the above code snippet, we first import the matplotlib library for creating plots. We then create a scatter plot of the predicted values against the actual values using the scatter function. We also add labels to the x and y axes and a title to the plot. Finally, we display the plot using the show function.

Predicting New Data

Once we have built and evaluated our linear regression model, we can use it to make predictions on new data. This involves three steps:

  1. Collecting new data: We first need to collect the new data that we want to make predictions on.

  2. Preprocessing the new data: We then need to preprocess the new data in the same way as we did with our training data. This involves applying the same transformations, such as scaling and one-hot encoding.

  3. Making predictions: Once the new data has been preprocessed, we can use our trained model to make predictions on it.

# Step 1: Collecting new data
new_data = [[5.2, 3.4, 1.4, 0.2], [6.1, 2.8, 4.7, 1.2], [7.3, 2.9, 6.3, 1.8]]

# Step 2: Preprocessing the new data
new_data = scaler.transform(new_data)

# Step 3: Making predictions
new_predictions = model.predict(new_data)

In the preceding code fragments, we collect new data on which to make predictions. This new data is then preprocessed using the same scaler object that was used to preprocess our training data. Finally, we use our trained linear regression model to make predictions on the preprocessed new data, which we store in the "new_predictions" variable.

Once we have made predictions based on the new data, we can use the resulting information for a variety of purposes, including making business decisions and conducting additional research.

Conclusion

Using Python and scikit-learn, we have learned how to construct a simple linear regression model in this tutorial. We began with an overview of linear regression and its applications in machine learning. Then, we walked through the steps required to construct a linear regression model, including data preparation, model construction, and model evaluation. We also learned how to evaluate the performance of our model using metrics such as mean squared error and R-squared, as well as how to make predictions on new data using the trained model.

Linear regression is a potent technique that can be utilized in a variety of machine learning applications, such as stock price forecasting and customer churn prediction. You should now have a solid understanding of how to construct and evaluate a linear regression model using Python and scikit-learn by following the steps outlined in this tutorial.