The OSEMN Approach for a Multiple Regression Analysis
Performing a regression is an incredibly important tool for data scientists and analysts in order to better understand how features impact a certain variable, and make new predictions based on data. While a regression analysis is an iterative process (meaning that you will have to jump back and forth between parts, repeat stages and sometimes complete the entire process over again) there is a general outline of steps that can be followed to perform this task. To perform a regression analysis, we can use the OSEMN (Obtain, Scrub, Explore, Model, INterpret) data science framework to group our project into parts. In order to apply OSEMN to a regression analysis, let’s look at the example of performing a multi variable regression on a popular housing dataset for the Seattle area.
Obtain is the part of the OSEMN where we will actually collect our data. However, prior to jumping into data collection, it’s important to understand the needs of the stakeholders in the project, and what problem we are really trying to solve. Will the data be used to make predictions in the future? Or to understand which variables are the most influential on an outcome metric? Who will be using this data or model at the end of the project? Its important to consider the overall purpose of the project, in order to collect only the data that will be most relevant to our analysis. For example, if we are trying to predict housing prices for luxury housing, we may need a different set or subset of the data than if we are attempting to design a model to predict low income housing prices. While having a lot of data is generally good, having more relevant data will improve the accuracy of the model as well as simplify the cleaning process. On the other hand, having too little data or data that lacks features relevant to the analysis may require collecting data from other sources. For the housing regression, we may require other information not included in the provided dataset that we think affects housing prices, such as neighborhood, school district, and resident income information. We can collect this data from public databases or from websites via web scraping and APIs.
Now that we have our base dataset to work with, we can move onto the scrub part of the OSEMN process, that involves the data cleaning and some parts of data exploration. In order for the data to run in Machine Learning algorithms, its important to preprocess the data. If we are using python or R to manipulate and process data, we can read files in using libraries, such as Pandas, to put it into a relational database format. After we have our data in a malleable format, here are a few considerations we should take into account:
· Treatment of missing values — Should we drop missing values or replace them with another metric, such as the mean, median or mode? How are missing values referenced?
· Duplicate values — Are there duplicate values in our dataset (repeated rows or columns)?
· Outliers — Should we include outliers in our dataset? If we decide to remove them, what metric should we use? IQR? Standard deviations?
It’s also important to make sure you’re only including the most relevant data for the analysis. In the case of the housing dataset, columns indicating if the house has been viewed or not may not be as relevant in determining housing price as other columns such as square footage. There may also be superfluous information about the house’s location (for example, town and zip code) that add little to no additional information. Its important to put your data in the cleanest format to be ready to feed into a machine learning algorithm.
The best way to make more intuitive model improvements is to understand the dataset. A couple decent ways to get a handle on your dataset include:
· Basic descriptive statistics — Calculate the mean, median, mode, minimum, maximum, and standard deviation. This may give you an initial understanding of the distribution of your independent and dependent variables and the different magnitudes of your variables.
· Histograms and boxplots — Another good way to understand the distributions of your data. Is the data normally distributed or fairly skewed? This can provide an indication of which features need to be transformed to make their distributions more normal, through methods such as log transformations and removal of outliers.
· Scatterplots — Scatterplots are a great visualization to help determine the relationship between certain variables. Some independent variables may be linearly related to the dependent variables, while others may be better represented through a higher order relationship (through polynomials). Scatterplots are also useful in checking some of the assumptions required to perform a multiple regression analysis, such as linearity (for linear regression models) and multicollinearity (for multi variable regression models).
· Heatmaps and correlation matrices — Take note of which independent variables may be correlated with each other by building correlation matrices or heatmaps. In the housing dataset, there may be pretty strong correlations between the variables such as square feet living area and square feet above ground. It may be important to eventually remove one of the correlated features in order to produce an accurate model.
It’s also important at this point to understand which of your variables are categorial and which are continuous. Categorical variables need to be one hot encoded (included as a binary variable) in order for machine learning algorithms to work properly. In the example of the housing dataset, certain categorical features are obvious (such as grade, condition, and zip code), while others may be less clear (year built, floors). Including variables as continuous or categorical makes a difference in how the variables are transformed before running them into the model. In terms of putting this into practice, there are various python methods and libraries to generate descriptive statistics, create visualizations, and make other transformations to the model during the explore phase.
After the data has been cleaned, explored and properly formatted, we can begin to fit a model to it. In order for the model to run properly, make sure all categorical variables are one hot encoded. Then, we split data into testing and training data. The training data will be used to engineer the model throughout the iterative modeling process, while the testing data will be used to test how to model performs on new data. This is to avoid overfitting the model to a certain dataset, thus allowing it to perform accurately as new data is fed into it. In the case of the housing dataset, we split the set 80% for training 20% for testing, as commonly seen in the data science realm. After splitting the data, perform any transformations and scaling necessary to the continuous variables (log transformation, min max scaling) before fitting the model. When we are finally ready to model, there are a couple useful built on python libraries, such as Scikit Learn and Statsmodels that include regression functions. For the housing dataset, we used a Statsmodels regression function with arguments for our dependent (housing prices) and independent (all other variables such as square feet, bedrooms, bathrooms, etc). Statsmodels produces an output similar to the one shown below:
When the model has produced an output, there are various things we need to look at in order to understand its accuracy and make improvements:
· R-squared value — This is a number between 0 and 1 that represents how good of a fit your model is. For example, in the case of performing a regression to predict housing prices, an r squared of 0.85 means that 85% of the variation in the dependent variable (housing price) is explained by the independent variables included in your model. While having a high R-squared is desirable, it is not the only metric of success and does not guarantee an accurate model. The model may have a high R- squared value due to having many different variables, but still have a high average error when it comes to making predictions. It also may not meet some of the basic model assumptions to perform a regression.
· Coefficients for each of the independent variables — Do the coefficients make sense in the context of the business problem? For example, does it make sense to have a negative coefficient for square footage of the house (meaning that as square footage increases, the housing price decreases?)
· P-Values — The p-value of each variable represents the probability that the relationship between the independent and dependent variable is zero (there is no relationship). For further iterations, variables with p-values over a certain threshold are not significant to the model and may need to be removed
This list is not exhaustive, and there are many other relevant metrics from the output data that can help improve the model, such as the JB score for normality and skewness and kurtosis. After examining the output, it’s important to test how the model’s predicted data (in our case, housing prices) varies from the actual data. Root Mean Square Error (RMSE) is a commonly used metric to measure the average error of the model. For example, in the case of the housing dataset, if our model has a RMSE of 150,000, our model is on average $150,000 off in its price predictions. Depending on the output, this can signify that the model needs more tuning. It’s also important at this time to again check that the model meets the basic assumptions of a regression by plotting the residuals (errors between the predicted value and actual value of the dependent variable) of the model. This can help determine if the model meets other basic assumptions including normality (that the residuals are normally distributed) and homoscedasticity (that the variance in the residuals shows no specific pattern). Being able to interpret all of the previously mentioned metrics will allow one to make improvements on the model so it generates less error in its predictions.
As previously mentioned, OSEMN is an iterative process. It’s normal to go back and scrub the data again, take out or add features, transform features, or even search for a new dataset entirely. The idea should be that at each cycle of the process, whether the model is improving or not, we better understand the data we’re working with, allowing us to build the most accurate, interpretive model possible.