See the full Jupyter Notebook on GitHub.

Let’s practice the machine learning workflow to predict’s a car’s market price using its attributes. For each car in our dataset, we have information about the technical aspects of the vehicle, such as the weight of the car, the miles per gallon, the acceleration, etc. Our goal is to identify which of the attributes, or combination, is most helpful in predicting the price of a car. To do this, we’ll use K-Nearest Neighbors.

Data Set

We’ll be working with the Automobile Data Set from the UCI Machine Learning Repo.

The target column is Price.

The columns that are numeric are symboling, wheel-base, length, width, curb-weight, engine-size, compression-rate, city-mpg, and highway-mpg.

The are no missing values in any columns, though we note that in normalized-losses shown using cars.head() the first three values are ?, so we should be careful of missing values that are represented using other symbols.

Data Cleaning

Since we already noticed the data issue in the normalized-losses column, let’s replace the ? with a NaN. This might occur in other columns, so if we do this replacement across the whole dataframe, this can let us know which columns actually do have missing values.

cars.replace('?',np.nan,inplace=True)

We are only going to focus on continuous numeric data, so we need to select only those columns.

continuous_values_cols = ['normalized-losses', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
numeric_cars = cars[continuous_values_cols]

Then we convert all columns to float

numeric_cars = numeric_cars.astype('float')

and then check for null values

numeric_cars.isnull().sum()

There are four missing values in the price column. Since price is our target value, we will remove the rows with missing values.

We only have 205 rows, so want to preserve as much data as we can. So the remaining missing values we will fill with the mean from the column

numeric_cars = numeric_cars.fillna(numeric_cars.mean())

Let’s also rescale the values in the numeric columns so they all range from 0 to 1. Except for the Price column.

price_col = numeric_cars['price']
numeric_cars = (numeric_cars - numeric_cars.min())/(numeric_cars.max() - numeric_cars.min())
numeric_cars['price'] = price_col

Now that the data is clean, we are ready to build the model to make predictions.

Univariate Model

Let’s start with a simple univariate model. We will then use this to understand the features better when we build our multivariate model.

Let’s write a function that does the training and simple validation process.

def knn_train_test(df,training_col,target_col,k):
    # initialize the model and the folding for the 
    # cross validation
    
    knn = KNeighborsRegressor(n_neighbors = k)
    
    kf = KFold(n_splits = 5, shuffle=True,random_state=1)

    # if we only want to check one column, we need to 
    # convert it to a list
    if isinstance(training_col,str):
        training_col = [training_col]

        
    # use cross validation to determine the mse
    mses = cross_val_score(knn,
                       df[training_col],
                       df[target_col],
                       scoring='neg_mean_squared_error',
                       cv=kf)

    rmse = np.sqrt(np.absolute(np.mean(mses)))
    # return RMSE
    return(rmse)

Since we are starting with a Univariate model, we create a list of column names, and loop through them with our function, using the default value for k. We get the following results:

Column Name RMSE
engine-size 3179
horsepower 3814
highway-mpg 4157
width 4241
curb-weight 4425
city-mpg 4502
length 5853
wheel-base 6155
compression-rate 6417
bore 6861
normalized-losses 7099
stroke 7134
peak-rpm 7599
height 7618

We then consider different values of k. For each individual feature, we five different k-values.

RMSE for different K values

We see that while there is a change in the RMSE, only a few features change their order.

Multivariate Model

Now, let’s run some multivariate models so we can include multiple features. We will include the best two, three, four, and five features from the previous step to evaluate the impact of different features.

Number of Features RMSE
Two 2994
Three 2862
Four 3166
Five 3159

Hyperparamater Tuning

Let’s now optimize the models that performed the best in the previous step. We’ll adjust the k value, the number of neighbors, from 1 to 25 for the three best models from above.

Hyperparameter Tuning

For the multivariate models, we see that the K-value we choose should be less than about 5. A k-value of 1 produces the least error for the Five features and Three features set. But a value of 1 would be prone to large errors for unusual cars, so we should avoid that. A value of 2 is most consistent for the lowest RMSE. There is only a small difference between the different models for k = 2. Five features has a slightly smaller error, and should perform better on outliers, so we would suggest using k = 2 with the five features we have used here.


<
Previous Post
Winning Jeopardy
>
Next Post
Predicting House Sale Prices