Last month I was not able to participate in Kaggle's tabular competition because there was no December 2022 competition. Kaggle has changed the name and format of their tabular competitions in 2023 to call it season and episode. Therefore, this month's competition will be for season 3 and episode 1. In addition, in the month of January 2023 there is going to be a competition each Tuesday of the month. Because of this, I will have a lot of work to do to complete in each of the competitions for this month.

The first competition for the month of January 2023 is based on the California House Price dataset. I have previously written posts on this dataset, but have decided to make a few changes to the model to see if I could improve upon the accuracy and reduce the error.

The problem statement for episode 1 of Kaggle's 2023 playground competition can be seen below:-

None

I have written the program using Kaggle's free online Jupyter Notebook, which can be accessed through the competition question.

The first thing to do when creating a new notebook is to import the libraries that will be needed to execute the program. The libraries that were imported for this competition question are:-

  1. Pandas for data processing,
  2. Numpy for numeric processing,
  3. Os to go into the operating system,
  4. Sklearn for machine learning,
  5. Matplotlib for graphics, and
  6. Seaborn for higher level graphics.

I then used the os library to load the files that had been stored in the directory for this competition question:-

None

I used pandas to read the three csv files into dataframes, being:-

  1. Train,
  2. Test, and
  3. submit.
None

I decided to remove all of the rows in the train dataframe that appear only once:-

None

I used seaborn's displot to analyse the target after the values that only appear once had been removed:-

None

I then combined the train and test dataframes by dropping the column, 'MedHouseVal' from the train dataframe.

Once the combi dataframe had been created, I dropped the column, 'id', because it is not necessary due to the indexing of each row in the dataframe:-

None

I normalised the data by converting all of the cells to values between 0 and 1 because the computer is able to train and fit the data better if it is in the same dimension:-

None

I then defined the X and y variables. The y variable, being dependent, is the target. The X and X_test variables, being independent, are composed of the combi dataframe.

In order to break the X and y variable into training and validating sets, I had to attach the target to the X variable:-

None

I then split the X variable into training and validation sets. The training set is 90% of the X variable and the validation set is the remaining 10%:-

None

I selected the model, and in this instance I used sklearn's Gradient BoostingRegressor. I achieved an accuracy of 74% when I trained and fitted the data into the model:-

None

When I made predictions on the validation set, I achieved 72% accuracy:-

None

I checked the error of the predictions and achieved 0.60:-

None

I plotted the predicted values versus the actual values on a graph:-

None

I then made predictions on the test set:-

None

I prepared the submission by placing the predictions in the column,'MedHouseVal' of the submit dataframe.

I then converted the submit dataframe to a csv file, which would be submitted to Kaggle for scoring:-

None

When I submitted my work to Kaggle, I achieved 59%, which is better than my previous score (when I used Linear Regression to solve the conundrum):-

None

I am happy that I have completed the first competition of the year. Now I just have to complete the additional four for this month.

I have created a code review to accompany this post, which can be viewed here:- https://www.youtube.com/watch?v=BiELAWF7tZA