Skip to content

ruili4UCSD/CSE151A_WIN24_GROUP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CSE151A_WIN24_GROUP

Introduction

Topic: LeetCode

The dataset we have selected is derived from LeetCode, one of the largest online platforms for coding interview preparation with an extensive collection of coding challenges categorized as easy, medium, or hard. However, the company has never explained what distinguishes these categories——how does Leetcode decide the difficulty of a problem? The dataset we plan to analyze represents 1825 Leetcode problems, with 19 different features such as difficulty, acceptance rate, attempted frequency, and related topics for each individual question. The purpose of our project is to understand what factors are considered to be most important to Leetcode when scoring the difficulty of a problem.

🔗 LeetCode Dataset

Steps

In order to undertake this task, we have followed a set of preliminary steps (data exploration and preprocessing) with the purpose of gaining some insight into what features are most responsible for determining difficulty ratings. After this initial step data analysis, we have developed 3 distinct classification models (k-nearest neighbors, decision tree and neural network) with the purpose of categorizing the difficulty of a problem when given these attributes. Each of our 3 models attempted the same classification task, so we were able to compare their performance.

Applications

One possible application of our project would be to automatically label the difficulty of coding problems encountered outside of Leetcode, such as HackerRank or Codesignal.

Code

For the complete version, please check the current notebook.

Method

Data exploration

We have decided to perform our data exploration in 2 stages, one before and one after the data preprocessing. Stage 1 is the initial exploration of our data, looking into our attributes, their datatypes, and the distribution of our data. Stage 2 has been performed after the initial preprocessing of our data, looking further into the correlation between the different features and our target.

For a chronological approach, the steps have been: data exploration stage 1, data preprocessing stage 1, data exploration stage 2 and data preprocessing stage 2.

Stage 1

Dataset info

The total number of observations that have been gathered for this dataset is 1825. We have considered that this is an appropriate number of observations for a thorough analysis and modeling but is manageable enough to avoid the need for subsampling.

The first step of our exploration has been observing our raw data. We have looked into the data types as well as the amount of null values. We have also taken a look at a small subset of the data in order to have a vague idea as to what to expect for each of the attributes.

Datatypes:

Column Non-Null Count Dtype
id 1825 int64
title 1825 object
is_premium 1825 int64
difficulty 1825 object
solution_link 987 object
acceptance_rate 1825 float64
frequency 1825 float64
url 1825 object
discuss_count 1825 int64
accepted 1825 object
submissions 1825 object
companies 1749 object
related_topics 1571 object
likes 1825 int64
dislikes 1825 int64
rating 1825 int64
asked_by_faang 1825 int64
similar_questions 745 object

Data Example:

id title description is_premium difficulty solution_link acceptance_rate frequency url discuss_count accepted submissions companies related_topics likes dislikes rating asked_by_faang similar_questions
0 1 Two Sum Given an array of integers `nums` and an integ... 0 Easy /articles/two-sum 46.7 100.0 https://leetcode.com/problems/two-sum 999 4.1M 8.7M Amazon,Google,Apple,Adobe,Microsoft,Bloomberg,... Array,Hash Table 20217 712 97 1 [3Sum, /problems/3sum/, Medium], [4Sum, /probl...
1 2 Add Two Numbers You are given two non-empty linked lists repre... 0 Medium /articles/add-two-numbers 35.7 93.1 https://leetcode.com/problems/add-two-numbers 999 1.9M 5.2M Bloomberg,Microsoft,Amazon,Google,Facebook,App... Linked List,Math,Recursion 11350 2704 81 1 [Multiply Strings, /problems/multiply-strings/...
2 3 Longest Substring Without Repeating Characters Given a string `s`, find the length of the lon... 0 Medium /articles/longest-substring-without-repeating-... 31.5 90.9 https://leetcode.com/problems/longest-substrin... 999 2.1M 6.7M Amazon,Bloomberg,Microsoft,Facebook,Apple,Adob... Hash Table,Two Pointers,String,Sliding Window 13810 714 95 1 [Longest Substring with At Most Two Distinct C...
3 4 Median of Two Sorted Arrays Given two sorted arrays `nums1` and `nums2` of... 0 Hard /articles/median-of-two-sorted-arrays 31.4 86.2 https://leetcode.com/problems/median-of-two-so... 999 904.7K 2.9M Amazon,Goldman Sachs,Facebook,Microsoft,Apple,... Array,Binary Search,Divide and Conquer 9665 1486 87 1 NaN
4 5 Longest Palindromic Substring Given a string `s`, return the longest palindr... 0 Medium /articles/longest-palindromic-substring 30.6 84.7 https://leetcode.com/problems/longest-palindro... 999 1.3M 4.1M Amazon,Microsoft,Wayfair,Facebook,Adobe,eBay,G... String,Dynamic Programming 10271 670 94 1 [Shortest Palindrome, /problems/shortest-palin...

<svg xmlns="http://www.w3.org/2000/svg" height="24px"viewBox="0 0 24 24" width="24px">

Attribute description

Each observation is detailed through 19 distinct features. However, for the scope of our analysis, we have opted to exclude 6 of these features as they do not contribute to our project's objectives. These features were irrelevant or non-numeric, and we decided against attempting to implement NLP as it is beyond the scope of this class.

Below we have provided a description for each of the features included in the dataset.

Column Name Data Type # of Null Description
id int64 0 Unique problem id.
title String 0 name of the problem.
description String 0 problem description.
is_premium int64 0 whether the question requires a premium account). This has been encoded as a number where the only possible values are 0 or 1, and 1 implies that the question does require a premium account.
difficulty String 0 The level of difficulty of the entry (e.g., Easy, Medium, Hard).
solution_link String 838 link to the solution for the problem.
acceptance_rate float64 0 how often the answer submitted is correct). This is a float value indicating the percentage of acceptance, ranging from 13.9 to 95.6 in the case of our dataset but with a possible range of values between 0 and 100.
frequency float64 0 how often the problem is attempted). This is also in the form of percentage.
url String 0 url to the problem.
discuss_count int64 0 (how many comments are submitted by users). For this feature we have observed that there is a maximum count of 999, which means that there could be problems with that value that could go well over this limit or stay close to it.
accepted String 0 how many times the answer was accepted). This has been encoded in the dataset as a string, where the value is encoded either as valueK, indicating that the value is in the thousands, or valueM, placing the value in the millions. However, for our project, we will convert this value to a number.
submissions String 0 how many times the answer was submitted). This feature has the same characteristics as the previous feature, so we will perform the same modification.
companies String 76 which companies were tagged as having asked this specific problem.
related_topics String 254 (topics related to the current problem). This feature is given as a string of the different topics that the problem is associated with. However, as will be observed in the analysis, we have decided to use one-hot encoding in order to be able to work with this feature. It is also important to note that there is a 14% of null values, meaning that 14% of our observations don’t have any value for this feature.
likes int64 0 how many likes the problem got). Integer value ranging from 2 to 20,200.
dislikes int64 0 how many dislikes the problem got). Integer value ranging from 0 to 8900.
rating int64 0 This feature is a combination of the previous 2 features. It is computed as likes/(likes+dislikes). Therefore, this is encoded in the form of percentage, ranging in the case of our dataset between 7 and 100.
asked_by_faang int64 0 whether or not the question was asked by facebook, apple, amazon, google, or netflix.
similar_questions String 1080 A list of similar entries, possibly including titles, links, and difficulty levels.

Figures

For a more detailed insight into the data, please refer to the current notebook.

Below, we will provide the figures that we believe provide the best insights into out data.

png

Figure 1. Correlation plot between the different features.

png

Figure 2. Visualization of the number of problems associated with each unique topic

png

Figure 3. Distribution of problems across different difficulty levels (Easy, Medium, Hard)

png

Figure 4. Distribution of topics by difficulty level

Stage 2

Figures

Pairplot

After the initial preprocessing we decided to visualize the pairwise relationships between variables to identify potential patterns and correlations.

png Figure 5. Pairplot analysis of encoded data

Key observations from our pair plot analysis:

  • There appears to be a significant correlation between discussion, acceptance rate, and difficulty.
  • Likes and dislikes are inversely correlated.
  • The count of non-FAANG companies and problem frequency seem to have a positive correlation. However, the pair plot does not provide provide substantial information relevant to our goal. Therefore, we will proceed to examine the data using a heatmap/correlation matrix to gain more insights.

Correlation matrix

Transitioning to the correlation matrix analysis, our aim is to gain deeper insights into the relationships between variables. We begin by normalizing our data, for which we have chosen min-max normalization since not all our data is normally distributed.

With the data normalization complete, we can now examine our correlation matrix.

png

Figure 6. Complete correlation matrix of all attributes

This correlation matrix offers insights into the relationships between variables. Notable observations include:

  • Several "Tree" related topics have a positive correlation with each other.
  • Discussion is negatively correlated with rating (significantly more so than submissions), implying that the harder a problem is, the less discussion there will be.
  • The most negatively correlated topic with rating is "Math."

However, given the extensive information presented in this correlation matrix, we will narrow our focus on the parameter of interest: problem difficulty.

png

Figure 7. Correlation between all attributes and problem difficulty

To clarify the complex data, we break down the analysis into related topics and the other remaining features. Looking at the related topics we can conclude some interesting properties:

  • "Dynamic programming" is the most correlated topic in determining problem difficulty.
  • "Array" topics are the most negatively correlated topic.
  • Most topic labels are associated with increased difficulty.

png

Figure 8. Correlation between related topics and problem difficulty

Looking at the other features we can conclude some other interesting properties:

  • "Acceptance rate" is the most correlated feature determining difficulty (which seems self evident).
  • "Discuss count" is the second most correlated feature.
  • "Dislikes" seems to be a better predictor of problem difficulty than likes (which is interesting since Leetcode actually hid the dislike counter).

png

Figure 9. Correlation between all features (except the related topics) and problem difficulty

We can now clearly decide which features are relevant to problem difficulty and which are not. In the next stage of our preprocessing we will drop all features with a correlation coefficient below 0.05, leaving us with a set of 24 features (excluding 'difficulty' itself).

png

Figure 10. Correlation between the features we have kept and problem difficulty

Preprocessing

Stage 1

Steps

1. Dropping Unnecessary Data

We have identified certain attributes that don't contribute to predicting 'difficulty'. These columns are:

  • "id": Unique identifiers for problems, irrelevant to difficulty prediction.
  • "title": Does not provide predictive value for difficulty without NLP.
  • "description": Unrelated to the predictive task at hand without the use of NLP.
  • "solution_link" and "url": Links that offer no predictive insight.
  • "similar_questions": Does not directly contribute to predicting difficulty without introducing too much complexity to the model. This info can also be gleaned from "related topics."
  • "asked_by_faang": Rather than a binary value of whether a FAANG company asked the question, a count of how many FAANG companies asked it might be more informative.

2. Data encoding

Certain columns require encoding to numerical attributes:

  • "difficulty": Our target variable will be encoded from categorical ("Easy", "Medium", "Hard") to a numerical format.
  • "submissions" and "accepted": These string-formatted numerical values will be converted to integers.
  • "companies": To capture the potential correlation between the problem's difficulty and the company's profile, we'll convert this into two numeric attributes: "faang_count" for the count of FAANG companies and "non_faang_count" for others.
  • "related_topics": Given the potential link between certain topics and problem difficulty, this attribute will be standardized and one-hot encoded to represent the various topics.

The changes performed can be appreciated in the following table:

is_premium difficulty acceptance_rate frequency discuss_count accepted submissions likes dislikes rating ... Sliding Window Sort Stack String Suffix Array Topological Sort Tree Trie Two Pointers Union Find
0 0 0 46.7 100.0 999 4100000.0 8700000.0 20217 712 97 ... 0 0 0 0 0 0 0 0 0 0
1 0 1 35.7 93.1 999 1900000.0 5200000.0 11350 2704 81 ... 0 0 0 0 0 0 0 0 0 0
2 0 1 31.5 90.9 999 2100000.0 6700000.0 13810 714 95 ... 1 0 0 1 0 0 0 0 1 0
3 0 2 31.4 86.2 999 904700.0 2900000.0 9665 1486 87 ... 0 0 0 0 0 0 0 0 0 0
4 0 1 30.6 84.7 999 1300000.0 4100000.0 10271 670 94 ... 0 0 0 1 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1820 1 0 88.9 0.0 15 305 343 2 2 50 ... 0 0 0 0 0 0 0 0 0 0
1821 0 0 67.8 0.0 104 7900.0 11700.0 37 4 90 ... 0 0 0 0 0 0 0 0 0 0
1822 0 1 71.8 0.0 135 6800.0 9500.0 81 4 95 ... 0 0 0 0 0 0 0 0 0 0
1823 0 1 47.2 0.0 134 5000.0 10700.0 147 8 95 ... 0 0 0 0 0 0 0 0 0 0
1824 0 2 28.1 0.0 48 2100.0 7400.0 52 43 55 ... 0 0 0 0 0 0 0 0 0 0

1825 rows × 55 columns

<svg xmlns="http://www.w3.org/2000/svg" height="24px"viewBox="0 0 24 24" width="24px">

3. Handling Missing Data After the initial data cleaning:

  • No attributes currently exhibit missing values, indicating a clean dataset ready for further processing.
  • For optimal model performance, normalization will be applied to the remaining attributes to ensute uniformity in data scale and distribution.

Stage 2

Following our comprehensive data exploration in stage 2, we have now identified a select group of features to keep. Although a significant portion of our data underwent preprocessing during the initial importing and cleaning phase, now we just need to make refinements based on our findings from the data exploration. That is the reason why we will drop all features with a correlation coefficient below 0.05, leaving us with a set of 24 features (excluding 'difficulty' itself).

A sample of the final dataframe can be observed below.

Sample output of our final dataframe

<svg xmlns="http://www.w3.org/2000/svg" height="24px"viewBox="0 0 24 24" width="24px">

acceptance_rate discuss_count Array dislikes Hash Table String Sliding Window Trie Depth-first Search Binary Search ... Breadth-first Search Ordered Map Heap Union Find Backtracking frequency Segment Tree rating Dynamic Programming difficulty
0 46.7 999 1 712 1 0 0 0 0 0 ... 0 0 0 0 0 100.0 0 97 0 0
1 35.7 999 0 2704 0 0 0 0 0 0 ... 0 0 0 0 0 93.1 0 81 0 1
2 31.5 999 0 714 1 1 1 0 0 0 ... 0 0 0 0 0 90.9 0 95 0 1
3 31.4 999 1 1486 0 0 0 0 0 1 ... 0 0 0 0 0 86.2 0 87 0 2
4 30.6 999 0 670 0 1 0 0 0 0 ... 0 0 0 0 0 84.7 0 94 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1820 88.9 15 0 2 0 0 0 0 0 0 ... 0 0 0 0 0 0.0 0 50 0 0
1821 67.8 104 0 4 0 0 0 0 0 0 ... 0 0 0 0 0 0.0 0 90 0 0
1822 71.8 135 1 4 0 0 0 0 0 0 ... 0 0 0 0 0 0.0 0 95 0 1
1823 47.2 134 0 8 0 0 0 0 0 0 ... 1 0 0 0 0 0.0 0 95 1 1
1824 28.1 48 0 43 0 0 0 0 0 0 ... 0 0 1 0 0 0.0 0 55 0 2

1825 rows × 23 columns

Model 1 - K-Nearest Neighbors (KNN)

For the complete version, please check the current notebook.

We chose the K-Nearest Neighbors (KNN) algorithm for its simplicity and effectiveness, particularly in multi-label classification tasks. KNN operates on the principle that similar data points are often in close proximity and classifies a new data point based on the majority vote of its 'k' nearest neighbors. Its ability to handle multi-label classification makes it well-suited for predicting the difficulty levels of LeetCode problems (Easy, Medium, Hard).

Hyperparameter tuning

In the case of K-nearest neighbors, the hyperparameter we can tune is the 'k' value to determine the optimal number of neighbors for classification. The way we have approached this has been by building different models with different k values in a range from 1 to 40 and calculating the argmin for its error.

Our finding was that k=12 produced the best result.

Model 2 - Decision Trees

We chose the Decision Tree algorithm for its easy interpretation, and its ability to visualize which attributes are most important, which isn't always clear with other algorithms. This is beneficial because we can see the hierarchical relationship between the features, thus being able to determine which features are most strongly predictive, and which are not. Also, its ability to handle various data types such as discrete or continuous values through the use of thresholds makes it suitable for a multi-label classification, or in our case, predicting the difficulty levels of LeetCode problems (Easy, Medium, Hard).

Hyperparameter tuning

For decision trees, we decided to tune the model on the ccp_alpha parameter. This parameter represents the degree of pruning after the decision tree hsa been built and is used to control the trade-off between tree complexity and its ability to minimize impurity. The goal is to get a decision tree that has a better performance on unseen data.

The way we have implemented this had been by obtaining the accuracy score of the model depending on the value for ccp_alpha and doing argmax to get the value for alpha.

The best validation accuracy was achieved for alpha=0.0016001376425733224

Model 3 - Neural networks

We have observed that k-nearest neighbors and decision trees both have similar performance. We hope that a neural network will be able to infer more complex patterns that may not be captured by these more simplistic models.

Hyperparameter tuning

In the case of neural networks there is a wide range of hyperparameters that can be tuned. In order to find the best neural networks, the parameters that we have chosen to tune are:

  • Number of layers
  • Activation function for all hidden layers
  • Learning rate
  • Optimizer
    • We haven't learned many optimizers beyond SGD but we will include Adam since it is based on SGD

We have chosen not to tune:

  • Hidden layers individually (at least for now)
    • While by no means authoritative, this stack overflow post seems to reasonably explain that changing the individual layers activation functions is likely to be less worthwhile than simply changing the number of layers or neurons
  • Output layer
    • Softmax is generally considered to be the best output activation function for multiclassification problems
  • Loss function
    • We don't want to use a regression loss function as our problem is a classification problem
    • Since we have multiple classes we should use categorical cross entropy and not binary crossentropy.
  • Epochs
    • We chose not to tune epochs as the keras docs say "It is generally not needed to tune the number of epochs because a built-in callback is passed to model.fit() to save the model at its best epoch evaluated by the validation_data."

Our search space ends up being as follows:

Search space summary
Default search space size: 6
hidden_layers (Int)
{'default': None, 'conditions': [], 'min_value': 2, 'max_value': 6, 'step': 1, 'sampling': 'linear'}
units_0 (Int)
{'default': None, 'conditions': [], 'min_value': 64, 'max_value': 512, 'step': 48, 'sampling': 'linear'}
activation (Choice)
{'default': 'relu', 'conditions': [], 'values': ['relu', 'tanh'], 'ordered': False}
units_1 (Int)
{'default': None, 'conditions': [], 'min_value': 64, 'max_value': 512, 'step': 48, 'sampling': 'linear'}
learning_rate (Float)
{'default': 0.0001, 'conditions': [], 'min_value': 0.0001, 'max_value': 1.0, 'step': None, 'sampling': 'log'}
optimizer (Choice)
{'default': 'adam', 'conditions': [], 'values': ['adam', 'SGD'], 'ordered': False}

After the training, our best model has a loss of 0.7388 and an accuracy of 0.6776. The model is as follows:

Model: Sequential

Layer (type) Output Shape Param #
dense (Dense) (None, 352) 8096
dense_1 (Dense) (None, 160) 56480
dense_2 (Dense) (None, 64) 10304
dense_3 (Dense) (None, 64) 4160
dense_4 (Dense) (None, 3) 195

Total params: 79235 Trainable params: 79235 Non-trainable params: 0

Result

Model 1 Results / Figures

Evaluation On Test Set (KNN)

After tuning our model to the best 'k' value, we continue by assessing its performance on the unseen test data to see how well it generalizes.

Confusion Matrix:
[[[118  17]
  [ 32  16]]

 [[ 39  49]
  [ 21  74]]

 [[134   9]
  [ 22  18]]]

Classification Report:
              precision    recall  f1-score   support

           0       0.48      0.33      0.40        48
           1       0.60      0.78      0.68        95
           2       0.67      0.45      0.54        40

    accuracy                           0.59       183
   macro avg       0.58      0.52      0.54       183
weighted avg       0.59      0.59      0.57       183

Evaluation On Training Set (KNN)

To understand our model's learning, we also evaluate its performance on the training set to see how well it has captured the underlying patterns of the data.

Confusion Matrix:
[[[1087  126]
  [ 210  219]]

 [[ 393  381]
  [ 163  705]]

 [[1238   59]
  [ 193  152]]]

Classification Report:
              precision    recall  f1-score   support

           0       0.63      0.51      0.57       429
           1       0.65      0.81      0.72       868
           2       0.72      0.44      0.55       345

    accuracy                           0.66      1642
   macro avg       0.67      0.59      0.61      1642
weighted avg       0.66      0.66      0.64      1642

Fitting Graph and Cross Validation Result

The fitting graph below shows the relationship between the choice of 'k' in KNN and the corresponding training, testing, and cross-validation errors.

png

Figure 12. Training, testing and cross validation error based on the value of k for KNN

Model 2 Results / Figures

Hyperparameter tuning observations

png

Figure 13. Training and testing error based on the value for the effective alpha

Below, you will be able to find the best model.

png

Figure 14. Best model for decision trees

Decision Tree visualization

svg

Figure 15. Visualizaiton of the decision tree model

Evaluation on Training and Test

Train accuracy: 0.7521315468940317
Test accuracy: 0.6065573770491803
              precision    recall  f1-score   support

           0       0.53      0.33      0.41        48
           1       0.62      0.72      0.66        95
           2       0.63      0.68      0.65        40

    accuracy                           0.61       183
   macro avg       0.59      0.57      0.57       183
weighted avg       0.60      0.61      0.59       183

Model 3 Results / Figures

Evaluation On Test Set (Neural Networks)

After obtaining out best model, we continue by assessing its performance on the unseen test data to see how well it generalizes.

              precision    recall  f1-score   support

          0       0.61      0.72      0.66        46
          1       0.71      0.71      0.71        98
          2       0.68      0.54      0.60        39

    accuracy                           0.68       183
  macro avg       0.67      0.66      0.66       183
weighted avg       0.68      0.68      0.68       183

Evaluation On Training Set (Neural Networks)

To understand our model's learning, we also evaluate its performance on the training set to see how well it has captured the underlying patterns of the data.

              precision    recall  f1-score   support

          0       0.65      0.66      0.66       431
          1       0.70      0.77      0.74       865
          2       0.78      0.56      0.65       346

    accuracy                           0.70      1642
  macro avg       0.71      0.67      0.68      1642
weighted avg       0.71      0.70      0.70      1642

Fitting Graph and Cross Validation Result

png

Figure 16. Training and testing error based on the parameter count

Our fitting graph is unfortunately very disorganized. This is because parameter count is not a great measure of model complexity. However, it gives us some insight into our model, such as the fact that it is slightly overfit.

Discussion

Discussion of Model #1: K-Nearest Neighbours

Comparing Training and Testing Error

By comparing the error/accuracy and precision, recall in training and test data, we can see that:

  • The model performs not too good in both trainingand test data, we only got accuracy = 0.66/0.59.
  • Overall, the model performs better in training data.
  • The model performs better in class '2', which is medium in "difficulty". This make sense because the observation of medium problems is the most.
  • The model performs stable in precision (around 0.60 in three classes), but performs unstable in recall (high in medium problems, low in other problems.)

Next we would show the fitting graph of training and test error under different k values.

Observations from the Graph (Figure 12)

  • Overfitting at Low k-values: When 'k' is small (especially at k=1), the model achieves perfect accuracy on the training data, indicative of overfitting. As 'k' increases, the training error rises, and the test error generally decreases which shows a reduction in overfitting.

  • Underfitting at High k-values: Larger 'k' values tend to generalize better to unseen data, reducing test errors but excessively large 'k' can lead to underfitting.

  • Cross-Validation: The cross-validation error doesn't always grow according to 'k'. This behavior shows the importance of cross-validation in identifying a 'k' that ensures a model that is balanced between known data and unseen data. This pattern shows the delicate balance between avoiding overfitting with lower 'k' values and the risk of underfitting with higher 'k' values, despite the smaller errors.

Reference used (KNN Introduction): https://www.codecademy.com/learn/introduction-to-supervised-learning-skill-path/modules/k-nearest-neighbors-skill-path/cheatsheet

Optimal 'k' Selection

We have chosen a 'k' of 12 since it shows the most balanced performance between overfitting and underfitting, as evidenced by its superior cross-validation score.

Conclusions

Despite our efforts in fine-tuning 'k', the model's performance through iterative cross-validation on the dataset — marked by accuracies of 0.59 and 0.66 — hasn't met our expectations. This might be due to the fact that there is a relatively weak correlation between our features and the target variable "difficulty". The exploratory analysis, particularly the heatmap, showed that even the most correlated feature, "acceptance_rate", only has a correlation coefficient of -0.39. This is likely a significant contributor to the model's underwhelming performance. Additionally, the disparity between the number of problems of different difficulty levels may be contributing.

Strategies for Improvement

Given the model's current limitations, we propose some approaches to improve its accuracy: Extended 'k' Range: Broadening the range of 'k' values beyond the initial 1 to 40, we may find a more effective 'k' that could improve the model's performance. However, we need to keep in mind that it's crucial to balance the risk of overfitting with smaller 'k' values against underfitting with larger 'k' values. Data Preprocessing Refinement: Revisiting our approach to preprocessing, which is currently centered around MinMax Scaling, could be beneficial. Exploring alternative scaling and standardization techniques could reveal new patterns and correlations that might help the model's performance. Oversampling: sampling with replacement to create a training data set with equal number of Easy, Medium, and Hard problems might be worth exploring. As we progress to subsequent models, we hope that we can achieve a better performance.

Discussion of Model #2: Decision Tree

Observations on hyperparameter tuning (Figure 13 and 14)

When accuracy score = 1, it is overfitting

  • Overfitting at low alpha values: When ccp alpha is small, indicating less pruning on the decision tree, the training model achieves perfect accuracy, which indicates overfitting.

  • Cost Complexity Pruning: On the fitting graph, we found out that the cost complexity pruning alpha value that gives us the highest accuracy in the testing dataset is 0.0016001376425733224.

  • Model Comparison: When compared to the previous models fitting graph it behaves similarly with earlier values of alpha/k overfitting and has a higher error at later values.

Observations on decision tree (Figure 15)

We observe that the most useful features for classification are the ones being compared in the top layers of the tree. In the top 3 layers, the only features used are discuss_count, acceptance_rate, and rating. In the lower layers, frequency, acceptance_rate, and many question topics (Hash Map, Union Find, etc) are used.

Looking at the tree as a whole, we find a lot of nodes of class 1, perhaps due to the nature of our encoding putting class 1 between the other classes causing many predictions of class 1. This aligns with about half of the samples being class 1 as well as class 1 having relatively fewer false negatives (precision < recall).

The first decision separates more than half of the class 2 observations from the set along with around a fourth of the class 1 observations and very few class 0 observations. This suggests that the classes do somewhat fall on a spectrum with respect to the selected features (it is easier to separate class 0 and 2 rather than 0 and 1 or 1 and 2). This is supported by the fact that there is only one instance of two sibling nodes being class 0 and 2; every other pair is 0/1 or 1/2.

Comparing Training and Testing Error

Now we conduct a comparative analysis of the model's performance across the training and the testing sets to understand how well the model can generalize beyond the data it was trained on.

  • Performance Overview: The accuracy scores indicate that the model's performance is moderate, with a training accuracy of 0.75 and testing accuracy of 0.61. This again, has room for improvement in the model's ability to predict unseen data accurately. However, it had slightly better results than the previous model. The model performs better on the training set, as expected.

  • Class-wise Performance: The model shows better performance for the 'medium' and 'hard' difficulty level than the 'easy' difficulty level. This could be because of having low prevalence of 'easy' difficulty class compared to 'medium' and 'hard' in the dataset.

  • Precision and Recall: The model’s precision across the 3 classes is relatively stable at around .60. The recall however is not as stable with recall being lower for the easier questions and higher for the medium and harder questions. The features don’t capture the difference between easy and medium questions very well so the model will frequently classify it as a medium question.

Conclusions

To conclude our second model, our decision tree performed mostly the same as the first model, and while having high recall on easy problems is still difficult, the recall is more stable for medium and difficult problems.

Strategies for Improvement

Some strategies that we have not implemented but would improve the performance of the decision tree are bagging and boosting. This model revealed the most important features, and getting more accurate data on discussion_count can be useful along with possible feature expansion

Discussion of Model #3: Neural Network

Comparing Training and Testing Error

We can now compare the Training and Testing Performance of our Neural Network and compare them to previous models

  • Training vs Testing: The training accuracy was 70% and the testing accuracy was 68%. This implies that our model is slightly overfit as the training accuracy is higher than the testing accuracy. The precision and recall of the testing set is also lower than the training set.
  • Neural Network vs KNN: When comparing training accuracy our neural network performs about 4% better than our KNN model. When comparing testing accuracy our neural network performs about 9% better. While not a huge difference it is a small improvment.
  • Neural Network vs Decision Tree: When comparing training accuracy our neural network performs about 3% worse than our decision tree model. However, when comparing testing accuracy our neural network performs about 7% better. This shows how the decision tree learning model tends to overfit data

In general the hypertuned neural network seems to more accurately predict the difficulty of Leetcode problems. However, there is a tradeoff. The neural network is much less transparent of a model than a decision tree learning model.

Conclusions

Our Neural Network model outperformed our KNN and Decision Tree models, but barely. With accuracies of 68% for testing data and 70% for training data, our model has 7% improvement over the previous ones.

We also need to acknowledge that Neural Networks are much more opaque than Decision Trees or KNN models. If somebody wants to understand why a problem is labeled easy, medium, or hard our Neural Network is unlikely to provide great insight.

Strategies for Improvement

  1. More rigorous hypertuning: For the sake of training time (and not losing our minds) we limited the parameters we were hypertuning. In theory there could be a model with more layers, more neurons, and different activation functions that may perform significantly better. Furthermore, we chose to use RandomizedSearch tuner for the sake of time, but GridSearch may give better results.
  2. Better data/pre-processing: As the saying goes, garbage in -> garbage out. While our models perform decently we are unlikely to get much better performance without more and better data. An important aspect of a Leetcode problem is the problem description. However, for simplicity sake we chose to drop it. If we were to process this data we would likely improve our model's accuracy significantly.

Conclusion

In this study, we explored three different machine learning models — K-Nearest Neighbors (KNN), Decision Tree, and Neural Network — to predict the difficulty level of LeetCode problems.

Having analyzed each of the different models, a conclusion that we are not excited to arrive to is that our dataset is not ideal for analysis. We have not been able to obtain a high accuracy in neither of the models, with the maximum accuracy having been obtained with the neural network, with an accuracy of 68% for testing data and 70% for training data. Although this accuracy is not as high as we would like, we note that it is better than chance, which in our case would be randomly guessing one of three classes, leading to an accuracy of approximately 33%.

We believe that a reason behind this could be the imbalanced distribution of the data, with the majority of problems falling into the medium difficulty category. This imbalance might have led to skewed model predictions and hindered their ability to accurately classify instances across all difficulty levels.

Looking ahead, we recognize the need for improvements to enhance the accuracy of our predictive models. Implementing the strategies outlined in the discussion section, such as refining data preprocessing techniques, as well as sampling techniques, expanding our hyperparameter tuning, and incorporating natural language processing, hold promise for improving model performance. Additionally, acquiring a more comprehensive dataset with a balanced distribution of problem difficulty could yield more accurate and reliable predictions.

In summary, while our initial findings may not be as promising as hoped, we remain optimistic about the potential for future improvements.

Collaboration

  • Ana Maria Sampedro Barrera: Writer : Brainstormed potential datasets and projects with the team. Attended all team meetings. Worked on describing the data for milestone #1 as well as writing and reorganizing the contents for the final report.

  • David Choi: Team Leader : Acted as project leader in most meetings. Worked on data exploration working on pairplots and small correlation matrices. Worked on data preprocessing, dropping the data and one hot encoding related topics. Worked on a model 3 neural network and the hyper parameter tuner. Created the fitting graph. Worked on the final readme file.

  • Isabelle Krochmal: Programmer: Brainstormed potential datasets and projects with team. Attended team meetings, lead one meeting when usual team leader was not available. Performed initial data preprocessing: for example, decided on irrelevant or unusable features to drop, manipulated lists of companies into quantitative data features usable in model. Worked on Milestone 3: KNN. Researched and reviewed code with team. Interpreted results of model and then used them to write explanations for our reasoning of which two models we would create next.

  • Evie Stergiopoulou: Programmer/Writer: Milestone 2: Conducted data exploration for analyzing problem counts, distribution of problem difficulty levels, and distribution of topics by difficulty level. Refined the data evaluation and preprocessing section and split it in subsections. Provided overview for each section, ensured consistent formatting, and elaborated on writing and code that needed clarification. Milestone 3: Refined all writing aspects and added code comments to the KNN model section. Updated the README with current information.

  • James Rey Macasa: Programmer : Did data visualization by programming the heatmap to show the correlation between different features of data. Attended most meetings, helping brainstorm potential models to use and what datasets to work on. Worked on milestone 3 model: KNN. Helped review model code and results and asked TAs for advice for model as well as wrote potential improvements that could be made for the model.

  • Rui Li: Programmer : In milestone1: wrote temp abstract. In milestone2: did github initialization, participated in data exploration, notebook reorganization, writing readme. In milestone3: train and evaluate KNN model, record fitting graph, draft conclusions, edit readme.

  • Jinwoong Huh: Programmer : Did initial data preprocessing and one-hot encoding of the data. Worked on coding the decision tree model, as well as the decision tree model and fitting graph visualization. Collaborated with teammates to do the observation writeup and readme of the DT model.

  • Andrew Jia: Writer : Contributed to early feature selection. Worked on model 2 (decision tree), mostly writing, formatting, changing to readme from ipynb. Includes writing observations from decision tree visualization and conclusion of milestone 4.

  • Ojasvi Tewari : Programmer : Contributed to discussion of possible datasets, helped with the abstracts. Worked on Model 3 by training the hypertuned model and setting up the code and training the Kfold hypertuned model.

  • Lucas Xu : Programmer : Contribution Worked on model 2 decision tree coded with team. Meet with team and discussed project ideas. Coded decision tree training also did writeups for decision tree conclusion and classification report.

🔗 Current Notebook

About

Group work repo for ucsd 151A win2024

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published