The dataset we have selected is derived from LeetCode, one of the largest online platforms for coding interview preparation with an extensive collection of coding challenges categorized as easy, medium, or hard. However, the company has never explained what distinguishes these categories——how does Leetcode decide the difficulty of a problem? The dataset we plan to analyze represents 1825 Leetcode problems, with 19 different features such as difficulty, acceptance rate, attempted frequency, and related topics for each individual question. The purpose of our project is to understand what factors are considered to be most important to Leetcode when scoring the difficulty of a problem.
In order to undertake this task, we have followed a set of preliminary steps (data exploration and preprocessing) with the purpose of gaining some insight into what features are most responsible for determining difficulty ratings. After this initial step data analysis, we have developed 3 distinct classification models (k-nearest neighbors, decision tree and neural network) with the purpose of categorizing the difficulty of a problem when given these attributes. Each of our 3 models attempted the same classification task, so we were able to compare their performance.
One possible application of our project would be to automatically label the difficulty of coding problems encountered outside of Leetcode, such as HackerRank or Codesignal.
For the complete version, please check the current notebook.
We have decided to perform our data exploration in 2 stages, one before and one after the data preprocessing. Stage 1 is the initial exploration of our data, looking into our attributes, their datatypes, and the distribution of our data. Stage 2 has been performed after the initial preprocessing of our data, looking further into the correlation between the different features and our target.
For a chronological approach, the steps have been: data exploration stage 1, data preprocessing stage 1, data exploration stage 2 and data preprocessing stage 2.
The total number of observations that have been gathered for this dataset is 1825. We have considered that this is an appropriate number of observations for a thorough analysis and modeling but is manageable enough to avoid the need for subsampling.
The first step of our exploration has been observing our raw data. We have looked into the data types as well as the amount of null values. We have also taken a look at a small subset of the data in order to have a vague idea as to what to expect for each of the attributes.
Datatypes:
Column | Non-Null Count | Dtype |
---|---|---|
id | 1825 | int64 |
title | 1825 | object |
is_premium | 1825 | int64 |
difficulty | 1825 | object |
solution_link | 987 | object |
acceptance_rate | 1825 | float64 |
frequency | 1825 | float64 |
url | 1825 | object |
discuss_count | 1825 | int64 |
accepted | 1825 | object |
submissions | 1825 | object |
companies | 1749 | object |
related_topics | 1571 | object |
likes | 1825 | int64 |
dislikes | 1825 | int64 |
rating | 1825 | int64 |
asked_by_faang | 1825 | int64 |
similar_questions | 745 | object |
Data Example:
id | title | description | is_premium | difficulty | solution_link | acceptance_rate | frequency | url | discuss_count | accepted | submissions | companies | related_topics | likes | dislikes | rating | asked_by_faang | similar_questions | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Two Sum | Given an array of integers `nums` and an integ... | 0 | Easy | /articles/two-sum | 46.7 | 100.0 | https://leetcode.com/problems/two-sum | 999 | 4.1M | 8.7M | Amazon,Google,Apple,Adobe,Microsoft,Bloomberg,... | Array,Hash Table | 20217 | 712 | 97 | 1 | [3Sum, /problems/3sum/, Medium], [4Sum, /probl... |
1 | 2 | Add Two Numbers | You are given two non-empty linked lists repre... | 0 | Medium | /articles/add-two-numbers | 35.7 | 93.1 | https://leetcode.com/problems/add-two-numbers | 999 | 1.9M | 5.2M | Bloomberg,Microsoft,Amazon,Google,Facebook,App... | Linked List,Math,Recursion | 11350 | 2704 | 81 | 1 | [Multiply Strings, /problems/multiply-strings/... |
2 | 3 | Longest Substring Without Repeating Characters | Given a string `s`, find the length of the lon... | 0 | Medium | /articles/longest-substring-without-repeating-... | 31.5 | 90.9 | https://leetcode.com/problems/longest-substrin... | 999 | 2.1M | 6.7M | Amazon,Bloomberg,Microsoft,Facebook,Apple,Adob... | Hash Table,Two Pointers,String,Sliding Window | 13810 | 714 | 95 | 1 | [Longest Substring with At Most Two Distinct C... |
3 | 4 | Median of Two Sorted Arrays | Given two sorted arrays `nums1` and `nums2` of... | 0 | Hard | /articles/median-of-two-sorted-arrays | 31.4 | 86.2 | https://leetcode.com/problems/median-of-two-so... | 999 | 904.7K | 2.9M | Amazon,Goldman Sachs,Facebook,Microsoft,Apple,... | Array,Binary Search,Divide and Conquer | 9665 | 1486 | 87 | 1 | NaN |
4 | 5 | Longest Palindromic Substring | Given a string `s`, return the longest palindr... | 0 | Medium | /articles/longest-palindromic-substring | 30.6 | 84.7 | https://leetcode.com/problems/longest-palindro... | 999 | 1.3M | 4.1M | Amazon,Microsoft,Wayfair,Facebook,Adobe,eBay,G... | String,Dynamic Programming | 10271 | 670 | 94 | 1 | [Shortest Palindrome, /problems/shortest-palin... |
<svg xmlns="http://www.w3.org/2000/svg" height="24px"viewBox="0 0 24 24" width="24px">
Each observation is detailed through 19 distinct features. However, for the scope of our analysis, we have opted to exclude 6 of these features as they do not contribute to our project's objectives. These features were irrelevant or non-numeric, and we decided against attempting to implement NLP as it is beyond the scope of this class.
Below we have provided a description for each of the features included in the dataset.
Column Name | Data Type | # of Null | Description |
---|---|---|---|
id | int64 | 0 | Unique problem id. |
title | String | 0 | name of the problem. |
description | String | 0 | problem description. |
is_premium | int64 | 0 | whether the question requires a premium account). This has been encoded as a number where the only possible values are 0 or 1, and 1 implies that the question does require a premium account. |
difficulty | String | 0 | The level of difficulty of the entry (e.g., Easy, Medium, Hard). |
solution_link | String | 838 | link to the solution for the problem. |
acceptance_rate | float64 | 0 | how often the answer submitted is correct). This is a float value indicating the percentage of acceptance, ranging from 13.9 to 95.6 in the case of our dataset but with a possible range of values between 0 and 100. |
frequency | float64 | 0 | how often the problem is attempted). This is also in the form of percentage. |
url | String | 0 | url to the problem. |
discuss_count | int64 | 0 | (how many comments are submitted by users). For this feature we have observed that there is a maximum count of 999, which means that there could be problems with that value that could go well over this limit or stay close to it. |
accepted | String | 0 | how many times the answer was accepted). This has been encoded in the dataset as a string, where the value is encoded either as valueK, indicating that the value is in the thousands, or valueM, placing the value in the millions. However, for our project, we will convert this value to a number. |
submissions | String | 0 | how many times the answer was submitted). This feature has the same characteristics as the previous feature, so we will perform the same modification. |
companies | String | 76 | which companies were tagged as having asked this specific problem. |
related_topics | String | 254 | (topics related to the current problem). This feature is given as a string of the different topics that the problem is associated with. However, as will be observed in the analysis, we have decided to use one-hot encoding in order to be able to work with this feature. It is also important to note that there is a 14% of null values, meaning that 14% of our observations don’t have any value for this feature. |
likes | int64 | 0 | how many likes the problem got). Integer value ranging from 2 to 20,200. |
dislikes | int64 | 0 | how many dislikes the problem got). Integer value ranging from 0 to 8900. |
rating | int64 | 0 | This feature is a combination of the previous 2 features. It is computed as likes/(likes+dislikes). Therefore, this is encoded in the form of percentage, ranging in the case of our dataset between 7 and 100. |
asked_by_faang | int64 | 0 | whether or not the question was asked by facebook, apple, amazon, google, or netflix. |
similar_questions | String | 1080 | A list of similar entries, possibly including titles, links, and difficulty levels. |
For a more detailed insight into the data, please refer to the current notebook.
Below, we will provide the figures that we believe provide the best insights into out data.
Figure 1. Correlation plot between the different features.
Figure 2. Visualization of the number of problems associated with each unique topic
Figure 3. Distribution of problems across different difficulty levels (Easy, Medium, Hard)
Figure 4. Distribution of topics by difficulty level
Pairplot
After the initial preprocessing we decided to visualize the pairwise relationships between variables to identify potential patterns and correlations.
Figure 5. Pairplot analysis of encoded data
Key observations from our pair plot analysis:
- There appears to be a significant correlation between discussion, acceptance rate, and difficulty.
- Likes and dislikes are inversely correlated.
- The count of non-FAANG companies and problem frequency seem to have a positive correlation. However, the pair plot does not provide provide substantial information relevant to our goal. Therefore, we will proceed to examine the data using a heatmap/correlation matrix to gain more insights.
Correlation matrix
Transitioning to the correlation matrix analysis, our aim is to gain deeper insights into the relationships between variables. We begin by normalizing our data, for which we have chosen min-max normalization since not all our data is normally distributed.
With the data normalization complete, we can now examine our correlation matrix.
Figure 6. Complete correlation matrix of all attributes
This correlation matrix offers insights into the relationships between variables. Notable observations include:
- Several "Tree" related topics have a positive correlation with each other.
- Discussion is negatively correlated with rating (significantly more so than submissions), implying that the harder a problem is, the less discussion there will be.
- The most negatively correlated topic with rating is "Math."
However, given the extensive information presented in this correlation matrix, we will narrow our focus on the parameter of interest: problem difficulty.
Figure 7. Correlation between all attributes and problem difficulty
To clarify the complex data, we break down the analysis into related topics and the other remaining features. Looking at the related topics we can conclude some interesting properties:
- "Dynamic programming" is the most correlated topic in determining problem difficulty.
- "Array" topics are the most negatively correlated topic.
- Most topic labels are associated with increased difficulty.
Figure 8. Correlation between related topics and problem difficulty
Looking at the other features we can conclude some other interesting properties:
- "Acceptance rate" is the most correlated feature determining difficulty (which seems self evident).
- "Discuss count" is the second most correlated feature.
- "Dislikes" seems to be a better predictor of problem difficulty than likes (which is interesting since Leetcode actually hid the dislike counter).
Figure 9. Correlation between all features (except the related topics) and problem difficulty
We can now clearly decide which features are relevant to problem difficulty and which are not. In the next stage of our preprocessing we will drop all features with a correlation coefficient below 0.05, leaving us with a set of 24 features (excluding 'difficulty' itself).
Figure 10. Correlation between the features we have kept and problem difficulty
1. Dropping Unnecessary Data
We have identified certain attributes that don't contribute to predicting 'difficulty'. These columns are:
- "id": Unique identifiers for problems, irrelevant to difficulty prediction.
- "title": Does not provide predictive value for difficulty without NLP.
- "description": Unrelated to the predictive task at hand without the use of NLP.
- "solution_link" and "url": Links that offer no predictive insight.
- "similar_questions": Does not directly contribute to predicting difficulty without introducing too much complexity to the model. This info can also be gleaned from "related topics."
- "asked_by_faang": Rather than a binary value of whether a FAANG company asked the question, a count of how many FAANG companies asked it might be more informative.
2. Data encoding
Certain columns require encoding to numerical attributes:
- "difficulty": Our target variable will be encoded from categorical ("Easy", "Medium", "Hard") to a numerical format.
- "submissions" and "accepted": These string-formatted numerical values will be converted to integers.
- "companies": To capture the potential correlation between the problem's difficulty and the company's profile, we'll convert this into two numeric attributes: "faang_count" for the count of FAANG companies and "non_faang_count" for others.
- "related_topics": Given the potential link between certain topics and problem difficulty, this attribute will be standardized and one-hot encoded to represent the various topics.
The changes performed can be appreciated in the following table:
is_premium | difficulty | acceptance_rate | frequency | discuss_count | accepted | submissions | likes | dislikes | rating | ... | Sliding Window | Sort | Stack | String | Suffix Array | Topological Sort | Tree | Trie | Two Pointers | Union Find | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 46.7 | 100.0 | 999 | 4100000.0 | 8700000.0 | 20217 | 712 | 97 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 1 | 35.7 | 93.1 | 999 | 1900000.0 | 5200000.0 | 11350 | 2704 | 81 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 31.5 | 90.9 | 999 | 2100000.0 | 6700000.0 | 13810 | 714 | 95 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 0 | 2 | 31.4 | 86.2 | 999 | 904700.0 | 2900000.0 | 9665 | 1486 | 87 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 1 | 30.6 | 84.7 | 999 | 1300000.0 | 4100000.0 | 10271 | 670 | 94 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1820 | 1 | 0 | 88.9 | 0.0 | 15 | 305 | 343 | 2 | 2 | 50 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1821 | 0 | 0 | 67.8 | 0.0 | 104 | 7900.0 | 11700.0 | 37 | 4 | 90 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1822 | 0 | 1 | 71.8 | 0.0 | 135 | 6800.0 | 9500.0 | 81 | 4 | 95 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1823 | 0 | 1 | 47.2 | 0.0 | 134 | 5000.0 | 10700.0 | 147 | 8 | 95 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1824 | 0 | 2 | 28.1 | 0.0 | 48 | 2100.0 | 7400.0 | 52 | 43 | 55 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1825 rows × 55 columns
<svg xmlns="http://www.w3.org/2000/svg" height="24px"viewBox="0 0 24 24" width="24px">
3. Handling Missing Data After the initial data cleaning:
- No attributes currently exhibit missing values, indicating a clean dataset ready for further processing.
- For optimal model performance, normalization will be applied to the remaining attributes to ensute uniformity in data scale and distribution.
Following our comprehensive data exploration in stage 2, we have now identified a select group of features to keep. Although a significant portion of our data underwent preprocessing during the initial importing and cleaning phase, now we just need to make refinements based on our findings from the data exploration. That is the reason why we will drop all features with a correlation coefficient below 0.05, leaving us with a set of 24 features (excluding 'difficulty' itself).
A sample of the final dataframe can be observed below.
<svg xmlns="http://www.w3.org/2000/svg" height="24px"viewBox="0 0 24 24" width="24px">
acceptance_rate | discuss_count | Array | dislikes | Hash Table | String | Sliding Window | Trie | Depth-first Search | Binary Search | ... | Breadth-first Search | Ordered Map | Heap | Union Find | Backtracking | frequency | Segment Tree | rating | Dynamic Programming | difficulty | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 46.7 | 999 | 1 | 712 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 100.0 | 0 | 97 | 0 | 0 |
1 | 35.7 | 999 | 0 | 2704 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 93.1 | 0 | 81 | 0 | 1 |
2 | 31.5 | 999 | 0 | 714 | 1 | 1 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 90.9 | 0 | 95 | 0 | 1 |
3 | 31.4 | 999 | 1 | 1486 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 86.2 | 0 | 87 | 0 | 2 |
4 | 30.6 | 999 | 0 | 670 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 84.7 | 0 | 94 | 1 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1820 | 88.9 | 15 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | 50 | 0 | 0 |
1821 | 67.8 | 104 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | 90 | 0 | 0 |
1822 | 71.8 | 135 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | 95 | 0 | 1 |
1823 | 47.2 | 134 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0.0 | 0 | 95 | 1 | 1 |
1824 | 28.1 | 48 | 0 | 43 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0.0 | 0 | 55 | 0 | 2 |
1825 rows × 23 columns
For the complete version, please check the current notebook.
We chose the K-Nearest Neighbors (KNN) algorithm for its simplicity and effectiveness, particularly in multi-label classification tasks. KNN operates on the principle that similar data points are often in close proximity and classifies a new data point based on the majority vote of its 'k' nearest neighbors. Its ability to handle multi-label classification makes it well-suited for predicting the difficulty levels of LeetCode problems (Easy, Medium, Hard).
In the case of K-nearest neighbors, the hyperparameter we can tune is the 'k' value to determine the optimal number of neighbors for classification. The way we have approached this has been by building different models with different k values in a range from 1 to 40 and calculating the argmin for its error.
Our finding was that k=12 produced the best result.
We chose the Decision Tree algorithm for its easy interpretation, and its ability to visualize which attributes are most important, which isn't always clear with other algorithms. This is beneficial because we can see the hierarchical relationship between the features, thus being able to determine which features are most strongly predictive, and which are not. Also, its ability to handle various data types such as discrete or continuous values through the use of thresholds makes it suitable for a multi-label classification, or in our case, predicting the difficulty levels of LeetCode problems (Easy, Medium, Hard).
For decision trees, we decided to tune the model on the ccp_alpha parameter. This parameter represents the degree of pruning after the decision tree hsa been built and is used to control the trade-off between tree complexity and its ability to minimize impurity. The goal is to get a decision tree that has a better performance on unseen data.
The way we have implemented this had been by obtaining the accuracy score of the model depending on the value for ccp_alpha and doing argmax to get the value for alpha.
The best validation accuracy was achieved for alpha=0.0016001376425733224
We have observed that k-nearest neighbors and decision trees both have similar performance. We hope that a neural network will be able to infer more complex patterns that may not be captured by these more simplistic models.
In the case of neural networks there is a wide range of hyperparameters that can be tuned. In order to find the best neural networks, the parameters that we have chosen to tune are:
- Number of layers
- Activation function for all hidden layers
- Learning rate
- Optimizer
- We haven't learned many optimizers beyond SGD but we will include Adam since it is based on SGD
We have chosen not to tune:
- Hidden layers individually (at least for now)
- While by no means authoritative, this stack overflow post seems to reasonably explain that changing the individual layers activation functions is likely to be less worthwhile than simply changing the number of layers or neurons
- Output layer
- Softmax is generally considered to be the best output activation function for multiclassification problems
- Loss function
- We don't want to use a regression loss function as our problem is a classification problem
- Since we have multiple classes we should use categorical cross entropy and not binary crossentropy.
- Epochs
- We chose not to tune epochs as the keras docs say "It is generally not needed to tune the number of epochs because a built-in callback is passed to model.fit() to save the model at its best epoch evaluated by the validation_data."
Our search space ends up being as follows:
Search space summary
Default search space size: 6
hidden_layers (Int)
{'default': None, 'conditions': [], 'min_value': 2, 'max_value': 6, 'step': 1, 'sampling': 'linear'}
units_0 (Int)
{'default': None, 'conditions': [], 'min_value': 64, 'max_value': 512, 'step': 48, 'sampling': 'linear'}
activation (Choice)
{'default': 'relu', 'conditions': [], 'values': ['relu', 'tanh'], 'ordered': False}
units_1 (Int)
{'default': None, 'conditions': [], 'min_value': 64, 'max_value': 512, 'step': 48, 'sampling': 'linear'}
learning_rate (Float)
{'default': 0.0001, 'conditions': [], 'min_value': 0.0001, 'max_value': 1.0, 'step': None, 'sampling': 'log'}
optimizer (Choice)
{'default': 'adam', 'conditions': [], 'values': ['adam', 'SGD'], 'ordered': False}
After the training, our best model has a loss of 0.7388 and an accuracy of 0.6776. The model is as follows:
Model: Sequential
Layer (type) | Output Shape | Param # |
---|---|---|
dense (Dense) | (None, 352) | 8096 |
dense_1 (Dense) | (None, 160) | 56480 |
dense_2 (Dense) | (None, 64) | 10304 |
dense_3 (Dense) | (None, 64) | 4160 |
dense_4 (Dense) | (None, 3) | 195 |
Total params: 79235 Trainable params: 79235 Non-trainable params: 0
After tuning our model to the best 'k' value, we continue by assessing its performance on the unseen test data to see how well it generalizes.
Confusion Matrix:
[[[118 17]
[ 32 16]]
[[ 39 49]
[ 21 74]]
[[134 9]
[ 22 18]]]
Classification Report:
precision recall f1-score support
0 0.48 0.33 0.40 48
1 0.60 0.78 0.68 95
2 0.67 0.45 0.54 40
accuracy 0.59 183
macro avg 0.58 0.52 0.54 183
weighted avg 0.59 0.59 0.57 183
To understand our model's learning, we also evaluate its performance on the training set to see how well it has captured the underlying patterns of the data.
Confusion Matrix:
[[[1087 126]
[ 210 219]]
[[ 393 381]
[ 163 705]]
[[1238 59]
[ 193 152]]]
Classification Report:
precision recall f1-score support
0 0.63 0.51 0.57 429
1 0.65 0.81 0.72 868
2 0.72 0.44 0.55 345
accuracy 0.66 1642
macro avg 0.67 0.59 0.61 1642
weighted avg 0.66 0.66 0.64 1642
The fitting graph below shows the relationship between the choice of 'k' in KNN and the corresponding training, testing, and cross-validation errors.
Figure 12. Training, testing and cross validation error based on the value of k for KNN
Figure 13. Training and testing error based on the value for the effective alpha
Below, you will be able to find the best model.
Figure 14. Best model for decision trees
Figure 15. Visualizaiton of the decision tree model
Train accuracy: 0.7521315468940317
Test accuracy: 0.6065573770491803
precision recall f1-score support
0 0.53 0.33 0.41 48
1 0.62 0.72 0.66 95
2 0.63 0.68 0.65 40
accuracy 0.61 183
macro avg 0.59 0.57 0.57 183
weighted avg 0.60 0.61 0.59 183
After obtaining out best model, we continue by assessing its performance on the unseen test data to see how well it generalizes.
precision recall f1-score support
0 0.61 0.72 0.66 46
1 0.71 0.71 0.71 98
2 0.68 0.54 0.60 39
accuracy 0.68 183
macro avg 0.67 0.66 0.66 183
weighted avg 0.68 0.68 0.68 183
To understand our model's learning, we also evaluate its performance on the training set to see how well it has captured the underlying patterns of the data.
precision recall f1-score support
0 0.65 0.66 0.66 431
1 0.70 0.77 0.74 865
2 0.78 0.56 0.65 346
accuracy 0.70 1642
macro avg 0.71 0.67 0.68 1642
weighted avg 0.71 0.70 0.70 1642
Figure 16. Training and testing error based on the parameter count
Our fitting graph is unfortunately very disorganized. This is because parameter count is not a great measure of model complexity. However, it gives us some insight into our model, such as the fact that it is slightly overfit.
By comparing the error/accuracy and precision, recall in training and test data, we can see that:
- The model performs not too good in both trainingand test data, we only got accuracy = 0.66/0.59.
- Overall, the model performs better in training data.
- The model performs better in class '2', which is medium in "difficulty". This make sense because the observation of medium problems is the most.
- The model performs stable in precision (around 0.60 in three classes), but performs unstable in recall (high in medium problems, low in other problems.)
Next we would show the fitting graph of training and test error under different k values.
-
Overfitting at Low k-values: When 'k' is small (especially at k=1), the model achieves perfect accuracy on the training data, indicative of overfitting. As 'k' increases, the training error rises, and the test error generally decreases which shows a reduction in overfitting.
-
Underfitting at High k-values: Larger 'k' values tend to generalize better to unseen data, reducing test errors but excessively large 'k' can lead to underfitting.
-
Cross-Validation: The cross-validation error doesn't always grow according to 'k'. This behavior shows the importance of cross-validation in identifying a 'k' that ensures a model that is balanced between known data and unseen data. This pattern shows the delicate balance between avoiding overfitting with lower 'k' values and the risk of underfitting with higher 'k' values, despite the smaller errors.
Reference used (KNN Introduction): https://www.codecademy.com/learn/introduction-to-supervised-learning-skill-path/modules/k-nearest-neighbors-skill-path/cheatsheet
We have chosen a 'k' of 12 since it shows the most balanced performance between overfitting and underfitting, as evidenced by its superior cross-validation score.
Despite our efforts in fine-tuning 'k', the model's performance through iterative cross-validation on the dataset — marked by accuracies of 0.59 and 0.66 — hasn't met our expectations. This might be due to the fact that there is a relatively weak correlation between our features and the target variable "difficulty". The exploratory analysis, particularly the heatmap, showed that even the most correlated feature, "acceptance_rate", only has a correlation coefficient of -0.39. This is likely a significant contributor to the model's underwhelming performance. Additionally, the disparity between the number of problems of different difficulty levels may be contributing.
Strategies for Improvement
Given the model's current limitations, we propose some approaches to improve its accuracy: Extended 'k' Range: Broadening the range of 'k' values beyond the initial 1 to 40, we may find a more effective 'k' that could improve the model's performance. However, we need to keep in mind that it's crucial to balance the risk of overfitting with smaller 'k' values against underfitting with larger 'k' values. Data Preprocessing Refinement: Revisiting our approach to preprocessing, which is currently centered around MinMax Scaling, could be beneficial. Exploring alternative scaling and standardization techniques could reveal new patterns and correlations that might help the model's performance. Oversampling: sampling with replacement to create a training data set with equal number of Easy, Medium, and Hard problems might be worth exploring. As we progress to subsequent models, we hope that we can achieve a better performance.
When accuracy score = 1, it is overfitting
-
Overfitting at low alpha values: When ccp alpha is small, indicating less pruning on the decision tree, the training model achieves perfect accuracy, which indicates overfitting.
-
Cost Complexity Pruning: On the fitting graph, we found out that the cost complexity pruning alpha value that gives us the highest accuracy in the testing dataset is 0.0016001376425733224.
-
Model Comparison: When compared to the previous models fitting graph it behaves similarly with earlier values of alpha/k overfitting and has a higher error at later values.
We observe that the most useful features for classification are the ones being compared in the top layers of the tree. In the top 3 layers, the only features used are discuss_count, acceptance_rate, and rating. In the lower layers, frequency, acceptance_rate, and many question topics (Hash Map, Union Find, etc) are used.
Looking at the tree as a whole, we find a lot of nodes of class 1, perhaps due to the nature of our encoding putting class 1 between the other classes causing many predictions of class 1. This aligns with about half of the samples being class 1 as well as class 1 having relatively fewer false negatives (precision < recall).
The first decision separates more than half of the class 2 observations from the set along with around a fourth of the class 1 observations and very few class 0 observations. This suggests that the classes do somewhat fall on a spectrum with respect to the selected features (it is easier to separate class 0 and 2 rather than 0 and 1 or 1 and 2). This is supported by the fact that there is only one instance of two sibling nodes being class 0 and 2; every other pair is 0/1 or 1/2.
Now we conduct a comparative analysis of the model's performance across the training and the testing sets to understand how well the model can generalize beyond the data it was trained on.
-
Performance Overview: The accuracy scores indicate that the model's performance is moderate, with a training accuracy of 0.75 and testing accuracy of 0.61. This again, has room for improvement in the model's ability to predict unseen data accurately. However, it had slightly better results than the previous model. The model performs better on the training set, as expected.
-
Class-wise Performance: The model shows better performance for the 'medium' and 'hard' difficulty level than the 'easy' difficulty level. This could be because of having low prevalence of 'easy' difficulty class compared to 'medium' and 'hard' in the dataset.
-
Precision and Recall: The model’s precision across the 3 classes is relatively stable at around .60. The recall however is not as stable with recall being lower for the easier questions and higher for the medium and harder questions. The features don’t capture the difference between easy and medium questions very well so the model will frequently classify it as a medium question.
To conclude our second model, our decision tree performed mostly the same as the first model, and while having high recall on easy problems is still difficult, the recall is more stable for medium and difficult problems.
Strategies for Improvement
Some strategies that we have not implemented but would improve the performance of the decision tree are bagging and boosting. This model revealed the most important features, and getting more accurate data on discussion_count can be useful along with possible feature expansion
We can now compare the Training and Testing Performance of our Neural Network and compare them to previous models
- Training vs Testing: The training accuracy was 70% and the testing accuracy was 68%. This implies that our model is slightly overfit as the training accuracy is higher than the testing accuracy. The precision and recall of the testing set is also lower than the training set.
- Neural Network vs KNN: When comparing training accuracy our neural network performs about 4% better than our KNN model. When comparing testing accuracy our neural network performs about 9% better. While not a huge difference it is a small improvment.
- Neural Network vs Decision Tree: When comparing training accuracy our neural network performs about 3% worse than our decision tree model. However, when comparing testing accuracy our neural network performs about 7% better. This shows how the decision tree learning model tends to overfit data
In general the hypertuned neural network seems to more accurately predict the difficulty of Leetcode problems. However, there is a tradeoff. The neural network is much less transparent of a model than a decision tree learning model.
Our Neural Network model outperformed our KNN and Decision Tree models, but barely. With accuracies of 68% for testing data and 70% for training data, our model has 7% improvement over the previous ones.
We also need to acknowledge that Neural Networks are much more opaque than Decision Trees or KNN models. If somebody wants to understand why a problem is labeled easy, medium, or hard our Neural Network is unlikely to provide great insight.
Strategies for Improvement
- More rigorous hypertuning: For the sake of training time (and not losing our minds) we limited the parameters we were hypertuning. In theory there could be a model with more layers, more neurons, and different activation functions that may perform significantly better. Furthermore, we chose to use RandomizedSearch tuner for the sake of time, but GridSearch may give better results.
- Better data/pre-processing: As the saying goes, garbage in -> garbage out. While our models perform decently we are unlikely to get much better performance without more and better data. An important aspect of a Leetcode problem is the problem description. However, for simplicity sake we chose to drop it. If we were to process this data we would likely improve our model's accuracy significantly.
In this study, we explored three different machine learning models — K-Nearest Neighbors (KNN), Decision Tree, and Neural Network — to predict the difficulty level of LeetCode problems.
Having analyzed each of the different models, a conclusion that we are not excited to arrive to is that our dataset is not ideal for analysis. We have not been able to obtain a high accuracy in neither of the models, with the maximum accuracy having been obtained with the neural network, with an accuracy of 68% for testing data and 70% for training data. Although this accuracy is not as high as we would like, we note that it is better than chance, which in our case would be randomly guessing one of three classes, leading to an accuracy of approximately 33%.
We believe that a reason behind this could be the imbalanced distribution of the data, with the majority of problems falling into the medium difficulty category. This imbalance might have led to skewed model predictions and hindered their ability to accurately classify instances across all difficulty levels.
Looking ahead, we recognize the need for improvements to enhance the accuracy of our predictive models. Implementing the strategies outlined in the discussion section, such as refining data preprocessing techniques, as well as sampling techniques, expanding our hyperparameter tuning, and incorporating natural language processing, hold promise for improving model performance. Additionally, acquiring a more comprehensive dataset with a balanced distribution of problem difficulty could yield more accurate and reliable predictions.
In summary, while our initial findings may not be as promising as hoped, we remain optimistic about the potential for future improvements.
-
Ana Maria Sampedro Barrera: Writer : Brainstormed potential datasets and projects with the team. Attended all team meetings. Worked on describing the data for milestone #1 as well as writing and reorganizing the contents for the final report.
-
David Choi: Team Leader : Acted as project leader in most meetings. Worked on data exploration working on pairplots and small correlation matrices. Worked on data preprocessing, dropping the data and one hot encoding related topics. Worked on a model 3 neural network and the hyper parameter tuner. Created the fitting graph. Worked on the final readme file.
-
Isabelle Krochmal: Programmer: Brainstormed potential datasets and projects with team. Attended team meetings, lead one meeting when usual team leader was not available. Performed initial data preprocessing: for example, decided on irrelevant or unusable features to drop, manipulated lists of companies into quantitative data features usable in model. Worked on Milestone 3: KNN. Researched and reviewed code with team. Interpreted results of model and then used them to write explanations for our reasoning of which two models we would create next.
-
Evie Stergiopoulou: Programmer/Writer: Milestone 2: Conducted data exploration for analyzing problem counts, distribution of problem difficulty levels, and distribution of topics by difficulty level. Refined the data evaluation and preprocessing section and split it in subsections. Provided overview for each section, ensured consistent formatting, and elaborated on writing and code that needed clarification. Milestone 3: Refined all writing aspects and added code comments to the KNN model section. Updated the README with current information.
-
James Rey Macasa: Programmer : Did data visualization by programming the heatmap to show the correlation between different features of data. Attended most meetings, helping brainstorm potential models to use and what datasets to work on. Worked on milestone 3 model: KNN. Helped review model code and results and asked TAs for advice for model as well as wrote potential improvements that could be made for the model.
-
Rui Li: Programmer : In milestone1: wrote temp abstract. In milestone2: did github initialization, participated in data exploration, notebook reorganization, writing readme. In milestone3: train and evaluate KNN model, record fitting graph, draft conclusions, edit readme.
-
Jinwoong Huh: Programmer : Did initial data preprocessing and one-hot encoding of the data. Worked on coding the decision tree model, as well as the decision tree model and fitting graph visualization. Collaborated with teammates to do the observation writeup and readme of the DT model.
-
Andrew Jia: Writer : Contributed to early feature selection. Worked on model 2 (decision tree), mostly writing, formatting, changing to readme from ipynb. Includes writing observations from decision tree visualization and conclusion of milestone 4.
-
Ojasvi Tewari : Programmer : Contributed to discussion of possible datasets, helped with the abstracts. Worked on Model 3 by training the hypertuned model and setting up the code and training the Kfold hypertuned model.
-
Lucas Xu : Programmer : Contribution Worked on model 2 decision tree coded with team. Meet with team and discussed project ideas. Coded decision tree training also did writeups for decision tree conclusion and classification report.