CSE151A_WIN24_GROUP

Introduction

Topic: LeetCode

The dataset we have selected is derived from LeetCode, one of the largest online platforms for coding interview preparation with an extensive collection of coding challenges categorized as easy, medium, or hard. However, the company has never explained what distinguishes these categories——how does Leetcode decide the difficulty of a problem? The dataset we plan to analyze represents 1825 Leetcode problems, with 19 different features such as difficulty, acceptance rate, attempted frequency, and related topics for each individual question. The purpose of our project is to understand what factors are considered to be most important to Leetcode when scoring the difficulty of a problem.

🔗 LeetCode Dataset

Steps

In order to undertake this task, we have followed a set of preliminary steps (data exploration and preprocessing) with the purpose of gaining some insight into what features are most responsible for determining difficulty ratings. After this initial step data analysis, we have developed 3 distinct classification models (k-nearest neighbors, decision tree and neural network) with the purpose of categorizing the difficulty of a problem when given these attributes. Each of our 3 models attempted the same classification task, so we were able to compare their performance.

Applications

One possible application of our project would be to automatically label the difficulty of coding problems encountered outside of Leetcode, such as HackerRank or Codesignal.

Code

For the complete version, please check the current notebook.

Method

Data exploration

We have decided to perform our data exploration in 2 stages, one before and one after the data preprocessing. Stage 1 is the initial exploration of our data, looking into our attributes, their datatypes, and the distribution of our data. Stage 2 has been performed after the initial preprocessing of our data, looking further into the correlation between the different features and our target.

For a chronological approach, the steps have been: data exploration stage 1, data preprocessing stage 1, data exploration stage 2 and data preprocessing stage 2.

Stage 1

Dataset info

The total number of observations that have been gathered for this dataset is 1825. We have considered that this is an appropriate number of observations for a thorough analysis and modeling but is manageable enough to avoid the need for subsampling.

The first step of our exploration has been observing our raw data. We have looked into the data types as well as the amount of null values. We have also taken a look at a small subset of the data in order to have a vague idea as to what to expect for each of the attributes.

Datatypes:

Column	Non-Null Count	Dtype
id	1825	int64
title	1825	object
is_premium	1825	int64
difficulty	1825	object
solution_link	987	object
acceptance_rate	1825	float64
frequency	1825	float64
url	1825	object
discuss_count	1825	int64
accepted	1825	object
submissions	1825	object
companies	1749	object
related_topics	1571	object
likes	1825	int64
dislikes	1825	int64
rating	1825	int64
asked_by_faang	1825	int64
similar_questions	745	object

Data Example:

	id	title	description	difficulty	solution_link	acceptance_rate	frequency	url	discuss_count	accepted	submissions	companies	related_topics	likes	dislikes	rating	asked_by_faang	similar_questions
0	1	Two Sum	Given an array of integers `nums` and an integ...	Easy	/articles/two-sum	46.7	100.0	https://leetcode.com/problems/two-sum	999	4.1M	8.7M	Amazon,Google,Apple,Adobe,Microsoft,Bloomberg,...	Array,Hash Table	20217	712	97	1	[3Sum, /problems/3sum/, Medium], [4Sum, /probl...
1	2	Add Two Numbers	You are given two non-empty linked lists repre...	Medium	/articles/add-two-numbers	35.7	93.1	https://leetcode.com/problems/add-two-numbers	999	1.9M	5.2M	Bloomberg,Microsoft,Amazon,Google,Facebook,App...	Linked List,Math,Recursion	11350	2704	81	1	[Multiply Strings, /problems/multiply-strings/...
2	3	Longest Substring Without Repeating Characters	Given a string `s`, find the length of the lon...	Medium	/articles/longest-substring-without-repeating-...	31.5	90.9	https://leetcode.com/problems/longest-substrin...	999	2.1M	6.7M	Amazon,Bloomberg,Microsoft,Facebook,Apple,Adob...	Hash Table,Two Pointers,String,Sliding Window	13810	714	95	1	[Longest Substring with At Most Two Distinct C...
3	4	Median of Two Sorted Arrays	Given two sorted arrays `nums1` and `nums2` of...	Hard	/articles/median-of-two-sorted-arrays	31.4	86.2	https://leetcode.com/problems/median-of-two-so...	999	904.7K	2.9M	Amazon,Goldman Sachs,Facebook,Microsoft,Apple,...	Array,Binary Search,Divide and Conquer	9665	1486	87	1	NaN
4	5	Longest Palindromic Substring	Given a string `s`, return the longest palindr...	Medium	/articles/longest-palindromic-substring	30.6	84.7	https://leetcode.com/problems/longest-palindro...	999	1.3M	4.1M	Amazon,Microsoft,Wayfair,Facebook,Adobe,eBay,G...	String,Dynamic Programming	10271	670	94	1	[Shortest Palindrome, /problems/shortest-palin...

Attribute description

Each observation is detailed through 19 distinct features. However, for the scope of our analysis, we have opted to exclude 6 of these features as they do not contribute to our project's objectives. These features were irrelevant or non-numeric, and we decided against attempting to implement NLP as it is beyond the scope of this class.

Below we have provided a description for each of the features included in the dataset.

Column Name	Data Type	# of Null	Description
id	int64	0	Unique problem id.
title	String	0	name of the problem.
description	String	0	problem description.
is_premium	int64	0	whether the question requires a premium account). This has been encoded as a number where the only possible values are 0 or 1, and 1 implies that the question does require a premium account.
difficulty	String	0	The level of difficulty of the entry (e.g., Easy, Medium, Hard).
solution_link	String	838	link to the solution for the problem.
acceptance_rate	float64	0	how often the answer submitted is correct). This is a float value indicating the percentage of acceptance, ranging from 13.9 to 95.6 in the case of our dataset but with a possible range of values between 0 and 100.
frequency	float64	0	how often the problem is attempted). This is also in the form of percentage.
url	String	0	url to the problem.
discuss_count	int64	0	(how many comments are submitted by users). For this feature we have observed that there is a maximum count of 999, which means that there could be problems with that value that could go well over this limit or stay close to it.
accepted	String	0	how many times the answer was accepted). This has been encoded in the dataset as a string, where the value is encoded either as valueK, indicating that the value is in the thousands, or valueM, placing the value in the millions. However, for our project, we will convert this value to a number.
submissions	String	0	how many times the answer was submitted). This feature has the same characteristics as the previous feature, so we will perform the same modification.
companies	String	76	which companies were tagged as having asked this specific problem.
related_topics	String	254	(topics related to the current problem). This feature is given as a string of the different topics that the problem is associated with. However, as will be observed in the analysis, we have decided to use one-hot encoding in order to be able to work with this feature. It is also important to note that there is a 14% of null values, meaning that 14% of our observations don’t have any value for this feature.
likes	int64	0	how many likes the problem got). Integer value ranging from 2 to 20,200.
dislikes	int64	0	how many dislikes the problem got). Integer value ranging from 0 to 8900.
rating	int64	0	This feature is a combination of the previous 2 features. It is computed as likes/(likes+dislikes). Therefore, this is encoded in the form of percentage, ranging in the case of our dataset between 7 and 100.
asked_by_faang	int64	0	whether or not the question was asked by facebook, apple, amazon, google, or netflix.
similar_questions	String	1080	A list of similar entries, possibly including titles, links, and difficulty levels.

Figures

For a more detailed insight into the data, please refer to the current notebook.

Below, we will provide the figures that we believe provide the best insights into out data.

Figure 1. Correlation plot between the different features.

Figure 2. Visualization of the number of problems associated with each unique topic

Figure 3. Distribution of problems across different difficulty levels (Easy, Medium, Hard)

Figure 4. Distribution of topics by difficulty level

Stage 2

Figures

Pairplot

After the initial preprocessing we decided to visualize the pairwise relationships between variables to identify potential patterns and correlations.

Figure 5. Pairplot analysis of encoded data

Key observations from our pair plot analysis:

There appears to be a significant correlation between discussion, acceptance rate, and difficulty.
Likes and dislikes are inversely correlated.
The count of non-FAANG companies and problem frequency seem to have a positive correlation. However, the pair plot does not provide provide substantial information relevant to our goal. Therefore, we will proceed to examine the data using a heatmap/correlation matrix to gain more insights.

Correlation matrix

Transitioning to the correlation matrix analysis, our aim is to gain deeper insights into the relationships between variables. We begin by normalizing our data, for which we have chosen min-max normalization since not all our data is normally distributed.

With the data normalization complete, we can now examine our correlation matrix.

Figure 6. Complete correlation matrix of all attributes

This correlation matrix offers insights into the relationships between variables. Notable observations include:

Several "Tree" related topics have a positive correlation with each other.
Discussion is negatively correlated with rating (significantly more so than submissions), implying that the harder a problem is, the less discussion there will be.
The most negatively correlated topic with rating is "Math."

However, given the extensive information presented in this correlation matrix, we will narrow our focus on the parameter of interest: problem difficulty.

Figure 7. Correlation between all attributes and problem difficulty

To clarify the complex data, we break down the analysis into related topics and the other remaining features. Looking at the related topics we can conclude some interesting properties:

"Dynamic programming" is the most correlated topic in determining problem difficulty.
"Array" topics are the most negatively correlated topic.
Most topic labels are associated with increased difficulty.

Figure 8. Correlation between related topics and problem difficulty

Looking at the other features we can conclude some other interesting properties:

"Acceptance rate" is the most correlated feature determining difficulty (which seems self evident).
"Discuss count" is the second most correlated feature.
"Dislikes" seems to be a better predictor of problem difficulty than likes (which is interesting since Leetcode actually hid the dislike counter).

Figure 9. Correlation between all features (except the related topics) and problem difficulty

We can now clearly decide which features are relevant to problem difficulty and which are not. In the next stage of our preprocessing we will drop all features with a correlation coefficient below 0.05, leaving us with a set of 24 features (excluding 'difficulty' itself).

Figure 10. Correlation between the features we have kept and problem difficulty

Preprocessing

Stage 1

Steps

1. Dropping Unnecessary Data

We have identified certain attributes that don't contribute to predicting 'difficulty'. These columns are:

"id": Unique identifiers for problems, irrelevant to difficulty prediction.
"title": Does not provide predictive value for difficulty without NLP.
"description": Unrelated to the predictive task at hand without the use of NLP.
"solution_link" and "url": Links that offer no predictive insight.
"similar_questions": Does not directly contribute to predicting difficulty without introducing too much complexity to the model. This info can also be gleaned from "related topics."
"asked_by_faang": Rather than a binary value of whether a FAANG company asked the question, a count of how many FAANG companies asked it might be more informative.

2. Data encoding

Certain columns require encoding to numerical attributes:

"difficulty": Our target variable will be encoded from categorical ("Easy", "Medium", "Hard") to a numerical format.
"submissions" and "accepted": These string-formatted numerical values will be converted to integers.
"companies": To capture the potential correlation between the problem's difficulty and the company's profile, we'll convert this into two numeric attributes: "faang_count" for the count of FAANG companies and "non_faang_count" for others.
"related_topics": Given the potential link between certain topics and problem difficulty, this attribute will be standardized and one-hot encoded to represent the various topics.

The changes performed can be appreciated in the following table:

	is_premium	difficulty	acceptance_rate	frequency	discuss_count	accepted	submissions	likes	dislikes	rating	...	Sliding Window	Sort	Stack	String	Suffix Array	Topological Sort	Tree	Trie	Two Pointers	Union Find
0	0	0	46.7	100.0	999	4100000.0	8700000.0	20217	712	97	...	0	0	0	0	0	0	0	0	0	0
1	0	1	35.7	93.1	999	1900000.0	5200000.0	11350	2704	81	...	0	0	0	0	0	0	0	0	0	0
2	0	1	31.5	90.9	999	2100000.0	6700000.0	13810	714	95	...	1	0	0	1	0	0	0	0	1	0
3	0	2	31.4	86.2	999	904700.0	2900000.0	9665	1486	87	...	0	0	0	0	0	0	0	0	0	0
4	0	1	30.6	84.7	999	1300000.0	4100000.0	10271	670	94	...	0	0	0	1	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1820	1	0	88.9	0.0	15	305	343	2	2	50	...	0	0	0	0	0	0	0	0	0	0
1821	0	0	67.8	0.0	104	7900.0	11700.0	37	4	90	...	0	0	0	0	0	0	0	0	0	0
1822	0	1	71.8	0.0	135	6800.0	9500.0	81	4	95	...	0	0	0	0	0	0	0	0	0	0
1823	0	1	47.2	0.0	134	5000.0	10700.0	147	8	95	...	0	0	0	0	0	0	0	0	0	0
1824	0	2	28.1	0.0	48	2100.0	7400.0	52	43	55	...	0	0	0	0	0	0	0	0	0	0

1825 rows × 55 columns

3. Handling Missing Data After the initial data cleaning:

No attributes currently exhibit missing values, indicating a clean dataset ready for further processing.
For optimal model performance, normalization will be applied to the remaining attributes to ensute uniformity in data scale and distribution.

Stage 2

Following our comprehensive data exploration in stage 2, we have now identified a select group of features to keep. Although a significant portion of our data underwent preprocessing during the initial importing and cleaning phase, now we just need to make refinements based on our findings from the data exploration. That is the reason why we will drop all features with a correlation coefficient below 0.05, leaving us with a set of 24 features (excluding 'difficulty' itself).

A sample of the final dataframe can be observed below.

Sample output of our final dataframe

	acceptance_rate	discuss_count	Array	dislikes	Hash Table	String	Sliding Window	Trie	Depth-first Search	Binary Search	...	Breadth-first Search	Ordered Map	Heap	Union Find	Backtracking	frequency	Segment Tree	rating	Dynamic Programming	difficulty
0	46.7	999	1	712	1	0	0	0	0	0	...	0	0	0	0	0	100.0	0	97	0	0
1	35.7	999	0	2704	0	0	0	0	0	0	...	0	0	0	0	0	93.1	0	81	0	1
2	31.5	999	0	714	1	1	1	0	0	0	...	0	0	0	0	0	90.9	0	95	0	1
3	31.4	999	1	1486	0	0	0	0	0	1	...	0	0	0	0	0	86.2	0	87	0	2
4	30.6	999	0	670	0	1	0	0	0	0	...	0	0	0	0	0	84.7	0	94	1	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1820	88.9	15	0	2	0	0	0	0	0	0	...	0	0	0	0	0	0.0	0	50	0	0
1821	67.8	104	0	4	0	0	0	0	0	0	...	0	0	0	0	0	0.0	0	90	0	0
1822	71.8	135	1	4	0	0	0	0	0	0	...	0	0	0	0	0	0.0	0	95	0	1
1823	47.2	134	0	8	0	0	0	0	0	0	...	1	0	0	0	0	0.0	0	95	1	1
1824	28.1	48	0	43	0	0	0	0	0	0	...	0	0	1	0	0	0.0	0	55	0	2

1825 rows × 23 columns

Model 1 - K-Nearest Neighbors (KNN)

For the complete version, please check the current notebook.

We chose the K-Nearest Neighbors (KNN) algorithm for its simplicity and effectiveness, particularly in multi-label classification tasks. KNN operates on the principle that similar data points are often in close proximity and classifies a new data point based on the majority vote of its 'k' nearest neighbors. Its ability to handle multi-label classification makes it well-suited for predicting the difficulty levels of LeetCode problems (Easy, Medium, Hard).

Hyperparameter tuning

In the case of K-nearest neighbors, the hyperparameter we can tune is the 'k' value to determine the optimal number of neighbors for classification. The way we have approached this has been by building different models with different k values in a range from 1 to 40 and calculating the argmin for its error.

Our finding was that k=12 produced the best result.

Model 2 - Decision Trees

We chose the Decision Tree algorithm for its easy interpretation, and its ability to visualize which attributes are most important, which isn't always clear with other algorithms. This is beneficial because we can see the hierarchical relationship between the features, thus being able to determine which features are most strongly predictive, and which are not. Also, its ability to handle various data types such as discrete or continuous values through the use of thresholds makes it suitable for a multi-label classification, or in our case, predicting the difficulty levels of LeetCode problems (Easy, Medium, Hard).

Hyperparameter tuning

For decision trees, we decided to tune the model on the ccp_alpha parameter. This parameter represents the degree of pruning after the decision tree hsa been built and is used to control the trade-off between tree complexity and its ability to minimize impurity. The goal is to get a decision tree that has a better performance on unseen data.

The way we have implemented this had been by obtaining the accuracy score of the model depending on the value for ccp_alpha and doing argmax to get the value for alpha.

The best validation accuracy was achieved for alpha=0.0016001376425733224

Model 3 - Neural networks

We have observed that k-nearest neighbors and decision trees both have similar performance. We hope that a neural network will be able to infer more complex patterns that may not be captured by these more simplistic models.

Hyperparameter tuning

In the case of neural networks there is a wide range of hyperparameters that can be tuned. In order to find the best neural networks, the parameters that we have chosen to tune are:

Number of layers
Activation function for all hidden layers
Learning rate
Optimizer
- We haven't learned many optimizers beyond SGD but we will include Adam since it is based on SGD

We have chosen not to tune:

Hidden layers individually (at least for now)
- While by no means authoritative, this stack overflow post seems to reasonably explain that changing the individual layers activation functions is likely to be less worthwhile than simply changing the number of layers or neurons
Output layer
- Softmax is generally considered to be the best output activation function for multiclassification problems
Loss function
- We don't want to use a regression loss function as our problem is a classification problem
- Since we have multiple classes we should use categorical cross entropy and not binary crossentropy.
Epochs
- We chose not to tune epochs as the keras docs say "It is generally not needed to tune the number of epochs because a built-in callback is passed to model.fit() to save the model at its best epoch evaluated by the validation_data."

Our search space ends up being as follows:

Search space summary
Default search space size: 6
hidden_layers (Int)
{'default': None, 'conditions': [], 'min_value': 2, 'max_value': 6, 'step': 1, 'sampling': 'linear'}
units_0 (Int)
{'default': None, 'conditions': [], 'min_value': 64, 'max_value': 512, 'step': 48, 'sampling': 'linear'}
activation (Choice)
{'default': 'relu', 'conditions': [], 'values': ['relu', 'tanh'], 'ordered': False}
units_1 (Int)
{'default': None, 'conditions': [], 'min_value': 64, 'max_value': 512, 'step': 48, 'sampling': 'linear'}
learning_rate (Float)
{'default': 0.0001, 'conditions': [], 'min_value': 0.0001, 'max_value': 1.0, 'step': None, 'sampling': 'log'}
optimizer (Choice)
{'default': 'adam', 'conditions': [], 'values': ['adam', 'SGD'], 'ordered': False}

After the training, our best model has a loss of 0.7388 and an accuracy of 0.6776. The model is as follows:

Model: Sequential

Layer (type)	Output Shape	Param #
dense (Dense)	(None, 352)	8096
dense_1 (Dense)	(None, 160)	56480
dense_2 (Dense)	(None, 64)	10304
dense_3 (Dense)	(None, 64)	4160
dense_4 (Dense)	(None, 3)	195

Total params: 79235 Trainable params: 79235 Non-trainable params: 0

Result

Model 1 Results / Figures

Evaluation On Test Set (KNN)

After tuning our model to the best 'k' value, we continue by assessing its performance on the unseen test data to see how well it generalizes.

Confusion Matrix:
[[[118  17]
  [ 32  16]]

 [[ 39  49]
  [ 21  74]]

 [[134   9]
  [ 22  18]]]

Classification Report:
              precision    recall  f1-score   support

           0       0.48      0.33      0.40        48
           1       0.60      0.78      0.68        95
           2       0.67      0.45      0.54        40

    accuracy                           0.59       183
   macro avg       0.58      0.52      0.54       183
weighted avg       0.59      0.59      0.57       183

Evaluation On Training Set (KNN)

To understand our model's learning, we also evaluate its performance on the training set to see how well it has captured the underlying patterns of the data.

Confusion Matrix:
[[[1087  126]
  [ 210  219]]

 [[ 393  381]
  [ 163  705]]

 [[1238   59]
  [ 193  152]]]

Classification Report:
              precision    recall  f1-score   support

           0       0.63      0.51      0.57       429
           1       0.65      0.81      0.72       868
           2       0.72      0.44      0.55       345

    accuracy                           0.66      1642
   macro avg       0.67      0.59      0.61      1642
weighted avg       0.66      0.66      0.64      1642

Fitting Graph and Cross Validation Result

The fitting graph below shows the relationship between the choice of 'k' in KNN and the corresponding training, testing, and cross-validation errors.

Figure 12. Training, testing and cross validation error based on the value of k for KNN

Model 2 Results / Figures

Hyperparameter tuning observations

Figure 13. Training and testing error based on the value for the effective alpha

Below, you will be able to find the best model.

Figure 14. Best model for decision trees

Decision Tree visualization

Figure 15. Visualizaiton of the decision tree model

Evaluation on Training and Test

Train accuracy: 0.7521315468940317
Test accuracy: 0.6065573770491803
              precision    recall  f1-score   support

           0       0.53      0.33      0.41        48
           1       0.62      0.72      0.66        95
           2       0.63      0.68      0.65        40

    accuracy                           0.61       183
   macro avg       0.59      0.57      0.57       183
weighted avg       0.60      0.61      0.59       183

Model 3 Results / Figures

Evaluation On Test Set (Neural Networks)

After obtaining out best model, we continue by assessing its performance on the unseen test data to see how well it generalizes.

              precision    recall  f1-score   support

          0       0.61      0.72      0.66        46
          1       0.71      0.71      0.71        98
          2       0.68      0.54      0.60        39

    accuracy                           0.68       183
  macro avg       0.67      0.66      0.66       183
weighted avg       0.68      0.68      0.68       183

Evaluation On Training Set (Neural Networks)

To understand our model's learning, we also evaluate its performance on the training set to see how well it has captured the underlying patterns of the data.

              precision    recall  f1-score   support

          0       0.65      0.66      0.66       431
          1       0.70      0.77      0.74       865
          2       0.78      0.56      0.65       346

    accuracy                           0.70      1642
  macro avg       0.71      0.67      0.68      1642
weighted avg       0.71      0.70      0.70      1642

Fitting Graph and Cross Validation Result

Figure 16. Training and testing error based on the parameter count

Our fitting graph is unfortunately very disorganized. This is because parameter count is not a great measure of model complexity. However, it gives us some insight into our model, such as the fact that it is slightly overfit.

Discussion

Discussion of Model #1: K-Nearest Neighbours

Comparing Training and Testing Error

By comparing the error/accuracy and precision, recall in training and test data, we can see that:

The model performs not too good in both trainingand test data, we only got accuracy = 0.66/0.59.
Overall, the model performs better in training data.
The model performs better in class '2', which is medium in "difficulty". This make sense because the observation of medium problems is the most.
The model performs stable in precision (around 0.60 in three classes), but performs unstable in recall (high in medium problems, low in other problems.)

Next we would show the fitting graph of training and test error under different k values.

Observations from the Graph (Figure 12)

Overfitting at Low k-values: When 'k' is small (especially at k=1), the model achieves perfect accuracy on the training data, indicative of overfitting. As 'k' increases, the training error rises, and the test error generally decreases which shows a reduction in overfitting.
Underfitting at High k-values: Larger 'k' values tend to generalize better to unseen data, reducing test errors but excessively large 'k' can lead to underfitting.
Cross-Validation: The cross-validation error doesn't always grow according to 'k'. This behavior shows the importance of cross-validation in identifying a 'k' that ensures a model that is balanced between known data and unseen data. This pattern shows the delicate balance between avoiding overfitting with lower 'k' values and the risk of underfitting with higher 'k' values, despite the smaller errors.

Reference used (KNN Introduction): https://www.codecademy.com/learn/introduction-to-supervised-learning-skill-path/modules/k-nearest-neighbors-skill-path/cheatsheet

Optimal 'k' Selection

We have chosen a 'k' of 12 since it shows the most balanced performance between overfitting and underfitting, as evidenced by its superior cross-validation score.

Conclusions

Despite our efforts in fine-tuning 'k', the model's performance through iterative cross-validation on the dataset — marked by accuracies of 0.59 and 0.66 — hasn't met our expectations. This might be due to the fact that there is a relatively weak correlation between our features and the target variable "difficulty". The exploratory analysis, particularly the heatmap, showed that even the most correlated feature, "acceptance_rate", only has a correlation coefficient of -0.39. This is likely a significant contributor to the model's underwhelming performance. Additionally, the disparity between the number of problems of different difficulty levels may be contributing.

Strategies for Improvement

Given the model's current limitations, we propose some approaches to improve its accuracy: Extended 'k' Range: Broadening the range of 'k' values beyond the initial 1 to 40, we may find a more effective 'k' that could improve the model's performance. However, we need to keep in mind that it's crucial to balance the risk of overfitting with smaller 'k' values against underfitting with larger 'k' values. Data Preprocessing Refinement: Revisiting our approach to preprocessing, which is currently centered around MinMax Scaling, could be beneficial. Exploring alternative scaling and standardization techniques could reveal new patterns and correlations that might help the model's performance. Oversampling: sampling with replacement to create a training data set with equal number of Easy, Medium, and Hard problems might be worth exploring. As we progress to subsequent models, we hope that we can achieve a better performance.

Discussion of Model #2: Decision Tree

Observations on hyperparameter tuning (Figure 13 and 14)

When accuracy score = 1, it is overfitting

Overfitting at low alpha values: When ccp alpha is small, indicating less pruning on the decision tree, the training model achieves perfect accuracy, which indicates overfitting.
Cost Complexity Pruning: On the fitting graph, we found out that the cost complexity pruning alpha value that gives us the highest accuracy in the testing dataset is 0.0016001376425733224.
Model Comparison: When compared to the previous models fitting graph it behaves similarly with earlier values of alpha/k overfitting and has a higher error at later values.

Observations on decision tree (Figure 15)

We observe that the most useful features for classification are the ones being compared in the top layers of the tree. In the top 3 layers, the only features used are discuss_count, acceptance_rate, and rating. In the lower layers, frequency, acceptance_rate, and many question topics (Hash Map, Union Find, etc) are used.

Looking at the tree as a whole, we find a lot of nodes of class 1, perhaps due to the nature of our encoding putting class 1 between the other classes causing many predictions of class 1. This aligns with about half of the samples being class 1 as well as class 1 having relatively fewer false negatives (precision < recall).

The first decision separates more than half of the class 2 observations from the set along with around a fourth of the class 1 observations and very few class 0 observations. This suggests that the classes do somewhat fall on a spectrum with respect to the selected features (it is easier to separate class 0 and 2 rather than 0 and 1 or 1 and 2). This is supported by the fact that there is only one instance of two sibling nodes being class 0 and 2; every other pair is 0/1 or 1/2.

Comparing Training and Testing Error

Now we conduct a comparative analysis of the model's performance across the training and the testing sets to understand how well the model can generalize beyond the data it was trained on.

Performance Overview: The accuracy scores indicate that the model's performance is moderate, with a training accuracy of 0.75 and testing accuracy of 0.61. This again, has room for improvement in the model's ability to predict unseen data accurately. However, it had slightly better results than the previous model. The model performs better on the training set, as expected.
Class-wise Performance: The model shows better performance for the 'medium' and 'hard' difficulty level than the 'easy' difficulty level. This could be because of having low prevalence of 'easy' difficulty class compared to 'medium' and 'hard' in the dataset.
Precision and Recall: The model’s precision across the 3 classes is relatively stable at around .60. The recall however is not as stable with recall being lower for the easier questions and higher for the medium and harder questions. The features don’t capture the difference between easy and medium questions very well so the model will frequently classify it as a medium question.

Conclusions

To conclude our second model, our decision tree performed mostly the same as the first model, and while having high recall on easy problems is still difficult, the recall is more stable for medium and difficult problems.

Strategies for Improvement

Some strategies that we have not implemented but would improve the performance of the decision tree are bagging and boosting. This model revealed the most important features, and getting more accurate data on discussion_count can be useful along with possible feature expansion

Discussion of Model #3: Neural Network

Comparing Training and Testing Error

We can now compare the Training and Testing Performance of our Neural Network and compare them to previous models

Training vs Testing: The training accuracy was 70% and the testing accuracy was 68%. This implies that our model is slightly overfit as the training accuracy is higher than the testing accuracy. The precision and recall of the testing set is also lower than the training set.
Neural Network vs KNN: When comparing training accuracy our neural network performs about 4% better than our KNN model. When comparing testing accuracy our neural network performs about 9% better. While not a huge difference it is a small improvment.
Neural Network vs Decision Tree: When comparing training accuracy our neural network performs about 3% worse than our decision tree model. However, when comparing testing accuracy our neural network performs about 7% better. This shows how the decision tree learning model tends to overfit data

In general the hypertuned neural network seems to more accurately predict the difficulty of Leetcode problems. However, there is a tradeoff. The neural network is much less transparent of a model than a decision tree learning model.

Conclusions

Our Neural Network model outperformed our KNN and Decision Tree models, but barely. With accuracies of 68% for testing data and 70% for training data, our model has 7% improvement over the previous ones.

We also need to acknowledge that Neural Networks are much more opaque than Decision Trees or KNN models. If somebody wants to understand why a problem is labeled easy, medium, or hard our Neural Network is unlikely to provide great insight.

Strategies for Improvement

More rigorous hypertuning: For the sake of training time (and not losing our minds) we limited the parameters we were hypertuning. In theory there could be a model with more layers, more neurons, and different activation functions that may perform significantly better. Furthermore, we chose to use RandomizedSearch tuner for the sake of time, but GridSearch may give better results.
Better data/pre-processing: As the saying goes, garbage in -> garbage out. While our models perform decently we are unlikely to get much better performance without more and better data. An important aspect of a Leetcode problem is the problem description. However, for simplicity sake we chose to drop it. If we were to process this data we would likely improve our model's accuracy significantly.

Conclusion

In this study, we explored three different machine learning models — K-Nearest Neighbors (KNN), Decision Tree, and Neural Network — to predict the difficulty level of LeetCode problems.

Having analyzed each of the different models, a conclusion that we are not excited to arrive to is that our dataset is not ideal for analysis. We have not been able to obtain a high accuracy in neither of the models, with the maximum accuracy having been obtained with the neural network, with an accuracy of 68% for testing data and 70% for training data. Although this accuracy is not as high as we would like, we note that it is better than chance, which in our case would be randomly guessing one of three classes, leading to an accuracy of approximately 33%.

We believe that a reason behind this could be the imbalanced distribution of the data, with the majority of problems falling into the medium difficulty category. This imbalance might have led to skewed model predictions and hindered their ability to accurately classify instances across all difficulty levels.

Looking ahead, we recognize the need for improvements to enhance the accuracy of our predictive models. Implementing the strategies outlined in the discussion section, such as refining data preprocessing techniques, as well as sampling techniques, expanding our hyperparameter tuning, and incorporating natural language processing, hold promise for improving model performance. Additionally, acquiring a more comprehensive dataset with a balanced distribution of problem difficulty could yield more accurate and reliable predictions.

In summary, while our initial findings may not be as promising as hoped, we remain optimistic about the potential for future improvements.

Collaboration

Ana Maria Sampedro Barrera: Writer : Brainstormed potential datasets and projects with the team. Attended all team meetings. Worked on describing the data for milestone #1 as well as writing and reorganizing the contents for the final report.
David Choi: Team Leader : Acted as project leader in most meetings. Worked on data exploration working on pairplots and small correlation matrices. Worked on data preprocessing, dropping the data and one hot encoding related topics. Worked on a model 3 neural network and the hyper parameter tuner. Created the fitting graph. Worked on the final readme file.
Isabelle Krochmal: Programmer: Brainstormed potential datasets and projects with team. Attended team meetings, lead one meeting when usual team leader was not available. Performed initial data preprocessing: for example, decided on irrelevant or unusable features to drop, manipulated lists of companies into quantitative data features usable in model. Worked on Milestone 3: KNN. Researched and reviewed code with team. Interpreted results of model and then used them to write explanations for our reasoning of which two models we would create next.
Evie Stergiopoulou: Programmer/Writer: Milestone 2: Conducted data exploration for analyzing problem counts, distribution of problem difficulty levels, and distribution of topics by difficulty level. Refined the data evaluation and preprocessing section and split it in subsections. Provided overview for each section, ensured consistent formatting, and elaborated on writing and code that needed clarification. Milestone 3: Refined all writing aspects and added code comments to the KNN model section. Updated the README with current information.
James Rey Macasa: Programmer : Did data visualization by programming the heatmap to show the correlation between different features of data. Attended most meetings, helping brainstorm potential models to use and what datasets to work on. Worked on milestone 3 model: KNN. Helped review model code and results and asked TAs for advice for model as well as wrote potential improvements that could be made for the model.
Rui Li: Programmer : In milestone1: wrote temp abstract. In milestone2: did github initialization, participated in data exploration, notebook reorganization, writing readme. In milestone3: train and evaluate KNN model, record fitting graph, draft conclusions, edit readme.
Jinwoong Huh: Programmer : Did initial data preprocessing and one-hot encoding of the data. Worked on coding the decision tree model, as well as the decision tree model and fitting graph visualization. Collaborated with teammates to do the observation writeup and readme of the DT model.
Andrew Jia: Writer : Contributed to early feature selection. Worked on model 2 (decision tree), mostly writing, formatting, changing to readme from ipynb. Includes writing observations from decision tree visualization and conclusion of milestone 4.
Ojasvi Tewari : Programmer : Contributed to discussion of possible datasets, helped with the abstracts. Worked on Model 3 by training the hypertuned model and setting up the code and training the Kfold hypertuned model.
Lucas Xu : Programmer : Contribution Worked on model 2 decision tree coded with team. Meet with team and discussed project ideas. Coded decision tree training also did writeups for decision tree conclusion and classification report.

🔗 Current Notebook

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
data		data
docs		docs
ipynb		ipynb
lib		lib
.gitignore		.gitignore
README.md		README.md

ruili4UCSD/CSE151A_WIN24_GROUP

Folders and files

Latest commit

History

Repository files navigation

CSE151A_WIN24_GROUP

Introduction

Topic: LeetCode

Steps

Applications

Code

Method

Data exploration

Stage 1

Dataset info

Attribute description

Figures

Stage 2

Figures

Preprocessing

Stage 1

Steps

Stage 2

Sample output of our final dataframe

Model 1 - K-Nearest Neighbors (KNN)

Hyperparameter tuning

Model 2 - Decision Trees

Hyperparameter tuning

Model 3 - Neural networks

Hyperparameter tuning

Result

Model 1 Results / Figures

Evaluation On Test Set (KNN)

Evaluation On Training Set (KNN)

Fitting Graph and Cross Validation Result

Model 2 Results / Figures

Hyperparameter tuning observations

Decision Tree visualization

Evaluation on Training and Test

Model 3 Results / Figures

Evaluation On Test Set (Neural Networks)

Evaluation On Training Set (Neural Networks)

Fitting Graph and Cross Validation Result

Discussion

Discussion of Model #1: K-Nearest Neighbours

Comparing Training and Testing Error

Observations from the Graph (Figure 12)

Optimal 'k' Selection

Conclusions

Discussion of Model #2: Decision Tree

Observations on hyperparameter tuning (Figure 13 and 14)

Observations on decision tree (Figure 15)

Comparing Training and Testing Error

Conclusions

Discussion of Model #3: Neural Network

Comparing Training and Testing Error

Conclusions

Conclusion

Collaboration

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 10

Languages

Packages