@@ -84,12 +84,70 @@ file.
84
84
85
85
# # Evaluating an Implementation
86
86
87
- Comming soon...
87
+ RubyTuner provides an `evaluate` command to assess various evaluation criteria
88
+ for generated content and the original implementation of a feature. This command is
89
+ useful for testing the output of fine-tuned models or for comparing different
90
+ implementations.
91
+
92
+ # ## Usage
93
+
94
+ ` ` ` bash
95
+ ruby_tuner evaluate FEATURE_ID [IMPLEMENTATION]
96
+ ` ` `
97
+
98
+ **Parameters:**
99
+
100
+ - `FEATURE_ID` : The ID of the feature to evaluate (required).
101
+ - `IMPLEMENTATION` : The implementation to evaluate (optional).
102
+
103
+ **Options:**
104
+
105
+ * `--similarity-method METHOD`: Specify the similarity method to use (`tf_idf` or `exact`; default: `tf_idf`).
106
+ * `--acceptance-score SCORE`: Set the similarity score that passes evaluation (default: `0.8`).
107
+ * `--file PATH`: Specify a file containing the implementation to evaluate.
108
+
109
+ # ## Examples
110
+
111
+ Evaluate an inline implementation :
112
+
113
+ ` ` ` bash
114
+ ruby_tuner evaluate sort-array "def sort_array(arr); arr.sort; end"
115
+ ` ` `
116
+
117
+ Evaluate an implementation from a file :
88
118
89
119
` ` ` bash
90
- ruby_tuner evaluate your-feature-description
120
+ ruby_tuner evaluate sort-array --file ./implementations/sort_array.rb
121
+ ` ` `
122
+
123
+ Evaluate an implementation from standard input :
124
+
125
+ ` ` ` bash
126
+ echo "def sort_array(arr); arr.sort; end" | ruby_tuner evaluate sort-array
127
+ ` ` `
128
+
129
+ Use a different similarity method and threshold :
130
+
131
+ ` ` `
132
+ ruby_tuner evaluate sort-array --similarity-method exact --similarity-threshold 0.9 "def sort_array(arr); arr.sort; end"
91
133
` ` `
92
134
135
+ # ## How it works
136
+
137
+ The evaluate command compares the provided implementation with the original
138
+ implementation stored in the feature's directory. It uses the specified
139
+ similarity method to calculate a similarity score and determines if the
140
+ implementation passes based on the similarity threshold.
141
+
142
+ This command is particularly useful for :
143
+
144
+ * Assessing the quality of generated code from fine-tuned models
145
+ * Comparing different implementations of the same feature
146
+ * Validating machine-generated code against human-written implementations
147
+
148
+ The evaluation results, including similarity scores and pass/fail status, will
149
+ be displayed in the console output.
150
+
93
151
# # Generating Training Data
94
152
95
153
Comming soon...
0 commit comments