-
Notifications
You must be signed in to change notification settings - Fork 21
/
AB testing.page
2608 lines (2222 loc) · 121 KB
/
AB testing.page
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: A/B testing long-form readability
description: a log of experiments done on the site design, intended to render pages more readable
created: 16 Jun 2012
tags: experiments, statistics, computer science, meta
status: in progress
belief: possible
...
> To gain some statistical & web development experience and to improve my readers' experiences, I have been running a series of CSS A/B tests since June 2012. As expected, most do not show any meaningful difference.
# Background
- https://www.google.com/analytics/siteopt/exptlist?account=18912926
- http://www.pqinternet.com/196.htm
- https://support.google.com/websiteoptimizer/bin/answer.py?hl=en&answer=61203 "Experiment with site-wide changes"
- https://support.google.com/websiteoptimizer/bin/answer.py?hl=en&answer=117911 "Working with global headers"
- https://support.google.com/websiteoptimizer/bin/answer.py?hl=en-GB&answer=61427
- https://support.google.com/websiteoptimizer/bin/answer.py?hl=en&answer=188090 "Varying page and element styles" - testing with inline CSS overriding the defaults
- http://stackoverflow.com/questions/2993199/with-google-website-optimizers-multivariate-testing-can-i-vary-multiple-css-cl
- http://www.xemion.com/blog/the-secret-to-painless-google-website-optimizer-70.html
- http://stackoverflow.com/tags/google-website-optimizer/hot
# Problems with "conversion" metric
https://support.google.com/websiteoptimizer/bin/answer.py?hl=en-AU&answer=74345 "Time on page as a conversion goal" - every page converts, by using a timeout (mine is 40 seconds). Problem: dichotomizing a continuous variable into a single binary variable destroys a massive amount of information. This is well-known in the statistical and psychological literature (eg. [MacCallum et al 2002](http://www.psychology.sunysb.edu/attachment/measures/content/maccallum_on_dichotomizing.pdf "On the Practice of Dichotomization of Quantitative Variables")) but I'll illustrate further with some information-theoretical observations.
According to my Analytics, the mean reading time (time on page) is 1:47 and the maximum bracket, hit by 1% of viewers, is 1801 seconds, and the range 1-1801 takes <10.8 bits to encode (`log2(1801) ~> 10.81`), hence each page view could be represented by <10.8 bits (less since reading time is so highly skewed). But if we dichotomize, then we learn simply that ~14% of readers will read for 40 seconds, hence each reader carries not 6 bits, nor 1 bit (if 50% read that long) but closer to 2/3 of a bit:
~~~{.R}
R> p=0.14; q=1-p; (-p*log2(p) - q*log2(q))
[1] 0.5842
~~~
This isn't even an efficient dichotomization: we could improve the fractional bit to 1 bit if we could somehow dichotomize at 50% of readers:
~~~{.R}
R> p=0.50; q=1-p; (-p*log2(p) - q*log2(q))
[1] 1
~~~
But unfortunately, simply lowering the timeout will have minimal returns as Analytics also reports that 82% of reader spend 0-10 seconds on pages. So we are stuck with a severe loss.
# ideas for testing
JS:
disqus
CSS
differences from readability
every declaration in default.CSS?
Donation
placement - left, right, bottom
donation text
help pay for hosting
help sponsor X experiment
Xah's text - did you find this article useful?
- test the suggestions in https://code.google.com/p/better-web-readability-project/ http://www.vcarrer.com/2009/05/how-we-read-on-web-and-how-can-we.html
# Testing
## `max-width`
CSS-3 property: set how wide the page will be in pixels if unlimited screen real estate is available. I noticed some people complained that pages were 'too wide' and this made it hard to read, which apparently is a real thing since lines are supposed to fit in eye saccades. So I tossed in 800px, 900px, 1300px, and 1400px to the first A/B test.
~~~{.Html}
<!-- Google Website Optimizer Control Script -->
<script>
function utmx_section(){}function utmx(){}
(function(){var k='0520977997',d=document,l=d.location,c=d.cookie;function f(n){
if(c){var i=c.indexOf(n+'=');if(i>-1){var j=c.indexOf(';',i);return escape(c.substring(i+n.
length+1,j<0?c.length:j))}}}var x=f('__utmx'),xx=f('__utmxx'),h=l.hash;
d.write('<sc'+'ript src="'+
'http'+(l.protocol=='https:'?'s://ssl':'://www')+'.google-analytics.com'
+'/siteopt.js?v=1&utmxkey='+k+'&utmx='+(x?x:'')+'&utmxx='+(xx?xx:'')+'&utmxtime='
+new Date().valueOf()+(h?'&utmxhash='+escape(h.substr(1)):'')+
'" type="text/javascript" charset="utf-8"></sc'+'ript>')})();
</script>
<!-- End of Google Website Optimizer Control Script -->
<!-- Google Website Optimizer Tracking Script -->
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['gwo._setAccount', 'UA-18912926-2']);
_gaq.push(['gwo._trackPageview', '/0520977997/test']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www')
+ '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
<!-- End of Google Website Optimizer Tracking Script -->
<!-- Google Website Optimizer Tracking Script -->
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['gwo._setAccount', 'UA-18912926-2']);
setTimeout(function() {
_gaq.push(['gwo._trackPageview', '/0520977997/goal']);
}, 40000);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') +
'.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
<!-- End of Google Website Optimizer Tracking Script -->
<script>utmx_section("max width")</script>
<style type="text/css">
body { max-width: 800px; }
</style>
</noscript>
~~~
It ran from mid-June to 1 August 2012. Unfortunately, I cannot be more specific: on 1 August, Google deleted Website Optimizer and told everyone to use 'Experiments' in Google Analytics - and *deleted all my information*. The graph over time, the exact numbers - all gone. So this is from memory.
The results were initially very promising: 'conversion' was defined as staying on a page for 40 seconds (I reasoned that this meant someone was actually reading the page), and had a base of around 70% of readers converting. With a few hundred hits, 900px converted at 10-20% more than the default! I was ecstatic. So when it began falling, I was only a little bothered (one had to expect some regression to the mean since the results were too good to be true). But as the hits increased into the low thousands, the effect kept shrinking all the way down to 0.4% improved conversion. At some points, 1300px actually exceeded 900px.
The second distressing thing was that Google's estimated chance of a particular intervention beating the default (which I believe is a Bonferroni-corrected _p_-value), did not increase! Even as each version received 20,000 hits, the chance stubbornly bounced around the 70-90% range for 900px and 1300px. This remained true all the way to the bitter end. At the end, each version had racked up 93,000 hits *and still was in the 80% decile*. Wow.
Ironically, I was warned at the beginning about both of these possible behaviors by a paper I read on large-scale corporate A/B testing: http://www.exp-platform.com/Documents/puzzlingOutcomesInControlledExperiments.pdf and http://www.exp-platform.com/Documents/controlledExperimentDMKD.pdf and http://www.exp-platform.com/Documents/2013%20controlledExperimentsAtScale.pdf It covered at length how many apparent trends simply evaporated, but it also covered later a peculiar phenomenon where A/B tests did not converge even after being run on ungodly amounts of data because the standard deviations kept changing (the user composition kept shifting and rendering previous data more uncertain). And it's a general phenomenon that even for large correlations, the trend will bounce around a lot before it stabilizes ([Schönbrodt & Perugini 2013](http://www.psy.lmu.de/allg2/download/schoenbrodt/pub/stable_correlations.pdf "At what sample size do correlations stabilize?")).
Oy vey! When I discovered Google had deleted my results, I decided to simply switch to 900px. Running a new test would not provide any better answers.
## TODO
how about a blue background?
see http://www.overcomingbias.com/2010/06/near-far-summary.html for more design ideas
5. table striping
~~~{.Css}
tbody tr:hover td { background-color: #f5f5f5;}
tbody tr:nth-child(odd) td { background-color: #f9f9f9;}
~~~
8. link decoration
~~~{.Css}
a { color: black; text-decoration: underline;}
a { color:#005AF2; text-decoration:none; }
~~~
# Resumption: ABalytics
In March 2013, I decided to give A/B testing another whack. Google Analytics Experiment did not seem to have improved and the commercial services continued to charge unacceptable prices, so I gave the Google Analytics custom variable integration approach another trying using [ABalytics](https://github.com/danmaz74/ABalytics). The usual puzzling, debugging, and frustration of combining so many disparate technologies (HTML *and* CSS *and* JS *and* Google Analytics) aside, it seemed to work on my test page. The current downside seems to be that the ABalytics approach may be fragile, and the UI in GA is awful (you have to do the statistics yourself).
## `max-width` redux
The test case is to rerun the `max-width` test and finish it.
### Implementation
The exact changes:
~~~{.Diff}
Sun Mar 17 11:25:39 EDT 2013 gwern@gwern.net
* default.html: setup ABalytics a/b testing https://github.com/danmaz74/ABalytics
(hope this doesn't break anything...)
addfile ./static/js/abalytics.js
hunk ./static/js/abalytics.js 1
...
hunk ./static/templates/default.html 28
+ <!-- override CSS with a/b test -->
+ <div class="maxwidth_class1"></div>
+
...
- <noscript><p>Enable JavaScript for Disqus comments</p></noscript>
+ window.onload = function() {
+ ABalytics.applyHtml();
+ };
+ </script>
hunk ./static/templates/default.html 119
+
+ ABalytics.init({
+ maxwidth: [
+ {
+ name: '800',
+ "maxwidth_class1": "<style>body { max-width: 800px; }</style>",
+ "maxwidth_class2": ""
+ },
+ {
+ name: '900',
+ "maxwidth_class1": "<style>body { max-width: 900px; }</style>",
+ "maxwidth_class2": ""
+ },
+ {
+ name: '1100',
+ "maxwidth_class1": "<style>body { max-width: 1100px; }</style>",
+ "maxwidth_class2": ""
+ },
+ {
+ name: '1200',
+ "maxwidth_class1": "<style>body { max-width: 1200px; }</style>",
+ "maxwidth_class2": ""
+ },
+ {
+ name: '1300',
+ "maxwidth_class1": "<style>body { max-width: 1300px; }</style>",
+ "maxwidth_class2": ""
+ },
+ {
+ name: '1400',
+ "maxwidth_class1": "<style>body { max-width: 1400px; }</style>",
+ "maxwidth_class2": ""
+ }
+ ],
+ }, _gaq);
+
~~~
### Results
I wound up the test on 17 April 2013 with the following results:
Width (px) Visits Conversion
---------- ------ ----------
1100 18,164 14.49%
1300 18,071 14.28%
1200 18,150 13.99%
800 18,599 13.94%
900 18,419 13.78%
1400 18,378 13.68%
109772 14.03%
### Analysis
1100px is close to my original A/B test indicating 1000px was the leading candidate, so that gives me additional confidence, as does the observation that 1300px and 1200px are the other leading candidates. (Curiously, the site conversion average before was 13.88%; perhaps my underlying traffic changed slightly around the time of the test? This would demonstrate why alternatives need to be tested simultaneously.) A quick and dirty R test of 1100px vs 1300px (`prop.test(c(2632,2581),c(18164,18071))`) indicates the difference isn't statistically-significant (at _p_=0.58), and we might want more data; worse, there is no clear linear relation between conversion and width (the plot is erratic, and a linear fit a dismal _p_=0.89):
~~~{.R}
rates <- read.csv(stdin(),header=TRUE)
Width,N,Rate
1100,18164,0.1449
1300,18071,0.1428
1200,18150,0.1399
800,18599,0.1394
900,18419,0.1378
1400,18378,0.1368
rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes
g <- glm(cbind(Successes,Failures) ~ Width, data=rates, family="binomial")
...Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.82e+00 4.65e-02 -39.12 <2e-16
Width 5.54e-06 4.10e-05 0.14 0.89
# not much better:
rates$Width <- as.factor(rates$Width)
rates$Width <- relevel(rates$Width, ref="900")
g2 <- glm(cbind(Successes,Failures) ~ Width, data=rates, family="binomial"); summary(g2)
~~~
But I want to move on to the next test and by the same logic it is highly unlikely that the difference between them is large or much in 1300px's favor (the kind of mistake I care about: switching between 2 equivalent choices doesn't matter, missing out on an improvement *does* matter - maximizing β, not minimizing α).
## Fonts
The _New York Times_ ran [an informal online experiment](http://opinionator.blogs.nytimes.com/2012/08/08/hear-all-ye-people-hearken-o-earth/ "Hear, All Ye People; Hearken, O Earth (Part One)") with a large number of readers (_n_=60750) and found that the [Baskerville](!Wikipedia) font led to more readers agreeing with a short text passage - this seems plausible enough given their very large sample size and Wikipedia's note that "The refined feeling of the typeface makes it an excellent choice to convey dignity and tradition."
### Power analysis
Would this font work its magic on `gwern.net` too? Let's see. The sample size is quite manageable, as over a month I will easily have 60k visits, and they tested 6 fonts, expanding their necessary sample. What sample size do I actually need? Their professor estimates the effect size of Baskerville at 1.5%; I would like my A/B test to have very high statistical power (0.9) and reach more stringent statistical-significance (_p_<0.01) so I can go around and in good conscience tell people to use Baskerville. I already know the average "conversion rate" is ~13%, so I get this power calculation:
~~~{.R}
power.prop.test(p1=0.13+0.015, p2=0.13, power=0.90, sig.level=0.01)
Two-sample comparison of proportions power calculation
n = 15683
p1 = 0.145
p2 = 0.13
sig.level = 0.01
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group
~~~
15000 visitors in each group seems reasonable; at ~16k visitors a week, that suggests a few weeks of testing. Of course I'm testing 4 fonts (see below), but that still fits in the ~2 months I've allotted for this test.
### Implementation
I had previously drawn on the NYT experiment for my site design:
~~~{.Css}
html {
...
font-family: Georgia, "HelveticaNeue-Light", "Helvetica Neue Light", "Helvetica Neue", Helvetica,
Arial, "Lucida Grande", garamond, palatino, verdana, sans-serif;
}
~~~
I had not used Baskerville but [Georgia](!Wikipedia "Georgia (typeface)") since Georgia seemed similar and was convenient, but we'll fix that now. Besides Baskerville & Georgia, we'll omit [Comic Sans](!Wikipedia) (of course), but we can try [Trebuchet](!Wikipedia "Trebuchet MS") for a total of 4 fonts (falling back to Georgia):
~~~{.Html}
hunk ./static/templates/default.html 28
+ <!-- override CSS with a/b test -->
+ <div class="fontfamily_class1"></div>
...
hunk ./static/templates/default.html 121
+ fontfamily: [
+ {
+ name: 'Baskerville',
+ "fontfamily_class1": "<style>html { font-family: Baskerville, Georgia; }</style>",
+ "fontfamily_class2": ""
+ },
+ {
+ name: 'Georgia',
+ "fontfamily_class1": "<style>html { font-family: Georgia; }</style>",
+ "fontfamily_class2": ""
+ },
+ {
+ name: 'Trebuchet',
+ "fontfamily_class1": "<style>html { font-family: 'Trebuchet MS', Georgia; }</style>",
+ "fontfamily_class2": ""
+ },
+ {
+ name: 'Helvetica',
+ "fontfamily_class1": "<style>html { font-family: Helvetica, Georgia; }</style>",
+ "fontfamily_class2": ""
+ }
+ ],
~~~
### Results
Running from 14 April 2013 to 16 June 2013:
Font Type Visits Conversion
---------- ------ ------- ----------
Trebuchet sans 35,473 13.81%
Baskerville serif 36,021 13.73%
Helvetica sans 35,656 13.43%
Georgia serif 35,833 13.31%
sans 71,129 13.62%
serif 71,854 13.52%
142,983 13.57%
The sample size for each font is 20k higher than I projected due to the enormous popularity of [an analysis of the lifetimes of Google services](Google shutdowns) I finished during the test. Regardless, it's clear that the results - with double the total sample size of the NYT experiment, focused on fewer fonts - are disappointing and there seems to be very little difference between fonts.
### Analysis
Picking the most extreme difference, between Trebuchet and Georgia, the difference is close to the usual definition of statistical-significance:
~~~{.R}
R> prop.test(c(0.1381*35473,0.1331*35833),c(35473,35833))
2-sample test for equality of proportions with continuity correction
data: c(0.1381 * 35473, 0.1331 * 35833) out of c(35473, 35833)
X-squared = 3.76, df = 1, p-value = 0.0525
alternative hypothesis: two.sided
95% confidence interval:
-5.394e-05 1.005e-02
sample estimates:
prop 1 prop 2
0.1381 0.1331
~~~
Which naturally implies that the much smaller difference between Trebuchet and Baskerville is not statistically-significant:
~~~{.R}
R> prop.test(c(0.1381*35473,0.1373*36021), c(35473,36021))
2-sample test for equality of proportions with continuity correction
data: c(0.1381 * 35473, 0.1373 * 36021) out of c(35473, 36021)
X-squared = 0.0897, df = 1, p-value = 0.7645
alternative hypothesis: two.sided
95% confidence interval:
-0.00428 0.00588
~~~
Since there's only small differences between individual fonts, I wondered if there might be a difference between the two sans-serifs and the two serifs. If we lump the 4 fonts into those 2 categories and look at the small difference in mean conversion rate:
~~~{.R}
R> prop.test(c(0.1362*71129,0.1352*71854), c(71129,71854))
2-sample test for equality of proportions with continuity correction
data: c(0.1362 * 71129, 0.1352 * 71854) out of c(71129, 71854)
X-squared = 0.2963, df = 1, p-value = 0.5862
alternative hypothesis: two.sided
95% confidence interval:
-0.002564 0.004564
~~~
Nothing doing there either. More generally:
~~~{.R}
rates <- read.csv(stdin(),header=TRUE)
Font,Serif,N,Rate
Trebuchet,FALSE,35473,0.1381
Baskerville,TRUE,6021,0.1373
Helvetica,FALSE,35656,0.1343
Georgia,TRUE,5833,0.1331
rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes
g <- glm(cbind(Successes,Failures) ~ Font, data=rates, family="binomial"); summary(g)
...Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.83745 0.03744 -49.08 <2e-16
FontGeorgia -0.03692 0.05374 -0.69 0.49
FontHelvetica -0.02591 0.04053 -0.64 0.52
FontTrebuchet 0.00634 0.04048 0.16 0.88
~~~
With essentially no meaningful differences between conversion rates, this suggests that however fonts matter, they don't matter for reading duration. So I feel free to pick the font that appeals to me visually, which is Baskerville.
## Line height
I have seen complaints that lines on `gwern.net` are "too closely spaced" or "run together" or "cramped", referring to the [line height](!Wikipedia "Leading") (the CSS property `line-height`). I set the CSS to `line-height: 150%;` to deal with this objection, but this was a simple hack based on rough eyeballing of it, and it was done before I changed the `max-width` and `font-family` settings after the previous testing. So it's worth testing some variants.
Most web design guides seem to suggest a safe default of 120%, rather than my current 150%. If we try to test each decile plus one on the outside, that'd give us 110, 120, 130, 140, 150, 160 or 6 options, which combined with the expected small effect, would require an unreasonable sample size (and I have nothing in the pipeline I expect might catch fire like the Google analysis and deliver an excess >50k visits). So I'll try just 120/130/140/150, and schedule a similar block of time as fonts (ending the experiment on 16 August 2013, with presumably >70k datapoints).
### Implementation
~~~{.Html}
hunk ./static/templates/default.html 30
- <div class="fontfamily_class1"></div>
+ <div class="linewidth_class1"></div>
hunk ./static/templates/default.html 156
- fontfamily:
+ linewidth:
hunk ./static/templates/default.html 158
- name: 'Baskerville',
- "fontfamily_class1": "<style>html { font-family: Baskerville, Georgia; }</style>",
- "fontfamily_class2": ""
+ name: 'Line120',
+ "linewidth_class1": "<style>div#content { line-height: 120%;}</style>",
+ "linewidth_class2": ""
hunk ./static/templates/default.html 163
- name: 'Georgia',
- "fontfamily_class1": "<style>html { font-family: Georgia; }</style>",
- "fontfamily_class2": ""
+ name: 'Line130',
+ "linewidth_class1": "<style>div#content { line-height: 130%;}</style>",
+ "linewidth_class2": ""
hunk ./static/templates/default.html 168
- name: 'Trebuchet',
- "fontfamily_class1": "<style>html { font-family: 'Trebuchet MS', Georgia; }</style>",
- "fontfamily_class2": ""
+ name: 'Line140',
+ "linewidth_class1": "<style>div#content { line-height: 140%;}</style>",
+ "linewidth_class2": ""
hunk ./static/templates/default.html 173
- name: 'Helvetica',
- "fontfamily_class1": "<style>html { font-family: Helvetica, Georgia; }</style>",
- "fontfamily_class2": ""
+ name: 'Line150',
+ "linewidth_class1": "<style>div#content { line-height: 150%;}</style>",
+ "linewidth_class2": ""
~~~
### Analysis
From 15 June 2013 - 15 August 2013:
line % _n_ Conversion %
------ ------- ------------
130 18,124 15.26
150 17,459 15.22
120 17,773 14.92
140 17,927 14.92
71,283 15.08
Just from looking at the miserably small difference between the most extreme percentages ($15.26 - 14.92 = 0.34$%), we can predict that nothing here was statistically-significant:
~~~{.R}
x1 <- 18124; x2 <- 17927; prop.test(c(x1*0.1524, x2*0.1476), c(x1,x2))
2-sample test for equality of proportions with continuity correction
data: c(x1 * 0.1524, x2 * 0.1476) out of c(x1, x2)
X-squared = 1.591, df = 1, p-value = 0.2072
~~~
I changed the 150% to 130% for the heck of it, even though the difference between 130 and 150 was trivially small.
<!-- rates <- read.csv(stdin(),header=TRUE)
Width,N,Rate
130,18124,0.1526
150,17459,0.1522
120,17773,0.1492
140,17927,0.1492
rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes
rates$Width <- as.factor(rates$Width)
g <- glm(cbind(Successes,Failures) ~ Width, data=rates, family="binomial")
...Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.74e+00 2.11e-02 -82.69 <2e-16
Width130 2.65e-02 2.95e-02 0.90 0.37
Width140 9.17e-06 2.97e-02 0.00 1.00
Width150 2.32e-02 2.98e-02 0.78 0.44
-->
## Null test
One of the suggestions in the A/B testing papers was to run a "null" A/B test where the payload is empty but the A/B testing framework is still measuring conversions etc. By definition, the null hypothesis of "no difference" should be true and at an alpha of 0.05, only 5% of the time would the null tests yield a _p_<0.05 (which is very different from the usual situation). The interest here is that it's possible that something is going wrong in one's A/B setup or in general, and so if one gets a "statistically-significant" result, it may be worthwhile investigating this anomaly.
It's easy to switch from the lineheight test to the null test; just rename the variables for Google Analytics, and empty the payloads:
~~~{.Html}
hunk ./static/templates/default.html 30
- <div class="linewidth_class1"></div>
+ <div class="null_class1"></div>
hunk ./static/templates/default.html 158
- linewidth: [
+ null: [
+ ...]]
hunk ./static/templates/default.html 160
- name: 'Line120',
- "linewidth_class1": "<style>div#content { line-height: 120%;}</style>",
+ name: 'null1',
+ "null_class1": "",
hunk ./static/templates/default.html 165
- { ...
- name: 'Line130',
- "linewidth_class1": "<style>div#content { line-height: 130%;}</style>",
- "linewidth_class2": ""
- },
- {
- name: 'Line140',
- "linewidth_class1": "<style>div#content { line-height: 140%;}</style>",
- "linewidth_class2": ""
- },
- {
- name: 'Line150',
- "linewidth_class1": "<style>div#content { line-height: 150%;}</style>",
+ name: 'null2',
+ "null_class1": "",
+ ... }
~~~
Since any difference due to the testing framework should be noticeable, this will be a shorter experiment, from 15 August to 29 August.
### Results
While amusingly the first pair of 1k hits resulted in a dramatic 18% vs 14% result, this quickly disappeared into a much more normal-looking set of data:
option _n_ conversion
------ ----- ------
null2 7,359 16.23%
null1 7,488 15.89%
14,847 16.06%
### Analysis
Ah, but can we reject the null hypothesis that ""==""? In a rare victory for null-hypothesis-significance-testing, we do not commit a Type I error:
~~~{.R}
R> x1 <- 7359; x2 <- 7488; prop.test(c(x1*0.1623, x2*0.1589), c(x1,x2))
2-sample test for equality of proportions with continuity correction
data: c(x1 * 0.1623, x2 * 0.1589) out of c(x1, x2)
X-squared = 0.2936, df = 1, p-value = 0.5879
alternative hypothesis: two.sided
95% confidence interval:
-0.008547 0.015347
~~~
But seriously, it is nice to see that ABalytics does not seem to be broken & favoring either option and any results driven by placement in the array of options.
## Text & background color
As part of the generally monochromatic color scheme, the background was off-white (grey) and the text was black:
~~~{.Css}
html { ...
background-color: #FCFCFC; /* off-white */
color: black;
... }
~~~
The hyperlinks, on the other hand, make use of a off-black `color: #303C3C`, partially motivated by Ian Storm Taylor's advice to ["Never Use Black"](http://ianstormtaylor.com/design-tip-never-use-black/). I wonder - should all the text be off-black too? And which combination is best? White/black? Off-white/black? Off-white/off-black? White/off-black? Let's try all 4 combinations here.
### Implementation
The usual:
~~~{.Html}
hunk ./static/templates/default.html 30
- <div class="underline_class1"></div>
+ <div class="ground_class1"></div>
hunk ./static/templates/default.html 155
- underline: [
+ ground: [
hunk ./static/templates/default.html 157
- name: 'underlined',
- "underline_class1": "<style>a { color: #303C3C; text-decoration: underline; }</style>",
- "underline_class2": ""
+ name: 'bw',
+ "ground_class1": "<style>html { background-color: white; color: black; }</style>",
+ "ground_class2": ""
hunk ./static/templates/default.html 162
- name: 'notUnderlined',
- "underline_class1": "<style>a { color: #303C3C; text-decoration: none; }</style>",
- "underline_class2": ""
+ name: 'obw',
+ "ground_class1": "<style>html { background-color: white; color: #303C3C; }</style>",
+ "ground_class2": ""
+ },
+ {
+ name: 'bow',
+ "ground_class1": "<style>html { background-color: #FCFCFC; color: black; }</style>",
+ "ground_class2": ""
+ },
+ {
+ name: 'obow',
+ "ground_class1": "<style>html { background-color: #FCFCFC; color: #303C3C; }</style>",
+ "ground_class2": ""
... ]]
~~~
### Data
I am a little curious about this one, so I scheduled a full month and half: 10 September - 20 October. Due to far more traffic than anticipated from submissions to Hacker News, I cut it short by 10 days to avoid wasting traffic on a test which was done (a total _n_ of 231,599 was more than enough). The results:
Version _n_ Conversion
---- ------ ---------
bw 58,237 12.90%
obow 58,132 12.62%
bow 57,576 12.48%
obw 57,654 12.44%
### Analysis
~~~{.R}
rates <- read.csv(stdin(),header=TRUE)
Black,White,N,Rate
TRUE,TRUE,58237,0.1290
FALSE,FALSE,58132,0.1262
TRUE,FALSE,57576,0.1248
FALSE,TRUE,57654,0.1244
rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes
g <- glm(cbind(Successes,Failures) ~ Black * White, data=rates, family="binomial")
summary(g)
...
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.9350 0.0125 -154.93 <2e-16
BlackTRUE -0.0128 0.0177 -0.72 0.47
WhiteTRUE -0.0164 0.0178 -0.92 0.36
BlackTRUE:WhiteTRUE 0.0545 0.0250 2.17 0.03
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6.8625e+00 on 3 degrees of freedom
Residual deviance: -1.1758e-11 on 0 degrees of freedom
AIC: 50.4
summary(step(g))
# same thing
~~~
So we can estimate the net effect of the 4 possibilities:
1. Black, White: -0.0128 + -0.0164 + 0.0545 = 0.0253
2. Off-black, Off-white: 0 + 0 + 0 = 0
3. Black, Off-white: -0.0128 + 0 + 0 = -0.0128
4. Off-black, White: 0 + -0.0164 + 0 = -0.0164
The results exactly match the data's rankings.
So, this suggests a change to the CSS: we switch the default background color from `#FCFCFC` to `white`, while leaving the default `color` its current `black`.
Reader Lucas asks in the comment sections whether, since we would expect new visitors to the website to be less likely to read a page in full than a returning visitor (who knows what they're in for & probably wants more), whether including such a variable (which is something Google Analytics does track) might improve the analysis. It's easy to ask GA for "New vs Returning Visitor" so I did:
~~~{.R}
rates <- read.csv(stdin(),header=TRUE)
Black,White,Type,N,Rate
FALSE,TRUE,new,36695,0.1058
FALSE,TRUE,old,21343,0.1565
FALSE,FALSE,new,36997,0.1043
FALSE,FALSE,old,21537,0.1588
TRUE,TRUE,new,36600,0.1073
TRUE,TRUE,old,22274,0.1613
TRUE,FALSE,new,36409,0.1075
TRUE,FALSE,old,21743,0.1507
rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes
g <- glm(cbind(Successes,Failures) ~ Black * White + Type, data=rates, family="binomial")
summary(g)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.134459 0.013770 -155.01 <2e-16
BlackTRUE -0.009219 0.017813 -0.52 0.60
WhiteTRUE 0.000837 0.017798 0.05 0.96
BlackTRUE:WhiteTRUE 0.034362 0.025092 1.37 0.17
Typeold 0.448004 0.012603 35.55 <2e-16
~~~
1. B/W: (-0.009219) + 0.000837 + 0.034362 = 0.02598
2. 0 + 0 + 0 = 0
3. B: (-0.009219) + 0 + 0 = -0.009219
4. W: 0 + 0.000837 + 0 = 0.000837
And again, 0.02598 > 0.000837. So as one hopes, thank to randomization, adding a missing covariate doesn't change our conclusion.
## List symbol and font-size
I make heavy use of unordered lists in articles; for no particular reason, the symbol denoting the start of each entry in a list is the little black square, rather than the more common little circle. I've come to find the little squares a little chunky and ugly, so I want to test that. And I just realized that I never tested font size (just type of font), even though increasing font size one of the most common CSS tweaks around. I don't have any reason to expect an interaction between these two bits of designs, unlike the previous A/B test, but I like the idea of getting more out of my data, so I am doing another factorial design, this time not 2x2 but 3x5. The options:
~~~{.Css}
ul { list-style-type: square; }
ul { list-style-type: circle; }
ul { list-style-type: disc; }
html { font-size: 100%; }
html { font-size: 105%; }
html { font-size: 110%; }
html { font-size: 115%; }
html { font-size: 120%; }
~~~
### Implementation
A 3x5 design, or 15 possibilities, does get a little bulkier than I'd like:
~~~{.Html}
hunk ./static/templates/default.html 30
- <div class="ground_class1"></div>
+ <div class="ulFontSize_class1"></div>
hunk ./static/templates/default.html 146
- ground: [
+ ulFontSize: [
hunk ./static/templates/default.html 148
- name: 'bw',
- "ground_class1": "<style>html { background-color: white; color: black; }</style>",
- "ground_class2": ""
+ name: 's100',
+ "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 100%; }</style>",
+ "ulFontSize_class2": ""
hunk ./static/templates/default.html 153
- name: 'obw',
- "ground_class1": "<style>html { background-color: white; color: #303C3C; }</style>",
- "ground_class2": ""
+ name: 's105',
+ "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 105%; }</style>",
+ "ulFontSize_class2": ""
hunk ./static/templates/default.html 158
- name: 'bow',
- "ground_class1": "<style>html { background-color: #FCFCFC; color: black; }</style>",
- "ground_class2": ""
+ name: 's110',
+ "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 110%; }</style>",
+ "ulFontSize_class2": ""
hunk ./static/templates/default.html 163
- name: 'obow',
- "ground_class1": "<style>html { background-color: #FCFCFC; color: #303C3C; }</style>",
- "ground_class2": ""
+ name: 's115',
+ "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 115%; }</style>",
+ "ulFontSize_class2": ""
+ },
+ {
+ name: 's120',
+ "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 120%; }</style>",
+ "ulFontSize_class2": ""
+ },
+ {
+ name: 'c100',
+ "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 100%; }</style>",
+ "ulFontSize_class2": ""
+ },
+ {
+ name: 'c105',
+ "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 105%; }</style>",
+ "ulFontSize_class2": ""
+ },
+ {
+ name: 'c110',
+ "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 110%; }</style>",
+ "ulFontSize_class2": ""
+ },
+ {
+ name: 'c115',
+ "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 115%; }</style>",
+ "ulFontSize_class2": ""
+ },
+ {
+ name: 'c120',
+ "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 120%; }</style>",
+ "ulFontSize_class2": ""
+ },
+ {
+ name: 'd100',
+ "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 100%; }</style>",
+ "ulFontSize_class2": ""
+ },
+ {
+ name: 'd105',
+ "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 105%; }</style>",
+ "ulFontSize_class2": ""
+ },
+ {
+ name: 'd110',
+ "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 110%; }</style>",
+ "ulFontSize_class2": ""
+ },
+ {
+ name: 'd115',
+ "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 115%; }</style>",
+ "ulFontSize_class2": ""
+ },
+ {
+ name: 'd120',
+ "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 120%; }</style>",
+ "ulFontSize_class2": ""
... ]]
~~~
### Data
I halted the A/B test on 27 October because I was noticing clear damage as compared to my default CSS. The results were:
List icon Font zoom _n_ Reading conversion rate
----------- --------- ------- -----------------------
square 100% 4,763 16.38%
disc 100% 4,759 16.18%
disc 110% 4,716 16.09%
circle 115% 4,933 15.95%
circle 100% 4,872 15.85%
circle 110% 4,920 15.53%
circle 120% 5,114 15.51%
square 115% 4,815 15.51%
square 110% 4,927 15.47%
circle 105% 5,101 15.33%
square 105% 4,775 14.85%
disc 115% 4,797 14.78%
disc 105% 5,006 14.72%
disc 120% 4,912 14.56%
square 120% 4,786 13.96%
73,196 15.38%
### Analysis
Incorporating visitor type:
~~~{.R}
rates <- read.csv(stdin(),header=TRUE)
Ul,Size,Type,N,Rate
c,120,old,2673,0.1650
c,115,old,2643,0.1854
c,105,new,2636,0.1392
d,105,old,2635,0.1613
s,110,old,2596,0.1749
s,120,old,2593,0.1678
s,105,new,2582,0.1243
d,120,old,2559,0.1649
c,110,new,2558,0.1298
d,110,new,2555,0.1307
c,100,old,2553,0.2002
c,105,old,2539,0.1713
d,115,old,2524,0.1565
s,115,new,2516,0.1391
c,110,old,2505,0.1741
d,100,new,2502,0.1431
c,120,new,2500,0.1284
s,110,new,2491,0.1265
c,115,new,2483,0.1228
d,120,new,2452,0.1277
d,105,new,2448,0.1364
c,100,new,2436,0.1199
d,115,new,2435,0.1437
s,100,new,2411,0.1497
s,120,new,2411,0.1161
s,105,old,2387,0.1571
s,115,old,2365,0.1674
d,100,old,2358,0.1735
s,100,old,2329,0.1803
d,110,old,2235,0.1888
rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes
g <- glm(cbind(Successes,Failures) ~ Ul * Size + Type, data=rates, family="binomial"); summary(g)
...
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.389310 0.270903 -5.13 2.9e-07
Uld -0.103201 0.386550 -0.27 0.789
Uls 0.055036 0.389109 0.14 0.888
Size -0.004397 0.002458 -1.79 0.074
Uld:Size 0.000842 0.003509 0.24 0.810
Uls:Size -0.000741 0.003533 -0.21 0.834
Typeold 0.317126 0.020507 15.46 < 2e-16
summary(step(g))
...
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.40555 0.15921 -8.83 <2e-16
Size -0.00436 0.00144 -3.02 0.0025
Typeold 0.31725 0.02051 15.47 <2e-16
# examine just the list type alone, since the Size result is clear.
summary(glm(cbind(Successes,Failures) ~ Ul + Type, data=rates, family="binomial"))
...
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.8725 0.0208 -89.91 <2e-16
Uld -0.0106 0.0248 -0.43 0.67
Uls -0.0265 0.0249 -1.07 0.29
Typeold 0.3163 0.0205 15.43 <2e-16
summary(glm(cbind(Successes,Failures) ~ Ul + Type, data=rates[rates$Size==100,], family="binomial"))
...
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.8425 0.0465 -39.61 < 2e-16
Uld -0.0141 0.0552 -0.26 0.80
Uls 0.0353 0.0551 0.64 0.52
Typeold 0.3534 0.0454 7.78 7.3e-15
~~~
The results are a little confusing in factorial form: it seems pretty clear that `Size` is bad and that 100% performs best, but what's going on with the list icon type? Do we have too little data or is it interacting with the font size somehow? I find it a lot clearer when plotted:
~~~{.R}
library(ggplot2)
qplot(Size,Rate,color=Ul,data=rates)
~~~
![Reading rate, split by font size, then by list icon type](/images/2013-10-27-abtesting-ulfontsize.png)
Immediately the negative effect of increasing the font size jumps out, but it's easier to understand the list icon estimates: square performs the best in the 100% (the original default) font size condition but it performs poorly in the other font sizes, which is why it seems to do only medium-well compared to the others. Given how much better 100% performs than the others, I'm inclined to ignore their results and keep the squares.
100% and squares, however, were the original CSS settings, so this means I will make no changes to the existing CSS based on these results.
## Blockquote formatting
Another bit of formatting I've been meaning to test for a while is seeing how well [Readability](http://www.readability.com/)'s pull-quotes next to blockquotes perform, and to check whether my zebra-striping of nested blockquotes is helpful or harmful.
The Readability thing goes like this:
~~~{.Css}
blockquote: : before {
content: "\201C";
filter: alpha(opacity=20);
font-family: "Constantia", Georgia, 'Hoefler Text', 'Times New Roman', serif;
font-size: 4em;
left: -0.5em;
opacity: .2;
position: absolute;
top: .25em }
~~~
The current blockquote striping goes thusly:
~~~{.Css}
blockquote, blockquote blockquote blockquote,
blockquote blockquote blockquote blockquote blockquote {
z-index: -2;
background-color: rgb(245, 245, 245); }
blockquote blockquote, blockquote blockquote blockquote blockquote,
blockquote blockquote blockquote blockquote blockquote blockquote {
background-color: rgb(235, 235, 235); }
~~~
### Implementation
This is another 2x2 design since we can use the Readability quotes or not, and the zebra-striping or not.
~~~{.Diff}
hunk ./static/css/default.css 271
-blockquote, blockquote blockquote blockquote,
- blockquote blockquote blockquote blockquote blockquote {
- z-index: -2;
- background-color: rgb(245, 245, 245); }
-blockquote blockquote, blockquote blockquote blockquote blockquote,
- blockquote blockquote blockquote blockquote blockquote blockquote {
- background-color: rgb(235, 235, 235); }
+/* blockquote, blockquote blockquote blockquote, */
+/* blockquote blockquote blockquote blockquote blockquote { */
+/* z-index: -2; */