-
Notifications
You must be signed in to change notification settings - Fork 2
Statistics
While there are many statistical functions, the summary
command is a simple way to compute descriptive statistics for a list of series. Here is an example only computing the most basic statistics using the --simple
option:
open abdata.gdt --quiet
list Y = IND YEAR n
summary Y --simple
# Store the results as a matrix
summary Y --simple
matrix stats = $result
print stats
The output is:
Mean Median S.D. Min Max
IND 5.123 5.000 2.678 1.000 9.000
YEAR 1980 1980 2.583 1976 1984
n 1.056 0.8272 1.342 -2.263 4.687
stats (3 x 5)
Mean Median S.D. Min Max
IND 5.1232 5.0000 2.6781 1.0000 9.0000
YEAR 1980.0 1980.0 2.5830 1976.0 1984.0
n 1.0560 0.82724 1.3415 -2.2634 4.6873
By means of the --by=Series
option, you can also compute statistics for each category of some other variable. The following example prints basic statistics for series n
and w
for each value of series IND
(industry ID):
set verbose off
open abdata.gdt --quiet
list Y = n w
summary Y --by=IND --simple
The output for the first three industries is:
IND = 1 (n = 122):
Mean Median S.D. Min Max
n 1.234 1.095 1.172 -0.5942 4.099
w 3.186 3.183 0.1511 2.757 3.581
IND = 2 (n = 88):
Mean Median S.D. Min Max
n 1.039 0.9792 1.387 -2.104 3.223
w 3.410 3.409 0.1363 2.870 3.812
IND = 3 (n = 89):
Mean Median S.D. Min Max
n 0.7006 0.4324 1.199 -1.726 3.030
w 3.287 3.331 0.1640 2.910 3.614
The aggregate()
function is powerful and allows you to aggregate data (like Pivot tables) by means of some aggregation function. Here is a simple example on how to compute the mean values of series n
and w
for each unique combination of the discrete series IND
and YEAR
(only showing the initial rows)
open abdata.gdt --quiet
list Y = n w
list groupby = IND YEAR
matrix mean_values = aggregate(Y, groupby, "mean")
printf "\n%12.2f\n", mean_values
The output is:
IND YEAR count n w
1.00 1976.00 8.00 0.89 3.12
1.00 1977.00 16.00 1.34 3.11
1.00 1978.00 17.00 1.37 3.09
.
.
.
2.00 1976.00 8.00 1.51 3.58
2.00 1977.00 12.00 1.14 3.50
2.00 1978.00 12.00 1.13 3.44
You can also pass your own custom aggregate function to aggregate()
. The function must return a scalar value. Here is an example for the inter-quartile range:
set verbose off
function scalar iqr (const series y)
/* Compute the interquartile range. */
scalar result = quantile(y, 0.75) - quantile(y, 0.25)
return result
end function
open mroz87.gdt --quiet
matrix result = aggregate(FAMINC, CIT, iqr)
printf "%12.2f\n", result
which returns the following table
byvar count f(x)
0.00 269.00 9591.00
1.00 484.00 13349.25
The following example shows how to run a simple OLS regression and how to store post-estimation information.
set verbose off
open abdata.gdt --quiet
ols ys const n w #OPTIONAL: --robust
matrix coeff = $coeff # point estimates
matrix stderr = $stderr # std. error
series uhat = $uhat # residuals
series yhat = $yhat # fitted values
# Print values
print coeff ~ stderr
print ys yhat uhat --byobs --range=:5
The output is:
Model 1: Pooled OLS, using 1031 observations
Included 140 cross-sectional units
Time-series length: minimum 7, maximum 9
Dependent variable: ys
coefficient std. error t-ratio p-value
--------------------------------------------------------
const 4.60388 0.0351033 131.2 0.0000 ***
n 0.00626942 0.00217539 2.882 0.0040 ***
w 0.00875566 0.0110959 0.7891 0.4302
Mean dependent var 4.638015 S.D. dependent var 0.093961
Sum squared resid 9.015800 S.E. of regression 0.093650
R-squared 0.008551 Adjusted R-squared 0.006622
F(2, 1028) 4.433125 P-value(F) 0.012105
Log-likelihood 980.1866 Akaike criterion −1954.373
Schwarz criterion −1939.558 Hannan-Quinn −1948.751
rho 0.802880 Durbin-Watson 0.305346
4.6039 0.035103
0.0062694 0.0021754
0.0087557 0.011096
ys yhat uhat
1:1
1:2 4.561294 4.636576 -0.07528
1:3 4.578384 4.636651 -0.05827
1:4 4.601245 4.636334 -0.03509
1:5 4.610656 4.636581 -0.02592
The modtest
command provides various specification tests which can be conducted after having estimated a model. Another command is reset
for running Ramsey's RESET test:. Here are examples:
set verbose off
open abdata.gdt --quiet
ols ys const n w --simple
modtest --normality --quiet
modtest --white --quiet
reset --squares-only --quiet
The output is as follows:
Pooled OLS, using 1031 observations
Included 140 cross-sectional units
Time-series length: minimum 7, maximum 9
Dependent variable: ys
coefficient std. error t-ratio p-value
--------------------------------------------------------
const 4.60388 0.0351033 131.2 0.0000 ***
n 0.00626942 0.00217539 2.882 0.0040 ***
w 0.00875566 0.0110959 0.7891 0.4302
SSR = 9.0158, R-squared = 0.008551
Test for null hypothesis of normal distribution:
Chi-square(2) = 82.810 with p-value 0.00000
White's test for heteroskedasticity
Test statistic: TR^2 = 29.920521,
with p-value = P(Chi-square(5) > 29.920521) = 0.000015
RESET test for specification (squares only)
Null hypothesis: specification is adequate
Test statistic: F = 12.054076,
with p-value = P(F(1,1027) > 12.0541) = 0.000538
Gretl allows you to test hypothesis in a simple manner.
First you can call the omit
command for testing zero restrictions on coefficients. Here is a simple example for testing the removal of two variables by means of an F-Test:
list X = n w
ols ys const X
# Test the restriction but do not re-estimate the model
omit X --test-only
# Test the restriction and re-estimate the model
omit X
The output is:
Test on Model 3:
Null hypothesis: the regression parameters are zero for the variables
n, w
Test statistic: F(2, 1028) = 4.43312, p-value 0.0121052
The restrict
-block command provides a powerful apparatus for testing set of (non-)linear restrictions. Here is an example using the --quiet
option for avoiding detailed output. You may also try the --bootstrap
option:
restrict --quiet # --bootstrap
b[w] = 0.005
b[n] - b[w] = 0
end restrict
This returns:
Restriction set
1: b[w] = 0.005
2: b[n] - b[w] = 0
Test statistic: F(2, 1028) = 0.224807, with p-value = 0.798709
The Wald test is a statistical test used to assess constraints on parameters in a regression model. In this example, we demonstrate how to perform a Wald test on a nonlinear restriction using Gretl (for more details, see the help on the restrict
command):
open data4-1.gdt
ols price const sqft bedrms baths
function scalar my_restr(matrix b)
ret = b[4] - b[2]*b[3]
return ret
end function
restrict
rfunc = my_restr
end restrict
This command yields:
Test statistic: chi^2(1) = 0.0400426, with p-value = 0.841397
This example illustrates on how to run non-parametric test to test for differences between variables. We make use of the difftest
command here.
The difftest
command supports three tests for given pairs of observations (such as weight pre- and post-treatment) for each subject:
- Sign Test (Wikipedia) -- H0: difference between the X and Y has zero median
- Wilcoxon Rank-Sum Test (Wikipedia) -- H0: the two medians are equal
- Wilcoxon Signed-Rank Test (Wikipedian) -- H0: the median difference is zero
The example uses simulated series for computing the test on differences in expected values.
set verbose off
##################
## Non-parametric difference tests
##################
set seed 1234 # only to ensure replicability
nulldata 100 # cross-sectional dataset
# Create some random variables
series y = normal(0, 2) # expected value 0
series x = normal(10, 2) # expected value 2
list L = y x # define a list of series which can be handy
# Stats and plot
summary L --simple
boxplot L --output=display
# Non-parametric difference tests
help difftest # see the help for information
difftest y x --sign # Sign test -- less powerful
printf "\nP-value of the Sign-test = %.2f (test-stat = %g)\n", $pvalue, $test
difftest y x --rank-sum # Wilcoxon rank-sum test (aka Mann-Whitney U test)
printf "\nP-value of the Wilcoxon rank-sum test = %.2f (test-stat = %g)\n", $pvalue, $test
difftest y x --signed-rank # Wilcoxon rank test
The test results are:
Test for difference between y and x
Sign Test
Number of differences: n = 100
Number of cases with y > x: w = 0 (0.00%)
Under the null hypothesis of no difference, W follows B(100, 0.5)
Prob(W <= 0) = 7.88861e-31
Prob(W >= 0) = 1
P-value of the Sign-test = 1.00 (test-stat = 0)
Test for difference between y and x
Wilcoxon Rank-Sum Test
Null hypothesis: the two medians are equal
n1 = 100, n2 = 100
w (sum of ranks, sample 1) = 5052
z = (5052 - 10050) / 409.268 = -12.2121
P(Z < -12.2121) = 1.34012e-34
Two-tailed p-value = 2.68024e-34
P-value of the Wilcoxon rank-sum test = 0.00 (test-stat = -12.2121)
Test for difference between y and x
Wilcoxon Signed-Rank Test
Null hypothesis: the median difference is zero
n = 100
W+ = 0, W- = 5050
(zero differences: 0, non-zero ties: 0)
Expected value = 2525
Variance = 84587.5
z = -8.68005
P(Z < -8.68005) = 1.97796e-18
Two-tailed p-value = 3.95591e-18
This example loads the cross-sectional and well-known MROZ dataset. By means of an OLS regression employing a level dummy, we want to test whether men earn higher wages in large cities compared to small cities.
open mroz87.gdt
boxplot HW CIT --factorized --output=display \
{ set title "Wages of men in small and large cities" font ',15'; }
# Regression for explaining Husband's wage by CIT (0: lives in small city, 1: lives in large city)
ols HW const CIT --robust # robust standard errors wrt eventual heteroskedasticity
printf "\nThe null hypothesis that wages in large cities are equal \n\
to wages in small cities can be rejected at the %.2f pct. \n\
significance level\n", $pvalue
# Run a restriction by hand
help restrict
# Test the null that hourly wages are on average 3$ higher in large cities
restrict --bootstrap
b[CIT] = 3
end restrict