Skip to content

Statistics

atecon edited this page Oct 23, 2024 · 10 revisions

Descriptive statistics

Basic example

While there are many statistical functions, the summary command is a simple way to compute descriptive statistics for a list of series. Here is an example only computing the most basic statistics using the --simple option:

open abdata.gdt --quiet
list Y = IND YEAR n 
summary Y --simple

# Store the results as a matrix
summary Y --simple
matrix stats = $result
print stats

The output is:

                 Mean     Median       S.D.        Min        Max
IND             5.123      5.000      2.678      1.000      9.000
YEAR             1980       1980      2.583       1976       1984
n               1.056     0.8272      1.342     -2.263      4.687

stats (3 x 5)

             Mean       Median         S.D.          Min          Max 
 IND       5.1232       5.0000       2.6781       1.0000       9.0000 
YEAR       1980.0       1980.0       2.5830       1976.0       1984.0 
   n       1.0560      0.82724       1.3415      -2.2634       4.6873 

Grouped statistics

By means of the --by=Series option, you can also compute statistics for each category of some other variable. The following example prints basic statistics for series n and w for each value of series IND (industry ID):

set verbose off
open abdata.gdt --quiet
list Y = n w
summary Y --by=IND --simple

The output for the first three industries is:

IND = 1 (n = 122):

                 Mean     Median       S.D.        Min        Max
n               1.234      1.095      1.172    -0.5942      4.099
w               3.186      3.183     0.1511      2.757      3.581

IND = 2 (n = 88):

                 Mean     Median       S.D.        Min        Max
n               1.039     0.9792      1.387     -2.104      3.223
w               3.410      3.409     0.1363      2.870      3.812

IND = 3 (n = 89):

                 Mean     Median       S.D.        Min        Max
n              0.7006     0.4324      1.199     -1.726      3.030
w               3.287      3.331     0.1640      2.910      3.614

Aggregation

The aggregate() function is powerful and allows you to aggregate data (like Pivot tables) by means of some aggregation function. Here is a simple example on how to compute the mean values of series n and w for each unique combination of the discrete series IND and YEAR (only showing the initial rows)

open abdata.gdt --quiet
list Y = n w
list groupby = IND YEAR

matrix mean_values = aggregate(Y, groupby, "mean")
printf "\n%12.2f\n", mean_values

The output is:

         IND        YEAR       count           n           w
        1.00     1976.00        8.00        0.89        3.12
        1.00     1977.00       16.00        1.34        3.11
        1.00     1978.00       17.00        1.37        3.09
          .
          .
          .
        2.00     1976.00        8.00        1.51        3.58
        2.00     1977.00       12.00        1.14        3.50
        2.00     1978.00       12.00        1.13        3.44

Custom aggregate function

You can also pass your own custom aggregate function to aggregate(). The function must return a scalar value. Here is an example for the inter-quartile range:

set verbose off

function scalar	iqr (const series y)
    /* Compute the interquartile range. */
    
    scalar result = quantile(y, 0.75) - quantile(y, 0.25)
    return result
end function

open mroz87.gdt --quiet
matrix result = aggregate(FAMINC, CIT, iqr)
printf "%12.2f\n", result

which returns the following table

       byvar       count        f(x)
        0.00      269.00     9591.00
        1.00      484.00    13349.25

OLS regression

Estimation

The following example shows how to run a simple OLS regression and how to store post-estimation information.

set verbose off

open abdata.gdt --quiet

ols ys const n w   #OPTIONAL: --robust

matrix coeff = $coeff  # point estimates
matrix stderr = $stderr  # std. error
series uhat = $uhat  # residuals
series yhat = $yhat  # fitted values

# Print values
print coeff ~ stderr
print ys yhat uhat --byobs --range=:5

The output is:

Model 1: Pooled OLS, using 1031 observations
Included 140 cross-sectional units
Time-series length: minimum 7, maximum 9
Dependent variable: ys

             coefficient   std. error   t-ratio    p-value
  --------------------------------------------------------
  const      4.60388       0.0351033    131.2      0.0000  ***
  n          0.00626942    0.00217539     2.882    0.0040  ***
  w          0.00875566    0.0110959      0.7891   0.4302 

Mean dependent var   4.638015   S.D. dependent var   0.093961
Sum squared resid    9.015800   S.E. of regression   0.093650
R-squared            0.008551   Adjusted R-squared   0.006622
F(2, 1028)           4.433125   P-value(F)           0.012105
Log-likelihood       980.1866   Akaike criterion    −1954.373
Schwarz criterion   −1939.558   Hannan-Quinn        −1948.751
rho                  0.802880   Durbin-Watson        0.305346

      4.6039     0.035103 
   0.0062694    0.0021754 
   0.0087557     0.011096 

              ys         yhat         uhat
1:1                                       
1:2     4.561294     4.636576     -0.07528
1:3     4.578384     4.636651     -0.05827
1:4     4.601245     4.636334     -0.03509
1:5     4.610656     4.636581     -0.02592

Specification tests

The modtest command provides various specification tests which can be conducted after having estimated a model. Another command is reset for running Ramsey's RESET test:. Here are examples:

set verbose off
open abdata.gdt --quiet
ols ys const n w --simple

modtest --normality --quiet
modtest --white --quiet
reset --squares-only --quiet

The output is as follows:

Pooled OLS, using 1031 observations
Included 140 cross-sectional units
Time-series length: minimum 7, maximum 9
Dependent variable: ys

             coefficient   std. error   t-ratio    p-value
  --------------------------------------------------------
  const      4.60388       0.0351033    131.2      0.0000  ***
  n          0.00626942    0.00217539     2.882    0.0040  ***
  w          0.00875566    0.0110959      0.7891   0.4302 

SSR = 9.0158, R-squared = 0.008551

Test for null hypothesis of normal distribution:
Chi-square(2) = 82.810 with p-value 0.00000

White's test for heteroskedasticity

Test statistic: TR^2 = 29.920521,
with p-value = P(Chi-square(5) > 29.920521) = 0.000015

RESET test for specification (squares only)
Null hypothesis: specification is adequate
Test statistic: F = 12.054076,
with p-value = P(F(1,1027) > 12.0541) = 0.000538

Hypothesis testing

Gretl allows you to test hypothesis in a simple manner.

omit variables

First you can call the omit command for testing zero restrictions on coefficients. Here is a simple example for testing the removal of two variables by means of an F-Test:

list X = n w
ols ys const X

# Test the restriction but do not re-estimate the model
omit X --test-only

# Test the restriction and re-estimate the model
omit X

The output is:

Test on Model 3:

  Null hypothesis: the regression parameters are zero for the variables
    n, w
  Test statistic: F(2, 1028) = 4.43312, p-value 0.0121052

Set of linear restrictions

The restrict-block command provides a powerful apparatus for testing set of (non-)linear restrictions. Here is an example using the --quiet option for avoiding detailed output. You may also try the --bootstrap option:

restrict --quiet # --bootstrap
    b[w] = 0.005
    b[n] - b[w] = 0
end restrict

This returns:

Restriction set
 1: b[w] = 0.005
 2: b[n] - b[w] = 0

Test statistic: F(2, 1028) = 0.224807, with p-value = 0.798709

Non-linear restrictions

The Wald test is a statistical test used to assess constraints on parameters in a regression model. In this example, we demonstrate how to perform a Wald test on a nonlinear restriction using Gretl (for more details, see the help on the restrict command):

open data4-1.gdt
ols price const sqft bedrms baths

function scalar my_restr(matrix b)
    ret = b[4] - b[2]*b[3]
    return ret
end function

restrict
    rfunc = my_restr
end restrict 

This command yields:

Test statistic: chi^2(1) = 0.0400426, with p-value = 0.841397

Non-parametric test for differences between variables

This example illustrates on how to run non-parametric test to test for differences between variables. We make use of the difftest command here.

The difftest command supports three tests for given pairs of observations (such as weight pre- and post-treatment) for each subject:

  1. Sign Test (Wikipedia) -- H0: difference between the X and Y has zero median
  2. Wilcoxon Rank-Sum Test (Wikipedia) -- H0: the two medians are equal
  3. Wilcoxon Signed-Rank Test (Wikipedian) -- H0: the median difference is zero

The example uses simulated series for computing the test on differences in expected values.

set verbose off

##################
## Non-parametric difference tests
##################
set seed 1234 	# only to ensure replicability
nulldata 100 	# cross-sectional dataset

# Create some random variables
series y = normal(0, 2)  # expected value 0
series x = normal(10, 2)  # expected value 2
list L = y x			# define a list of series which can be handy

# Stats and plot
summary L --simple
boxplot L --output=display

# Non-parametric difference tests
help difftest			# see the help for information

difftest y x --sign   # Sign test -- less powerful
printf "\nP-value of the Sign-test = %.2f (test-stat = %g)\n", $pvalue, $test

difftest y x --rank-sum   # Wilcoxon rank-sum test (aka Mann-Whitney U test)
printf "\nP-value of the Wilcoxon rank-sum test = %.2f (test-stat = %g)\n", $pvalue, $test

difftest y x --signed-rank   # Wilcoxon rank test

The test results are:

Test for difference between y and x
Sign Test

Number of differences: n = 100
  Number of cases with y > x: w = 0 (0.00%)
  Under the null hypothesis of no difference, W follows B(100, 0.5)
  Prob(W <= 0) = 7.88861e-31
  Prob(W >= 0) = 1

P-value of the Sign-test = 1.00 (test-stat = 0)

Test for difference between y and x
Wilcoxon Rank-Sum Test
Null hypothesis: the two medians are equal

  n1 = 100, n2 = 100
  w (sum of ranks, sample 1) = 5052
  z = (5052 - 10050) / 409.268 = -12.2121
  P(Z < -12.2121) = 1.34012e-34
  Two-tailed p-value = 2.68024e-34

P-value of the Wilcoxon rank-sum test = 0.00 (test-stat = -12.2121)

Test for difference between y and x
Wilcoxon Signed-Rank Test
Null hypothesis: the median difference is zero

  n = 100
  W+ = 0, W- = 5050
  (zero differences: 0, non-zero ties: 0)
  Expected value = 2525
  Variance = 84587.5
  z = -8.68005
  P(Z < -8.68005) = 1.97796e-18
  Two-tailed p-value = 3.95591e-18

Parametric regression-based test for differences between categories

This example loads the cross-sectional and well-known MROZ dataset. By means of an OLS regression employing a level dummy, we want to test whether men earn higher wages in large cities compared to small cities.

open mroz87.gdt

boxplot HW CIT --factorized --output=display \
  { set title "Wages of men in small and large cities" font ',15'; }

# Regression for explaining Husband's wage by CIT (0: lives in small city, 1: lives in large city)
ols HW const CIT --robust   # robust standard errors wrt eventual heteroskedasticity
printf "\nThe null hypothesis that wages in large cities are equal  \n\
  to wages in small cities can be rejected at the %.2f pct. \n\
  significance level\n", $pvalue

# Run a restriction by hand
help restrict

# Test the null that hourly wages are on average 3$ higher in large cities
restrict --bootstrap
    b[CIT] = 3
end restrict