Skip to content

Data handling

atecon edited this page Jan 19, 2024 · 5 revisions

This section, shows some examples on how to handle datasets.

Create an artificial dataset

Cross-sectional dataset

set verbose off  # avoid detailed printouts
clear   # clear memory

nulldata 3  # three observations (rows)

# Create a normally-distributed random variable
scalar mean = 4
scalar std_dev = 0.5
series y = normal(mean, std_dev)

series y_sq = y^2
series log_y = log(y)
series exp_y = exp(y)
series x = {1, 2, 3}'
series z = y - x

print y y_sq log_y exp_y x z --byobs

Returns the output:

             y         y_sq        log_y        exp_y            x

1     3.035885      9.21660     1.110503      20.8194            1
2     4.053212     16.42853     1.399510      57.5821            2
3     4.993476     24.93480     1.608132     147.4481            3

             z

1     2.035885
2     2.053212
3     1.993476

Add markers to a dataset

By means of the markers command one attach labels to rows indices. These also will be shown for some plots. The example involves a discrete series y with three distinct values. We link three string values, stored in a string array S, to the series by means of the markers` command.

nulldata 4
series y = {1, 2, 3, 1}
series x = normal()

strings S = defarray("Berlin", "Munich", "Hamburg", "Potsdam")
markers --from-array=S

print y x -o

This yields:

                   y            x

 Berlin            1    -1.385067
 Munich            2     0.955097
Hamburg            3     0.500555
Potsdam            1     1.256609

Sample restriction

Restricting the dataset based on some condition(s) is a frequent task. The smpl command can be used for that. A sample dataset is used for illustration. We compute descriptive statistics for some variables:

open abdata.gdt --quiet
list Y = IND YEAR n   # list of series

# Statistics based on all data
summary Y --simple

# Restrict the data for years between 1976 and 1978
smpl YEAR >= 1976 && YEAR <= 1978 --restrict
summary Y --simple

smpl full  # restore the full sample
summary Y --simple

This yields:

                 Mean     Median       S.D.        Min        Max
IND             5.123      5.000      2.678      1.000      9.000
YEAR             1980       1980      2.583       1976       1984
n               1.056     0.8272      1.342     -2.263      4.687


                 Mean     Median       S.D.        Min        Max
IND             5.025      4.000      2.649      1.000      9.000
YEAR             1977       1977     0.8175       1976       1978
n               1.185     0.9738      1.346     -2.010      4.609


                 Mean     Median       S.D.        Min        Max
IND             5.123      5.000      2.678      1.000      9.000
YEAR             1980       1980      2.583       1976       1984
n               1.056     0.8272      1.342     -2.263      4.687

Series

Binary dummies

Let's open some sample dataset shipped by Gretl and create a binary dummy which takes the value of 1 if series YEAR is either 1977 or 1980:

open abdata.gdt --quiet
series DUM = (YEAR == 1977 || YEAR == 1980)
print YEAR DUM -o --range=1:10    # print the first ten entries 

The output is:

            YEAR          DUM

1:1         1976            0
1:2         1977            1
1:3         1978            0
1:4         1979            0
1:5         1980            1
1:6         1981            0
1:7         1982            0
1:8         1983            0
1:9         1984            0
2:1         1976            0

String valued series

In Gretl you can also create a string-valued series. First we assume that we have a series y with discrete values 1, 2 and 3. We then create an array for string values, series series_labels, with three entries which get attached by means of the stringify() function:

set seed 1234
nulldata 20

series y = randgen(i, 1, 3)
setinfo y --description="3 different categories"
print y --byobs --range=1:5

# Create strings for categories 1, 2 and 3
strings series_labels = defarray("Low income", "Medium income", "High income")

# Attache strings to categorical series
help stringify
stringify(y, series_labels)

print y --byobs --range=1:5

This code returns:

              y

1 Medium income
2    Low income
3    Low income
4    Low income
5   High income

Instead of a string array, the user can also refer to some text-file comprising string values. See the help on the stringify().

Metadata

Add metadata to series

A series object in Gretl can include some metadata such as a descriptive labels. One can also set the description which should appear when plotting a series. Here is an example:

nulldata 3
series y = normal()

# Add a series description
setinfo y --description="Some random number"

# Instead of 'y' showing up in a graph, show another description
setinfo y --graph-name="Cool variable"

boxplot y --output=display   # See the output

Replace values

Series

Suppose you have a weirdly valued dataset such as:

set verbose off
nulldata 5
series weird_values = {5, 6, 10, 20, NA}'
print weird_values --byobs

By means of the replace() function, you we want to replace value 5 by 0, 6 by 1, 10 by 3, 20 by 4 and missing values (NA) by -1:

# Let’s replace values
help replace
matrix find = {5, 6, 10, 20, NA}
matrix replace_by = {0, 1, 2, 3, -1}

# Create new series y with replaced values 
series y = replace(weird_values, find, replace_by)

print weird_values y --byobs 

The result is:

  weird_values            y

1            5            0
2            6            1
3           10            2
4           20            3
5                        -1

More complicated example

Suppose you have a dataset with integer values ranging from 0 to 20. You to replace numbers from 0-5 by 1, 6-10 by 2, 11-20 by 3. How to do this? See here:

nulldata 40    # some empty dataset

# Discrete random numbers between 0 and 20
series old = randgen(i, 0, 20)
# print Var_alt --byobs

series new = NA  # Initialize an empty series

# Replace 0-5 by 1
matrix find = seq(0, 5)
scalar subst = 1
series new = replace(old, find, subst)

# Replace 6-10 by 2
matrix find = seq(6, 10)
scalar subst = 2
series new = replace(new, find, subst)

# Replace 11-20 by 3
matrix find = seq(11, 20)
scalar subst = 3
series new = replace(new, find, subst)

print old new --byobs

Gets you:

           old          new

 1           15            3
 2           13            3
 3           10            2
 4           20            3
 5            6            2
 6            5            1
 7            5            1
 8            8            2
 9           10            2
10           14            3