-
Notifications
You must be signed in to change notification settings - Fork 2
Data handling
This section, shows some examples on how to handle datasets.
set verbose off # avoid detailed printouts
clear # clear memory
nulldata 3 # three observations (rows)
# Create a normally-distributed random variable
scalar mean = 4
scalar std_dev = 0.5
series y = normal(mean, std_dev)
series y_sq = y^2
series log_y = log(y)
series exp_y = exp(y)
series x = {1, 2, 3}'
series z = y - x
print y y_sq log_y exp_y x z --byobs
Returns the output:
y y_sq log_y exp_y x
1 3.035885 9.21660 1.110503 20.8194 1
2 4.053212 16.42853 1.399510 57.5821 2
3 4.993476 24.93480 1.608132 147.4481 3
z
1 2.035885
2 2.053212
3 1.993476
By means of the markers
command one attach labels to rows indices. These also will be shown for some plots. The example involves a discrete series y
with three distinct values. We link three string values, stored in a string array S, to the series by means of the
markers` command.
nulldata 4
series y = {1, 2, 3, 1}
series x = normal()
strings S = defarray("Berlin", "Munich", "Hamburg", "Potsdam")
markers --from-array=S
print y x -o
This yields:
y x
Berlin 1 -1.385067
Munich 2 0.955097
Hamburg 3 0.500555
Potsdam 1 1.256609
Restricting the dataset based on some condition(s) is a frequent task. The smpl
command can be used for that. A sample dataset is used for illustration. We compute descriptive statistics for some variables:
open abdata.gdt --quiet
list Y = IND YEAR n # list of series
# Statistics based on all data
summary Y --simple
# Restrict the data for years between 1976 and 1978
smpl YEAR >= 1976 && YEAR <= 1978 --restrict
summary Y --simple
smpl full # restore the full sample
summary Y --simple
This yields:
Mean Median S.D. Min Max
IND 5.123 5.000 2.678 1.000 9.000
YEAR 1980 1980 2.583 1976 1984
n 1.056 0.8272 1.342 -2.263 4.687
Mean Median S.D. Min Max
IND 5.025 4.000 2.649 1.000 9.000
YEAR 1977 1977 0.8175 1976 1978
n 1.185 0.9738 1.346 -2.010 4.609
Mean Median S.D. Min Max
IND 5.123 5.000 2.678 1.000 9.000
YEAR 1980 1980 2.583 1976 1984
n 1.056 0.8272 1.342 -2.263 4.687
Let's open some sample dataset shipped by Gretl and create a binary dummy which takes the value of 1 if series YEAR
is either 1977 or 1980:
open abdata.gdt --quiet
series DUM = (YEAR == 1977 || YEAR == 1980)
print YEAR DUM -o --range=1:10 # print the first ten entries
The output is:
YEAR DUM
1:1 1976 0
1:2 1977 1
1:3 1978 0
1:4 1979 0
1:5 1980 1
1:6 1981 0
1:7 1982 0
1:8 1983 0
1:9 1984 0
2:1 1976 0
In Gretl you can also create a string-valued series. First we assume that we have a series y
with discrete values 1, 2 and 3. We then create an array for string values, series series_labels
, with three entries which get attached by means of the stringify()
function:
set seed 1234
nulldata 20
series y = randgen(i, 1, 3)
setinfo y --description="3 different categories"
print y --byobs --range=1:5
# Create strings for categories 1, 2 and 3
strings series_labels = defarray("Low income", "Medium income", "High income")
# Attache strings to categorical series
help stringify
stringify(y, series_labels)
print y --byobs --range=1:5
This code returns:
y
1 Medium income
2 Low income
3 Low income
4 Low income
5 High income
Instead of a string array, the user can also refer to some text-file comprising string values. See the help on the stringify()
.
A series object in Gretl can include some metadata such as a descriptive labels. One can also set the description which should appear when plotting a series. Here is an example:
nulldata 3
series y = normal()
# Add a series description
setinfo y --description="Some random number"
# Instead of 'y' showing up in a graph, show another description
setinfo y --graph-name="Cool variable"
boxplot y --output=display # See the output
Suppose you have a weirdly valued dataset such as:
set verbose off
nulldata 5
series weird_values = {5, 6, 10, 20, NA}'
print weird_values --byobs
By means of the replace()
function, you we want to replace value 5 by 0, 6 by 1, 10 by 3, 20 by 4 and missing values (NA
) by -1:
# Let’s replace values
help replace
matrix find = {5, 6, 10, 20, NA}
matrix replace_by = {0, 1, 2, 3, -1}
# Create new series y with replaced values
series y = replace(weird_values, find, replace_by)
print weird_values y --byobs
The result is:
weird_values y
1 5 0
2 6 1
3 10 2
4 20 3
5 -1
Suppose you have a dataset with integer values ranging from 0 to 20. You to replace numbers from 0-5 by 1, 6-10 by 2, 11-20 by 3. How to do this? See here:
nulldata 40 # some empty dataset
# Discrete random numbers between 0 and 20
series old = randgen(i, 0, 20)
# print Var_alt --byobs
series new = NA # Initialize an empty series
# Replace 0-5 by 1
matrix find = seq(0, 5)
scalar subst = 1
series new = replace(old, find, subst)
# Replace 6-10 by 2
matrix find = seq(6, 10)
scalar subst = 2
series new = replace(new, find, subst)
# Replace 11-20 by 3
matrix find = seq(11, 20)
scalar subst = 3
series new = replace(new, find, subst)
print old new --byobs
Gets you:
old new
1 15 3
2 13 3
3 10 2
4 20 3
5 6 2
6 5 1
7 5 1
8 8 2
9 10 2
10 14 3