Named after Boltzmann's Η-theorem, Shannon defined the entropy Η (Greek capital letter eta) of a discrete random variable X with possible values {x1, ..., xn} and probability mass function P(X) as:
Here E is the expected value operator, and I is the information content of X. I(X) is itself a random variable. The entropy can explicitly be written as:
where b is the base of the logarithm used. Common values of b are 2, Euler's number e, and 10, and the corresponding units of entropy are the bits for b = 2, nats for b = e, and bans for b = 10.
One may also define the conditional entropy of two events X and Y taking values xi and yj respectively, as:
where p(xi, yj) is the probability that X = xi and Y = yj. This quantity should be understood as the amount of randomness in the random variable X given the event Y [SOURCE].
This project calculates the Shannon entropy of a given text message based on symbol frequencies.
All the code required to get started is in the file (shannon-entropy.py). Only a working installation of Python 3 is necessary [LINK].
After user input, the program iterates over the given string (m) separating each character (symbol) and calculating its frequency (probability) over the length of m. Besides Shannon's entropy, values for optimally encoding the message and the metric entropy are also determined. Such optimal encoding would allocate fewer bits for the frequency occuring symbols and long bit sequences for the more infrequent symbols [SOURCE].
This is a sample output of entering the string "abracadabra":
Enter the message: abracadabra
Symbol-occurrence frequencies:
b --> 0.18182 -- 2
d --> 0.09091 -- 1
a --> 0.45455 -- 5
r --> 0.18182 -- 2
c --> 0.09091 -- 1
H(X) = 2.04039 bits. Rounded to 2 bits/symbol (bits per byte),
it will take 22 bits to optimally encode "abracadabra"
Metric entropy: 0.18549
For questions or comments:
- Author: Gianni Perez @ skylabus.com or at gianni.perez@gmail.com