Supervision Meetings

This page contains some notes, and questions for the weekly meetings.

Week 3
Week 4
Week 5
Week 6
Week 7
Week 8
Week 9
Week 10
Week 11
Week 12
Week 13
Week 14
Week 15
Week 16
Week 17
Week 19

Week 19

Discussed the thesis report:

Descriptive statistics for elevation data should be included.
Discussion should be somewhat selfcontained (~2 pages), and include
- what we set out to do,
- what we have done,
- main findings,
- what went wrong, and what can we learn from this,
- what could be done in the future,
- what open questions are there.

that is, the results of the thesis should be put into a broader perspective.

Week 17

Discuss

Meta-Data Fusion: could be seen as a weighted ranking rather than a computation of probabilities.
The PDF of the random variables X estimated by the observed elevations O are now Gaussian + Uniform, weighted towards Gaussian if |O| is large.

Question

Do we have enough results to concentrate on the write-up?

Week 16

Run 2 x Resnet experiments for whole data set
Make plots reproducible (rewrite some old plot code)
Separate Evaluation and Predicitions

Altitude Information

The majority of the recordings in the BirdCLEF 2016 data set have meta data containing the elevation at which the bird song was recorded. This information could probably be used to improve classification accuracy.

Let Y = {y_1, ..., y_n} be the sound classes of the data set (n = 999), and let T = {r_1, ..., r_m} be the training set. Let e(r) be a function which returns the elevation of recording r, and let f(r) be a true classifier which returns the sound class of r. We then create the elevation observations for each sound class y_i in Y by O_i = {e(r) | f(r) = y_i} for i = 1, ..., n. A stochastic process X_i with the distribution N(mu_i, sigma_i) is estimated for each O_i, where mu_i = mean(O_i) and sigma_i = std(O_i). The classification of a recording r can now be done by combining classifier_score = f_w(r) = (p_1, ..., p_n), and elevation_score = (P(X_1 = e(r)), ..., P(X_n = e(r))), where f_w is the classifier parametrized by w, p_i is the estimated probability that r belongs to sound class y_i.

The final classification is then

(p_1 * P(X_1 = e(r)), ..., p_n * P(X_n = e(r)) / sum of these probabilities

Using Bayes Thm.

Pr(A|X) = Pr(X|A)Pr(A) / Pr(X),

where A is the event that r is species y, and X is the event that the elevation is e(r).

A possible problem with this approach is that number of elevation observations are not enough to get a good estimate of the distribution. Meaning that if a recording of a bird is made at an "unusual" elevation where it has not been observed before could result in a false classification.

CubeRun	With	Without
Top-1	0.645	0.637
Top-5	0.815	0.801
MAP	0.685	0.675
AUROC	0.977	0.969
Coverage Error	20.7	25.6
Label Ranking Average Precision	0.724	0.714
Ranking Loss	0.020	0.025

The most notable improvement is on the coverage error where we on average need to predict the 20 most probable classes to cover all ground truth labels instead of 25.

Training Samples and Confusion

The number of predictions for a class are higher than expected when there are many training segments, and lower than expected when there are few training segments.

Week 15

Discuss Report

Structure
Baseline
Results
Methods

Results

Figure out points to make
What evidence do we have for this?
Add results which do not really support the main points but could be interesting to round up to an Appendix.
Running times for the networks

Methods

Version number of libraries
How do one recreate the plots?
What hardware is used?

Discuss Data Fusion

Show histogram plots of elevation for each sound class.

Week 14

Possible Explanation for Confusion

Looking at the confusion matrix for the bot 100 sound classes for the residual network we can see that the network tends to predict sample points as sound class 12, or 42 they get 14 and 15 predictions respectively while other classes get around 2-4 predictions.

The amount of training segments for sound class 12 and 42 is 359 and 756 respectively, while the amount of training segments for other sound classes is on average less than 100 segments, probably around 70 (by manual observation). Meaning that the sound classes which the network favors are probably over represented.

Week 13

Confusion Matrix
Strategies on Generalization
Run 240 bot 100

Confusion Matrix

CubeRun Bot 100

ResNet Bot 100

In this confusion matrix there seems to be some structure in the confusions, and there sort of a line vertically along sound class 42(?). Meaning that the network may favor this class as a prediction.

Optimizers

Training the CubeRun model on the small data set with seven different optimizers. The default parameters from the keras library was used for all optimizers except Stochastic Gradient Decent which uses a parameter configuration which is close to the baseline.

The only two optimizers which give decent results are Adadelta and Stochastic Gradient Decent.

Training on Bot 100

Fundamental difference between data and method.

Training the network only on the 100 sound classes on which is performs worst may give some insight into what the underlying problem could be.

CubeRun

The validation accuracy stays very low while training on these classes, indicating that these classes are actually harder to learn. The validation loss is in fact increasing.

The characteristic of this curve stays the same, however, it is shifted towards the left which indicates worse performance.

Top 1: 0.410958904109589
Top 2: 0.5205479452054794
Top 3: 0.6118721461187214
Top 4: 0.6621004566210046
Top 5: 0.6986301369863014
Mean Average Precision: 0.473481438051
Area Under Curve: 0.902126285688

Training on same species (bot 100) but with a reshuffled data set (train/valid 90/10).

ResNet

Top 1: 0.42570281124497994
Top 2: 0.5381526104417671
Top 3: 0.6104417670682731
Top 4: 0.6305220883534136
Top 5: 0.678714859437751
Mean Average Precision: 0.48241371466
Area Under Curve: 0.895622895623

Similar results as for CubeRun can be observed for ResNet when training on bot 100.

Train 10 times for CubeRun/ResNet and check variance

In [16]: a.intersection_all([bot_10_1, bot_10_2, bot_10_3, bot_10_4, bot_10_5])
Out[16]: 
['nigricollis',
 'affinis',
 'conspicillatus',
 'fuscorufa',
 'amazonum',
 'chrysoptera']

In [17]: a.intersection_all([top_10_1, top_10_2, top_10_3, top_10_4, top_10_5])
Out[17]: 
['rufimarginatus',
 'erythrocercum',
 'martinicus',
 'brissonii',
 'chloropterus',
 'certhia',
 'albescens']

In [18]: a.intersection_all([top_10_1, top_10_2, top_10_3, top_10_4, top_10_5, bot_10_1, bot_10
    ...: _2, bot_10_3, bot_10_4, bot_10_5])
Out[18]: []

Training using less MFCCs

It is obvious on the small data set that the Residual Neural Network is not as stable in validation accuracy, the variance of "generalization" is quite high. It may be that the network is more prone to overfitting.

Week 12

Meeting

Try all optimizers
Train 10 times for CubeRun/ResNet and check variance of top/bot
Train MFCCs with less features
Train on bot/top 100

Done

Training Resnet and CubeRun on large dataset.
Performed sanity check on MFCC input data
Perfomed sanity check on classification by implementing a simple method which takes a model, and a sound file as input and returns the predicted species. Just to ensure that it actually predicts a species.
Maintanance work on code base (restructured how results are saved and stored).
Found and read a new ResNet publication (https://arxiv.org/pdf/1605.07146v3.pdf), which argues that the width of the network may be more important than depth due to diminishing feature reuse.
Tempogram input data (seems to take too much time to compute)

Questions

Which experiments would be interesting to run next?
Should we look into "Wide Residual Neural Networks" (https://arxiv.org/pdf/1605.07146v3.pdf)
The training/validation is split 90/10, which is the same as in the baseline, for comparable results. However, the rather small amount of validation data may make the validation set more suceptible to "bad choices". E.g., for small sound classes we have 1-2 data points, since these are randomly selected we may very well select two data points which are of poor quality making that sound class hard to predict for the classifier.

Possible Problems

Validation data not representative.

I trained the CubeRun model on the small dataset, and manually inspected the sound class with worst classification accuracy (0%). By listening to the validation samples, and training samples I realized that the validation samples were barely audible, and the amount of actual bird vocals present was very low. In contrast, the training samples had lots of distinct bird vocals. Since the amount of validation samples for each class is quite low, as low as 1-2 for some classes, it can be the case that the randomly selected validation samples are really bad recordings, leading to an extremely bad validation accuracy for that class.

Intersection between top/bot 100/200 of Resnet and Cuberun

It could be interesting to know if the ResNet and CubeRun have problem with the same sound classes.

In [68]: len(a.intersection(res_top_100, cube_top_100))
Out[68]: 31

In [69]: len(a.intersection(res_bot_100, cube_bot_100))
Out[69]: 54

In [70]: len(a.intersection(res_top_100, cube_bot_100))
Out[70]: 2

In [71]: len(a.intersection(res_bot_100, cube_top_100))
Out[71]: 3

As we can see the bot 100 sound classes for ResNet and CubeRun share around 50% of the same classes. Since we have ~800 classes if this were random we would expect 1/8 to be shared rather than 1/2, meaning that there probably is a pattern here. Neither does top 100 and bot 100 for the classifers seem to share many classes.

In [72]: len(a.intersection(res_top_200, cube_top_200))
Out[72]: 139

In [73]: len(a.intersection(res_bot_200, cube_bot_200))
Out[73]: 127

In [75]: len(a.intersection(res_top_200, cube_bot_200))
Out[75]: 6

In [76]: len(a.intersection(res_bot_200, cube_top_200))
Out[76]: 8

The top/bot 200 of ResNet and CubeRun share around 65% of the same sound classes, meaning that they perform well on a similar set of sound classes, and poorly on a similar set of sound classes. The top/bot sets share nearly no classes <5%.

Accuracy of sound classes (sorted)

This plot clearly shows that the network performs very well for some classes, and extremely poorly for other classes. One would expect this to be mainly due to the uneven distribution of training samples for each class, however, the effect of this is less than one might expect (as seen in the image below). Therefore there must be some other factor which enables the network to classify certain classes with ease, and other with great difficulty.

It may be of interest to train a classifier on the top 100 classes, and the bot 100 classes to see if the results are consistent. That is, if the accuracy for the top 100 remains high, and the accuracy for the bot 100 classes remains low.

It may also be of interest to run the training multiple times and see if the top 100, and bot 100 ramain roughly the same. If they do not it may be that the network gets stuck in different local optima which favors the prediction of some classes over others. If they do there is something fundamentally hard about the bot 100 classes. However, with the current computational resources available this investigation may not be feasible. (Each training round takes roughly 3 days to complete.) But we could take a subset of say 50 sound classes, train on them a couple of times and see if the problems remain, and how each training round affects them.

Accuracy with respect to mean energy in samples (small dataset)

The plot shows how the classification accuracy of each sound class (descending order) plotted in the same graph as the mean energy of the training (green line), and validation (blue line) samples. The mean energy of a the data points of a sound class is simply the total sum of the amplitude spectrogram for each data point in the sound class divided by the total number of data points.

We can see that the energy distribution of the training set, and validation set are similar, however, it is not possible to make any correlation between classification accuracy and the mean energy in the data points. With the exception of the sound class with the worst validation accucary, where the validation data barely contains any energy at all. This may explain why these validation recordings was barely audible.

Accuracy of 10% bins with respect to #training samples

The number of training samples for each sound class does affect the resulting accuracy for that class. However, not as much as one might expect.

Regarding MFCCs

"However, note that a lesser but still substantial improvement over the baseline MFCC system can usually be attained simply by using the raw Mel spectral data as input rather than MFCCs. One of the long-standing motivations for the MFCC transformation has been to reduce spectral data down to a lower dimensionality while hoping to preserve most of the implicit semantic information; but as we have seen, the random forest classifier performs well with high-dimensional input, and such data reduction is not necessary and often holds back classification performance. Future investigators should consider using Mel spectra as a baseline, rather than MFCCs as is common at present" - (https://arxiv.org/pdf/1405.6524.pdf)

There are also results in this paper showing that there could be a sweet spot for the number of features used when using MFCCs: http://www98.griffith.edu.au/dspace/bitstream/handle/10072/54461/81896_1.pdf?sequence=1 (page 8)

Week 11

In this plot the accuracy of each "chunk" is ordered by how many training samples are available. That is, the average accuracy of the 10% of the species with the most training samples available is plotted at x=0, and so forth.

We can see that there is a trend where the accuracy becomes smaller with less training samples, which is what one would expect, since the network will se the samples of the "small" sound classes less often, and thus learn them less well. However, the trend does not seem to be blowing up with a decreasing number of species, but it still performs rather evenly. Bear in mind that the largest sound classes has ~200 training samples, while the smallest ones have ~10 training samples, yet the prediction accuracy only varies from ~0.65 to ~0.48.

Half-time Report

During the half-time report the projects progress was discussed, and the examiner approved of the progress of the thesis. Further investigative methods were discussed, and what the next steps could be.

Week 10

Discussion

The training data for each sound class is very uneven, it could make sense to weigh the samples, or at the very least see how this correlates to classification accuracy.

Top 10	Bot 10
leucophrys 212	gymnops 9
flaveola 190	pretrei 9
gujanensis 141	melanoleucus 9
viridis 138	candei 9
albicollis 135	bolivianus 9
rufus 119	novacapitalis 9
ruficauda 119	gyrola 9
capensis 118	squamigera 9
guttatus 118	luteiventris 9
longirostris 115	sp.nov.Alto_Pisones 0

Discuss MFCC features
Discuss resizing
Discuss data fusion (elevation)
Discuss evaluation (top 10, median 10, bottom 10) (validation data may be to sparse for a fair comparison)
Discuss noise reduction, could it be interesting to preproess the audio files even more?

Results

CubeRun MFCC without any augmentation at all (resized from 128x130 to 256x256).

CubeRun MFCC with data augmentation (noise/same class/time shift/pitch shift)

CubeRun MFCC with data augmentation, and Multiple-Width Frequency-Delta Data Augmentation

CubeRun Spectrogram with data augmentation

ResNet MFCC with data augmentation, and MWFD

MFCC features

128x130 MFCC matrix

256x256 MFCC matrix (resized)

256x512 Log Spectrogram (reference)

Week 9

The baseline model has been evaluated on the large (~25000 file) data set. The results are close to what would be expected according to the paper, if not a bit better.

Top 1: 2246
Top 2: 2561
Top 3: 2707
Top 4: 2806
Top 5: 2862
Mean Average Precision: 0.653409677675
Area Under Curve: 0.974919928416
Total predictions: 3657

Training History

Discusss

What are the next steps? Multiple-Width Frequency-Delta Data Augmentation? Maybe try out using a tempogram as a data augmentatino technique?
Next week is week 10, meaning that we should schedule a half-time presentation late next week, or possibly the week after that.

Week 8

It seems that the combinatino of dropout followed by a batch normalization on the input layer is what is causing the bad results. Without dropout the CubeRun model reaches a mAP of around 0.84. With dropout it is no good at all.

I have found a great library called librosa where there are a lot of feature extraction methods available. The current implementation now uses librosa.stft to compute the spectrograms. There are available methods for computing MFCCs as well as deltas for the MFCCs in librosa. Meaning that it should be no problem to implement the Multiple-Width Frequency-Delta data augmentation. It may also be interesting to play around with the the rythmic representations such as tempograms which are available in the library.

Since most popular models (VGG16, VGG19, xception, inception_v3), are readily available in the Keras library, it may be of interest to test all of them, and see if any other model performs well in this problem domain, and it would make the study more comprenehsive.

CubeRun

Without dropout

This model reaches a mAP of 0.84

Without batch normalization

With dropout and batch normalization

Resnet

The much deeper residual network seems to perform well, and reaches a mAP of around 0.86.

34 layer Resnet

Week 7

Contacted Elias Sprengel and used his [feedback](https://github.com/johnmartinsson/bird-species-classification /wiki/Journal#2016-12-14).
Evaluated the CubeRun model again, but it still performs much better when trained on amplitude spectrograms.
Trained a deep residual network model.

The CubeRun model seems to converge quite quickly, and keeps an validation accuracy of around 0.7 after 100 epochs. The Resnet model does not seem to have converged after 60 epochs, and could use more training. Neither of these accuracy scores are the mean over a whole file, but rather the accuracy for all segments of all files. The results are expected to improve when taking the mean over a file. Last time an accuracy of 0.57 for the CubeRun model over the segments in the validation set increased to an accuracy of 0.76 over the mean of the segments for each file.

Time:

CubeRun (5 conv layers): 60s/epoch
Resnet (18 conv layers): 240s/epoch

CubeRun 240 Epochs on Amplitude Spectrogram (best: 0.74)

Top 1: 100
Top 2: 105
Top 3: 113
Top 4: 114
Top 5: 116

Mean Average Precision: 0.846718206657 Area Under Curve: 0.966350301984 Total predictions: 122

CubeRun 240 Epochs on Log Amplitude Spectrogram

Resnet 60 Epochs on Log Amplitude Spectrogram (best: 0.6)

Week 6

Evaluated the baseline with and without augmentation, and with and without log amplitude spectrogram. The best performing setting is without log amplitude spectrogram and with augmentation.

Questions:

Could the reason that the log amp spectrogram works so badly originate in the augmentation techniques? Two signals are mixed with a factor alhpa, then three noise signals are added with a dampening factor of 0.4. Should some sort of normalization be done here?
Since values close to 0 become large negatives, maybe the network learns silence/noise and gets overfitted to silence/noise in the training set, making it useless in the validation set.

Week 5

The data augmentation is in place. A set of augmentation dictionaries is created by drawing a sample at random, then find another sample with the same class, at random, then three random noise samples are drawn. The dictionary is only an object which keeps track of the relative paths to the files. These augmentation dictionaries are then supplied in mini-batches to the training scheme, at which point they are actually loaded, and combined, meaning that we only keep the mini-batch in memory when it is used. The two same class samples are combined using a random weight alpha in [0, 1). The noise samples are added to the new sample with a dampening factor of 0.4. The mini-batch generator is configurable. It is possible to set how many augmentation samples should be created, how large each mini-batch should be, and how many mini-batches should be generated.

Abstract example of how this is used:

X_valid, Y_valid = loader.load_validation_data(config)

for (X_train, Y_train) in loader.mini_batch_generator(config):
    model.fit(X_train, Y_train, validation_data=(X_valid, Y_valid), config)

An example of an augmentation dict:

{'augmentation_noise_filepaths': [
   'datasets/birdClef2016Subset_preprocessed/train/noise/LIFECLEF2014_BIRDAMAZON_XC_WAV_RN9970_noise_chunk.wav.gz',
   'datasets/birdClef2016Subset_preprocessed/train/noise/LIFECLEF2015_BIRDAMAZON_XC_WAV_RN28317_noise_chunk.wav.gz',
   'datasets/birdClef2016Subset_preprocessed/train/noise/LIFECLEF2015_BIRDAMAZON_XC_WAV_RN18403_noise_chunk.wav.gz'
   ],
   'augmentation_signal_filepath': 'datasets/birdClef2016Subset_preprocessed/train/LIFECLEF2015_BIRDAMAZON_XC_WAV_RN18700_signal_chunk.wav.gz',
  'labels': ['17'],
  'signal_filepath': 'datasets/birdClef2016Subset_preprocessed/train/LIFECLEF2014_BIRDAMAZON_XC_WAV_RN7087_signal_chunk.wav.gz'}

Show:

Questions:

At which level should I start the thesis? Acoustic Scene Classification, or should I review the state of the art literature of bird species classification using acoustics?
It is clear that the authors of the reference paper has treated the data set as a single-instance single-label problem. I have extracted a random subset of of 20 species from the BirdClef2016 data set, which is the data set the authors tested their method on. I believe that it may be better to keep working with this data set instead of MLSP2013.
The network seems to converge better when training on the amplitude spectrogram rather than the log amplitude spectrogram. What does this mean? Could it be that the input values are higher and that this makes the network converge faster?

Week 4

Print out Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection

I have confirmed that the data augmentation is performed in the time domain. This was stated later in the paper, and I had overlooked it.

Show:

Reproduced noise and signal masks
Running on GPU reduces model fitting from 3-4 hours to 2-3 minutes.

Questions:

What is an parametric equalizing function, and how should it be used? This is not mentioned in the reference paper, and may not be used, they may simply additively combine the same class signals.
How should we load same class segments and combine them? What kind of "randomness" makes sense? Should they be stored to file? (for smaller data set this is not a problem, however, if the larger set gets augmented 5-10 times, the disk space used will increase from ~70GB to 350-700GB.
Should the data be normalized to zero mean and identity variance?

Answers:

Leave this out for now. However, it is a way to accentuate certain frequencies around the center frequency. Q-factor decides the bandwidth, gain decides if it should be accentuated or lowered.
It makes sense to load data batches at random from disk, possibly directly from a compressed file, with a fixed random seed. These batches can then be augmented with other randomly loaded signals.
This is not clear, however, it does not make any sense in the time domain where signals already should have zero mean. It could make sense in the spectral/image domain, where each sample is just a high-dimensional point in space.

Read up on direct current offest (DC offset), and if it makes sense to normalize the "images" to zero mean and identity variance.

Week 3

Questions:

Some bird vocals in the MLSP data set seem to have very low energy, and it is hard to distinguish some vocal structures from noise using a simple mark higher than 3 times row/column median + dilation/erosion filter technique. What can be done? Should anything be done in the baseline?
It is not clear in which order, and how many times the dilation, and erosion filter are applied during the signal/noise extraction stage. It is neither clear what the neighbourhood looks like. We know that it is a 4 by 4 neighbourhood, but it is not clear if this consists of 4 by 4 ones, or if the mask has any other shape. How should this be interpreted?
In the paper refered to from the "baseline paper" they remove the 4 lowest and the 24 highest frequency bins from the spectrogram. Should I assume that this is done also in the "baseline paper"?
Additively combine signals. Is it possible using spectrograms?

Answers:

Use baseline specifications, and take it from there.
Empirically test the methods, and try to reproduce the results as close as possible to the paper.
Since they write "we follow quire closely" it can be assumed that the same method is used in their paper.
Empirically test with sinusoids, and see if it works. I.e., confirm f(s1 + s2) = f(s1) + f(s2), where f is a function which computes a spectrogram, and s1 and s2 are audio signals.

Supervision Meetings

Supervision Meetings

Week 19

Week 17

Discuss

Question

Week 16

Altitude Information

Training Samples and Confusion

Week 15

Discuss Report

Results

Methods

Discuss Data Fusion

Week 14

Possible Explanation for Confusion

Week 13

Confusion Matrix

CubeRun Bot 100

ResNet Bot 100

Optimizers

Training on Bot 100

CubeRun

ResNet

Train 10 times for CubeRun/ResNet and check variance

Training using less MFCCs

Week 12

Meeting

Done

Questions

Possible Problems

Intersection between top/bot 100/200 of Resnet and Cuberun

Accuracy of sound classes (sorted)

Accuracy with respect to mean energy in samples (small dataset)

Accuracy of 10% bins with respect to #training samples

Regarding MFCCs

Week 11

Half-time Report

Week 10

Discussion

Results

MFCC features

Week 9

Training History

Discusss

Week 8

CubeRun

Resnet

Week 7

Week 6

Week 5

Week 4

Week 3

Clone this wiki locally