-
Notifications
You must be signed in to change notification settings - Fork 42
Supervision Meetings
This page contains some notes, and questions for the weekly meetings.
- Week 3
- Week 4
- Week 5
- Week 6
- Week 7
- Week 8
- Week 9
- Week 10
- Week 11
- Week 12
- Week 13
- Week 14
- Week 15
- Week 16
- Week 17
- Week 19
Discussed the thesis report:
- Descriptive statistics for elevation data should be included.
- Discussion should be somewhat selfcontained (~2 pages), and include
- what we set out to do,
- what we have done,
- main findings,
- what went wrong, and what can we learn from this,
- what could be done in the future,
- what open questions are there.
that is, the results of the thesis should be put into a broader perspective.
- Meta-Data Fusion: could be seen as a weighted ranking rather than a computation of probabilities.
- The PDF of the random variables X estimated by the observed elevations O are now Gaussian + Uniform, weighted towards Gaussian if |O| is large.
- Do we have enough results to concentrate on the write-up?
- Run 2 x Resnet experiments for whole data set
- Make plots reproducible (rewrite some old plot code)
- Separate Evaluation and Predicitions
The majority of the recordings in the BirdCLEF 2016 data set have meta data containing the elevation at which the bird song was recorded. This information could probably be used to improve classification accuracy.
Let Y = {y_1, ..., y_n} be the sound classes of the data set (n = 999), and let T = {r_1, ..., r_m} be the training set. Let e(r) be a function which returns the elevation of recording r, and let f(r) be a true classifier which returns the sound class of r. We then create the elevation observations for each sound class y_i in Y by O_i = {e(r) | f(r) = y_i} for i = 1, ..., n. A stochastic process X_i with the distribution N(mu_i, sigma_i) is estimated for each O_i, where mu_i = mean(O_i) and sigma_i = std(O_i). The classification of a recording r can now be done by combining classifier_score = f_w(r) = (p_1, ..., p_n), and elevation_score = (P(X_1 = e(r)), ..., P(X_n = e(r))), where f_w is the classifier parametrized by w, p_i is the estimated probability that r belongs to sound class y_i.
The final classification is then
(p_1 * P(X_1 = e(r)), ..., p_n * P(X_n = e(r)) / sum of these probabilities
Using Bayes Thm.
Pr(A|X) = Pr(X|A)Pr(A) / Pr(X),
where A is the event that r is species y, and X is the event that the elevation is e(r).
A possible problem with this approach is that number of elevation observations are not enough to get a good estimate of the distribution. Meaning that if a recording of a bird is made at an "unusual" elevation where it has not been observed before could result in a false classification.
CubeRun | With | Without |
---|---|---|
Top-1 | 0.645 | 0.637 |
Top-5 | 0.815 | 0.801 |
MAP | 0.685 | 0.675 |
AUROC | 0.977 | 0.969 |
Coverage Error | 20.7 | 25.6 |
Label Ranking Average Precision | 0.724 | 0.714 |
Ranking Loss | 0.020 | 0.025 |
The most notable improvement is on the coverage error where we on average need to predict the 20 most probable classes to cover all ground truth labels instead of 25.
The number of predictions for a class are higher than expected when there are many training segments, and lower than expected when there are few training segments.
- Structure
- Baseline
- Results
- Methods
- Figure out points to make
- What evidence do we have for this?
- Add results which do not really support the main points but could be interesting to round up to an Appendix.
- Running times for the networks
- Version number of libraries
- How do one recreate the plots?
- What hardware is used?
Show histogram plots of elevation for each sound class.
Looking at the confusion matrix for the bot 100 sound classes for the residual network we can see that the network tends to predict sample points as sound class 12, or 42 they get 14 and 15 predictions respectively while other classes get around 2-4 predictions.
The amount of training segments for sound class 12 and 42 is 359 and 756 respectively, while the amount of training segments for other sound classes is on average less than 100 segments, probably around 70 (by manual observation). Meaning that the sound classes which the network favors are probably over represented.
- Confusion Matrix
- Strategies on Generalization
- Run 240 bot 100
In this confusion matrix there seems to be some structure in the confusions, and there sort of a line vertically along sound class 42(?). Meaning that the network may favor this class as a prediction.
Training the CubeRun model on the small data set with seven different optimizers. The default parameters from the keras library was used for all optimizers except Stochastic Gradient Decent which uses a parameter configuration which is close to the baseline.
The only two optimizers which give decent results are Adadelta and Stochastic Gradient Decent.
- Fundamental difference between data and method.
Training the network only on the 100 sound classes on which is performs worst may give some insight into what the underlying problem could be.
The validation accuracy stays very low while training on these classes, indicating that these classes are actually harder to learn. The validation loss is in fact increasing.
The characteristic of this curve stays the same, however, it is shifted towards the left which indicates worse performance.
- Top 1: 0.410958904109589
- Top 2: 0.5205479452054794
- Top 3: 0.6118721461187214
- Top 4: 0.6621004566210046
- Top 5: 0.6986301369863014
- Mean Average Precision: 0.473481438051
- Area Under Curve: 0.902126285688
Training on same species (bot 100) but with a reshuffled data set (train/valid 90/10).
- Top 1: 0.42570281124497994
- Top 2: 0.5381526104417671
- Top 3: 0.6104417670682731
- Top 4: 0.6305220883534136
- Top 5: 0.678714859437751
- Mean Average Precision: 0.48241371466
- Area Under Curve: 0.895622895623
Similar results as for CubeRun can be observed for ResNet when training on bot 100.
In [16]: a.intersection_all([bot_10_1, bot_10_2, bot_10_3, bot_10_4, bot_10_5])
Out[16]:
['nigricollis',
'affinis',
'conspicillatus',
'fuscorufa',
'amazonum',
'chrysoptera']
In [17]: a.intersection_all([top_10_1, top_10_2, top_10_3, top_10_4, top_10_5])
Out[17]:
['rufimarginatus',
'erythrocercum',
'martinicus',
'brissonii',
'chloropterus',
'certhia',
'albescens']
In [18]: a.intersection_all([top_10_1, top_10_2, top_10_3, top_10_4, top_10_5, bot_10_1, bot_10
...: _2, bot_10_3, bot_10_4, bot_10_5])
Out[18]: []
It is obvious on the small data set that the Residual Neural Network is not as stable in validation accuracy, the variance of "generalization" is quite high. It may be that the network is more prone to overfitting.
- Try all optimizers
- Train 10 times for CubeRun/ResNet and check variance of top/bot
- Train MFCCs with less features
- Train on bot/top 100
- Training Resnet and CubeRun on large dataset.
- Performed sanity check on MFCC input data
- Perfomed sanity check on classification by implementing a simple method which takes a model, and a sound file as input and returns the predicted species. Just to ensure that it actually predicts a species.
- Maintanance work on code base (restructured how results are saved and stored).
- Found and read a new ResNet publication (https://arxiv.org/pdf/1605.07146v3.pdf), which argues that the width of the network may be more important than depth due to diminishing feature reuse.
- Tempogram input data (seems to take too much time to compute)
- Which experiments would be interesting to run next?
- Should we look into "Wide Residual Neural Networks" (https://arxiv.org/pdf/1605.07146v3.pdf)
- The training/validation is split 90/10, which is the same as in the baseline, for comparable results. However, the rather small amount of validation data may make the validation set more suceptible to "bad choices". E.g., for small sound classes we have 1-2 data points, since these are randomly selected we may very well select two data points which are of poor quality making that sound class hard to predict for the classifier.
- Validation data not representative.
I trained the CubeRun model on the small dataset, and manually inspected the sound class with worst classification accuracy (0%). By listening to the validation samples, and training samples I realized that the validation samples were barely audible, and the amount of actual bird vocals present was very low. In contrast, the training samples had lots of distinct bird vocals. Since the amount of validation samples for each class is quite low, as low as 1-2 for some classes, it can be the case that the randomly selected validation samples are really bad recordings, leading to an extremely bad validation accuracy for that class.
It could be interesting to know if the ResNet and CubeRun have problem with the same sound classes.
In [68]: len(a.intersection(res_top_100, cube_top_100))
Out[68]: 31
In [69]: len(a.intersection(res_bot_100, cube_bot_100))
Out[69]: 54
In [70]: len(a.intersection(res_top_100, cube_bot_100))
Out[70]: 2
In [71]: len(a.intersection(res_bot_100, cube_top_100))
Out[71]: 3
As we can see the bot 100 sound classes for ResNet and CubeRun share around 50% of the same classes. Since we have ~800 classes if this were random we would expect 1/8 to be shared rather than 1/2, meaning that there probably is a pattern here. Neither does top 100 and bot 100 for the classifers seem to share many classes.
In [72]: len(a.intersection(res_top_200, cube_top_200))
Out[72]: 139
In [73]: len(a.intersection(res_bot_200, cube_bot_200))
Out[73]: 127
In [75]: len(a.intersection(res_top_200, cube_bot_200))
Out[75]: 6
In [76]: len(a.intersection(res_bot_200, cube_top_200))
Out[76]: 8
The top/bot 200 of ResNet and CubeRun share around 65% of the same sound classes, meaning that they perform well on a similar set of sound classes, and poorly on a similar set of sound classes. The top/bot sets share nearly no classes <5%.
This plot clearly shows that the network performs very well for some classes, and extremely poorly for other classes. One would expect this to be mainly due to the uneven distribution of training samples for each class, however, the effect of this is less than one might expect (as seen in the image below). Therefore there must be some other factor which enables the network to classify certain classes with ease, and other with great difficulty.
It may be of interest to train a classifier on the top 100 classes, and the bot 100 classes to see if the results are consistent. That is, if the accuracy for the top 100 remains high, and the accuracy for the bot 100 classes remains low.
It may also be of interest to run the training multiple times and see if the top 100, and bot 100 ramain roughly the same. If they do not it may be that the network gets stuck in different local optima which favors the prediction of some classes over others. If they do there is something fundamentally hard about the bot 100 classes. However, with the current computational resources available this investigation may not be feasible. (Each training round takes roughly 3 days to complete.) But we could take a subset of say 50 sound classes, train on them a couple of times and see if the problems remain, and how each training round affects them.
The plot shows how the classification accuracy of each sound class (descending order) plotted in the same graph as the mean energy of the training (green line), and validation (blue line) samples. The mean energy of a the data points of a sound class is simply the total sum of the amplitude spectrogram for each data point in the sound class divided by the total number of data points.
We can see that the energy distribution of the training set, and validation set are similar, however, it is not possible to make any correlation between classification accuracy and the mean energy in the data points. With the exception of the sound class with the worst validation accucary, where the validation data barely contains any energy at all. This may explain why these validation recordings was barely audible.
The number of training samples for each sound class does affect the resulting accuracy for that class. However, not as much as one might expect.
"However, note that a lesser but still substantial improvement over the baseline MFCC system can usually be attained simply by using the raw Mel spectral data as input rather than MFCCs. One of the long-standing motivations for the MFCC transformation has been to reduce spectral data down to a lower dimensionality while hoping to preserve most of the implicit semantic information; but as we have seen, the random forest classifier performs well with high-dimensional input, and such data reduction is not necessary and often holds back classification performance. Future investigators should consider using Mel spectra as a baseline, rather than MFCCs as is common at present" - (https://arxiv.org/pdf/1405.6524.pdf)
There are also results in this paper showing that there could be a sweet spot for the number of features used when using MFCCs: http://www98.griffith.edu.au/dspace/bitstream/handle/10072/54461/81896_1.pdf?sequence=1 (page 8)
In this plot the accuracy of each "chunk" is ordered by how many training samples are available. That is, the average accuracy of the 10% of the species with the most training samples available is plotted at x=0, and so forth.
We can see that there is a trend where the accuracy becomes smaller with less training samples, which is what one would expect, since the network will se the samples of the "small" sound classes less often, and thus learn them less well. However, the trend does not seem to be blowing up with a decreasing number of species, but it still performs rather evenly. Bear in mind that the largest sound classes has ~200 training samples, while the smallest ones have ~10 training samples, yet the prediction accuracy only varies from ~0.65 to ~0.48.
During the half-time report the projects progress was discussed, and the examiner approved of the progress of the thesis. Further investigative methods were discussed, and what the next steps could be.
- Investigate why augmented MFCCs does not work as well as logarithmic spectral input data
- Double-check implementation
- Perform a sanity check on input data (just in case something is wrong)
- Investigate which parts of the data set is hard for respective classifier
- Resnet: top/bot 25 classes
- CubeRun: top/bot 25 classes
- Investigate accuracy with respect to number of training samples
- Resnet
- Cuberun
- Further investigation metrics
- accuracy with respect to length of song / energy of segment
- Possible improvements
- Tempogram
- Data fusion (altitude/location/time/biotope)
The training data for each sound class is very uneven, it could make sense to weigh the samples, or at the very least see how this correlates to classification accuracy.
Top 10 | Bot 10 |
---|---|
leucophrys 212 | gymnops 9 |
flaveola 190 | pretrei 9 |
gujanensis 141 | melanoleucus 9 |
viridis 138 | candei 9 |
albicollis 135 | bolivianus 9 |
rufus 119 | novacapitalis 9 |
ruficauda 119 | gyrola 9 |
capensis 118 | squamigera 9 |
guttatus 118 | luteiventris 9 |
longirostris 115 | sp.nov.Alto_Pisones 0 |
- Discuss MFCC features
- Discuss resizing
- Discuss data fusion (elevation)
- Discuss evaluation (top 10, median 10, bottom 10) (validation data may be to sparse for a fair comparison)
- Discuss noise reduction, could it be interesting to preproess the audio files even more?
CubeRun MFCC without any augmentation at all (resized from 128x130 to 256x256).
CubeRun MFCC with data augmentation (noise/same class/time shift/pitch shift)
CubeRun MFCC with data augmentation, and Multiple-Width Frequency-Delta Data Augmentation
CubeRun Spectrogram with data augmentation
ResNet MFCC with data augmentation, and MWFD
128x130 MFCC matrix
256x256 MFCC matrix (resized)
256x512 Log Spectrogram (reference)
The baseline model has been evaluated on the large (~25000 file) data set. The results are close to what would be expected according to the paper, if not a bit better.
- Top 1: 2246
- Top 2: 2561
- Top 3: 2707
- Top 4: 2806
- Top 5: 2862
- Mean Average Precision: 0.653409677675
- Area Under Curve: 0.974919928416
- Total predictions: 3657
- What are the next steps? Multiple-Width Frequency-Delta Data Augmentation? Maybe try out using a tempogram as a data augmentatino technique?
- Next week is week 10, meaning that we should schedule a half-time presentation late next week, or possibly the week after that.
It seems that the combinatino of dropout followed by a batch normalization on the input layer is what is causing the bad results. Without dropout the CubeRun model reaches a mAP of around 0.84. With dropout it is no good at all.
I have found a great library called librosa where there are a lot of feature extraction methods available. The current implementation now uses librosa.stft to compute the spectrograms. There are available methods for computing MFCCs as well as deltas for the MFCCs in librosa. Meaning that it should be no problem to implement the Multiple-Width Frequency-Delta data augmentation. It may also be interesting to play around with the the rythmic representations such as tempograms which are available in the library.
Since most popular models (VGG16, VGG19, xception, inception_v3), are readily available in the Keras library, it may be of interest to test all of them, and see if any other model performs well in this problem domain, and it would make the study more comprenehsive.
Without dropout |
---|
This model reaches a mAP of 0.84
Without batch normalization |
---|
With dropout and batch normalization |
---|
The much deeper residual network seems to perform well, and reaches a mAP of around 0.86.
34 layer Resnet |
---|
- Contacted Elias Sprengel and used his [feedback](https://github.com/johnmartinsson/bird-species-classification /wiki/Journal#2016-12-14).
- Evaluated the CubeRun model again, but it still performs much better when trained on amplitude spectrograms.
- Trained a deep residual network model.
The CubeRun model seems to converge quite quickly, and keeps an validation accuracy of around 0.7 after 100 epochs. The Resnet model does not seem to have converged after 60 epochs, and could use more training. Neither of these accuracy scores are the mean over a whole file, but rather the accuracy for all segments of all files. The results are expected to improve when taking the mean over a file. Last time an accuracy of 0.57 for the CubeRun model over the segments in the validation set increased to an accuracy of 0.76 over the mean of the segments for each file.
Time:
- CubeRun (5 conv layers): 60s/epoch
- Resnet (18 conv layers): 240s/epoch
CubeRun 240 Epochs on Amplitude Spectrogram (best: 0.74) |
---|
- Top 1: 100
- Top 2: 105
- Top 3: 113
- Top 4: 114
- Top 5: 116
Mean Average Precision: 0.846718206657 Area Under Curve: 0.966350301984 Total predictions: 122
CubeRun 240 Epochs on Log Amplitude Spectrogram |
---|
Resnet 60 Epochs on Log Amplitude Spectrogram (best: 0.6) |
---|
Evaluated the baseline with and without augmentation, and with and without log amplitude spectrogram. The best performing setting is without log amplitude spectrogram and with augmentation.
- Augmentation Log Amplitude Spectrogram
- Augmentation Amplitude Spectrogram
- Log Amplitude Spectrogram
- Amplitude Spectrogram
- Spectrogram Comparison
Questions:
- Could the reason that the log amp spectrogram works so badly originate in the augmentation techniques? Two signals are mixed with a factor alhpa, then three noise signals are added with a dampening factor of 0.4. Should some sort of normalization be done here?
- Since values close to 0 become large negatives, maybe the network learns silence/noise and gets overfitted to silence/noise in the training set, making it useless in the validation set.
The data augmentation is in place. A set of augmentation dictionaries is created by drawing a sample at random, then find another sample with the same class, at random, then three random noise samples are drawn. The dictionary is only an object which keeps track of the relative paths to the files. These augmentation dictionaries are then supplied in mini-batches to the training scheme, at which point they are actually loaded, and combined, meaning that we only keep the mini-batch in memory when it is used. The two same class samples are combined using a random weight alpha in [0, 1). The noise samples are added to the new sample with a dampening factor of 0.4. The mini-batch generator is configurable. It is possible to set how many augmentation samples should be created, how large each mini-batch should be, and how many mini-batches should be generated.
Abstract example of how this is used:
X_valid, Y_valid = loader.load_validation_data(config)
for (X_train, Y_train) in loader.mini_batch_generator(config):
model.fit(X_train, Y_train, validation_data=(X_valid, Y_valid), config)
An example of an augmentation dict:
{'augmentation_noise_filepaths': [
'datasets/birdClef2016Subset_preprocessed/train/noise/LIFECLEF2014_BIRDAMAZON_XC_WAV_RN9970_noise_chunk.wav.gz',
'datasets/birdClef2016Subset_preprocessed/train/noise/LIFECLEF2015_BIRDAMAZON_XC_WAV_RN28317_noise_chunk.wav.gz',
'datasets/birdClef2016Subset_preprocessed/train/noise/LIFECLEF2015_BIRDAMAZON_XC_WAV_RN18403_noise_chunk.wav.gz'
],
'augmentation_signal_filepath': 'datasets/birdClef2016Subset_preprocessed/train/LIFECLEF2015_BIRDAMAZON_XC_WAV_RN18700_signal_chunk.wav.gz',
'labels': ['17'],
'signal_filepath': 'datasets/birdClef2016Subset_preprocessed/train/LIFECLEF2014_BIRDAMAZON_XC_WAV_RN7087_signal_chunk.wav.gz'}
Show:
Questions:
- At which level should I start the thesis? Acoustic Scene Classification, or should I review the state of the art literature of bird species classification using acoustics?
- It is clear that the authors of the reference paper has treated the data set as a single-instance single-label problem. I have extracted a random subset of of 20 species from the BirdClef2016 data set, which is the data set the authors tested their method on. I believe that it may be better to keep working with this data set instead of MLSP2013.
- The network seems to converge better when training on the amplitude spectrogram rather than the log amplitude spectrogram. What does this mean? Could it be that the input values are higher and that this makes the network converge faster?
Print out Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection
I have confirmed that the data augmentation is performed in the time domain. This was stated later in the paper, and I had overlooked it.
Show:
- Reproduced noise and signal masks
- Running on GPU reduces model fitting from 3-4 hours to 2-3 minutes.
Questions:
- What is an parametric equalizing function, and how should it be used? This is not mentioned in the reference paper, and may not be used, they may simply additively combine the same class signals.
- How should we load same class segments and combine them? What kind of "randomness" makes sense? Should they be stored to file? (for smaller data set this is not a problem, however, if the larger set gets augmented 5-10 times, the disk space used will increase from ~70GB to 350-700GB.
- Should the data be normalized to zero mean and identity variance?
Answers:
- Leave this out for now. However, it is a way to accentuate certain frequencies around the center frequency. Q-factor decides the bandwidth, gain decides if it should be accentuated or lowered.
- It makes sense to load data batches at random from disk, possibly directly from a compressed file, with a fixed random seed. These batches can then be augmented with other randomly loaded signals.
- This is not clear, however, it does not make any sense in the time domain where signals already should have zero mean. It could make sense in the spectral/image domain, where each sample is just a high-dimensional point in space.
Read up on direct current offest (DC offset), and if it makes sense to normalize the "images" to zero mean and identity variance.
Questions:
-
Some bird vocals in the MLSP data set seem to have very low energy, and it is hard to distinguish some vocal structures from noise using a simple mark higher than 3 times row/column median + dilation/erosion filter technique. What can be done? Should anything be done in the baseline?
-
It is not clear in which order, and how many times the dilation, and erosion filter are applied during the signal/noise extraction stage. It is neither clear what the neighbourhood looks like. We know that it is a 4 by 4 neighbourhood, but it is not clear if this consists of 4 by 4 ones, or if the mask has any other shape. How should this be interpreted?
-
In the paper refered to from the "baseline paper" they remove the 4 lowest and the 24 highest frequency bins from the spectrogram. Should I assume that this is done also in the "baseline paper"?
-
Additively combine signals. Is it possible using spectrograms?
Answers:
- Use baseline specifications, and take it from there.
- Empirically test the methods, and try to reproduce the results as close as possible to the paper.
- Since they write "we follow quire closely" it can be assumed that the same method is used in their paper.
- Empirically test with sinusoids, and see if it works. I.e., confirm f(s1 + s2) = f(s1) + f(s2), where f is a function which computes a spectrogram, and s1 and s2 are audio signals.