Post date: Sep 20, 2020 5:8:57 AM
We're going to make a dataset!
Refer to (https://www.mygardenbirdwatch.com/?cur=page/page&id=4&title=BIRD_ID) to get the 28 most common Malaysian birds.
Remove birds not found in Sabah/Sarawak. (3 removed)
Pick top 10 from the list in 2019. Refer to (https://www.mygardenbirdwatch.com/?cur=bird/result&date=2010). Three birds are not common in Sabah/Sarawak (starred).
4. Pick the remaining form list from (1). We're left with 25:
5. Remove those with data in Xeno-canto restricted or less than 1 hour of recordings. Now down to 13 species. Total duration for all 13 species is more 76 hours (76:48:25 or 276,505 seconds to be exact). At 44.1 kHz sampling rate, that's 24 GB uncompressed. Let's download all these sounds and then filter some more until we go down to 10 species.
Urban8k has 8732 labeled sound clips from 10 classes at 4 second each clip. That's 34,928 seconds. From the 276k seconds, we could end up with a smaller sample set after removing the silent passages. We can only know after the work is done.
Papers on datasets:
Salamon, Justin, Christopher Jacoby, and Juan Pablo Bello. "A dataset and taxonomy for urban sound research." Proceedings of the 22nd ACM international conference on Multimedia. 2014. Link at Researchgate.
Morfi, Veronica, et al. "NIPS4Bplus: a richly annotated birdsong audio dataset." PeerJ Computer Science 5 (2019): e223. Link at publisher.
Morfi, G. Automatic detection and classi cation of bird sounds in low-resource wildlife audio datasets. Diss. Queen Mary University of London, 2019. Link at Queen Mary.
Salamon, Justin, et al. "Towards the automatic classification of avian flight calls for bioacoustic monitoring." PLoS ONE 11.11 (2016): e0166866. LInk at publisher.