ABSTRACT 5
INTRODUCTION 6
1. MONAURAL SOURCE SEPARATION AND SPEAKER RECOGNITION.. 10
2. MODEL DESCRIPTION 12
2.1. Model: Deep Recurrent Neural Networks 12
2.2. Model Architecture 15
2.3. Quality metrics 17
2.4. Software environment 17
3. SOFTWARE DEVELOPMENT 19
3.1. Python Application (User Interface) 19
3.1.1. Design of the application 19
3.1.2. Functional requirements 20
3.2. Preprocessing 21
3.3. Model variables List 21
3.3.1. Model parameters 21
3.3.2. Parameters meaning 24
3.4. List of issues explanation and solution 26
4. TESTING 28
4.1. Testing of the neural network 28
4.1.1. Using of the pretrained model 28
4.1.2. Training the model on the own data set 28
4.1.3. Training after observation 29
4.1.4. Training with one speaker and many speakers 30
4.2. Testing of the UI 30
5. RESULT DISCUSSION 32
CONCLUSION 35
REFERENCE LIST 36
Actuality
Artificial Intelligence (AI) is growing so fast nowadays, and many applications already have a human-like intelligence, and this makes this field rather interesting that challenging.
The introduction of neural networks changed the way that people think about the limitations of intelligence of computers, now we have neural networks that simulate the work of our brain cells, so you get it to learn things, recognize patterns and make decisions in a humanlike way
The power of the tradition feedforward neural networks is limited somehow because it has no notion of order in time and it does not consider the recent past results from the previous nodes into consideration. The feedforward networks are amnesiacs regarding their recent past; they remember nostalgically only the formative moments of training. Recurrent neural networks on the other hand take as their input not just the current input example they see, but also what they have perceived previously in time.
The decision taken by a recurrent network reached at time step t-1 affects the decision it will reach one moment later at time step t. So recurrent networks have two sources of input, the present and the recent past, which combine to determine how they respond to new data, much as we do in life. The term “deep” refers to the number of hidden units inside the layers which enforces and increases the amount of learning subsequently it has a direct effect on the results but the time and hardware requirements are fairly increased
Speaker Recognition is one such concept which has beheld mankind’s attention. There can be no greater testimony to the same than the fact that people were already working on this idea - a few decades before John McCarthy even coined the term "Artificial Intelligence", ever since, this term is being used to refer to applications that includes learning done by the automatically by the machines.
Speaker Recognition refers to the automated method of identifying or confirming the identity of an individual based on his voice (voice biometrics). This can then be used in numerous ways - ranging from criminal investigations, determining the speaker, verify the identity of the speaker and so on.
Speaker Recognition is a result of cross-linking various avenues of technology like Machine Learning, Artificial Intelligence and Neural Networks. I propose to develop a system based on mathematical algorithms and principles which involve all the aforementioned technologies. That being said, Speaker Recognition also depends on a few other factors: the level of noise, the quality of the audio file. My software aims to address the aforementioned problems, by developing a python user interface and a Matlab backend that reduces the noise then uses deep recurrent Neural Network for Speaker Recognition of the people whom you want to listen to, then the output is the audio file after rising up the volume of some chosen people and cleaning the noise.
Research goal and objectives
The goal of the research is to develop an application for monaural source separation and speaker recognition
To achieve this goal we had the following objectives:
1) to analyze the contemporary speaker recognition technology and to identify key issues that needs to be addressed in the real-world deployment of this technology;
2) to explore alternative speech parameterization techniques and identify the most successful ones;
3) to study ways for improving the speaker recognition performance and noise robustness for real-world operational conditions;
4) to study alternative to the present state-of-the-art approaches for speaker recognition, and to identify the ones that offer practical advantages using deep recurrent neural networks;
5) to create a prototype of a speaker recognition system that utilizes the consequences from the abovementioned issues 2—4;
6) test our application and improve the performance.
The practical significance
The project is beneficial as the core of source separation and speaker’s speech identification which can have many significant uses for individuals and organizations
This project can be used in:
1) training the model to the voice features of some speakers;
2) using the application to separate the speech of each speaker;
3) using the application to extract a speaker’s speech;
4) noise reduction and audio files normalization.
Structure of the thesis
The thesis consists of four chapters, Introduction, Conclusion, and references list.
In chapter one, the source separation task and speaker recognition are explained, also the various approaches to solve this task are being analyzed, and a conclusion was made to explain why the model I used produces better results.
In chapter two, the model is described in details by first introducing the architecture of the deep recurrent neural network and the equations used with their description and significance with the quality metrics being used to judge the results. Also, the software environment used in the whole thesis is mentioned.
In chapter three, the software development section is being discussed where there are screen shots from the python User Interface and different functionalities, besides, the preprocessing on the audio files and how the dataset was produced, also it has the set of variables used in the deep recurrent neural network and the meaning of some important variables was explained in the form of tables. Also, we have a list of errors that we faced while using Matlab and some libraries and their solutions to make it easier for anyone who would like to continue the development in this field.
In chapter four, the most significant experiments that were done are discussed, together with some accuracy metrices and results of these experiments is shown in tables.
In chapter five, the results are discussed of the overall task and there are some pictures to show the amplitude of the audio files before and after the separation which explains the efficiency of the model.
In my thesis, the application of monaural source separation and speaker recognition using a deep recurrent neural network was demonstrated.
During the work we reached following objectives:
1) contemporary speaker recognition technology was analyzed and key issues that needs to be addressed in the real-world deployment of this technology was identified;
2) alternative speech parameterization techniques were explored and the most successful ones were identified;
3) ways for improving the speaker recognition performance and noise robustness for real-world operational conditions were studied;
4) alternative to the present state-of-the-art approaches for speaker recognition was studied, and the ones that offer practical advantages using deep recurrent neural networks were identified;
5) an application prototype of a speaker recognition system that utilizes the consequences from the abovementioned issues 2-4 was created;
6) the application was tested and the performance was improved.
Jointly optimized model with time frequency masking functions embedded to the network layers was tested. Performance of the separation process was evaluated using three error metrics: SDR, SIR and SAR, that shows some good results with the appropriate bitrate and the amplitude parameters of the signals in the audio files. The overall performance of the model outperforms the performance of NMF and normal deep neural networks. Further work will be focused on speeding up the source separation using GPUs and speed-performance balance tuning.
1. Huang P. S. Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation // IEEE/ACM Trans. Audio Speech Lang. Process, 2015. - Vol. 23. - No. - 12. - P. 2136-2147.
2. Ephraim Y., Malah D. Speech enhancement using a minimum meansquare error log-spectral amplitude estimator. // IEEE Trans. Acoust, 1985. - Vol. 33. - No. 2. - P. 443-445.
3. Lee D.D., Seung H.S. Learning the parts of objects by non-negative matrix factorization. // Nature, 1999. - Vol. 401. - No. 6755. - P. 788-791.
4. Hofmann T. Probabilistic latent semantic indexing. // Sigir, 1999. - P. 50-57.
5. Smaragdis P., Raj B., Shashanka M. A Probabilistic Latent Variable Model for Acoustic Modeling. // Adv. Model. Acoust. Process. Work, 2006. - P. 1-7.
6. Huang P.-S. et al. Deep learning for monaural speech separation // Proc. IEEE Int. Conf. Acoust. Speech Signal Process (ICASSP), 2014. - P. 1562-1566.
7. Huang P., Kim M. Singing-Voice Separation From Monaural Recordings Using Deep Recurrent Neural Networks. // Target, 2014. - P. 477482.
8. Hermans M., Schrauwen B. Training and Analyzing Deep Recurrent Neural Networks. // Nips, 2013. - P. 190-198.
9. Pascanu R. et al. How to Construct Deep Recurrent Neural Networks. // ICLR, 2014. - P. 1-13.
10. Wang D. Time - Frequency Masking for Speech Hearing Aid Design. // Trends Amplif, 2008. - Vol. 12. - P. 332-353.
11. Reju V.G. Blind Separation of Speech Mixtures. // IEEE Trans. Signal Process, 2009. - Vol. 52. - No. 7. - P. 1830-1847.
12. Vincent E., Gribonval R., Fevotte C. Performance measurement in blind audio source separation. // IEEE Trans. Audio, Speech Lang. Process,
2006 - Vol. 14. - No. 4. - P. 1462-1469.
13. MathWorks - Makers of MATLAB and Simulink [Electronic resource] URL: https://www.mathworks.com/?s_tid=gn_logo (date of access: 06.03.2018).
14. HTK Speech Recognition Toolkit. [Electronic resource] URL: http://htk.eng.cam.ac.uk/ (date of access: 06.03.2018).
15. The Laboratory for the Recognition and Organization of Speech and Audio (LabROSA). [Electronic resource] URL: https://labrosa.ee.columbia.edu/ (date of access: 06.03.2018)....20