| Literature DB >> 32083214 |
Abstract
Speech recognition has become one of the most significant parts of human-computer interaction due to emergence of new technologies such as smartphone, smart watch and many modern technologies, therefore the need of an ASR for local languages is felt. The basic aim of this paper is to develop an isolated digits recognition for Pashto language, using deep CNN. The database of Pashto digits from 0 to 9 with 50 utterance for each digits is used. Twenty MFCC features extracted for each isolated digit and fed as input to CNN. The network has been used for the proposed system is deep up to 4 convolutional layers, followed by ReLU and max-pooling layers. The network has been trained on the 50% of data and the rest of the data was used for testing. The total average of 84.17% accuracy was achieved for testing which show 7.32% better performance as compared to existing similar works.Entities:
Keywords: CNN; Computer science; MFCC; Pashto isolated digits recognition; Speech recognition
Year: 2020 PMID: 32083214 PMCID: PMC7016387 DOI: 10.1016/j.heliyon.2020.e03372
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Figure 1Block diagram of an ASR System.
Figure 2Steps involved in MFCC feature extraction.
Figure 3Architecture of CNN having 4 convolution followed by pooling and one fully-connected layer in the last.
The parameters used in CNN.
| Convolutional layer | Number of Filters | Filter size | Generated feature map | Max-pooling |
|---|---|---|---|---|
| First layer | 25 | (10,10) | 25 | Yes |
| Second layer | 22 | (10,10) | 22 | Yes |
| Third layer | 44 | (3,3) | 44 | Yes |
| Fourth layer | 44 | (2,2) | 44 | No |
Figure 4Rectified linear unit.
Total training and testing accuracy.
| Data | Number of Speakers | Accuracy |
|---|---|---|
| Training | 25 (both male and female) | 90.14% |
| Testing | 25 (both male and female) | 84.17% |
Figure 5Total accuracy and loss during training and testing.