Literature DB >> 36061371

Multi-task learning neural networks for breath sound detection and classification in pervasive healthcare.

Dat Tran-Anh¹, Nam Hoai Vu¹, Khanh Nguyen-Trong¹, Cuong Pham¹.

Abstract

With the emergence of many grave Chronic obstructive pulmonary diseases (COPDs) and the COVID-19 pandemic, there is a need for timely detection of abnormal respiratory sounds, such as deep and heavy breaths. Although numerous efficient pervasive healthcare systems have been proposed for tracking patients, few studies have focused on these breaths. This paper presents a method that supports physicians in monitoring in-hospital and at-home patients by monitoring their breath. The proposed method is based on three deep neural networks in audio analysis: RNNoise for noise suppression, SincNet - Convolutional Neural Network, and Residual Bidirectional Long Short-Term Memory for breath sound analysis at edge devices and centralized servers, respectively. We also developed a pervasive system with two configurations: (i) an edge architecture for in-hospital patients; and (ii) a central architecture for at-home ones. Furthermore, a dataset, named BreathSet, was collected from 27 COPD patients being treated at three hospitals in Vietnam to verify our proposed method. The experimental results demonstrated that our system efficiently detected and classified breath sounds with F1-scores of 90% and 91% for the tiny model version on low-cost edge devices, and 90% and 95% for the full model version on central servers, respectively. The proposed system was successfully implemented at hospitals to help physicians in monitoring respiratory patients in real time.

Entities: Chemical

Keywords: Breath sound detection and classification; BreathSet dataset; Pervasive deep learning; RNNoise; Residual BiLSTM; SincNet-CNN

Year: 2022 PMID： 36061371 PMCID： PMC9419997 DOI： 10.1016/j.pmcj.2022.101685

Source DB: PubMed Journal: Pervasive Mob Comput ISSN： 1574-1192 Impact factor: 3.848

Introduction

Breath sounds are made by air moving through the respiratory system, which can be recorded over the chest wall, the trachea, or at the mouth. They are classified as normal or abnormal sounds. The latter includes the absence or reduced intensity of sounds while breathing, normal breath sounds heard in abnormal areas, as well as adventitious sounds [1]. Several abnormal breaths can be easily differentiated by humans [2], such as agonal breathing, or snores. However, there are slight signs that are difficult to analyze and hard to distinguish, such as heavy and deep breaths. This may lead to a wrong diagnosis [3] if not analyzed by a well-trained physician. A heavy breath is considered to be dyspnoea, which is characterized by shortness of breath and difficulty breathing, or the inability of taking a complete, satisfactory breath [4]. Whereas, a deep breath is defined as a tidal volume several times the normal volume and observed as a crescendo on a slow lung volume trace. The deep breath may occur in both healthy and unhealthy people. In the context of many Chronic obstructive pulmonary diseases (COPDs) and the COVID-19 pandemic, the timely detection and classification of these signs is critical. In the field of healthcare monitoring, many studies have been conducted to analyze respiratory disease by a typical pervasive architecture. This architecture usually includes four layers, including the physical, mobile computing, cloud, and application layer [5]. The physical layer is responsible for data collection related to respiratory information, such as sounds, airflow, chest wall movements, or air temperature [6], [7]. The mobile computing layer contains communication, computing, and handheld devices, which acquire data from the previous layer and/or perform several processes, including data mining and analysis or preprocessing steps like noise-canceling. These data are then passed through the cloud layer that performs data storage and/or analytic tasks, such as respiratory classification or statistical operations. At the last layer, the prediction will be shown to doctors. The techniques used to capture information at the physical layers are critical for a large-scale deployment of such pervasive systems. Previous studies were based either on contact (direct contact with the patient’s body) or contactless (without contact with the patient’s body) methods. The contact approaches cover a wider range of technologies [6], such as acoustic-based methods (i.e, microphones) to capture respiratory sounds, airflow-based methods to monitor respiratory airflow (i.e, oronasal thermistor), or methods to measure chest wall movements (i.e, smart textiles [8]). The contactless approaches leverage numerous signals to obtain respiratory information, such as ultrasonic sensors [9], optical sensors [10], or radio frequency technologies [11]. Although advantages such as greater patient comfort or longer continuous monitoring, many proposed methods, both contact and contactless, are nevertheless inaccuracies with movements (i.e, ultrasound, radar approaches, or conductive sensors), expensive and complicated equipment setups (i.e, thermography, ultrasound or optical approaches) [12]. In this context, monitoring respiratory sounds by wearable techniques (i.e., microphones) appears as a suitable solution that balances the tradeoff between simplicity, accuracy, cost efficiency, and reproducibility [13], especially in developing countries such as Vietnam. The main drawback of this approach is susceptible to noise, like environmental vibrations or ambient noise. However, it can be overcome by advances in noise reduction using deep learning [14]. Recently, deep learning has been widely adopted in pervasive healthcare systems [15]. Many neural networks have been proposed to derive useful high-level features from low-level respiratory signals. Convolutional neural networks (CNN), Recurrent neural networks (RNN), Long short-term memory (LSTM), and bidirectional-Long short-term memory (BiLSTM) are the most used networks in the field [16]. However, existing methods are still too complex and expensive to implement. They usually apply the central approach, which requires high computing resources at the cloud layer [17]. This approach can result in a high accuracy but is not suitable in practical situations, where we can reduce the workload by edge computing at different levels. Moreover, from our knowledge, there are no studies on analyzing heavy, deep, and normal breaths. They typically studied “clear” adventitious sounds, such as cough and wheeze. Therefore, in this study, we propose three neural networks and a pervasive system to detect and classify heavy, deep, and normal breaths observed through the mouth. The proposed solution is applicable in two scenarios: at-home and in-hospital patient monitoring. We also applied different techniques of edge computing to reduce the workload and requirement of high computing resources. The major contributions of this work are as follows: We applied multi-task learning and evaluated two neural networks, SincNet – Convolutional neural (SincNet - CNN) [18] and Residual bidirectional long short time memory (Residual Bi-LSTM) network, to detect and classify breath sounds. We propose a noise reduction method using RNNoise neural network [14] that can be implemented on low-cost devices. We built a public dataset named BreathSet, consisting of four types of breaths collected from 27 patients at three hospitals in Vietnam. We designed and developed (i) a cheap contact wearable device that allows for non-invasive measurement of respiratory sounds, and (ii) an embedded computer supporting edge computing that reduces the requirements of high-performance resources at the cloud layer. With these devices, we can continually record patient’s breaths. A demonstrative system based on our method was implemented. The article is structured as follows. Section 2 discusses related studies. Section 3 presents our proposed method. The experimental evaluation is presented in Section 4, and finally, some concluding remarks are provided in Section 5.

Related work

Several methods of breath sound analysis have been proposed in the literature. They can be summarized by characteristics such as sound types, features, and applied methods [1]. For sound types, most existing works mainly focused on detecting or classifying adventitious sound and comparing it with the normal breath, such as apnea episodes or dysrhythmic breathing [7]. Studies on heavy or deep breaths, which are signs of some important COPDs or COVID-19 [19], especially in patients with COPD history, are few. Among the machine learning methods developed for analyzing audio signals, the methods for breath sound analysis can be divided into two types: (i) traditional machine learning and (ii) deep learning [6], [10]. The first technique is typically based on shallow structured architectures to train classifiers such as Support Vector Machine (SVM), Gaussian Mixture Model (GMM), Random Forest (RF), and Artificial Neural Network (ANN). However, due to their limited capabilities, the architectures are not suitable for complicated large-scale real-world problems such as breath sound detection/classification [20]. These problems require a more profound and layered architecture to extract complex information [10], such as in deep learning. Deep learning models have gradually replaced traditional techniques in breath sound analysis, with the former demonstrating superior performance in terms of accuracy and computational time. Many neural networks have been proposed to learn useful features from low-level sensors and construct high-level features of breath sounds, including CNN, RNN, LSTM, and BiLSTM [16]. For example, Islam et al. [21] proposed a Teacher–Student architecture to detect breath phases (inhalation and exhalation) using smartphone sensors. The student model contains a CNN with acoustic MFCC features as input, which are used for breath phase-detection, while the teacher model is a signal processing model used to estimate the phase labels, with the help of data from IMU sensors (acceleration and motion sensors). Although a high accuracy was obtained, the proposed method was evaluated only on a Samsung Galaxy Note 8. To cover the temporal features of breath sounds, networks with memory structures, which consider for the temporal relationships in the input–output mappings, such as RNN and its variants, are usually employed. LSTM and BiLSTM that can model long short-term sequences are the most frequently used recurrent networks for breath sound detection/classification [16]. Unlike LSTM, which processes only the information obtained before the current moment, BiLSTM can use both forward and backward information for its analysis [15]. For instance, Shis et al. [22] presented a BiLSTM-CNN model to detect inhalation-pause–exhalation-pause sequences in which the CNN block was used for feature extraction, while the attention-based LSTM was used for real-time detection. Similar to the works presented in [21], the accuracy achieved by Shis et al. was highly dependent on the devices used to collect data. For feature extraction, researchers either extract physical properties of respiratory signals, such as RR intervals [23], or directly use raw data, such as recorded breathing sounds. Although physical properties have been demonstrated to be helpful for breath analysis, they have many obvious shortcomings. On the one hand, these approaches greatly depend on signal quality and background noise. On the other hand, they need a manual design with many specific algorithms, such as the sensitive R-wave detection algorithm [24]. In this study, owing to automatic feature extraction and selection of deep learning models, once segmented breaths were denoised and segmented, we directly used them to train our deep models. This method is able to mine useful information from the input data with a suitable network architecture. Moreover, a significant advantage of this method is that the model performance continuously increases with the accumulation of data. The spectrogram-based technique and its variants are the most popular feature extraction techniques used in breath analysis, especially Mel-Frequency Cepstral Coefficients (MFCCs). These features are calculated from the short-term Fourier Transform as the cepstrum of the mel-warped spectrum. MFCCs represent phonemes (which are the distinct units of sound) as the shape of the vocal tract (which is responsible for sound generation) is manifest in them. Moreover, they model the signals based on the human sense of hearing, which has proven to achieve better performance in recognizing breath patterns as in [21], [22]. It makes MFCC a great feature for respiratory audio analysis. Among existing methods, MFCC with CNN usually produces the most optimized model in terms of resource usage, whereas BiLSTM provides highly accurate models but at higher costs. Therefore, CNN networks are typically used for edge computing in a pervasive system [25], whereas BiLSTM networks are suitable for the central approach. Numerous existing works approached breath detection and classification as two different steps. Therefore, studies that perform both tasks usually use two different models, which requires more efforts from researchers. For instance, Yan et al. [26] presented a method that used two separate models for detection and classification. They used a convolutional recurrent neural network (CRNN) with learnable gated linear units (GLU), and a sigmoid loss function to train the first model. The second model used GLU-CRNN with sigmoid and softmax loss functions together. These approaches usually require more storage while deploying models, which can cause critical issues in low configuration hardware such as edge devices. These issues can be resolved by multi-task learning [27], where a model can perform both tasks simultaneously. In breath sound analysis, noise artifacts are one of the key obstacles that often degrade the quality of input data. Therefore, many denoising algorithms have been proposed such as high pass filter and power spectral subtraction, Recursive Least Squares (RLS), Least Mean Squared Error (LMS), RMS, Kalman, and deep learning models. Among the proposed algorithms, the deep learning-based method is better than traditional noise suppression [14]. In this study, we propose pervasive deep learning techniques to analyze three types of breath sounds (heavy, deep and normal) recorded through the mouth. The proposed method includes two models for different situations: edge computing with a low-cost model (SincNet-CNN) and central computing (BiLSTM) with a higher accuracy model. Multi-task learning was used to train the proposed networks, which jointly learn to detect and classify breath sounds. Moreover, to avoid issues of device dependency, we also developed a cheap device to collect data (recording breath sounds).

Materials and methods

The system overview

We propose a pervasive system named e-Breath Monitor, which consists of four layers, including Physical layer, Mobile Computing layer, Cloud layer, and Application layer, as shown in Fig. 1. The first layer records the breaths through the mouth. It contains an embedded computer and a microphone, as presented in Fig. 2. The embedded computer was based on the Raspberry Pi Zero W created by Adafruit,2 powered by a rechargeable lion battery 3.700 mAH 3.7 V of EEMB manufacturer, as shown in Fig. 2a and b. The battery enables the device to run for over 10 h without recharging. The computer was packaged in a plastic box created by a 3-D printer, as shown in Fig. 2b and c. The microphone used is a Micro Shure BETA 98H/C3 and is equipped with a three meter (10ft) high-flex cable. In this study, to record breath sounds, our devices were worn on the upper arm with the help of a dedicated strap, as shown in Fig. 2d. Owing to the high-flex cable, it is easy to adjust and fix the microphone position. Patients can pull and place the microphone at a distance of 1–2 cm from the subject’s mouth to record breath sounds. Although not as effective as directly placed on the skin of the thorax or trachea, our method is more comfortable, less affected by movements, and supports a longer monitoring period. We also developed a module executed on the embedded device to record breath sounds at a sampling frequency of 8 kHz and a sampling size of 16 bits. They were first collected to train our models (the dashed arrows in Fig. 1) and then used to monitor patients (the solid arrows in Fig. 1). For the training phase, we used Bluetooth to send data to the phones at the next layer, and for the monitoring phase, Wi-Fi connection was employed to send data to the Raspberry Pi 4.

Fig. 1

e-Breath monitor architecture with four layers and three scenarios (Training, At-home Monitoring, and In-hospital Monitoring).

Fig. 2

Physical layer with an embedded computer and a microphone ((a), (b): the embedded computer, packaged in a plastic box, consists of a Raspberry Pi Zero W and a rechargeable lion battery; (c), (d): the microphone and the embedded computer are attached and put in an upper arm strap).

e-Breath monitor architecture with four layers and three scenarios (Training, At-home Monitoring, and In-hospital Monitoring). The mobile computing layer is composed of cheap Android phones from VinSmart Research & Manufacturer Joint Stock Company, a Raspberry Pi 4 of the Raspberry Pi Foundation,4 and a conventional Wi-Fi router. The phones, which were used only to collect data for model training, stored data sent from the previous layer. For this purpose, a mobile app was developed to control the embedded computer (the recording module) at the physical layer. It contains a button to start and a label to show the duration of recordings. Physical layer with an embedded computer and a microphone ((a), (b): the embedded computer, packaged in a plastic box, consists of a Raspberry Pi Zero W and a rechargeable lion battery; (c), (d): the microphone and the embedded computer are attached and put in an upper arm strap). The Raspberry Pi and router were used to monitor patients with two different scenarios and approaches, including the central approach for at-home patients and the edge computing for in-hospital patients. For the first scenario, the Raspberry Pi operates as a gateway that receives monitored data from the previous layer and then transmits them to the cloud layer (at-home monitoring scenario in Fig. 1). This layer contains a high-performance model running on a central GPU server. Regarding the second scenario, it can be considered as edge computing devices that support doctors in monitoring and diagnosing hospitalized patients. In this context, we deployed the physical layer at hospitals. The embedded computer monitors breath sounds of patients and transfers them to our edge devices. An improved CNN model was trained to analyze the monitored data; it can run on low-cost hardware. We also developed an application to execute the CNN model and send the results directly to the doctors (in-hospital monitoring scenario in Fig. 1). In both scenarios, we implemented a noise filter, namely RNNoise in the Raspberry Pi, just after receiving recorded data from the previous layer. The cloud layer is a central server that performs three tasks, including training models, analyzing breath sounds of at-home patients, and managing patients using a management information system (MIS). Two models were trained in the first task, including (i) a low-cost model in terms of computing resources, and (ii) a high-cost model with higher accuracy. Due to the budget limitations and privacy policies at many local hospitals, deploying powerful servers locally (budget issues) or using external ones (privacy concerns) is unfeasible. Therefore, the high-cost model, which requires powerful servers, is not appropriate for the in-hospital scenario. Although having a lower performance, the availability of physicians makes the low-cost model that can be executed on the embedded computer a suitable solution. Regarding at-home patients, owing to the unavailability of physicians, we need a model that can provide a high accuracy. Moreover, the privacy is less constrained for this type of patients. For these reasons, the first model was used for the real-time monitoring of in-hospital patients, while the second was used for remotely monitoring at-home patients. The second task supports monitoring hospital-at-home patients, in which the physical layer is deployed at the patient’s home. The cloud layer receives monitored data transmitted from gateways, analyzes them, and sends the results to doctors, as shown in Fig. 1, Monitoring (at-home) scenario. For both scenarios (at-home and in-hospital monitoring), the output is stored at hospitals. Thus, a lightweight web server that works as an MIS was developed to support doctors in further analyzing this information and managing their patients. At the last layer, doctors access the MIS to monitor patients and perform different statistical analyses on their devices (computer or smartphone). Next, we present the deep learning networks used for both monitoring scenarios.

Preprocessing

Before training the models, we performed a series of preprocessing on the raw data, including sliding windows (labeling and segmentation) and noise reduction. First, we applied an overlapping sliding window with a 4 s window size and 80% of overlap to segment the raw data. It provides better performance than a disjointed segmentation. The windows were then slided over the data and forwarded in time. With the help of respiratory physicians, we labeled each window as heavy, deep, or normal if at least 50% of the breaths is inside the window; otherwise it was labeled as ‘other’. To have a platform independent noise filter, instead of using existing libraries, such as android libraries, we applied RNNoise, a deep learning method for noise-canceling, as shown in the first block of Fig. 3, Fig. 4. A comb filter defined at the pitch interval is applied to each window. The network employs a 20 ms window size that overlaps by 50% and slides over the Vorbis window. RNNoise operates on an 8 kHz full-range audio input. We implemented this noise filtering on the Raspberry Pi 4 at the mobile computing layer.

Fig. 3

Overview of the proposed SincNet-CNN network (the first two blocks are RNNoise, for noise-canceling, and MFCC features; the last three blocks belong to SincNet-CNN consisting of a SincNet, CNN layer, and DNN with an output softmax layer).

Fig. 4

Overview of the proposed Residual BiLSTM network (the first two blocks are RNNoise, for noise-canceling, and MFCC features; the last two blocks are the BiLSTM network that consists of four BiLSTM layers and a DNN with an output softmax layer).

Multi-task neural networks for real-time monitoring in-hospital patients

To monitor in-hospital patients, we used a novel CNN architecture, named SincNet-CNN, that can run on low-cost hardware and in real time. The network was successfully applied to speech recognition tasks. SincNet-CNN includes two main layers: (i) SincNet data filtering and (ii) CNN layer. The first layer filters the input signal based on a predefined filter . This filter reduces the number of learnable parameters of the CNN layer. Let be the chunks of signal and be the filter function that depends on few learnable parameters . The filtered output at time is calculated as shown in Formula (1). The number of learnable parameters can be defined by a generic bandpass filter. Thus, the function is as follows: where and are the learned low and high cut-off frequencies, and the function is defined as . From the sampling frequency of the input signal, we initialize a random cut-off frequency in the range of . and are generated based on the lowest and highest cut-off frequencies, respectively. Overview of the proposed SincNet-CNN network (the first two blocks are RNNoise, for noise-canceling, and MFCC features; the last three blocks belong to SincNet-CNN consisting of a SincNet, CNN layer, and DNN with an output softmax layer). The Hamming window was used to smooth out the abrupt discontinuities at the ends of , as follows: The filter is usually initiated with random cut-off frequencies of the mel-scale filter-bank. In this way, it has the advantage of allocating more filters directly in the lower part of the spectrum that contains many important clues of breath sounds. After filtering, the output is passed to the CNN layer. It is jointly optimized with CNN parameters using Stochastic Gradient Descent. As shown in Fig. 3, the proposed network contains conventional layers, including pooling, normalization, activations, dropout, fully connected, and a softmax at the last layer. To speed up the computation time, we used a modified version of SincNet-CNN, which has only 1 CNN layer after the SincNet layer. The SincNet layer starts with a special convolution, SincConvFast, that filters input data as mentioned previously. The CNN uses a traditional convolution. The two layers employ max pooling, layer normalization, and leaky ReLU activation. The two FC layers consist of the same configuration: 2048 nodes, batch normalization, and leaky ReLU activation. Finally, the output of the model is obtained by the dense layer with a Softmax activation function. We applied multi-task learning techniques, in which our model can determine if a breath sound sample was captured and whether it needs to be classified. This allows our model to ignore any privacy-sensitive content. Overview of the proposed Residual BiLSTM network (the first two blocks are RNNoise, for noise-canceling, and MFCC features; the last two blocks are the BiLSTM network that consists of four BiLSTM layers and a DNN with an output softmax layer). Structure of SincNet-CNN. Structure of BiLSTM.

High-performance neural networks for at-home remote monitoring

We propose a network that contains BiLSTM layers, a DNN layer, and output, as shown in Fig. 4, to remotely monitor at-home patients. BiLSTM is composed of two hidden networks: the forward and backward LSTM networks, which provide support to capture and process the information before and after the current moment. In our previous work [15], we applied BiLSTM to recognize hand gestures, but with a different architecture and input data (3-axis accelerometer and gyroscopes data with three CNN and two BiLSTM layers). The obtained results were better than those obtained using the LSTM. Table 2 presents the detailed network that includes four BiLSTM layers with 32 memory units. The output of the previous layer is the input of the next layer. Both outputs are then added together by a residual connection. This output is flattened into a vector and treated as an input for the fully connected layer with 256 neurons for multi-task learning: detection (breath or not) or classification (normal, deep, heavy breath, or other). The first task ignores and removes all non-breath data. It allows us to ignore concerns about privacy sensitive content. To eliminate the overfitting problem, we used a dropout layer with a drop rate of 0.5.

Table 2

Structure of BiLSTM.

Layer	Structure	Output shape	# params
Input	–	(Other, 1200, 12)	–

Bi_lstm	32	(None, 1200, 64)	11.776K
Bi_lstm	32	(None, 1200, 64)	25.088K
Add	–	(None, 1200, 64)	0

Bi_lstm	32	(None, 1200, 64)	25.088K
Bi_lstm	32	(None, 64)	25.088K
Add	–	(None, 1200, 12)	0

Flatten	–	(None, 76 800)	0
Dense	256	(None, 64)	4915.264K
Dropout	–	(None, 64)	0

Dense	–	(None, 4)	297
Dense	–	(None, 2)	149

Total	–	–	5002.71K

Experiments

To evaluate the proposed neural networks and pervasive system, we conducted experiments on two datasets: BreathSet and ICBHI’17 [28]. For an objective evaluation, we separately trained and evaluated three neural networks on the two datasets, including a baseline network (LSTM) and our proposed networks (SincNet and Residual BiLSTM). We also assessed the feasibility of the pervasive system and its performance in terms of time-consumption. We discuss the conducted experiments in this section.

Dataset

We built a dataset, BreathSet, from 27 patients (21 males and 6 females) with confirmed COPD who were hospitalized in three big hospitals of Thai Nguyen city, Vietnam. The patients gave written informed consent in a protocol that was approved by the hospitals’ Medical Ethics Council. The average demographic information of the studied subjects are reported in Table 3.

Table 3

Demographic information of studied subjects.

Gender	Number of patients	Age (years)	Weight (kg)	Height (cm)
Male	21	36	58	173
Female	6	35.4	45	158.7

Thirty embedded computers were manufactured and examined by the ethics council, to collect breath sounds, lung rale, and other adventitious sounds (in this study, we focused only on devices and methods for breath sounds). Demographic information of studied subjects. For an objective observation, we did not request the subjects to perform the studied activities intentionally (deep, heavy, and normal breaths). The subjects only wore our devices and performed their usual activities, including sitting, standing, and walking. Each subject performed the activities in several 30-min collection periods in a day, for 21 days (3 days for breath sounds, and 18 days for lung rale and other adventitious). A total of 3645 minutes of recordings was collected. We filtered samples for which the physicians could not determine the type owing to abundant noise or significantly low pitch. Subsequently, we obtained a total of 552 min that contained 752, 732, and 9589 cycles of deep, heavy, and normal breaths, respectively. The remaining data were labeled as ‘other’. To balance the dataset, we used only 732 normal breaths and 793 ‘other’ that contained single and multi-person speech, blows, phone notifications, and fan sounds. Finally, BreathSet consists of 3009 breath cycles with a total of 150 min. Next, we segmented BreathSet into five different sizes (1, 2, 3, 4, and 5 s) and 80% overlapping between frames. This overlapping technique allows us to augment the training data and provides better performance than a disjointed segmentation. Among the segmented frames, the ones with 4 s performed the best, as shown in Table 4.

Table 4

Accuracy comparison among different window sizes on BreathSet.

Windows size	SincNet-CNN			BiLSTM
	Precision	Recall	F1-score	Precision	Recall	F1-score
1 s	0.83	0.82	0.82	0.87	0.86	0.86
2 s	0.85	0.85	0.85	0.93	0.93	0.93
3 s	0.88	0.87	0.88	0.92	0.92	0.92
4 s	0.92	0.91	0.91	0.95	0.94	0.95
5 s	0.92	0.90	0.90	0.90	0.90	0.91

The data were annotated by supervised doctors using Audacity, with four labels, as shown in Fig. 5. The normal breaths have the largest cycle, with a relatively low amplitude and power spectrogram. The deep and heavy breaths contain a higher amplitude and power spectrogram with a shorter cycle in which heavy breaths are the shortest ones. The other class contains a different time domain that is thinner and has the highest power spectrogram.

Fig. 5

Time and frequency domain (Linear-frequency power spectrogram) of (a) normal, (b) deep, (c) heavy, (d) others (e.g., people laughing).

Time and frequency domain (Linear-frequency power spectrogram) of (a) normal, (b) deep, (c) heavy, (d) others (e.g., people laughing). Accuracy comparison among different window sizes on BreathSet. To have an objective evaluation, we also applied our proposed method on ICBHI’17, a popular dataset for respiratory sound analysis [28]. The sounds were captured at different chest locations using four devices, including Welch Allyn Master Elite Plus Stethoscope Model 5079-400, 3 M Littmann Classic II SE, 3 M Littmann 3200 Electronic Stethoscope, and C 417 PP AKG Acoustics microphones, at various sampling frequencies (4 kHz, 10 kHz, and 44.1 kHz). To standardize, they were down-sampled and up-sampled to 8 kHz as our dataset. ICBHI’17 contains 920 audio samples from 126 patients, in which each breathing cycle was annotated by two respiratory physiotherapists and one medical doctor into one of the four classes: normal, wheeze, crackle, and both wheeze and crackle. The dataset includes a total of 330 min of recordings containing 6898 respiratory cycles, of which 1864 cycles contain crackles, 886 contain wheeze, 506 contain both, and the remaining are normal. This dataset was not captured through the mouth as our method. However, we conducted experiments on this dataset for a performance comparison of multi-task learning on respiratory related diseases.

Experiment setup

We conducted three experiments as follows: Experiment #1 - Feature evaluation with SincNet-CNN and Residual BiLSTM on BreathSet. We evaluated three common features (spectrograms, MFCCs, and MFCCs after noise-canceling by RNNoise) to select the best one for the other experiments. Experiment #2 - Multi-task learning evaluation with LSTM, SincNet-CNN, and Residual BiLSTM on BreathSet and ICBHI’17. First, we trained and evaluated the proposed networks on BreathSet to detect breath/non-breath, and to classify normal, deep, heavy, and other sounds. Next, we experimented again on ICBHI’17 (trained other models) to detect normal/abnormal sounds, and classify them as normal, crackle, or wheeze. Experiment #3 - feasibility and processing time of e-Breath monitor system. Models were implemented using the TensorFlow Framework 2.3.0 and Python 3.6.9 on a 12 GB NVIDIA Tesla K80 GPU and an Intel(R) 2.3 GHz Xeon(R) micro-processor. To optimize network parameters, we applied Azure Machine Learning pipelines [29]. The platform allows automating hyperparameter tuning and to run experiments in parallel to efficiently optimize hyperparameters. We tested with several configuration sets, including the training, validation, testing ratio (60 - 20 - 20, 60 - 30 - 10, 70 - 20 - 10, 50 - 30 - 20), input shape ((32, 128), (32, 140), (40, 126), (40, 128), (40, 140)), and batch size (32, 64, 128). Finally, the most optimized hyperparameters were three loss functions, including Mean squared error (MSE), Parse categorical cross entropy loss (SCC), and the average of these two functions (SCC-MSE loss); Adam optimizer with , , and , 10−4 for the initial learning rate as it was a self-adjusting learning rate technique; and a mini-batch with a size of 128. The training–validation–testing split was 70:20:10 (2099, 609, and 301 samples from 20, 4, and 3 patients to train and evaluate models on BreathSet, and 4.892, 1379, and 627 samples from 100, 14, and 12 patients to train and evaluate models on ICBHI’17). There is no patient overlap between the sets. Other details of the network parameters are shown in Tables 1, and 2 for SincNet-CNN, and Residual BiLSTM.

Table 1

Structure of SincNet-CNN.

Layer	Output shape	# params
SincConvFast	(None, 2150, 64)	–
MaxPooling1D	(None, 716, 64)	0
LayerNormalization	(None, 716, 64)	128
LeakyReLU	(None, 716, 64)	0

Conv1D	(None, 712, 32)	10 272
MaxPooling1	(None, 237, 32)	0
LayerNormalization1	(None, 237, 32)	64
LeakyReLU	(None, 237, 32)	0

Flatten	(None, 7584)	0
LayerNormalization2	(None, 7584)	15 168

Dense	(None, 64)	485 440
BatchNormalization	(None, 64)	256
LeakyReLU	(None, 64	0

Dense	(None, 64)	4160
BatcNormalization1	(None, 64)	256
LeakyReLU	(None, 64)	0

Dense	(None, 4)	1040
Dense	(None, 2)	520

Total	–	517,557

The F1 score, precision, recall, accuracy, and confusion matrix were used to evaluate the experimental results. Accuracy progress on the BreathSet dataset (a) and Computation time of BiLSTM and SincNet-CNN.

Results and discussion

Results on the feature evaluation with BreathSet. Results on multi-task learning of the proposed models. Confusion matrix of BreathSet. Confusion matrix of ICBHI’17.

Feature evaluation

The results of feature evaluation are provided in Table 5. From the obtained results, we found that from the spectrogram to MFCC, the f1-score increased 4%, and 2%, on average, for SincNet-CNN and residual BiLSTM (the performance did not change for LSTM); while from MFCC to RNNoise & MFCC, the f1-scores were 5%, 4%, and 5%. The last features (RNNoise & MFCC) provided the highest performance in terms of precision, recall, and f1-score. This can be attributed to the fact that breath sounds were collected in noisy environments (at hospitals). Thus, noise-canceling allows us to improve the breath sound quality, as illustrated in Fig. 7. Therefore, for the other experiments, we used RNNoise & MFCC features to evaluate the proposed method.

Table 5

Results on the feature evaluation with BreathSet.

		SincNet-CNN			LSTM			Residual BiLSTM
		Precision	Recall	F1-score	Precision	Recall	F1-score	Precision	Recall	F1-score
Spectrogram	Normal	0.79	0.74	0.74	0.80	0.76	0.78	0.86	0.83	0.81
	Deep	0.85	0.75	0.80	0.82	0.78	0.81	0.89	0.86	0.88
	Heavy	0.89	0.82	0.85	0.93	0.92	0.93	0.93	0.89	0.91
	Other	0.87	0.89	0.89	0.89	0.94	0.92	0.89	0.94	0.92
	Mean/Std	0.85	0.80	0.82	0.86	0.85	0.86	0.90	0.88	0.88

MFCC	Normal	0.81	0.82	0.79	0.82	0.79	0.80	0.87	0.88	0.86
MFCC	Deep	0.83	0.82	0.82	0.84	0.78	0.83	0.87	0.88	0.87
	Heavy	0.89	0.94	0.91	0.91	0.89	0.90	0.92	0.94	0.93
	Other	0.87	0.94	0.91	0.87	0.94	0.91	0.93	0.94	0.94
	Mean/Std	0.85	0.88	0.86	0.86	0.85	0.86	0.90	0.91	0.90

RNNoise filter	Normal	0.96	0.97	0.98	0.92	0.82	0.87	0.92	0.89	0.93
RNNoise filter	Deep	0.74	0.83	0.95	0.92	0.83	0.88	0.96	0.90	0.94
& MFCC	Heavy	0.94	0.92	0.90	0.97	0.92	0.94	0.97	0.98	0.96
	Other	0.99	0.92	0.85	0.83	0.99	0.91	0.95	0.99	0.97
	Mean/Std	0.92	0.91	0.91	0.91	0.89	0.90	0.95	0.94	0.95

Fig. 7

The heavy breath before (a) and after (b) noise-canceling.

Multi-task learning evaluation

We conducted experiments on BreathSet and ICHBI’17 separately. The obtained results are shown in Table 6. On BreathSet, SincNet-CNN achieved a high accuracy of 91% and 90%, on average, for classification and detection, respectively. On ICBHI’17, the proposed model outperformed (83% and 85% for classification and detection, respectively) recent methods [30].

Table 6

Results on multi-task learning of the proposed models.

		SincNet-CNN			LSTM			Residual BiLSTM
		Precision	Recall	F1-score	Precision	Recall	F1-score	Precision	Recall	F1-score
BreathSet
Classification	Normal	0.96	0.97	0.98	0.92	0.82	0.87	0.92	0.89	0.93
	Deep	0.74	0.83	0.95	0.92	0.83	0.88	0.96	0.90	0.94
	Heavy	0.94	0.92	0.90	0.97	0.92	0.94	0.97	0.98	0.96
	Other	0.99	0.92	0.85	0.83	0.99	0.91	0.95	0.99	0.97
	Mean/Std	0.92	0.91	0.91	0.91	0.89	0.90	0.95	0.94	0.95

Detection	Breath	0.90	0.91	0.90	0.92	0.88	0.89	0.93	0.87	0.90
	Non-Breath	0.92	0.91	0.90	0.92	0.88	0.89	0.93	0.87	0.90
	Mean/Std	0.91	0.91	0.90	0.92	0.88	0.89	0.93	0.87	0.90

ICHBI Dataset

Classification	Crackle	0.94	0.76	0.84	0.83	0.83	0.83	0.88	0.86	0.87
	Normal	0.87	0.94	0.90	0.90	0.88	0.89	0.96	0.97	0.97
	Wheeze	0.72	0.75	0.73	0.70	0.76	0.73	0.75	0.76	0.75
	Mean/Std	0.84	0.82	0.83	0.81	0.82	0.82	0.86	0.86	0.86

Detection	Abnormal	0.96	0.72	0.82	0.84	0.91	0.87	0.89	0.92	0.91
	Normal	0.80	0.97	0.88	0.92	0.84	0.88	0.94	0.88	0.92
	Mean/Std	0.87	0.85	0.85	0.88	0.88	0.88	0.91	0.90	0.91

Moreover, unlike these methods, we focused on running the SincNet-CNN models on low-cost devices. Therefore, even on a lower configuration, the proposed method can detect and classify breath sounds accurately. Regarding local hospitals, especially rural areas in many developing countries, such as Vietnam, these results are crucial for in-hospital monitoring, which can reduce investment costs. On BreathSet, the training of residual BiLSTM was stopped after 100 epochs, as shown in Fig. 6(a). The figure also shows that the gap between training accuracy and test accuracy is acceptable, which means that the model works correctly without overfitting. BiLSTM achieved the highest accuracy of 95% and 90%, on average, for classification and detection, respectively. On ICBHI’17, it achieved state-of-the-art accuracies of 91% and 86% for these tasks. Moreover, the network outperformed the baseline LSTM network in the detection and classification of breath sounds. More details can be found in the confusion matrix for BreathSet and ICBHI’17, as presented in Table 7, Table 8. This can be attributed to the fact that residual BiLSTM can use forward and backward information, which makes it capable of capturing breath information before and after the current ones. Therefore, the obtained results are more stable and accurate.

Fig. 6

Accuracy progress on the BreathSet dataset (a) and Computation time of BiLSTM and SincNet-CNN.

Table 7

Confusion matrix of BreathSet.

Ground truth	Predict
	Normal	Deep	Heavy	Others
Normal	713	12	6	1
Deep	23	698	14	17
Heavy	8	18	690	16
Others	10	21	42	720

Table 8

Confusion matrix of ICBHI’17.

Ground truth	Predict
	Crackle	Normal	Wheeze
Crackle	4965	240	292
Normal	409	9055	380
Wheeze	108	272	2871

The results are comparable to other works such as [16], [31], while we mainly focused on detection and classification of slight indications of COPD such as heavy or deep breaths, which is critical in the context of the COVID-19 pandemic. Thus, the proposed network can diagnose patients remotely (at-home remote monitoring) and accurately, which can reduce the overload problem of many hospitals. e-Breath monitor system with four layers including (i) the physical layer; (ii) the mobile computing layer at patient’s sides; (iii) the cloud server at our project office; and (iv) the application layer at physician’s sides for monitoring patients. In-hospital patient monitoring (a. Wearable devices are attached to the upper arm of patients; they can pull the high-flex cable of the microphone near to their mouth; b. Physicians can use their smartphones to monitor patient breaths in real time).

Feasibility and processing time evaluation

After successful testing, we deployed the proposed models and e-Breath monitor system at three hospitals in Thai Nguyen city, Vietnam. The deployment plan is shown in Fig. 8:

Fig. 8

e-Breath monitor system with four layers including (i) the physical layer; (ii) the mobile computing layer at patient’s sides; (iii) the cloud server at our project office; and (iv) the application layer at physician’s sides for monitoring patients.

the physical layer: patients wore our devices on their upper arms, the mobile computing layer: a router and gateway were deployed at hospitals, the cloud layer: it contains a 2.30 GHz Intel Xeon processor, 16 GB of DDR4 RAM, and a 12 GB NVIDIA TeslaK80 GPU, which were installed at our project office in Hanoi, Vietnam, the application layer: physicians use their computers and smartphones to access MIS and monitor their patients. The in-hospital monitoring scenarios were performed and examined by physicians, as shown in Fig. 9. They were able to monitor and diagnose their patients through our system.

Fig. 9

In-hospital patient monitoring (a. Wearable devices are attached to the upper arm of patients; they can pull the high-flex cable of the microphone near to their mouth; b. Physicians can use their smartphones to monitor patient breaths in real time).

Besides, we conducted experiments to evaluate the real-time performance of the proposed system. The tasks that take the longest computation time were model executions. Thus, we first measured this metric, and then the total processing time of e-Breath Monitor, in both scenarios. The model execution time in both the scenarios (in-hospital monitoring with SincNet-CNN and at-home monitoring with BiLSTM) is shown in Fig. 6(b). Regardless of first-time execution, which usually takes a long time for initial loading, the computation time of the BiLSTM model was approximately 603 ms, whereas this number was 286 ms on average for SincNet-CNN. Thus, the total processing time for the in-hospital and at-home scenarios was 796 and 1193 ms. Therefore, our proposed method can monitor in real time in both scenarios.

Conclusions and future work

A pervasive deep learning framework is proposed to monitor respiratory patients. A typical pervasive architecture with three deep neural networks was proposed to detect and classify deep, heavy, and normal breaths. We designed low-cost edge devices, including a wearable IoT device to capture breath sounds through the mouth, and an embedded computer that can execute a tiny deep model with high accuracy. Two multi-tasking deep learning architectures (SincNet-CNN and BiLSTM networks) are proposed for breath detection and classification. The experimental results showed that our proposed method is effective and reliable with high F1 scores of 90% and 95% for the detection and classification of breath sounds under real-world settings while being fast enough for real-time processing. In the future, we will build a larger dataset and improve the performance of our model. Deploying the system in hospitals for assisting doctors to remotely monitor patients with respiratory disease is also part of our future plan.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

12 in total

Review 1. Respiration rate monitoring methods: a review.

Authors: F Q Al-Khalidi; R Saatchi; D Burke; H Elphick; S Tan
Journal: Pediatr Pulmonol Date: 2011-01-31

2. Breath sounds.

Authors: P Forgacs; A R Nathoo; H D Richardson
Journal: Thorax Date: 1971-05 Impact factor: 9.139

3. Lung sounds classification using convolutional neural networks.

Authors: Dalal Bardou; Kun Zhang; Sayed Mohammad Ahmad
Journal: Artif Intell Med Date: 2018-05-01 Impact factor: 5.326

Review 4. Functional symptoms confused with allergic disorders in children and adolescents.

Authors: Bodo Niggemann
Journal: Pediatr Allergy Immunol Date: 2002-10 Impact factor: 6.377

5. Contactless monitoring of human respiration using infrared thermography and deep learning.

Authors: Preeti Jagadev; Shubham Naik; Lalat Indu Giri
Journal: Physiol Meas Date: 2022-03-17 Impact factor: 2.833

6. An Internet-of-Medical-Things-Enabled Edge Computing Framework for Tackling COVID-19.

Authors: Md Abdur Rahman; M Shamim Hossain
Journal: IEEE Internet Things J Date: 2021-01-12 Impact factor: 10.238

Review 7. Automatic adventitious respiratory sound analysis: A systematic review.

Authors: Renard Xaviero Adhi Pramono; Stuart Bowyer; Esther Rodriguez-Villegas
Journal: PLoS One Date: 2017-05-26 Impact factor: 3.240

Review 8. Contact-Based Methods for Measuring Respiratory Rate.

Authors: Carlo Massaroni; Andrea Nicolò; Daniela Lo Presti; Massimo Sacchetti; Sergio Silvestri; Emiliano Schena
Journal: Sensors (Basel) Date: 2019-02-21 Impact factor: 3.576

9. Breath-Jockey: Development and Feasibility Assessment of a Wearable System for Respiratory Rate and Kinematic Parameter Estimation for Gallop Athletes.

Authors: Joshua Di Tocco; Riccardo Sabbadini; Luigi Raiano; Federica Fani; Simone Ripani; Emiliano Schena; Domenico Formica; Carlo Massaroni
Journal: Sensors (Basel) Date: 2020-12-29 Impact factor: 3.576

Review 10. A Survey on Recent Advances in Machine Learning Based Sleep Apnea Detection Systems.

Authors: Anita Ramachandran; Anupama Karuppiah
Journal: Healthcare (Basel) Date: 2021-07-20