End-to-End Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature Analysis

We present an end-to-end multichannel speaker-attributed automatic speech recognition (MC-SA-ASR) system that combines a Conformer-based encoder with multi-frame cross-channel attention and a speaker-attributed Transformer-based decoder. To the best of our knowledge, this is the first model that efficiently integrates ASR and speaker identification modules in a multichannel setting. On simulated mixtures of LibriSpeech data, our system reduces the word error rate (WER) by up to 12% and 16% relative compared to previously proposed single-channel and multichannel approaches, respectively. Furthermore, we investigate the impact of different input features, including multichannel magnitude and phase information, on the ASR performance. Finally, our experiments on the AMI corpus confirm the effectiveness of our system for real-world multichannel meeting transcription.


INTRODUCTION
Automatic processing of multi-party (a.k.a.multi-speaker) speech recordings, such as meetings, requires multi-speaker automatic speech recognition (ASR) and diarization systems.The main challenges faced in these scenarios include overlapping speech, reverberation caused by distant microphones, and background noise.End-to-end multi-speaker ASR and diarization systems for single-channel [1][2][3][4] and multichannel recordings [5][6][7][8] have recently emerged, demonstrating promising results on meeting transcription tasks.
The earlier approaches for end-to-end multi-speaker ASR and diarization for single-channel recordings lacked information sharing between the ASR and the speaker verification modules [2] and/or required a varying number of speaker encoder (or attention) modules based on the number of speakers [1,3].The authors of [4] proposed an end-toend single-channel speaker-number invariant Transformerbased speaker-attributed ASR (SA-ASR) system based on serialized output training (SOT).This system shares both speech and speaker representations across the multi-speaker ASR and speaker diarization tasks.Similar to their singlechannel counterparts, the ASR and speaker identification modules in end-to-end multichannel ASR and diarization approaches [5][6][7] do not share speech and speaker representations.Interestingly, the approach of multichannel word-level diarization with SOT (MC-WD-SOT) in [8] performs a fusion of ASR and speaker information.It uses multi-frame cross-channel attention (MFCCA) [9] in the ASR encoder to integrate information from different channels and hiddenlayer embeddings from the ASR decoder to assist speaker identification.However, the ASR and the diarization modules in this approach use different types of attention for multichannel fusion, leading to an increase in the number of model parameters.Moreover, the communication of ASR and speaker information is not bidirectional, since the speaker information is not reciprocally exploited by ASR.
Multichannel SA-ASR (MC-SA-ASR) can exploit spatial information which is generally advantageous for localizing different speakers.However, the predominant approaches in end-to-end multichannel ASR use Mel filterbank features as input and discard phase information [9,10].More recently, the "all-in-one" model in [11] uses interchannel phase difference (IPD) [12][13][14] and spatial features, wherein the later require video information to localize the speakers.In [15], magnitude and phase information from each channel were separately passed through linear layers and then concatenated to form the input for multichannel ASR.To our knowledge, there is currently no research comparing and discussing the impact of different input features on MC-SA-ASR.
In this paper, we target end-to-end MC-SA-ASR, and investigate the benefit of i) leveraging speaker information for ASR and ii) exploiting phase information.To do so, we propose an MC-SA-ASR system that tightly couples a Conformer-based encoder with MFCCA, a speaker encoder and a speaker-attributed Transformer-based decoder.We explore the use of phase information as input.We conduct extensive experiments on simulated mixtures of LibriSpeech data as well as on the AMI corpus.
The rest of the paper is organized as follows.Section 2 presents the related works.In Section 3, we introduce our proposed MC-SA-ASR system and the different features and feature encoding methods.Section 4 presents our experimental setup and results on simulated data, as well as on real meeting data.Finally, Section 5 provides a conclusion.979-8-3503-0689-7/23/$31.00 ©2023 IEEE Fig. 1.Overview of the proposed MC-SA-ASR system (left) and its encoder (right).

End-to-end single-channel SA-ASR
A Transformer-based end-to-end single-channel SA-ASR (SC-SA-ASR) system was proposed in [4].Following the SOT principle, the output is the concatenation of all speakers' sentences, where each token is associated with one speaker ID and distinct speakers are separated by a <sc> token.The inputs to the model consist of an acoustic feature sequence X ∈ R L×A where L is the sequence length and A the feature dimension, and a matrix S ∈ R E×K of E-dimensional reference speaker embeddings obtained from enrollment data, each corresponding to one speaker k out of K.The feature sequence X is fed to the speaker encoder and the ASR encoder: The resulting embeddings H spk , H asr and the n − 1 previous ASR output tokens ŷ[1:n−1] are given to the speaker decoder to obtain a speaker posterior ŝn and a speaker profile sn associated with the n-th output token as follows: where (•) denotes matrix transposition.The ASR embedding H asr and the speaker profile sn are provided to the ASR decoder to generate the n-th ASR output token: Given the ground truth token and speaker sequences, the training objective is to maximize the joint probability

Conformer-based multichannel ASR encoder
The Conformer-based multichannel ASR encoder is based on MFCCA [9], which is an attention mechanism that combines cross-channel and temporal context information.The h-th MFCCA head is computed as where C represents the number of channels, W q h , W k h , b q h and b k h are learnable parameters, and X = [X 0 , . . ., X t , . . ., X T ] with X t = [X t−F , . . ., X t , . . ., X t+F ] ∈ R (2F +1)C×D the concatenation of F context frames at each time step t.

PROPOSED METHOD
The proposed MC-SA-ASR system is illustrated in Fig. 1.It draws inspiration from the end-to-end SC-SA-ASR model in Section 2.1, and uses reference speaker embeddings to guide the ASR decoder as in (6).However, instead of the singlechannel encoder in Section 2.1, our system uses a modified version of the Conformer-based multichannel ASR encoder in Section 2.2.The ASR and speaker decoders remain the same as in the SC-SA-ASR model.In this section, we describe the elements that are specific to the proposed system.

Input features for multichannel ASR
Multichannel speech signals contain spatial information that could be advantageous for discriminating speakers.Therefore, we investigate whether incorporating phase information can help our MC-SA-ASR model to generate more accurate multi-speaker transcriptions.
We consider two alternative sets of input features.On the one hand, we compute M -dimensional log Mel filterbank features from the STFT magnitude only.On the other hand, we concatenate the STFT magnitude and the cosine and sine of  the phase, each with G dimension, into a 3 × G-dimensional representation, that is called the magnitude+phase feature.These features then undergo a specific processing, which is illustrated in Fig. 2 and described hereafter.
For the Mel filterbank, we apply depthwise separable convolution on each microphone.In the input array, C represents the number of microphones and L is the audio length.After two layers of 2-dimensional convolution, the output has a dimension of C × T × A, where T = L/4 since each layer performs a sub-sampling of factor 2 over the time dimension, and A = 32 × M/4, which is the output feature dimension.
The magnitude+phase features are processed using three layers of depthwise separable convolution.In the first layer, convolutional operations are used to fuse information across magnitude, cosine and sine values.The next two layers are similar to the ones used for the Mel filterbank, resulting in features with a dimension of C × T × A, similar to the Mel filterbank features, but with A = 32 × G/8.
Finally, for both the Mel filterbank and magnitude+phase features, the output array of dimension C × T × A is passed to a linear layer yielding a representation array of dimension C × T × D, where D is the model dimension which is the same for both input features.

Convolution fusion
Convolution fusion serves as the output layer of the multichannel ASR encoder (see Fig. 1).It combines the representations corresponding to the multiple input channels.We extend the convolution fusion from [9] to support 2, 3, and 4 channel input as illustrated in Fig. 3.

Speaker encoder
Mel filterbank features are first averaged across all channels and then fed to our speaker embedding model.We use an x-vector speaker embedding model based on the emphasized channel attention, propagation and aggregation in TDNN (ECAPA-TDNN) [16].In order to align the dimension of the x-vectors with our model architecture, we replace the final (average pooling) layer by a linear layer.

. Dataset and metrics
We simulate a multi-speaker scenario using the LibriSpeech dataset [17].The train-960, dev-clean and test-clean subsets are used to generate our train, dev and test sets, respectively.We assume a linear 2-to 4-microphone array with an aperture of 10 cm.Room impulse responses (RIRs) aperture are generated by the image source method (ISM) using the gpuRIR toolkit [18].The length, width and the height of the room are randomly drawn between 3 and 8 m and between 2.4 and 3 m, respectively.The array centers and speaker positions are randomly sampled with the constraint that the array center are at most 0.5 m away from the room center, the height in the range of 0.6 to 0.8 m, and the speakers should be at least 0.5 m away from the walls.The RT60 value is fixed in the range [0.4,1].The number of speakers in each sentence is randomly chosen between 1 and 3.The speaker embeddings directory includes 8 speakers, and the speaker IDs are also randomly generated.In order to simulate multi-speaker scenarios, the second and third speakers signals are delayed relative to the previous speaker.This realistic choice also guarantees the first-in first-out (FIFO) principle behind SOT [19].The sentences of each speaker are concatenated together and separated from the next speaker using the <sc> token.
We use the word error rate (WER) to evaluate the ASR task, and the token-level speaker error rate (T-SER) and sentence-level speaker error rate (S-SER) [20] to evaluate the speaker prediction task.In order to calculate the S-SER, a single speaker is assigned to each hypothesized sentence by taking a max over token-level speaker prediction counts within the respective sentences segmented by <sc>.

Baseline
We choose two baseline systems to study the impact of a multichannel encoder that effectively utilizes speaker information.The first baseline is the end-to-end SC-SA-ASR system proposed in [4].It helps us to compare the impact of a multichannel MFCCA-based ASR encoder on far-field speech.Our implementation of the SC-SA-ASR system uses ECAPA-TDNN speaker embeddings.The second baseline is the MFCCA-based multichannel ASR (MC-ASR) model proposed in [9].The MC-ASR model has been extended to perform speaker identification in the MC-WD-SOT model presented in [8].However, the MC-WD-SOT model does not incorporate any speaker information in the ASR decoder.
Hence, the performance of the ASR module in the MC-WD-SOT model is not expected to be better than that of the MC-ASR model.Thus, in favor of a simpler implementation, we choose the MC-ASR model as the second baseline.

Model and training setup
We compute STFT with a window length of 25 ms and a hop size of 10 ms.The magnitude information is used to generate 80-dimensional log Mel filterbank features.For the magnitude and phase features, we generate a 3 × 201-dimensional feature that includes magnitude, cosine, and sine phases.According to the convolution process in Fig. 2, the convolutional feature extractor produces features A of size 640 and 832 for the Mel filterbank features and phase-wise features, respectively.The speaker encoder uses 80-dimensional log Mel filterbank features averaged across all channels.For all the models (SC-SA-ASR, MC-ASR and MC-SA-ASR) in our experiments, the Conformer-based encoder has 12 layers, and the Transformer-based decoder has 1 layer.The speaker decoder in SC-MC-ASR and MC-SA-ASR has 2 layers.All multi-head attention mechanisms have 4 heads, the model dimension D is set to 256, and the size of the feedforward layer is 2048.Following [9], the context frame length F of MFCCA in MC-ASR and SA-MC-ASR is set to 2. Our text tokenizer is a SentencePiece model [21] with a vocabulary of 5000 tokens.The speaker embedding model is an ECAPA-TDNN model pre-trained on the VoxCeleb1 [22] and VoxCeleb2 [23] training data, generating 192 dimension embeddings.In each reference speaker embedding matrix S, the number of speakers K is set to 8, and the embedding for each speaker is derived from two random enrollment sentences.
Our experiments were implemented using the Speech-Brain toolkit [24].All the models were trained until conver-gence.The ASR modules in SC-SA-ASR and MC-SA-ASR were pre-trained for 80 epochs by setting S and H spk to 0. We utilized the Adam optimizer with a learning rate of 5 × 10 −4 during the pre-training process.Subsequently, the ASR and speaker modules of the SC-SA-ASR model were fine-tuned for 60 epochs using a learning rate of 2.5 × 10 −4 .The MC-SA-ASR model was fine-tuned for 120 epochs using a learning rate of 1.5 × 10 −5 , after the initial training of 60 epochs.The weight of the speaker loss was set to to 0.1, following prior work [20].The MC-ASR model was trained for 140 epochs with a learning rate of 5 × 10 −4 .For all the experiments, the global batch size (batch size × number of GPUs × gradient accumulation factor) is fixed to 160.

Results and discussion
Table 1 presents the test results of the baseline models (SC-SA-ASR and MC-ASR) and the proposed model (MC-SA-ASR).First of all, by comparing the results of SC-SA-ASR and MC-SA-ASR, we can conclude that using a multichannel encoder to process multi-microphone speech information improves the speech recognition performance.Specifically, the MC-SA-ASR model with Mel filterbank input features achieves a WER of 14.77% in the 4-channel scenario, that is a 12% relative reduction compared to the SC-SA-ASR model (16.81%).On average, the WER of the MC-SA-ASR models with 2, 3, and 4 channels (15.14%) is reduced by 10% relative compared to the SC-SA-ASR model.Secondly, the proposed MC-SA-ASR model, which incorporates speaker information into the ASR decoder, obtains a 16% relative reduction in WER compared to MC-ASR (18.03%).This suggests that leveraging speaker information can improve the performance of multi-speaker ASR in a multichannel setting.
The test WERs obtained by the proposed MC-SA-ASR model with Mel filterbank input features and with mag-nitude+phase input features do not exhibit significant differences.However, interestingly, in the 2-channel scenarios, Mel filterbank features (15.41%) outperformed magnitude+phase (16.76%) with a 8% relative lower WER.Conversely, in the test results of the 3-and 4-channel scenarios, magni-tude+phase features achieved a slight relative reduction of 1.4% and 0.5% in WER, respectively.This leads us to speculate that phase features may result in better ASR performance in models with a larger number of channels.However, further testing is required to validate this conclusion.Moreover, in the test results of the 3-channel scenarios, magnitude+phase features exhibited a relative reduction of 4% in WER on both 2 and 3-speaker-mixed datasets compared to Mel filterbank features.From this, we can conclude that in multi-speaker multichannel ASR, phase features perform better on chunks with a larger number of speakers.This can be explained by the fact that as the number of speakers in a room increases, the positional information of the speakers has a greater impact on ASR performance.

Evaluation on AMI
We validate the effectiveness of the proposed MC-SA-ASR model on real-world data by fine-tuning and testing our pretrained model on the AMI meeting corpus [25].

Corpus preparation
The AMI multiple distant microphone (MDM) corpus consists of approximately 100 hours of 8-to 16-channel audio recordings of 3 to 5 participants in meetings.The data is annotated in terms of ASR and diarization, with the start and end timestamps for each sentence.In order to process the meeting files, which typically consist of approximately 1 hour of content, we have to divide each meeting into smaller segments.We adopt a segmentation approach inspired by "utterance groups" [26] which works as follows.(a) Segment each meeting using a chunk size of b seconds and a hop size of o seconds.(b) If the start/end time of a segment falls within a region involving more than one speaker, it is adjusted to be two seconds outside the overlap region.(c) If the start/end time of a segment falls within a word, it is adjusted to align with the start/end of that word.The benefits of segmenting in this manner are twofold.Firstly, the utilization of a hop size allows for an increased number of training samples.Secondly, segmenting outside the speaker overlap regions respects the FIFO training approach.
We conducted experiments with chunk size of 5, 10 or 15 s, with the hop size set to 2 s. Figure 4 illustrates the distribution of the number of segments and speakers in the datasets generated using different chunk sizes.Table 2 presents some statistics of the training, development, and test sets generated with a chunk size of 5 s.It is observed that the probability of a chunk containing multiple speakers increases as the chunk size increases.To evaluate the model's performance on datasets with varying numbers of speakers, we combined all the test sets segmented at 5, 10, and 15 s.We then divided them into four different test sets based on the number of Table 2. AMI statistics after segmentation (chunks size of 5 seconds).The average duration is in seconds (s), and the total duration is in hours (h).

Fine-tune settings
We fine-tune the pre-trained SC-SA-ASR and MC-SA-ASR models on the AMI MDM datasets using the Full-corpus-ASR partitions.SC-SA-ASR utilizes only the 1st channel of Array 11 of the train-dev-test splits; while MC-SA-ASR utilizes the 1st and 5th channels for the 2-channel model.The datasets with chunk sizes of 5, 10, and 15 s undergo finetuning for 40 and 90 epochs, respectively.In each case, the first half of all training epochs updates the ASR module, while the second half jointly updates the ASR and speake modules.All fine-tuning steps employ the Adam optimizer with a learning rate of 1 × 10 −4 , and a global batch size of 160.

Results and discussion
Table 3 presents the test results of SC-SA-ASR and MC-SA-ASR on datasets segmented using different chunk sizes.We also compare the performance of models trained on different chunk sizes on all the test sets.Table 4 presents the results, divided into 4 subsets based on the number of speakers.We can observe that 2-channel MC-SA-ASR consistently achieves a lower WER than SC-SA-ASR.Particularly, the 5second model demonstrates a 6% relative reduction in WER compared to SC-SA-ASR on the 2-speaker test set.Moreover, we observe that models trained on smaller chunk sizes perform better.In MC-SA-ASR, the 5-second model exhibits relative reductions in WER of 43%, 34%, 17%, and 11% on the test sets with 1, 2, 3 and 4 speakers compared to the 15second model, respectively (note that the SER exhibits a similar trend).This might be explained by the fact that, according to Fig. 4, when decreasing the number of speakers, both the total number and the proportion of 5-second segments in the training set increase compared to 15-second segments.
A comparison of S-SER versus T-SER performances in Table 3 shows that S-SER is much higher than T-SER on the AMI test set, for both SC-SA-ASR and MC-SA-ASR sys-  1.The reason is that the prediction of speaker change markers <sc> is a more challenging task on the AMI test set.As a further analysis, we evaluate the performance of the two systems in terms of speaker counting task on the AMI test set.Table 5 presents the speaker counting accuracy obtained by counting the occurrences of <sc> tokens in the ASR output, on datasets with different numbers of speakers.We observe that the MC-SA-ASR system is consistently outperforming the SC-SA-ASR system.However, the accuracy decreases by 60% relative from the 3-speaker test set to the 4-speaker set for MC-SA-ASR (61% for SC-SA-ASR).Furthermore, for scenarios involving 2, 3, and 4 speakers, the majority of errors originate from underestimating the number of speakers.

CONCLUSION
In this paper, we have introduced an end-to-end MC-SA-ASR system that combines a Conformer-based encoder with multi-frame cross-channel attention and a speaker-attributed Transformer-based decoder.Experimental results demonstrate that, on simulated data, our approach achieves relative reductions in WER of up to 12% and 16% compared to existing single-channel and multichannel methods, respectively.We also studied the impact of using Mel filterbank vs. mag-nitude+phase features on MC-SA-ASR.On real-world data, our model achieves a relative reduction in WER of up to 6% compared to SC-SA-ASR.However, it still has limitations in accurately determining the number of speakers in scenarios involving three or more participants.Future research can focus on improving this aspect.

ACKNOWLEDGMENTS
Experiments presented in this paper were carried out using (a) the Grid'5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RE-NATER and several Universities as well as other organizations (see https://www.grid5000.fr),and (b) HPC resources from GENCI-IDRIS (Grant 2023-[AD011013881]).

1 .
Evaluation on mixtures of LibriSpeech data 4.1.1

Fig. 4 .
Fig. 4. Percentage of segments containing a given number of speakers for different chunk sizes on the AMI corpus.

Table 1 .
WER (%), sentence-level SER (S-SER) (%) and token-level SER (T-SER) (%) on the simulated multichannel multispeaker LibriSpeech test set.Results are grouped by the number of speakers in the simulated mixture.The '1,2,3-speaker mix' column shows the results obtained on a test set containing mixtures of 1, 2, and 3 speakers.

Table 4 .
WER (%) and token-level SER (T-SER) (%) of models adapted to AMI when chunk sizes are different across train and test splits.Rows 5 s, 10 s and 15 s refer to the chunk size in train-dev splits.The test set consists of 5, 10 and 15 s chunks.MC-SA-ASR uses 2 channels.

Table 5 .
Speaker counting accuracy (%) on the total of 5, 10 and 15 seconds test chunks of AMI for models trained on chunks of 5 seconds..5953.87 20.04 6.66 tems, as opposed to lower S-SER and higher T-SER values on the LibriSpeech test set in Table