Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • معلومة اضافية
    • Contributors:
      Tallinn University of Technology (TTÜ); Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio (IRIT-SAMoVA); Institut de recherche en informatique de Toulouse (IRIT); Université Toulouse Capitole (UT Capitole); Université de Toulouse (UT)-Université de Toulouse (UT)-Université Toulouse - Jean Jaurès (UT2J); Université de Toulouse (UT)-Université Toulouse III - Paul Sabatier (UT3); Université de Toulouse (UT)-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP); Université de Toulouse (UT)-Toulouse Mind & Brain Institut (TMBI); Université Toulouse - Jean Jaurès (UT2J); Université de Toulouse (UT)-Université de Toulouse (UT)-Université Toulouse III - Paul Sabatier (UT3); Université de Toulouse (UT)-Université Toulouse Capitole (UT Capitole); Université de Toulouse (UT); DYNamiques de l’Information (DYNI); Laboratoire d'Informatique et des Systèmes (LIS) (Marseille, Toulon) (LIS); Aix Marseille Université (AMU)-Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS)-Aix Marseille Université (AMU)-Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS); Centre National de la Recherche Scientifique (CNRS); Agence de l’Innovation Défense under the grant number 2022 65 0079; HPC resources of GENCI-IDRIS under the allocations AD011014274, as well as the TalTech supercomputing resources
    • بيانات النشر:
      HAL CCSD
      ISCA
    • الموضوع:
      2024
    • Collection:
      Aix-Marseille Université: HAL
    • الموضوع:
    • الموضوع:
      Quebec City, Canada
    • نبذة مختصرة :
      International audience ; A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) was proposed as an unsupervised alternative that uses real recordings, yet struggles with over-separation and adapting to long-form audio. We introduce PixIT, a joint approach that combines permutation invariant training (PIT) for speaker diarization (SD) and MixIT for SSep. With a small extra requirement of needing SD labels during training, it solves the problem of over-separation and allows stitching local separated sources leveraging existing work on clustering-based neural SD. We measure the quality of the separated sources via applying automatic speech recognition (ASR) systems to them. PixIT boosts the performance of various ASR systems across two meeting corpora both in terms of the speaker-attributed and utterance-based word error rates while not requiring any fine-tuning.
    • Relation:
      info:eu-repo/semantics/altIdentifier/arxiv/2403.02288; ARXIV: 2403.02288
    • الرقم المعرف:
      10.21437/odyssey.2024-17
    • Rights:
      info:eu-repo/semantics/OpenAccess
    • الرقم المعرف:
      edsbas.7D0BA26