Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • معلومة اضافية
    • Contributors:
      Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio (IRIT-SAMoVA); Institut de recherche en informatique de Toulouse (IRIT); Université Toulouse Capitole (UT Capitole); Université de Toulouse (UT)-Université de Toulouse (UT)-Université Toulouse - Jean Jaurès (UT2J); Université de Toulouse (UT)-Université Toulouse III - Paul Sabatier (UT3); Université de Toulouse (UT)-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP); Université de Toulouse (UT)-Toulouse Mind & Brain Institut (TMBI); Université Toulouse - Jean Jaurès (UT2J); Université de Toulouse (UT)-Université de Toulouse (UT)-Université Toulouse III - Paul Sabatier (UT3); Université de Toulouse (UT)-Université Toulouse Capitole (UT Capitole); Université de Toulouse (UT); ANR-18-CE23-0005,LUDAU,Découverte peu et non-supervisée d'unités audio à l'aide d'apprentissage profond(2018); ANR-19-P3IA-0004,ANITI,Artificial and Natural Intelligence Toulouse Institute(2019)
    • بيانات النشر:
      HAL CCSD
    • الموضوع:
      2023
    • Collection:
      Université Toulouse 2 - Jean Jaurès: HAL
    • نبذة مختصرة :
      Automated Audio Captioning (AAC) involves generating natural language descriptions of audio content, using encoder-decoder architectures. An audio encoder produces audio embeddings fed to a decoder, usually a Transformer decoder, for caption generation. In this work, we describe our model, which novelty, compared to existing models, lies in the use of a ConvNeXt architecture as audio encoder, adapted from the vision domain to audio classification. This model, called CNexttrans, achieved state-of-the-art scores on the AudioCaps (AC) dataset and performed competitively on Clotho (CL), while using four to forty times fewer parameters than existing models. We examine potential biases in the AC dataset due to its origin from AudioSet by investigating unbiased encoder's impact on performance. Using the well-known PANN's CNN14, for instance, as an unbiased encoder, we observed a 1.7% absolute reduction in SPIDEr score (where higher scores indicate better performance). To improve cross-dataset performance, we conducted experiments by combining multiple AAC datasets (AC, CL, MACS, WavCaps) for training. Although this strategy enhanced overall model performance across datasets, it still fell short compared to models trained specifically on a single target dataset, indicating the absence of a one-size-fits-all model. To mitigate performance gaps between datasets, we introduced a Task Embedding (TE) token, allowing the model to identify the source dataset for each input sample. We provide insights into the impact of these TEs on both the form (words) and content (sound event types) of the generated captions. The resulting model, named CoNeTTE, an unbiased CNext-trans model enriched with dataset-specific Task Embeddings, achieved SPIDEr scores of 44.1% and 30.5% on AC and CL, respectively. For the sake of reproducibility, we have made our code publicly available: https://github.com/Labbeti/conette-audio-captioning.
    • Relation:
      info:eu-repo/semantics/altIdentifier/arxiv/2309.00454; hal-04193791; https://ut3-toulouseinp.hal.science/hal-04193791; https://ut3-toulouseinp.hal.science/hal-04193791/document; https://ut3-toulouseinp.hal.science/hal-04193791/file/conette_AAC.pdf; ARXIV: 2309.00454
    • Rights:
      info:eu-repo/semantics/OpenAccess
    • الرقم المعرف:
      edsbas.65E8B6F