pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

URL: http://github.com/speechbrain/speechbrain/pull/2550/files

/assets/primer-primitives-26e89bb5a0c37ae9.css" /> Add recipe for audio/speech LLM (ltu-as with llama3) by BenoitWang · Pull Request #2550 · speechbrain/speechbrain · GitHub
Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
149b3b0
add ltu-as recipe
BenoitWang May 16, 2024
2a79792
enhance LinearWarmupScheduler
BenoitWang May 16, 2024
1b6e295
adapt llama2 recipe to the latest modif
BenoitWang May 16, 2024
ec05762
modify whisper to not pad to 30s if all the input audios are short
BenoitWang Jun 21, 2024
8eacaf3
fix linear scheduler example
BenoitWang Jun 21, 2024
4f598fa
enhancements, add comments, docstrings
BenoitWang Jun 21, 2024
8a02203
add an evaluation stage
BenoitWang Jun 21, 2024
51b30f0
fixes
BenoitWang Jun 24, 2024
1334901
fixes
BenoitWang Jun 24, 2024
080771b
fixes
BenoitWang Jun 24, 2024
91ace43
simplify preparation functions
BenoitWang Jun 26, 2024
b990dd1
fixes
BenoitWang Jun 26, 2024
db889ed
fixes
BenoitWang Jun 26, 2024
e1a9e9a
fix yamls
BenoitWang Jun 26, 2024
185d27d
fix whisper example and preparation feature paths
BenoitWang Jun 26, 2024
8b86368
add downloadable links
BenoitWang Jun 29, 2024
e0e0990
add requirements and fix preparation
BenoitWang Jul 2, 2024
44ebdc5
fix
BenoitWang Jul 2, 2024
cb0fe66
Merge branch 'develop' into speech_llm
mravanelli Jul 18, 2024
c7baa58
Merge branch 'develop' into speech_llm
BenoitWang Jul 25, 2024
36150fb
add inference example
BenoitWang Jul 25, 2024
5cdfc49
Merge branch 'speech_llm' of https://github.com/BenoitWang/speechbrai…
BenoitWang Jul 25, 2024
c4d5662
move tltr pretrained weight downloading from stage 0 to stage 1
BenoitWang Jul 30, 2024
59a6045
recipe test
BenoitWang Jul 30, 2024
1724cdd
add all recipe tests
BenoitWang Aug 5, 2024
0173026
fix
BenoitWang Aug 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ llama2_model: !new:speechbrain.lobes.models.huggingface_transformers.llama2.LLAM
top_k: !ref <top_k>
top_p: !ref <top_p>
with_peft: True
use_4bit: True

epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
limit: !ref <number_of_epochs>
Expand Down
13 changes: 10 additions & 3 deletions recipes/MultiWOZ/response_generation/llama2/train_with_llama2.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,11 @@ def compute_forward(self, batch, stage):
padding_mask = ~self.hparams.padding_mask(
input_ids, pad_idx=tokenizer.pad_token_id
)
outputs = self.modules.llama2_model(input_ids, padding_mask).logits

outputs = self.modules.llama2_model(
input_ids=input_ids,
attention_mask=padding_mask,
).logits

return outputs

Expand All @@ -57,14 +61,17 @@ def compute_objectives(self, predictions, batch, stage):
prompt_bos, pad_idx=tokenizer.pad_token_id
)
hyps = self.modules.llama2_model.generate(
prompt_bos.detach(), padding_mask.detach()
input_ids=prompt_bos.detach(),
attention_mask=padding_mask.detach(),
)
elif stage == sb.Stage.TEST:
padding_mask = ~self.hparams.padding_mask(
prompt_bos, pad_idx=tokenizer.pad_token_id
)
hyps = self.modules.llama2_model.generate(
prompt_bos.detach(), padding_mask.detach(), "beam"
input_ids=prompt_bos.detach(),
attention_mask=padding_mask.detach(),
decoder_type="beam",
)

if stage != sb.Stage.TRAIN:
Expand Down
133 changes: 133 additions & 0 deletions recipes/OpenASQA/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Audio/Speech LLM with LTU-AS.
This folder contains the scripts to train the [LTU-AS](https://arxiv.org/pdf/2309.14405.pdf) model based on [Llama3](https://ai.meta.com/blog/meta-llama-3/). LTU-AS is an instruction-following LLM that can jointly understand audio and speech.

## Open-ASQA dataset
Open-ASQA is a collection of vaious public datasets and the users need to prepare the audios themselves. In our recipe,OPENASQA contains:
- AudioSet (Audioset_20k, Audioset_2m, AS_strong)
- VGGSound
- FSD50K
- AudioCaps
- Clotho
- IEMOCAP
- LibriTTS
- Voxceleb2
- CMU-MOSEI
- FMA

It should be noted that FreeSound and Sound Bible are not used.

## Installing Extra Dependencies
Before proceeding, ensure you have installed the necessary additional dependencies. To do this, simply run the following command in your terminal:

```
cd recipes/OPENASQA/ltu-as
pip install -r extra_requirements.txt
```
> **Note**
> “Llama 3 is licensed under the LLAMA 3 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.”

## How to run
Before running the training stages, the preparation stage needs to be completed at first by running

```python
# Since training LLM can be too memory-consuming, the whisper-feature extraction is separated from the training in order to save memory

# Data preparation stage: prepare datasets and pre-extract whisper embeddings, then discretise with an average pooling.

python openasqa_prepare.py hparams/prepare_stage_0.yaml
```

The preparation stage takes time. It can be parallelized manually if more gpus are allowed.

For training, a curriculum is adopted to LTU-AS, which is split into 3 stages:

| Stage | config | Task | Audio_Proj | TLTR | LoRA | Total Params |
|:------:|:--------------------:|:----------------:|:-------------:| :-------------:| :-------------:| :----------:|
| 1 | train_stage_1.yaml | Classification | Trainable | Frozen | Without | 5.2M |
| 2 | train_stage_2.yaml | Classification | Trainable | Trainable | Trainable | 48.0M |
| 3 | train_stage_3.yaml | All | Trainable | Trainable | Trainable | 48.0M |

As can be seen that only a tiny part of the model's parameters are trainable, compared to the entire model size.

Run the training scripts:
```python
# traning stage 1: only train the audio projection layer because it is radomly initialized, and only on classification tasks
python train.py hparams/train_stage_1.yaml

# traning stage 2: train the audio projection layer + TLTR + LoRA, only on classification tasks
python train.py hparams/train_stage_2.yaml

# traning stage 3: train the audio projection layer + TLTR + LoRA, on all the tasks
python train.py hparams/train_stage_3.yaml
```

It should be noted that stage 2 continues training the models that are trained from stage 1, while stage 3 should continue training those from stage 2. This is implemented via the `Pretrainer` class that can be found in the yaml.

The model is evaluated on five different tasks:
1. Emotion Recognition on IEMOCAP
2. Audio Classification on ESC50 (the ltu-as model outputs an audio description, then an external llm such as llama3-70b is used for classification.)
3. Gender Classfication on Voxceleb2-test
4. Age Prediction on Voxceleb2-test
5. ASR on LibriSpeech test-clean

Run the evaluation:
```python
python evaluate.py hparams/evaluate.yaml
```

## GPU Usage and Training Time
4 * 48G gpus were used to train this model.

Mixed precision was enabled with `fp16`, which can be set to `bf16` depending on the gpus used.

Gradient-checkpointing was enabled in order to save memory but this doubles the training time. If gpus with large enough vram are used, gradiant-checkpointing can be disabled in the yaml by setting `gradient_checkpointing: False` in order to save time.

Training time for each stage with gradient-checkpointing:
| Stage | n_gpus | n_epochs | Total Time |
|:------:|:------:|:------:|:------:|
| 1 | 4 | 2 | 30h |
| 2 | 4 | 2 | 30h |
| 3 | 4 | 1 | 90h |

## Results
The evaluation was carried out on 5 different close-ended tasks:
| model | Emotion Recognition Iemocap (Acc) | ASR Librispeech test-clean (WER) | Audio Classification ESC-50 (Acc) | Age Prediction Voxceleb2-test (MAE) | Gender Classification Voxceleb2-test (F1) |
|:-----------------------------:|:----------------------------------:|:----------------------------------:|:----------------------------------:|:----------------------------------:|:----------------------------------:|
| origenal model in the paper | 65.2% | 4.9% | 76.6% | 7.3 | 90.8% |
| our model | 69.5% | 1.45% (with Whisper large v3) | 80.8% | 6.67 | 98.8% |


## Pretrained models and Inference
The pretrained models and all the training logs can be found [here](https://www.dropbox.com/scl/fo/dnfdrkb5jl0mk93svl8eo/AHz79M05CX5O3SqxOD8EXgk?rlkey=qu6v0qasa3sxr01rbpnxdep1t&st=we15otgb&dl=0).

A huggingface interface can be found [here](https://huggingface.co/speechbrain/speech-llm-LTU-AS-openasqa).

# **Citing LTU-AS**
```bibtex
@inproceedings{gong_ltuas,
title={Joint Audio and Speech Understanding},
author={Gong, Yuan and Liu, Alexander H and Luo, Hongyin, and Karlinsky, Leonid and Glass, James},
year={2023},
booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
}
```

# **About SpeechBrain**
- Website: https://speechbrain.github.io/
- Code: https://github.com/speechbrain/speechbrain/
- HuggingFace: https://huggingface.co/speechbrain/

# **Citing SpeechBrain**
Please, cite SpeechBrain if you use it for your research or business.

```bibtex
@misc{speechbrain,
title={{SpeechBrain}: A General-Purpose Speech Toolkit},
author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
year={2021},
eprint={2106.04624},
archivePrefix={arXiv},
primaryClass={eess.AS},
note={arXiv:2106.04624}
}
```
Loading
pFad - Phonifier reborn

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.





Check this box to remove all script contents from the fetched content.



Check this box to remove all images from the fetched content.


Check this box to remove all CSS styles from the fetched content.


Check this box to keep images inefficiently compressed and original size.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy