基于PyTorch(mps)的人工智能本地语音识别库Whisper(Python

前文回溯，之前一篇：含辞未吐,声若幽兰,史上最强免费人工智能AI语音合成TTS服务微软Azure(.10接入)，利用AI技术将文本合成语音，现在反过来，利用开源库再将语音转回文字，所谓闻其声而知雅意。

是一个开源的语音识别库，它是由 AI (FAIR)开发的，支持多种语言的语音识别。它使用了双向循环神经网络（bi- RNNs）来识别语音并将其转换为文本。支持自定义模型，可以用于实现在线语音识别，并且具有高级的语音识别功能，支持语音识别中的语音活动检测和语音识别中的语音转文本。它是使用进行开发，可以使用 API来调用语音识别，并且提供了一系列的预训练模型和数据集来帮助用户开始使用。

基于MPS的安装

我们知道一直以来在M芯片的MacOs系统中都不支持cuda模式原人工智能语音计算器，而现在，新的MPS后端扩展了生态系统并提供了现有的脚本功能来在 GPU上设置和运行操作。

截止本文发布，与 3.11不兼容，所以我们将使用最新的 3.10.x 版本。

确保安装.10最新版：

➜  transformers git:(stable) python3Python 3.10.9 (main, Dec 15 2022, 17:11:09) [Clang 14.0.0 (clang-1400.0.29.202)] on darwinType "help", "copyright", "credits" or "license" for more information.>>>

随后运行安装命令：

pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

安装成功后，在终端里验证-MPS的状态：

➜  transformers git:(stable) python3Python 3.10.9 (main, Dec 15 2022, 17:11:09) [Clang 14.0.0 (clang-1400.0.29.202)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> import torch>>> torch.backends.mps.is_available()True>>>

返回True即可。

MPS (Multi- )性能测试

MPS (Multi- )是中的一种分布式训练方式。它是基于Apple的MPS(Metal ) 框架开发的。MPS可以在多核的苹果设备上加速的运算。MPS使用了多个设备上的多个核心来加速模型的训练。它可以将模型的计算过程分配到多个核心上，并且可以在多个设备上进行训练，从而提高训练速度。

MPS 可以在 Apple 的设备（如和 iPad）上加速模型训练，也可以在 Mac 上使用。可以使用MPS来加速卷积神经网络（CNNs）、循环神经网络（RNNs）和其他类型的神经网络。使用MPS可以在不改变模型结构的情况下，通过分布式训练来加速模型的训练速度。

现在我们来做一个简单测试：

import torchimport timeitimport randomx = torch.ones(50000000,device='cpu')print(timeit.timeit(lambda:x*random.randint(0,100),number=1))

首先创建一个大小为的全为1的张量 x，并将其设置为在cpu上运算。最后使用 . 函数来测量在 CPU 上执行 x 乘以一个随机整数的时间。 =1表示只运行一次。这段代码的作用是在cpu上测量运算一个张量的时间。

运行结果：

➜  nlp_chinese /opt/homebrew/bin/python3.10 "/Users/liuyue/wodfan/work/nlp_chinese/mps_test.py"0.020812375005334616

在10核M1pro的cpu芯片加持下原人工智能语音计算器，运行时间为：0.

随后换成MPS模式：

import torchimport timeitimport randomx = torch.ones(50000000,device='mps')print(timeit.timeit(lambda:x*random.randint(0,100),number=1))

程序返回：

➜  nlp_chinese /opt/homebrew/bin/python3.10 "/Users/liuyue/wodfan/work/nlp_chinese/mps_test.py"0.003058041911572218

16核的GPU仅用时：0.

也就是说MPS的运行速度比CPU提升了7倍左右。

语音识别

安装好了，我们安装:

pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git

安装好之后进行验证：

➜  transformers git:(stable) whisper   usage: whisper [-h] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large}] [--model_dir MODEL_DIR]               [--device DEVICE] [--output_dir OUTPUT_DIR] [--verbose VERBOSE] [--task {transcribe,translate}]               [--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,hi,hr,ht,hu,hy,id,is,it,iw,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]

随后安装:

brew install ffmpeg

然后编写语音识别代码：

import whispermodel = whisper.load_model("small")# load audio and pad/trim it to fit 30 secondsaudio = whisper.load_audio("/Users/liuyue/wodfan/work/mydemo/b1.wav")audio = whisper.pad_or_trim(audio)# make log-Mel spectrogram and move to the same device as the modelmel = whisper.log_mel_spectrogram(audio).to("cpu")# detect the spoken language_, probs = model.detect_language(mel)print(f"Detected language: {max(probs, key=probs.get)}")# decode the audiooptions = whisper.DecodingOptions(fp16 = False)result = whisper.decode(model, mel, options)# print the recognized textprint(result.text)

这里导入音频后，通过.方法自动检测语言，然后输出文本：

➜  minGPT git:(master) ✗ /opt/homebrew/bin/python3.10 "/Users/liuyue/wodfan/work/minGPT/wisper_test.py"Detected language: zhHello大家好,这里是刘悦的技术博客,众神殿内,高朋满座,圣有如云,VMware,Virtual Box,UPM等虚拟机大神群英汇翠,指见位于C位王座上的Parallels唱网抬头,缓缓群寻,屁腻群小,目光到处,无人敢抬头对视。是的,如果说虚拟机领域有一位王者,非Parallels不能领袖群伦,毕竟大厂背书,功能满格,美中不足之处就是价格略高,

这里使用的small模型，也可以用更大的模型比如：、large。模型越大，效果越好。

如果想使用MPS的方式，需要改写一下源码，将方法的参数改为mps即可：

def load_model(name: str, device: Optional[Union[str, torch.device]] = None, download_root: str = None, in_memory: bool = False) -> Whisper:    """    Load a Whisper ASR model    Parameters    ----------    name : str        one of the official model names listed by `whisper.available_models()`, or        path to a model checkpoint containing the model dimensions and the model state_dict.    device : Union[str, torch.device]        the PyTorch device to put the model into    download_root: str        path to download the model files; by default, it uses "~/.cache/whisper"    in_memory: bool        whether to preload the model weights into host memory    Returns    -------    model : Whisper        The Whisper ASR model instance    """    if device is None:        device = "cuda" if torch.cuda.is_available() else "mps"

代码在第18行。

随后运行脚本也改成mps:

import whispermodel = whisper.load_model("medium")# load audio and pad/trim it to fit 30 secondsaudio = whisper.load_audio("/Users/liuyue/wodfan/work/mydemo/b1.wav")audio = whisper.pad_or_trim(audio)# make log-Mel spectrogram and move to the same device as the modelmel = whisper.log_mel_spectrogram(audio).to("mps")# detect the spoken language_, probs = model.detect_language(mel)print(f"Detected language: {max(probs, key=probs.get)}")# decode the audiooptions = whisper.DecodingOptions(fp16 = False)result = whisper.decode(model, mel, options)# print the recognized textprint(result.text)

这回切换为模型，程序返回：

➜  minGPT git:(master) ✗ /opt/homebrew/bin/python3.10 "/Users/liuyue/wodfan/work/minGPT/wisper_test.py"100%|█████████████████████████████████████| 1.42G/1.42G [02:34<00:00, 9.90MiB/s]Detected language: zhHello 大家好,这里是刘悦的技术博客,众神殿内,高朋满座,圣有如云,VMware,Virtualbox,UTM等虚拟机大神群音惠翠,只见位于C位王座上的Parallels唱往抬头,缓缓轻寻,屁逆群小,目光到处,无人敢抬头对视。

效率和精准度提升了不少，但模型的体积也更大，达到了1.42g。

结语

作为一个开源的语音识别库，支持多种语言，并且使用双向循环神经网络（bi- RNNs）来识别语音并将其转换为文本，支持自定义模型，可以用于实现在线语音识别，并且具有高级的语音识别功能，支持语音识别中的语音活动检测和语音识别中的语音转文本，在的MPS加成下，更是猛虎添翼，绝世好库，值得拥有。

本文到此结束，希望对大家有所帮助。

本文内容由互联网用户自发贡献，该文观点仅代表作者本人。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，请发送邮件至81118366@qq.com举报，一经查实，本站将立刻删除。发布者：简知小编，转载请注明出处：https://www.jianzixun.com/101461.html