whisper
Automatic Speech Recognition • OpenAIWhisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
Usage
Workers - TypeScript
  export interface Env {  AI: Ai;}
export default {  async fetch(request, env): Promise<Response> {    const res = await fetch(      "https://github.com/Azure-Samples/cognitive-services-speech-sdk/raw/master/samples/cpp/windows/console/samples/enrollment_audio_katie.wav"    );    const blob = await res.arrayBuffer();
    const input = {      audio: [...new Uint8Array(blob)],    };
    const response = await env.AI.run(      "@cf/openai/whisper",      input    );
    return Response.json({ input: { audio: [] }, response });  },} satisfies ExportedHandler<Env>;curl
  curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run/@cf/openai/whisper  \  -X POST  \  -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN"  \  --data-binary "@talking-llama.mp3"Parameters
* indicates a required field
Input
-  
0string -  
1object-  
audio *arrayAn array of integers that represent the audio data constrained to 8-bit unsigned integer values
-  
itemsnumberA value between 0 and 255
 
 -  
 
 -  
 
Output
-  
text *stringThe transcription
 -  
word_countnumber -  
wordsarray-  
itemsobject-  
wordstring -  
startnumberThe second this word begins in the recording
 -  
endnumberThe ending second when the word completes
 
 -  
 
 -  
 -  
vttstring 
API Schemas
The following schemas are based on JSON Schema
{    "oneOf": [        {            "type": "string",            "format": "binary"        },        {            "type": "object",            "properties": {                "audio": {                    "type": "array",                    "description": "An array of integers that represent the audio data constrained to 8-bit unsigned integer values",                    "items": {                        "type": "number",                        "description": "A value between 0 and 255"                    }                }            },            "required": [                "audio"            ]        }    ]}{    "type": "object",    "contentType": "application/json",    "properties": {        "text": {            "type": "string",            "description": "The transcription"        },        "word_count": {            "type": "number"        },        "words": {            "type": "array",            "items": {                "type": "object",                "properties": {                    "word": {                        "type": "string"                    },                    "start": {                        "type": "number",                        "description": "The second this word begins in the recording"                    },                    "end": {                        "type": "number",                        "description": "The ending second when the word completes"                    }                }            }        },        "vtt": {            "type": "string"        }    },    "required": [        "text"    ]}