keras
/

whisper_medium_en

@@ -1,18 +1,107 @@
 ---
 library_name: keras-hub
 ---
-This is a [`Whisper` model](https://keras.io/api/keras_hub/models/whisper) uploaded using the KerasHub library and can be used with JAX, TensorFlow, and PyTorch backends.
-Model config:
-* **name:** whisper_backbone
-* **trainable:** True
-* **vocabulary_size:** 51864
-* **num_layers:** 24
-* **num_heads:** 16
-* **hidden_dim:** 1024
-* **intermediate_dim:** 4096
-* **num_mels:** 80
-* **dropout:** 0.0
-* **max_encoder_sequence_length:** 3000
-* **max_decoder_sequence_length:** 448
-This model card has been generated automatically and should be completed by the model author. See [Model Cards documentation](https://huggingface.co/docs/hub/model-cards) for more information.

 ---
 library_name: keras-hub
 ---
+### Model Overview
+⚠️ Whisper is currently only available via the `keras-hub-nightly` package. Use `pip install keras-hub-nightly` to try this model.
+A Whisper encoder-decoder network for speech.
+This class implements a Transformer-based encoder-decoder model as
+described in
+["Robust Speech Recognition via Large-Scale Weak Supervision"](https://arxiv.org/abs/2212.04356).
+It includes the embedding lookups and transformer layers, but not the head
+for predicting the next token.
+The default constructor gives a fully customizable, randomly initialized Whisper
+model with any number of layers, heads, and embedding dimensions. To load
+preset architectures and weights, use the `from_preset()` constructor.
+Disclaimer: Pre-trained models are provided on an "as is" basis, without
+warranties or conditions of any kind. The underlying model is provided by a
+third party and subject to a separate license, available
+[here](https://github.com/openai/whisper).
+__Arguments__
+- __vocabulary_size__: int. The size of the token vocabulary.
+- __num_layers__: int. The number of transformer encoder layers and
+    transformer decoder layers.
+- __num_heads__: int. The number of attention heads for each transformer.
+    The hidden size must be divisible by the number of attention heads.
+- __hidden_dim__: int. The size of the transformer encoding and pooler layers.
+- __intermediate_dim__: int. The output dimension of the first Dense layer in
+    a two-layer feedforward network for each transformer.
+- __num_mels__: int. The number of mel-frequency filters. Defaults to `80`.
+- __dropout__: float. Dropout probability for the Transformer encoder.
+- __max_encoder_sequence_length__: int. The maximum sequence length that the
+    audio encoder can consume. Since the second convolutional layer in
+    the encoder reduces the sequence length by half (stride of 2), we
+    use `max_encoder_sequence_length // 2` as the sequence length for the
+    positional embedding layer.
+- __max_decoder_sequence_length__: int. The maximum sequence length that the
+    text decoder can consume.
+### Example Usage
+```python
+import keras_hub
+import keras_core as keras
+import numpy as np
+```
+```python
+input_data = {
+    "encoder_features": np.ones(shape=(1, 12, 80), dtype="int32"),
+    "decoder_token_ids": np.ones(shape=(1, 12), dtype="int32"),
+    "decoder_padding_mask": np.array(
+        [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]
+    ),
+}
+# Randomly initialized Whisper encoder-decoder model with a custom config.
+model = keras_hub.models.WhisperBackbone(
+    vocabulary_size=51864,
+    num_layers=4,
+    num_heads=4,
+    hidden_dim=256,
+    intermediate_dim=512,
+    max_encoder_sequence_length=128,
+    max_decoder_sequence_length=128,
+)
+model(input_data)
+```
+## Example Usage with Hugging Face URI
+```python
+import keras_hub
+import keras_core as keras
+import numpy as np
+```
+```python
+input_data = {
+    "encoder_features": np.ones(shape=(1, 12, 80), dtype="int32"),
+    "decoder_token_ids": np.ones(shape=(1, 12), dtype="int32"),
+    "decoder_padding_mask": np.array(
+        [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]
+    ),
+}
+# Randomly initialized Whisper encoder-decoder model with a custom config.
+model = keras_hub.models.WhisperBackbone(
+    vocabulary_size=51864,
+    num_layers=4,
+    num_heads=4,
+    hidden_dim=256,
+    intermediate_dim=512,
+    max_encoder_sequence_length=128,
+    max_decoder_sequence_length=128,
+)
+model(input_data)
+```