register model & update readme

Browse files

Files changed (5) hide show

.gitattributes +1 -0
README.md +101 -190
config.json +4 -0
model.png +3 -0
modeling_prot2text2.py +288 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -3,205 +3,116 @@ license: mit
 language:
 - en
 base_model:
-- meta-llama/Llama-3.1-8B-Instruct
 - facebook/esm2_t36_3B_UR50D
 pipeline_tag: text-generation
 tags:
 - biology
-- medical
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 language:
 - en
 base_model:
+- meta-llama/Llama-3.1-8B-Instruct-Instruct
 - facebook/esm2_t36_3B_UR50D
 pipeline_tag: text-generation
 tags:
 - biology
 ---
+# Pro2Text-V2: Protein Function Prediction with Multimodal Contrastive Alignment
+This is the official repository for the paper "Prot2Text-V2: Protein Function Prediction with Multimodal Contrastive Alignment" by Xiao Fei, Michail Chatzianastasis, Sarah Almeida Carneiro, Hadi Abdine, Lawrence P. Petalidis, and Michalis Vazirgiannis.
+We're excited to share that our paper has been accepted to **NeurIPS 2025**! The pretrained model weights and the dataset are now publicly available here.
+Resources and Documentation:
+* [📃 ArXiV Preprint 2505.11194](https://arxiv.org/abs/2505.11194)
+* [📜 NeurIPS 2025 Poster](ttps://neurips.cc/virtual/2025/poster/115368)
+* [💻 GitHub Repository](https://github.com/ColinFX/Prot2Text-V2)
+* [🤗 Experimental Dataset](https://huggingface.co/datasets/habdine/Prot2Text-Data)
+## Model Details
+**Prot2Text-V2** treats a protein sequence as if it were another language, and then translate it into English. The model takes the raw amino acid sequence as input and generates a clear, human-readable paragraph describing what the protein does.
+The model is an innovative fusion of three key components:
+* Protein language model as sequence encoder: `facebook/esm2_t36_3B_UR50D`
+* Modality adapter as a unique and lightweight component that bridges the gap between protein embeddings and the language model.
+* Natural language decoder for generating articulate textual descriptions utilizing the sequence embeddings: `meta-llama/Llama-3.1-8B-Instruct`
+<img src="./model.png" alt="Model Architecture" width="100%"/>
+## Usage: inference
+```python
+import torch
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained(
+    pretrained_model_name_or_path="xiao-fei/Prot2Text-V2-11B-Instruct-hf",
+    trust_remote_code=True,
+    torch_dtype=torch.bfloat16,
+    device_map="cpu"
+)
+esm_tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t36_3B_UR50D")
+llama_tokenizer = AutoTokenizer.from_pretrained(
+    pretrained_model_name_or_path="meta-llama/Llama-3.1-8B-Instruct",
+    pad_token='<|reserved_special_token_0|>'
+)
+example_sequence = (
+    "MCYSANGNTFLIVDNTQKRIPEEKKPDFVRENVGDLDGVIFVELVDGKYFMDYYNRDGSMAAFCGNGARAFSQ"
+    "YLIDRGWIKEKEFTFLSRAGEIKVIVDDSIWVRMPGVSEKKEMKVDGYEGYFVVVGVPHFVMEVKGIDELDVE"
+    "KLGRDLRYKTGANVDFYEVLPDRLKVRTYERGVERETKACGTGVTSVFVVYRDKTGAKEVKIQVPGGTLFLKE"
+    "ENGEIFLRGDVKRCSEE"
+)
+system_message = (
+    "You are a scientific assistant specialized in protein function "
+    "predictions. Given the sequence embeddings and other information "
+    "of a protein, describe its function clearly and concisely in "
+    "professional language. "
+)
+placeholder = '<|reserved_special_token_1|>'
+user_message = "Sequence embeddings: " + placeholder * (len(example_sequence)+2)
+tokenized_prompt = llama_tokenizer.apply_chat_template(
+    [
+        {"role": "system", "content": system_message},
+        {"role": "user", "content": user_message}
+    ],
+    add_generation_prompt=True,
+    tokenize=True,
+    return_tensors="pt",
+    return_dict=True
+)
+tokenized_sequence = esm_tokenizer(
+    ex_seq,
+    return_tensors="pt"
+)
+model.eval()
+generated = model.generate(
+    inputs=tokenized_prompt["input_ids"].to("cuda"),
+    attention_mask=tokenized_prompt["attention_mask"].to("cuda"),
+    protein_input_ids=tokenized_sequence["input_ids"].to("cuda"),
+    protein_attention_mask=tokenized_sequence["attention_mask"].to("cuda"),
+    max_new_tokens=1024,
+    eos_token_id=128009,
+    pad_token_id=128002,
+    return_dict_in_generate=False,
+    num_beams=4,
+    do_sample=False,
+)
+print(llama_tokenizer.decode(generated[0], skip_special_tokens=True))
+```
+For detailed instructions on fine-tuning the model and reproducing the experiments, please refer to our [GitHub page](https://github.com/ColinFX/Prot2Text-V2).
+## Ⓒ Citation
+If you find our research helpful, feel free to 🖋️ cite our work or ❤️ like the page:
+```bibtex
+@misc{prot2textv2,
+      title={Prot2Text-V2: Protein Function Prediction with Multimodal Contrastive Alignment},
+      author={Xiao Fei and Michail Chatzianastasis and Sarah Almeida Carneiro and Hadi Abdine and Lawrence P. Petalidis and Michalis Vazirgiannis},
+      year={2025},
+      eprint={2505.11194},
+      archivePrefix={arXiv},
+      primaryClass={cs.CE},
+      url={https://arxiv.org/abs/2505.11194},
+}
+```

config.json CHANGED Viewed

@@ -70,6 +70,10 @@
   "architectures": [
     "Esm2LlamaInstructForCausalLM"
   ],
   "esm_config": {
     "_attn_implementation_autoset": true,
     "_name_or_path": "/ssd1/huggingface-models/esm2_t36_3B_UR50D",

   "architectures": [
     "Esm2LlamaInstructForCausalLM"
   ],
+  "auto_map": {
+    "AutoConfig": "modeling_prot2text2.Esm2LlamaInstructConfig",
+    "AutoModelForCausalLM": "modeling_prot2text2.Esm2LlamaInstructForCausalLM"
+  },
   "esm_config": {
     "_attn_implementation_autoset": true,
     "_name_or_path": "/ssd1/huggingface-models/esm2_t36_3B_UR50D",

model.png ADDED Viewed

Git LFS Details

SHA256: faff935131de00279a4a6232ff4ae1fdaa73fd8ed15f35bfd64b84298f45bcdd
Pointer size: 131 Bytes
Size of remote file: 451 kB

modeling_prot2text2.py ADDED Viewed

	@@ -0,0 +1,288 @@

+from typing import Dict, Optional, Tuple, Union
+import torch
+from transformers import AutoConfig, AutoModelForCausalLM
+from transformers import EsmConfig, LlamaConfig, PretrainedConfig
+from transformers import EsmModel, LlamaForCausalLM, PreTrainedModel
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.generation.utils import Cache, GenerateOutput
+class ModalityAdapterConfig(PretrainedConfig):
+    model_type = "modality_adapter"
+    def __init__(
+            self,
+            input_dim: int,
+            intermediate_dim: int,
+            output_dim: int,
+            dropout_rate: float = 0.3,
+            **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.input_dim = input_dim
+        self.intermediate_dim = intermediate_dim
+        self.output_dim = output_dim
+        self.dropout_rate = dropout_rate
+class Esm2LlamaInstructConfig(PretrainedConfig):
+    model_type = "esm2llama_instruct"
+    def __init__(
+            self,
+            # model components
+            esm_config: Optional[Union[EsmConfig, Dict]] = None,
+            adapter_config: Optional[Union[ModalityAdapterConfig, Dict]] = None,
+            llama_config: Optional[Union[LlamaConfig, Dict]] = None,
+            # standalone attributes
+            placeholder_id: int = 128003,
+            **kwargs
+    ):
+        super().__init__(**kwargs)
+        if isinstance(esm_config, dict):
+            self.esm_config = EsmConfig(**esm_config)
+        else:
+            self.esm_config = esm_config
+        if isinstance(llama_config, dict):
+            self.llama_config = LlamaConfig(**llama_config)
+        else:
+            self.llama_config = llama_config
+        if isinstance(adapter_config, dict):
+            self.adapter_config = ModalityAdapterConfig(**adapter_config)
+        else:
+            self.adapter_config = adapter_config
+        self.placeholder_id = placeholder_id
+class ModalityAdapter(PreTrainedModel):
+    config_class = ModalityAdapterConfig
+    def __init__(self, config: ModalityAdapterConfig):
+        super().__init__(config)
+        self.config = config
+        self.fc1 = torch.nn.Linear(config.input_dim, config.intermediate_dim)
+        self.fc2 = torch.nn.Linear(config.intermediate_dim, config.output_dim)
+        self.activation = torch.nn.GELU()
+        self.ln1 = torch.nn.LayerNorm(normalized_shape=config.intermediate_dim)  # DEPRECATED
+        self.ln2 = torch.nn.LayerNorm(normalized_shape=config.output_dim)  # DEPRECATED
+        self.dropout = torch.nn.Dropout(p=config.dropout_rate)
+        self.post_init()  # initialize weights and apply final processing
+    def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
+        # input: (bsz, seq_len, input_dim)
+        hidden_states = self.activation(self.fc1(hidden_states))
+        hidden_states = self.dropout(hidden_states)
+        # interm: (bsz, seq_len, interm_dim)
+        hidden_states = self.activation(self.fc2(hidden_states))
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = torch.nn.functional.normalize(hidden_states, p=2, dim=-1)
+        return hidden_states  # (bsz, seq_len, output_dim)
+class Esm2LlamaInstructForCausalLM(PreTrainedModel):
+    """
+    Esm2LlamaInstructForCausalLM model for protein function prediction.
+    Similar to `EncoderDecoderModel` but with more complicated architecture.
+    Initialize with either a configuration OR all three components.
+    `kwargs` can override standalone attributes in `Esm2LlamaInstructConfig`.
+    """
+    config_class = Esm2LlamaInstructConfig
+    def __init__(
+            self,
+            config: Optional[Esm2LlamaInstructConfig] = None,
+            esm_encoder: Optional[EsmModel] = None,
+            adapter: Optional[ModalityAdapter] = None,
+            llama_decoder: Optional[LlamaForCausalLM] = None,
+            **kwargs
+        ):
+        if config is not None:  # components ignored if config is provided
+            super().__init__(config)
+            self.esm_encoder = EsmModel(
+                config.esm_config,
+                add_pooling_layer=False
+            )
+            self.adapter = ModalityAdapter(config.adapter_config)
+            self.llama_decoder = LlamaForCausalLM(config.llama_config)
+        else:
+            config = Esm2LlamaInstructConfig(
+                esm_config=esm_encoder.config,
+                adapter_config=adapter.config,
+                llama_config=llama_decoder.config,
+                **kwargs  # override standalone attributes
+            )
+            super().__init__(config)
+            self.esm_encoder = esm_encoder
+            self.adapter = adapter
+            self.llama_decoder = llama_decoder
+    def prepare_decoder_inputs(
+            self,
+            input_ids: torch.LongTensor,
+            encoder_hidden_states: torch.FloatTensor,
+            attention_mask: Optional[torch.LongTensor] = None,
+            encoder_attention_mask: Optional[torch.LongTensor] = None,
+    ):
+        """
+        Embed and replace placeholder in `input_ids` by encoder hidden states.
+        `input_ids` must be passed to locate placeholder for replacement.
+        """
+        # preparation
+        batch_size, seq_len = input_ids.size()
+        _, encoder_seq_len, _ = encoder_hidden_states.size()
+        if attention_mask is None:
+            attention_mask = torch.ones(
+                (batch_size, seq_len),
+                dtype=torch.long,
+                device=input_ids.device
+            )
+        if encoder_attention_mask is None:
+            encoder_attention_mask = torch.ones(
+                (batch_size, encoder_seq_len),
+                dtype=torch.long,
+                device=encoder_hidden_states.device
+            )
+        inputs_embeds = self.llama_decoder.get_input_embeddings()(input_ids)
+        # replacement
+        placeholder_mask = input_ids == self.config.placeholder_id
+        encoder_mask = encoder_attention_mask.bool()
+        inputs_embeds[placeholder_mask] = encoder_hidden_states[encoder_mask]
+        return inputs_embeds, attention_mask
+    def forward(
+            self,
+            # chat template text inputs
+            input_ids: Optional[torch.LongTensor] = None,
+            attention_mask: Optional[torch.LongTensor] = None,
+            position_ids: Optional[torch.LongTensor] = None,
+            past_key_values: Optional[Cache] = None,
+            labels: Optional[torch.LongTensor] = None,
+            # protein amino-acid sequence inputs
+            protein_input_ids: Optional[torch.LongTensor] = None,
+            protein_attention_mask: Optional[torch.LongTensor] = None,
+            protein_position_ids: Optional[torch.LongTensor] = None,
+            protein_head_mask: Optional[torch.LongTensor] = None,
+            protein_inputs_embeds: Optional[torch.FloatTensor] = None,
+            # behavior control arguments
+            use_cache: Optional[bool] = None,
+            output_attentions: Optional[bool] = None,
+            output_hidden_states: Optional[bool] = None,
+            return_dict: Optional[bool] = None,
+            return_encoder_outputs: bool = False,
+            return_adapter_outputs: bool = False,
+            return_decoder_inputs: bool = False,
+            cache_position: Optional[torch.LongTensor] = None
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        """
+        Compute encoder and adapter outputs, then pass to decoder.
+        `input_ids` is expected to be [prompt + description] in teacher-forcing
+        scenario and [prompt] only in first iteration of inference (with
+        return_decoder_inputs=True).
+        Attention: possible concatenation of the mask and labels should be
+        handled before calling this method.
+        `inputs_embeds` not allowed due to placeholder replacement scheme.
+        """
+        # esm_encoder forward
+        encoder_output = self.esm_encoder(
+            input_ids=protein_input_ids,
+            attention_mask=protein_attention_mask,
+            position_ids=protein_position_ids,
+            head_mask=protein_head_mask,
+            inputs_embeds=protein_inputs_embeds,
+            use_cache=False, # because config.esm_config.is_decoder=False
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict
+        )
+        encoder_hidden_states = encoder_output[0]
+        encoder_attention_mask = protein_attention_mask
+        if return_encoder_outputs:
+            return encoder_output
+        # adapter forward
+        adapter_output = self.adapter(encoder_hidden_states)
+        if return_adapter_outputs:
+            return adapter_output, encoder_attention_mask
+        # decoder input preparation
+        inputs_embeds, attention_mask = self.prepare_decoder_inputs(
+            input_ids=input_ids,
+            encoder_hidden_states=adapter_output,
+            attention_mask=attention_mask,
+            encoder_attention_mask=encoder_attention_mask,
+        )
+        if return_decoder_inputs:
+            return inputs_embeds, attention_mask
+        # llama_decoder forward
+        return self.llama_decoder.forward(
+            input_ids=None,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            return_dict=return_dict,
+            cache_position=cache_position
+        )
+    def generate(
+        self,
+        inputs: torch.LongTensor,  # alias of `input_ids`
+        attention_mask: Optional[torch.LongTensor] = None,
+        protein_input_ids: Optional[torch.LongTensor] = None,
+        protein_attention_mask: Optional[torch.LongTensor] = None,
+        protein_inputs_embeds: Optional[torch.FloatTensor] = None,
+        **kwargs
+    ) -> Union[GenerateOutput, torch.LongTensor]:
+        """
+        Do inference based on given input prompt.
+        `inputs` is expected to be [prompt] only.
+        Output will not keep the input prompt due to input in form of embeds.
+        Generation behavior can be controlled by `args` and `kwargs`, read
+        `GenerationMixin.generate` for more info.
+        """
+        # get decoder inputs
+        prompt_inputs_embeds, prompt_attention_mask = self(
+            input_ids=inputs,
+            attention_mask=attention_mask,
+            protein_input_ids=protein_input_ids,
+            protein_attention_mask=protein_attention_mask,
+            protein_inputs_embeds=protein_inputs_embeds,
+            use_cache=False,
+            output_attentions=False,
+            output_hidden_states=False,
+            return_dict=False,
+            return_decoder_inputs=True
+        )
+        # do generate on llama_decoder
+        return self.llama_decoder.generate(
+            inputs_embeds=prompt_inputs_embeds,
+            attention_mask=prompt_attention_mask,
+            **kwargs
+        )
+    def gradient_checkpointing_enable(self):
+        """
+        Enable gradient checkpointing for all submodules that support it.
+        Attention! Model need to be in train mode before calling this method.
+        """
+        if hasattr(self.esm_encoder, "gradient_checkpointing_enable"):
+            self.esm_encoder.gradient_checkpointing_enable()
+        if hasattr(self.llama_decoder, "gradient_checkpointing_enable"):
+            self.llama_decoder.gradient_checkpointing_enable()
+        # simple adapter no need to implement gradient checkpointing
+    def gradient_checkpointing_disable(self):
+        if hasattr(self.esm_encoder, "gradient_checkpointing_disable"):
+            self.esm_encoder.gradient_checkpointing_disable()
+        if hasattr(self.llama_decoder, "gradient_checkpointing_disable"):
+            self.llama_decoder.gradient_checkpointing_disable()
+AutoConfig.register("esm2llama_instruct", Esm2LlamaInstructConfig)
+AutoModelForCausalLM.register(Esm2LlamaInstructConfig, Esm2LlamaInstructForCausalLM)