Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +79 -27
config.json +7 -3
dnabert2_exon_intron_classification.py +100 -0

README.md CHANGED Viewed

@@ -1,62 +1,114 @@
 ---
 license: mit
 base_model:
-- zhihan1996/DNABERT-2-117M
 tags:
-- genomics
-- bioinformatics
-- DNA
-- sequence-classification
-- introns
-- exons
-- DNABERT2
 ---
 # Exons and Introns Classifier
-BERT finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset.
 ## Architecture
 - Base model: DNABERT2
 - Approach: Full-sequence classification
-- Framework: PyTorch + Hugging Face Transformers
 ## Usage
 ```python
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
-tokenizer = AutoTokenizer.from_pretrained("GustavoHCruz/ExInDNABERT2")
-model = AutoModelForSequenceClassification.from_pretrained("GustavoHCruz/ExInDNABERT2")
 ```
-Prompt format:
-The model expects nucleotide sequences.
 The model should predict the next token as the class label: 0 (Intron) or 1 (Exon).
-## Data
-The model was trained on a processed version of GenBank sequences spanning multiple species.
 ## Publications
-- **Full Paper – 2nd Place (National)**
   Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
-  [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575)
-- **Short Paper (International)**
   Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
-  [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113)
 ## Training
 - Trained on an architecture with 8x H100 GPUs.
 ## GitHub Repository
 The full code for **data processing, model training, and inference** is available on GitHub:
 [CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
-You can find scripts for:
-- Preprocessing GenBank sequences
-- Fine-tuning models
-- Evaluating and using the trained models

 ---
 license: mit
 base_model:
+  - zhihan1996/DNABERT-2-117M
 tags:
+  - genomics
+  - bioinformatics
+  - DNA
+  - sequence-classification
+  - introns
+  - exons
+  - DNABERT2
 ---
 # Exons and Introns Classifier
+DNABERT2 finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset (34,627 different species).
+---
 ## Architecture
 - Base model: DNABERT2
 - Approach: Full-sequence classification
+---
 ## Usage
+You can use this model through its own custom pipeline:
 ```python
+from transformers import pipeline
+pipe = pipeline(
+  task="dnabert2-exon-intron-classification",
+  model="GustavoHCruz/ExInDNABERT2",
+  trust_remote_code=True,
+)
+out = pipe(
+  "GCAGCAACAGTGCCCAGGGCTCTGATGAGTCTCTCATCACTTGTAAAG"
+)
+print(out) # EXON
 ```
+This model uses the same maximum context length as the standard DNABERT2 (512 tokens), but it was trained on DNA sequences of up to 256 nucleotides.
+The pipeline will automatically truncate the nucleotide sequence they exceed this limit.
+---
+## Custom Usage Information
+The model expects the same tokens as DNABERT2, ou seja, nucleotídeos de entrada, como por exemplo
+```
+GTAAGGAGGGGGAT
+```
 The model should predict the next token as the class label: 0 (Intron) or 1 (Exon).
+---
+## Dataset
+The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions).
+---
 ## Publications
+- **Full Paper**
   Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
+  DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575).
+- **Short Paper**
   Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
+  DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113).
+---
 ## Training
 - Trained on an architecture with 8x H100 GPUs.
+---
+## Metrics
+**Average accuracy:** **0.9956**
+| Class      | Precision | Recall | F1-Score |
+| ---------- | --------- | ------ | -------- |
+| **Intron** | 0.9943    | 0.9922 | 0.9932   |
+| **Exon**   | 0.9962    | 0.9972 | 0.9967   |
+### Notes
+- Metrics were computed on a full isolated test set.
+- The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores.
+---
 ## GitHub Repository
 The full code for **data processing, model training, and inference** is available on GitHub:
 [CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
+You can find scripts for:
+- Preprocessing GenBank sequences
+- Fine-tuning models
+- Evaluating and using the trained models

config.json CHANGED Viewed

@@ -1,8 +1,6 @@
 {
   "alibi_starting_size": 512,
-  "architectures": [
-    "BertForSequenceClassification"
-  ],
   "attention_probs_dropout_prob": 0.0,
   "auto_map": {
     "AutoConfig": "configuration_bert.BertConfig",
@@ -11,6 +9,12 @@
     "AutoModelForSequenceClassification": "bert_layers.BertForSequenceClassification"
   },
   "classifier_dropout": null,
   "dtype": "float32",
   "gradient_checkpointing": false,
   "hidden_act": "gelu",

 {
   "alibi_starting_size": 512,
+  "architectures": ["BertForSequenceClassification"],
   "attention_probs_dropout_prob": 0.0,
   "auto_map": {
     "AutoConfig": "configuration_bert.BertConfig",
     "AutoModelForSequenceClassification": "bert_layers.BertForSequenceClassification"
   },
   "classifier_dropout": null,
+  "custom_pipelines": {
+    "dnabert2-exon-intron-classification": {
+      "impl": "dnabert2_exon_intron_classification.DNABERT2ExonIntronClassificationPipeline",
+      "pt": ["BertForSequenceClassification"]
+    }
+  },
   "dtype": "float32",
   "gradient_checkpointing": false,
   "hidden_act": "gelu",

dnabert2_exon_intron_classification.py ADDED Viewed

	@@ -0,0 +1,100 @@

+from typing import Any
+import torch
+from transformers import BertForSequenceClassification, Pipeline
+from transformers.pipelines import PIPELINE_REGISTRY
+from transformers.utils.generic import ModelOutput
+def process_label(p: str) -> str:
+	return "EXON" if p == 0 else "INTRON"
+class DNABERT2ExonIntronClassificationPipeline(Pipeline):
+	def _sanitize_parameters(
+		self,
+		**kwargs
+	):
+		preprocess_kwargs = {}
+		for k in ("max_length"):
+			if k in kwargs:
+				preprocess_kwargs[k] = kwargs[k]
+		forward_kwargs = {
+			k: v for k, v in kwargs.items()
+			if k not in preprocess_kwargs
+		}
+		postprocess_kwargs = {}
+		return preprocess_kwargs, forward_kwargs, postprocess_kwargs
+	def preprocess(
+		self,
+		input_,
+		**preprocess_parameters
+	):
+		assert self.tokenizer
+		if isinstance(input_, str):
+			sequence = input_
+		elif isinstance(input_, dict):
+			sequence = input_.get("sequence", "")
+		else:
+				raise TypeError("input_ must be str or dict with 'sequence' key")
+		sequence = sequence[:256]
+		max_length = preprocess_parameters.get("max_length", 256)
+		if not isinstance(max_length, int):
+			raise TypeError("max_length must be an int")
+		token_kwargs: dict[str, Any] = {"return_tensors": "pt"}
+		token_kwargs["max_length"] = max_length
+		token_kwargs["truncation"] = True
+		enc = self.tokenizer(sequence, **token_kwargs).to(self.model.device)
+		return {"prompt": sequence, "inputs": enc}
+	def _forward(self, input_tensors: dict, **forward_params):
+		assert isinstance(self.model, BertForSequenceClassification)
+		kwargs = dict(forward_params)
+		inputs = input_tensors.get("inputs")
+		if inputs is None:
+			raise ValueError("Model inputs missing in input_tensors (expected key 'inputs').")
+		if hasattr(inputs, "items") and not isinstance(inputs, torch.Tensor):
+			try:
+				expanded_inputs: dict[str, torch.Tensor] = {k: v.to(self.model.device) if isinstance(v, torch.Tensor) else v for k, v in dict(inputs).items()}
+			except Exception:
+				expanded_inputs = {}
+				for k, v in dict(inputs).items():
+					expanded_inputs[k] = v.to(self.model.device) if isinstance(v, torch.Tensor) else v
+		else:
+			if isinstance(inputs, torch.Tensor):
+				expanded_inputs = {"input_ids": inputs.to(self.model.device)}
+			else:
+				expanded_inputs = {"input_ids": torch.tensor(inputs, device=self.model.device)}
+		self.model.eval()
+		with torch.no_grad():
+			outputs = self.model(**expanded_inputs, **kwargs)
+		pred_id = torch.argmax(outputs.logits, dim=-1).item()
+		return ModelOutput({"pred_id": pred_id})
+	def postprocess(self, model_outputs: dict, **kwargs):
+		assert self.tokenizer
+		pred_id = model_outputs["pred_id"]
+		return process_label(pred_id)
+PIPELINE_REGISTRY.register_pipeline(
+	"dnabert2-exon-intron-classification",
+	pipeline_class=DNABERT2ExonIntronClassificationPipeline,
+	pt_model=BertForSequenceClassification,
+)