GustavoHCruz commited on
Commit
fb7f477
·
verified ·
1 Parent(s): 9761972

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +79 -27
  2. config.json +7 -3
  3. dnabert2_exon_intron_classification.py +100 -0
README.md CHANGED
@@ -1,62 +1,114 @@
1
  ---
2
  license: mit
3
  base_model:
4
- - zhihan1996/DNABERT-2-117M
5
  tags:
6
- - genomics
7
- - bioinformatics
8
- - DNA
9
- - sequence-classification
10
- - introns
11
- - exons
12
- - DNABERT2
13
  ---
 
14
  # Exons and Introns Classifier
15
 
16
- BERT finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset.
 
 
17
 
18
  ## Architecture
 
19
  - Base model: DNABERT2
20
  - Approach: Full-sequence classification
21
- - Framework: PyTorch + Hugging Face Transformers
22
-
 
23
  ## Usage
24
 
 
 
25
  ```python
26
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
27
- tokenizer = AutoTokenizer.from_pretrained("GustavoHCruz/ExInDNABERT2")
28
- model = AutoModelForSequenceClassification.from_pretrained("GustavoHCruz/ExInDNABERT2")
 
 
 
 
 
 
 
 
 
 
29
  ```
30
 
31
- Prompt format:
 
 
 
 
32
 
33
- The model expects nucleotide sequences.
 
 
 
 
 
 
34
 
35
  The model should predict the next token as the class label: 0 (Intron) or 1 (Exon).
36
 
37
- ## Data
 
 
38
 
39
- The model was trained on a processed version of GenBank sequences spanning multiple species.
 
 
40
 
41
  ## Publications
42
 
43
- - **Full Paper – 2nd Place (National)**
44
  Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
45
- [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575)
46
- - **Short Paper (International)**
47
  Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
48
- [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113)
49
-
 
 
50
  ## Training
51
 
52
  - Trained on an architecture with 8x H100 GPUs.
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  ## GitHub Repository
55
 
56
  The full code for **data processing, model training, and inference** is available on GitHub:
57
  [CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
58
 
59
- You can find scripts for:
60
- - Preprocessing GenBank sequences
61
- - Fine-tuning models
62
- - Evaluating and using the trained models
 
 
1
  ---
2
  license: mit
3
  base_model:
4
+ - zhihan1996/DNABERT-2-117M
5
  tags:
6
+ - genomics
7
+ - bioinformatics
8
+ - DNA
9
+ - sequence-classification
10
+ - introns
11
+ - exons
12
+ - DNABERT2
13
  ---
14
+
15
  # Exons and Introns Classifier
16
 
17
+ DNABERT2 finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset (34,627 different species).
18
+
19
+ ---
20
 
21
  ## Architecture
22
+
23
  - Base model: DNABERT2
24
  - Approach: Full-sequence classification
25
+
26
+ ---
27
+
28
  ## Usage
29
 
30
+ You can use this model through its own custom pipeline:
31
+
32
  ```python
33
+ from transformers import pipeline
34
+
35
+ pipe = pipeline(
36
+ task="dnabert2-exon-intron-classification",
37
+ model="GustavoHCruz/ExInDNABERT2",
38
+ trust_remote_code=True,
39
+ )
40
+
41
+ out = pipe(
42
+ "GCAGCAACAGTGCCCAGGGCTCTGATGAGTCTCTCATCACTTGTAAAG"
43
+ )
44
+
45
+ print(out) # EXON
46
  ```
47
 
48
+ This model uses the same maximum context length as the standard DNABERT2 (512 tokens), but it was trained on DNA sequences of up to 256 nucleotides.
49
+
50
+ The pipeline will automatically truncate the nucleotide sequence they exceed this limit.
51
+
52
+ ---
53
 
54
+ ## Custom Usage Information
55
+
56
+ The model expects the same tokens as DNABERT2, ou seja, nucleotídeos de entrada, como por exemplo
57
+
58
+ ```
59
+ GTAAGGAGGGGGAT
60
+ ```
61
 
62
  The model should predict the next token as the class label: 0 (Intron) or 1 (Exon).
63
 
64
+ ---
65
+
66
+ ## Dataset
67
 
68
+ The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions).
69
+
70
+ ---
71
 
72
  ## Publications
73
 
74
+ - **Full Paper**
75
  Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
76
+ DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575).
77
+ - **Short Paper**
78
  Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
79
+ DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113).
80
+
81
+ ---
82
+
83
  ## Training
84
 
85
  - Trained on an architecture with 8x H100 GPUs.
86
 
87
+ ---
88
+
89
+ ## Metrics
90
+
91
+ **Average accuracy:** **0.9956**
92
+
93
+ | Class | Precision | Recall | F1-Score |
94
+ | ---------- | --------- | ------ | -------- |
95
+ | **Intron** | 0.9943 | 0.9922 | 0.9932 |
96
+ | **Exon** | 0.9962 | 0.9972 | 0.9967 |
97
+
98
+ ### Notes
99
+
100
+ - Metrics were computed on a full isolated test set.
101
+ - The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores.
102
+
103
+ ---
104
+
105
  ## GitHub Repository
106
 
107
  The full code for **data processing, model training, and inference** is available on GitHub:
108
  [CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
109
 
110
+ You can find scripts for:
111
+
112
+ - Preprocessing GenBank sequences
113
+ - Fine-tuning models
114
+ - Evaluating and using the trained models
config.json CHANGED
@@ -1,8 +1,6 @@
1
  {
2
  "alibi_starting_size": 512,
3
- "architectures": [
4
- "BertForSequenceClassification"
5
- ],
6
  "attention_probs_dropout_prob": 0.0,
7
  "auto_map": {
8
  "AutoConfig": "configuration_bert.BertConfig",
@@ -11,6 +9,12 @@
11
  "AutoModelForSequenceClassification": "bert_layers.BertForSequenceClassification"
12
  },
13
  "classifier_dropout": null,
 
 
 
 
 
 
14
  "dtype": "float32",
15
  "gradient_checkpointing": false,
16
  "hidden_act": "gelu",
 
1
  {
2
  "alibi_starting_size": 512,
3
+ "architectures": ["BertForSequenceClassification"],
 
 
4
  "attention_probs_dropout_prob": 0.0,
5
  "auto_map": {
6
  "AutoConfig": "configuration_bert.BertConfig",
 
9
  "AutoModelForSequenceClassification": "bert_layers.BertForSequenceClassification"
10
  },
11
  "classifier_dropout": null,
12
+ "custom_pipelines": {
13
+ "dnabert2-exon-intron-classification": {
14
+ "impl": "dnabert2_exon_intron_classification.DNABERT2ExonIntronClassificationPipeline",
15
+ "pt": ["BertForSequenceClassification"]
16
+ }
17
+ },
18
  "dtype": "float32",
19
  "gradient_checkpointing": false,
20
  "hidden_act": "gelu",
dnabert2_exon_intron_classification.py ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any
2
+
3
+ import torch
4
+ from transformers import BertForSequenceClassification, Pipeline
5
+ from transformers.pipelines import PIPELINE_REGISTRY
6
+ from transformers.utils.generic import ModelOutput
7
+
8
+
9
+ def process_label(p: str) -> str:
10
+ return "EXON" if p == 0 else "INTRON"
11
+
12
+ class DNABERT2ExonIntronClassificationPipeline(Pipeline):
13
+ def _sanitize_parameters(
14
+ self,
15
+ **kwargs
16
+ ):
17
+ preprocess_kwargs = {}
18
+
19
+ for k in ("max_length"):
20
+ if k in kwargs:
21
+ preprocess_kwargs[k] = kwargs[k]
22
+
23
+ forward_kwargs = {
24
+ k: v for k, v in kwargs.items()
25
+ if k not in preprocess_kwargs
26
+ }
27
+
28
+ postprocess_kwargs = {}
29
+
30
+ return preprocess_kwargs, forward_kwargs, postprocess_kwargs
31
+
32
+ def preprocess(
33
+ self,
34
+ input_,
35
+ **preprocess_parameters
36
+ ):
37
+ assert self.tokenizer
38
+
39
+ if isinstance(input_, str):
40
+ sequence = input_
41
+ elif isinstance(input_, dict):
42
+ sequence = input_.get("sequence", "")
43
+ else:
44
+ raise TypeError("input_ must be str or dict with 'sequence' key")
45
+
46
+ sequence = sequence[:256]
47
+
48
+ max_length = preprocess_parameters.get("max_length", 256)
49
+ if not isinstance(max_length, int):
50
+ raise TypeError("max_length must be an int")
51
+
52
+ token_kwargs: dict[str, Any] = {"return_tensors": "pt"}
53
+ token_kwargs["max_length"] = max_length
54
+ token_kwargs["truncation"] = True
55
+
56
+ enc = self.tokenizer(sequence, **token_kwargs).to(self.model.device)
57
+
58
+ return {"prompt": sequence, "inputs": enc}
59
+
60
+ def _forward(self, input_tensors: dict, **forward_params):
61
+ assert isinstance(self.model, BertForSequenceClassification)
62
+ kwargs = dict(forward_params)
63
+
64
+ inputs = input_tensors.get("inputs")
65
+
66
+ if inputs is None:
67
+ raise ValueError("Model inputs missing in input_tensors (expected key 'inputs').")
68
+
69
+ if hasattr(inputs, "items") and not isinstance(inputs, torch.Tensor):
70
+ try:
71
+ expanded_inputs: dict[str, torch.Tensor] = {k: v.to(self.model.device) if isinstance(v, torch.Tensor) else v for k, v in dict(inputs).items()}
72
+ except Exception:
73
+ expanded_inputs = {}
74
+ for k, v in dict(inputs).items():
75
+ expanded_inputs[k] = v.to(self.model.device) if isinstance(v, torch.Tensor) else v
76
+ else:
77
+ if isinstance(inputs, torch.Tensor):
78
+ expanded_inputs = {"input_ids": inputs.to(self.model.device)}
79
+ else:
80
+ expanded_inputs = {"input_ids": torch.tensor(inputs, device=self.model.device)}
81
+
82
+ self.model.eval()
83
+ with torch.no_grad():
84
+ outputs = self.model(**expanded_inputs, **kwargs)
85
+
86
+ pred_id = torch.argmax(outputs.logits, dim=-1).item()
87
+
88
+ return ModelOutput({"pred_id": pred_id})
89
+
90
+ def postprocess(self, model_outputs: dict, **kwargs):
91
+ assert self.tokenizer
92
+
93
+ pred_id = model_outputs["pred_id"]
94
+ return process_label(pred_id)
95
+
96
+ PIPELINE_REGISTRY.register_pipeline(
97
+ "dnabert2-exon-intron-classification",
98
+ pipeline_class=DNABERT2ExonIntronClassificationPipeline,
99
+ pt_model=BertForSequenceClassification,
100
+ )