GustavoHCruz commited on
Commit
a8d4848
·
verified ·
1 Parent(s): 8360810

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -1
README.md CHANGED
@@ -2,4 +2,61 @@
2
  license: mit
3
  base_model:
4
  - zhihan1996/DNABERT-2-117M
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: mit
3
  base_model:
4
  - zhihan1996/DNABERT-2-117M
5
+ tags:
6
+ - genomics
7
+ - bioinformatics
8
+ - DNA
9
+ - sequence-classification
10
+ - introns
11
+ - exons
12
+ - DNABERT2
13
+ ---
14
+ # Exons and Introns Classifier
15
+
16
+ BERT finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset.
17
+
18
+ ## Architecture
19
+ - Base model: DNABERT2
20
+ - Approach: Full-sequence classification
21
+ - Framework: PyTorch + Hugging Face Transformers
22
+
23
+ ## Usage
24
+
25
+ ```python
26
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
27
+ tokenizer = AutoTokenizer.from_pretrained("GustavoHCruz/ExInDNABERT2")
28
+ model = AutoModelForSequenceClassification.from_pretrained("GustavoHCruz/ExInDNABERT2")
29
+ ```
30
+
31
+ Prompt format:
32
+
33
+ The model expects nucleotide sequences.
34
+
35
+ The model should predict the next token as the class label: 0 (Intron) or 1 (Exon).
36
+
37
+ ## Data
38
+
39
+ The model was trained on a processed version of GenBank sequences spanning multiple species.
40
+
41
+ ## Publications
42
+
43
+ - **Full Paper – 2nd Place (National)**
44
+ Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
45
+ [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575)
46
+ - **Short Paper (International)**
47
+ Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
48
+ [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113)
49
+
50
+ ## Training
51
+
52
+ - Trained on an architecture with 8x H100 GPUs.
53
+
54
+ ## GitHub Repository
55
+
56
+ The full code for **data processing, model training, and inference** is available on GitHub:
57
+ [CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
58
+
59
+ You can find scripts for:
60
+ - Preprocessing GenBank sequences
61
+ - Fine-tuning models
62
+ - Evaluating and using the trained models