AnnyNguyen commited on
Commit
c00c38b
·
verified ·
1 Parent(s): 13df8cb

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +137 -0
README.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: uitnlp/visobert
4
+ tags:
5
+ - vietnamese
6
+ - spam-detection
7
+ - text-classification
8
+ - e-commerce
9
+ datasets:
10
+ - ViSpamReviews
11
+ metrics:
12
+ - accuracy
13
+ - macro-f1
14
+ - macro-precision
15
+ - macro-recall
16
+ model-index:
17
+ - name: visobert-spam-multi-class
18
+ results:
19
+ - task:
20
+ type: text-classification
21
+ name: Spam Review Detection
22
+ dataset:
23
+ name: ViSpamReviews
24
+ type: ViSpamReviews
25
+ metrics:
26
+ - type: accuracy
27
+ value: N/A
28
+ - type: macro-f1
29
+ value: N/A
30
+ ---
31
+ # visobert-spam-multi-class: Spam Review Detection for Vietnamese Text
32
+
33
+ This model is a fine-tuned version of [uitnlp/visobert](https://huggingface.co/uitnlp/visobert) on the **ViSpamReviews** dataset for spam review detection in Vietnamese e-commerce reviews.
34
+
35
+ ## Model Details
36
+
37
+ * **Base Model**: `uitnlp/visobert`
38
+ * **Description**: ViSoBERT - Vietnamese Social BERT
39
+ * **Dataset**: ViSpamReviews (Vietnamese Spam Review Dataset)
40
+ * **Fine-tuning Framework**: HuggingFace Transformers
41
+ * **Task**: Spam Review Detection (multi-class)
42
+ * **Number of Classes**: 4
43
+
44
+ ### Hyperparameters
45
+
46
+ * Max sequence length: `256`
47
+ * Learning rate: `5e-5`
48
+ * Batch size: `32`
49
+ * Epochs: `100`
50
+ * Early stopping patience: `5`
51
+
52
+ ## Dataset
53
+
54
+ The model was trained on the **ViSpamReviews** dataset, which contains 19,860 Vietnamese e-commerce review samples. The dataset includes:
55
+
56
+ * **Train set**: 14,299 samples (72%)
57
+ * **Validation set**: 1,590 samples (8%)
58
+ * **Test set**: 3,971 samples (20%)
59
+
60
+ ### Label Distribution
61
+
62
+
63
+ * **NO-SPAM** (0): Genuine product reviews
64
+ * **SPAM-1** (1): Fake review (synthetic/manipulated reviews)
65
+ * **SPAM-2** (2): Brand-only reviews (only mention brand without product details)
66
+ * **SPAM-3** (3): Irrelevant reviews (unrelated content)
67
+
68
+ ## Results
69
+
70
+ The model was evaluated on the test set with the following metrics:
71
+
72
+ * Results: <INSERT_METRICS>
73
+
74
+ ## Usage
75
+
76
+ You can use this model for spam review detection in Vietnamese text. Below is an example:
77
+
78
+ ```python
79
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
80
+ import torch
81
+
82
+ # Load model and tokenizer
83
+ model_name = "visolex/visobert-spam-multiclass"
84
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
85
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
86
+
87
+ # Example review text
88
+ text = "Sản phẩm này rất tốt, shop giao hàng nhanh!"
89
+
90
+ # Tokenize
91
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
92
+
93
+ # Predict
94
+ with torch.no_grad():
95
+ outputs = model(**inputs)
96
+ predicted_class = outputs.logits.argmax(dim=-1).item()
97
+ probabilities = torch.softmax(outputs.logits, dim=-1)
98
+
99
+
100
+ # Map to label
101
+ label_map = {
102
+ 0: "NO-SPAM",
103
+ 1: "SPAM-1 (fake review)",
104
+ 2: "SPAM-2 (brand-only)",
105
+ 3: "SPAM-3 (irrelevant)"
106
+ }
107
+ predicted_label = label_map[predicted_class]
108
+ confidence = probabilities[0][predicted_class].item()
109
+
110
+ print(f"Text: {text}")
111
+ print(f"Predicted: {predicted_label} (confidence: {confidence:.2%})")
112
+
113
+ ```
114
+
115
+ ## Citation
116
+
117
+ If you use this model, please cite:
118
+
119
+ ```bibtex
120
+ @misc{{
121
+ {model_key}_spam_detection,
122
+ title={{{description}}},
123
+ author={{ViSoLex Team}},
124
+ year={{2025}},
125
+ howpublished={{\url{{https://huggingface.co/{visolex/visobert-spam-multiclass}}}}}
126
+ }}
127
+ ```
128
+
129
+ ## License
130
+
131
+ This model is released under the Apache-2.0 license.
132
+
133
+ ## Acknowledgments
134
+
135
+ * Base model: [{base_model}](https://huggingface.co/{base_model})
136
+ * Dataset: ViSpamReviews (Vietnamese Spam Review Dataset)
137
+ * ViSoLex Toolkit for Vietnamese NLP