Update README.md
Browse files
README.md
CHANGED
|
@@ -22,6 +22,23 @@ widget:
|
|
| 22 |
# bkai-foundation-models/vietnamese-bi-encoder
|
| 23 |
|
| 24 |
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
<!--- Describe your model here -->
|
| 27 |
|
|
|
|
| 22 |
# bkai-foundation-models/vietnamese-bi-encoder
|
| 23 |
|
| 24 |
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
| 25 |
+
We train the model on a merged training dataset that consists of:
|
| 26 |
+
- MS Macro (translated in Vietnamese)
|
| 27 |
+
- Squadv2 (translated in Vietnamese)
|
| 28 |
+
- 80% of the training set from the Legal Text Retrieval Zalo 2021 challenge
|
| 29 |
+
|
| 30 |
+
We use phobert-base-v2 as the pre-trained backbone.
|
| 31 |
+
|
| 32 |
+
Here are the results on the remaining 20% of the training set from the Legal Text Retrieval Zalo 2021 challenge:
|
| 33 |
+
|
| 34 |
+
| Pretrained Model | Trained Datasets | Acc@1 | Acc@10 | Acc@100 | Pre@10 | MRR@10 |
|
| 35 |
+
|-------------------------------|---------------------------------------|:------------:|:-------------:|:--------------:|:-------------:|:-------------:|
|
| 36 |
+
| [Vietnamese-SBERT](https://huggingface.co/keepitreal/vietnamese-sbert) | - | 32.34 | 52.97 | 89.84 | 7.05 | 45.30 |
|
| 37 |
+
| | MSMACRO | 54.06 | 84.69 | 93.75 | 8.33 | 64.56 |
|
| 38 |
+
| PhoBERT-base-v2 | MSMACRO | 47.81 | 77.19 | 92.34 | 7.72 | 58.37 |
|
| 39 |
+
| | MSMACRO + SQuADv2.0 + 80% Zalo | 73.28 | 93.59 | 98.85 | 9.36 | 80.73 |
|
| 40 |
+
|
| 41 |
+
![Uploading image.png…]()
|
| 42 |
|
| 43 |
<!--- Describe your model here -->
|
| 44 |
|