ytu-ce-cosmos
/

turkish-mini-bert-uncased

Model card Files Files and versions

tkesgin commited on Jul 29, 2023

Commit

789b3ec

·

1 Parent(s): b2c2c47

Update README.md

Files changed (1) hide show

README.md +75 -1

README.md CHANGED Viewed

@@ -1,3 +1,77 @@
 ---
-license: mit
 ---

 ---
+widget:
+- text: "gelirken bir litre [MASK] aldım."
+  example_title: "Örnek 1"
 ---
+# turkish-mini-bert-uncased
+This is a Turkish Mini uncased BERT model, developed to fill the gap for small-sized BERT models for Turkish. Since this model is uncased: it does not make a difference between turkish and Turkish.
+#### ⚠ Uncased use requires manual lowercase conversion
+**Don't** use the `do_lower_case = True` flag with the tokenizer. Instead, convert your text to lower case as follows:
+```python
+text.replace("I", "ı").lower()
+```
+This is due to a [known issue](https://github.com/huggingface/transformers/issues/6680) with the tokenizer.
+Be aware that this model may exhibit biased predictions as it was trained primarily on crawled data, which inherently can contain various biases.
+Other relevant information can be found in the [paper](https://arxiv.org/abs/2307.14134).
+## Example Usage
+```python
+from transformers import AutoTokenizer, BertForMaskedLM
+from transformers import pipeline
+model = BertForMaskedLM.from_pretrained("ytu-ce-cosmos/turkish-mini-bert-uncased")
+# or
+# model = BertForMaskedLM.from_pretrained("ytu-ce-cosmos/turkish-mini-bert-uncased", from_tf = True)
+tokenizer = AutoTokenizer.from_pretrained("ytu-ce-cosmos/turkish-mini-bert-uncased")
+unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
+unmasker("gelirken bir litre [MASK] aldım.")
+[{'score': 0.16809749603271484,
+  'token': 2417,
+  'token_str': 'su',
+  'sequence': 'gelirken bir litre su aldım.'},
+ {'score': 0.16734205186367035,
+  'token': 11818,
+  'token_str': 'benzin',
+  'sequence': 'gelirken bir litre benzin aldım.'},
+ {'score': 0.11109649389982224,
+  'token': 4521,
+  'token_str': 'süt',
+  'sequence': 'gelirken bir litre süt aldım.'},
+ {'score': 0.03409354388713837,
+  'token': 5662,
+  'token_str': 'suyu',
+  'sequence': 'gelirken bir litre suyu aldım.'},
+ {'score': 0.031942177563905716,
+  'token': 7157,
+  'token_str': 'kahve',
+  'sequence': 'gelirken bir litre kahve aldım.'}]
+```
+# Acknowledgments
+- Research supported with Cloud TPUs from [Google's TensorFlow Research Cloud](https://sites.research.google/trc/about/) (TFRC). Thanks for providing access to the TFRC ❤️
+- Thanks to the generous support from the Hugging Face team, it is possible to download models from their S3 storage 🤗
+# Citations
+```bibtex
+@article{kesgin2023developing,
+  title={Developing and Evaluating Tiny to Medium-Sized Turkish BERT Models},
+  author={Kesgin, Himmet Toprak and Yuce, Muzaffer Kaan and Amasyali, Mehmet Fatih},
+  journal={arXiv preprint arXiv:2307.14134},
+  year={2023}
+}
+```
+# License
+MIT