IMPLEMENTATION OF DIFFERENT TYPES OF TOKENIZERS IN TRANSFORMER ARCHITECTURE FOR MACHINE TRANSLATION TASK
DOI:
https://doi.org/10.35546/kntu2078-4481.2024.1.25Keywords:
language model, machine translation, tokenization, transformer architectureAbstract
Tokenization is the first step for almost all natural language processing tasks, and all modern language models use subword tokenization algorithms to process the input text. Due to the unique characteristics of various languages, the development of a tokenization algorithm typically requires language-specific customization. Pre-trained models for languages with limited training resources use the same tokenizers as models for English. The impact of tokenization algorithms may be different for resource-constrained languages, particularly those where words commonly feature prefixes and suffixes. In addition, the impact of different tokenization methods has not been studied in detail for lowresource languages, including Ukrainian. In this work, we train tokenizers such as WordPiece, BPE, and Unigram to study their effectiveness in terms of the accuracy of machine translation of sentences from English into Ukrainian. To conduct an experimental comparison of tokenizers for the English to Ukrainian translation task, we did not use an existing pre-trained language model. Instead, we pre-trained our own medium-sized language models based on the configuration and training procedure of the Marian model. The developed pipeline of operations consists of collecting and cleaning the training corpus of sentence pairs, training a tokenizer with a fixed-length dictionary, and pre-training a deep language model using the selected tokenizer. After that, the accuracy of the models was evaluated using such metrics as SacreBLEU and ROUGE. The obtained experimental results emphasize the role of tokenization in language modeling, in particular for morphologically rich languages. In addition, the higher morphological accuracy of Unigram tokenization leads to better performance in natural language machine translation tasks.
References
Vaswani, A. et al. (2017). Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS). https://doi.org/10.48550/arXiv.1706.03762
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. In 54th Annual Meeting of the Association for Computational Linguistics (pp. 1715–1725). Association for Computational Linguistics (ACL).
Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 66–75).
Schuster, M., & Nakajima, K. (2012). Japanese and Korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5149–5152). IEEE.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Liu, Y., Ott, M., et al. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Lewis, M., Liu, Y., et al. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Conference on Empirical Methods in Natural Language Processing.
Zhilin Yang, Zihang Dai, et al. (2019). XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
Zhenzhong Lan, Mingda Chen, et al. (2019). ALBERT: A lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
Colin Raffel, Noam Shazeer, et al. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
MarianNMT. (n.d.). https://marian-nmt.github.io/
Domingo, M., García-Martínez, M., Helle, A., Casacuberta, F., & Herranz, M. (2019). How much does tokenization affect neural machine translation?. In International Conference on Computational Linguistics and Intelligent Text Processing (pp. 545–554). Cham: Springer Nature Switzerland.
Matthias Gall´e. (2019). Investigating the effectiveness of BPE: The power of shorter sequences. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 1375–1381). Association for Computational Linguistics.
Bostrom, K., & Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining. arXiv preprint arXiv:2004.03720.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., et al. (2019). HuggingFace's Transformers: State-of-the-art Natural Language Processing. ArXiv, abs/1910.03771.
Hugging Face. (n.d.). https://huggingface.co/datasets/kde4