IMPLEMENTATION OF THE MULTI-HEAD ATTENTION MECHANISM AND TRANSFORMER MODEL FOR MACHINE TRANSLATION
DOI:
https://doi.org/10.35546/kntu2078-4481.2023.1.15Keywords:
attention mechanism, machine translation, natural language processing, transformer modelAbstract
The attention mechanism is used in a wide range of neural architectures and has been researched within diverse application domains. The attention mechanism has become a popular deep learning technique for several reasons. First, state-of-theart models that incorporate attention mechanisms achieve high results for a variety of tasks such as text classification, image captioning, sentiment analysis, natural language recognition, and machine translation. Using an attention mechanism, neural architectures can automatically weight the relevance of any region of input text and take those weights into account when solving the underlying problem. In addition, the popularity of attention mechanisms further increased with the implementation of the transformer model, which once again proved how effective the attention mechanism is. The transformer architecture does not use sequential processing or recursion, but relies only on the self-attention mechanism to capture global dependencies between input and output sequences. In the paper a transformer model that implements scaled scalar product attention was used, which corresponds to the procedure of the general attention mechanism. The built model is based on the multi-head attention mechanism, where the self-attention module repeats the calculation several times in parallel. These calculations are combined to get a final estimate. Applying multi-head attention gives the model more opportunities to encode multiple connections and nuances for each word. The application of the multi-head attention mechanism allows the attention function to obtain information from different parts of the representation, which is impossible when using self-attention only. The transformer model was implemented using the TensorFlow and Keras frameworks for the task of machine translation from English to Ukrainian. The dataset for model training, validation, and testing was obtained from the Tatoeba Project. A custom word embedding layer was implemented using a positional encoding matrix.
References
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1409.0473
Vaswani, A. et al. (2017). Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS). https://doi.org/10.48550/arXiv.1706.03762
Galassi, A., Lippi, M., & Torroni, P. (2021). Attention in Natural Language Processing. IEEE Transactions On Neural Networks And Learning Systems, Vol. 32, No. 10. https://doi.org/10.1109/TNNLS.2020.3019893
Chaudhari, Sh., Mithal, V., Polatkan, G., & Ramanath, R. (2021). An Attentive Survey of Attention Models. ACM Transactions on Intelligent Systems and Technology, Vol. 1, No. 1. https://doi.org/10.1145/3465055
Brauwers, G., & Frasincar, F. (2021). A General Survey on Attention Mechanisms in Deep Learning. IEEE Transactions on Knowledge and Data Engineering (TKDE). https://doi.org/10.1109/TKDE.2021.3126456
Cristina, S., & Saeed, M. (2022). Building Transformer Models with Attention: Implementing a Neural Machine Translator from Scratch in Keras. Machine Learning Mastery.
Rothman, D. (2022). Transformers for Natural Language Processing: Build, train, and fine-tune deep neural network architectures for NLP with Python, PyTorch, TensorFlow, BERT, and GPT-3, 2nd Edition. Packt Publishing.
Yıldırım, S., & Asgari-Chenaghlu, M. (2021). Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques. Packt Publishing.
Tatoeba Project. (n.d.). http://tatoeba.org/home