Scientific projects

Grant funding for young scientists under the "Zhas Galym" project for 2024-2026

Project Manager: Kamshat Tusupova, PhD

Project Title:

Development of an intelligent information system for optimal resource allocation for manufacturing enterprises and forecasting their development dynamics.
IRN: AP22684879

Project Description: The project aims to develop an intelligent information system that ensures optimal resource allocation for manufacturing enterprises and forecasts their development dynamics. The research is based on the Cobb-Douglas production function, adapted to external influences and uncertainty, allowing for the construction of an economic and mathematical model reflecting real-world processes. The project plans to develop control algorithms that ensure flexible and efficient resource allocation, as well as integrate machine learning methods to improve forecasting accuracy. The system will include a user interface and a database for storing and processing information. To evaluate its effectiveness, testing on real data and subsequent monitoring of the results of implementation in a production environment are planned.

Results: An economic and mathematical model based on the Cobb-Douglas function was developed, taking into account external factors and uncertainty. Control algorithms were created to ensure optimal resource allocation under changing conditions. Machine learning methods were integrated, improving forecasting accuracy and system efficiency.

Funding volume by year for 2024-2026:

Total amount:
25,068,744 tenge
Amount for 2024:
7,764,468 tenge
Amount for 2025:
8,513,893 tenge
Amount for 2026:
8,790,383 tenge

Project Manager: prof., Ualsher Tukeyev

Project Title: Study of Neural Models for the Generation of Speech Transcripts and Meeting Minutes in Turkic Languages.
IRN: AP23487816

Project Description:

The development of interaction between Turkic languages is a pressing issue in light of the active development of cooperation among Turkic states. One important area for developing interaction between Turkic languages is ensuring the efficient processing of meeting minutes held in Turkic languages.
The project aims to develop a neural technology for the automatic generation of speech transcripts and meeting minutes for Turkic languages. The study includes Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek, with special attention paid to their interaction with Kazakh. The development of interaction between Turkic languages is a pressing issue in light of the active development of cooperation among Turkic states. One important area for developing interaction between Turkic languages is ensuring the efficient processing of meeting minutes held in Turkic languages. The main idea is to develop a comprehensive technology based on neural models for the automatic generation of speech transcripts and meeting minutes in Turkic languages. The objects of study are Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek. Particular attention is paid to the following pairs: Azerbaijani–Kazakh, Kyrgyz–Kazakh, Turkish–Kazakh, Turkmen–Kazakh, and Uzbek–Kazakh.

Results:

Expected results of the project: 1) models, methods, and tools for a comprehensive technology for generating speech transcripts and meeting minutes in Turkic languages; 2) models, methods, and tools for machine translation of Turkic language transcripts into Kazakh; 3) datasets for training neural models; 4) models, methods, and tools for generating meeting minutes in Kazakh from transcripts. Project results achieved as of July 2025:
- Speech-to-text transcript recognition tools have been selected for the Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek languages;
- Relational machine translation technologies have been developed for the following language pairs: Turkish-Kazakh, Uzbek-Kazakh, and Kyrgyz-Kazakh;
Five synthetic parallel corpora have been generated using back translation technology: Azeri-Kazakh, Kyrgyz-Kazakh, Turkish-Kazakh, Turkmen-Kazakh, and Uzbek-Kazakh, each containing 1 million parallel sentences (200,000, 300,000, and 500,000);

The total size of the resulting synthetic parallel corpora is 5 million parallel sentences for the Azeri-Kazakh, Kyrgyz-Kazakh, Turkish-Kazakh, Turkmen-Kazakh, and Uzbek-Kazakh pairs. - The 5 million parallel sentences obtained were cleaned using an automated program to remove abbreviation and repetition errors;
- After cleaning and removing duplicates for each language pair, approximately 782,000 corpora remained, totaling approximately 3,910,000 sentences;
The NLLB 1.3B (Google) model was trained on 487,000 synthetic and cleaned corpora for all language pairs;
The base NLLB 1.3B model, the NLLB 1.3B model trained on synthetic and purified corpora (497,000 words). Evaluation was performed using the WER, TER, BLEU, and CHRF metrics. Here are the scores for the three models for the BLEU metric: Azeri-Kazakh: 22.99, 37.28, 47.84; Kyrgyz-Kazakh: 17.72, 29.73, 48.27; Turkmen-Kazakh: 9.18, 22.82, 33.22; Turkish-Kazakh: 18.52, 33.2, 42.10; Uzbek-Kazakh: 19.7, 31.71. The BLEU score of the model on the purified corpus improved by an average of 25 notches compared to the base model.

Project Manager: Aidana Karibayeva, PhD
Project Title: Research on Automatic Generation of Parallel Speech Corpora of Turkic Languages and Their Use for Neural Models
IRN: AP23488624
This project is dedicated to the creation of a parallel speech corpus for Turkic languages, a pressing issue in linguistics, computational linguistics, and information technology. Such a corpus will enable more accurate research on linguistic features, the development of new analysis and processing models, and the development of machine translation, speech recognition, and synthesis systems. The project explores two approaches: a cascaded corpus generation scheme (STT-TTT-TTS) and its direct application in training STS (Speech-to-Speech) systems. The project's implementation will lay the foundation for automated applications, including translators, chatbots, and intelligent communication systems. The results obtained will be useful for both scientific research and practical developments aimed at supporting and developing Turkic languages. They will also contribute significantly to the preservation of cultural and linguistic heritage, strengthening interlingual communication, and expanding access to modern digital technologies for speakers of Turkic languages.
Expected project results:
A cascade-based machine translation scheme for speech from Kazakh to Turkic languages will be developed; parallel audio corpora from Kazakh to Turkish, Tatar, and Uzbek will be created and used in training machine translation systems; and the results of cascade-based and neural machine translation technologies will be obtained and evaluated using quality metrics. Over the entire project period, articles will be prepared and published in highly ranked journals in accordance with the requirements of the competition documentation:
- At least three (3) articles and/or reviews in peer-reviewed scientific journals in the project's scientific focus, indexed in Science Citation Index Expanded and included in the 1st (first), 2nd (second), and/or 3rd (third) quartiles of the Web of Science impact factor and/or having a CiteScore percentile in Scopus of at least sixty (60);
- At least one (1) article or review in a peer-reviewed foreign or domestic publication recommended by the Committee on Scientific and Cultural Organizations of the Russian Federation;
- One monograph published in a domestic publication.

Total amount: 91,286,606.18 tenge.
Amount for 2024: 26,922,345.17 tenge.
Amount for 2025:
30,489,305.39 tenge
Amount for 2026:
33,874,955.62 tenge

Project Title: AP19677835 "Research of Models and Development of an Intelligent Question-Answering System Based on Semantic Approaches for the State Language in the Field of Legislation of the Republic of Kazakhstan"

Project Manager: Associate Professor, PhD. Diana Rakhimova

Implementation Period: 2023-2025

Relevance

The relevance of this topic stems from the need to improve the accessibility and effectiveness of legal information in the state language of the Republic of Kazakhstan. Modern legal documents have a complex structure, making them difficult to understand without specialized knowledge. The use of semantic approaches in intelligent question-answering systems enables the automatic analysis and interpretation of legal texts, providing citizens and specialists with convenient access to information. In the context of digitalization of public administration, such a system could significantly improve the quality of legal consulting. The development of natural language processing (NLP) technologies for the Kazakh language requires the adaptation and creation of specialized models that take into account its morphological and syntactic features. Researching various models for question-answering systems will help identify the most effective methods for semantic analysis of legal texts. Thus, the development of an intelligent system based on semantic approaches contributes to the development of a digital legal information ecosystem and the strengthening of Kazakhstan's language policy.

Goal

The project aims to create an intelligent question-answering system that understands questions in Kazakh related to the legislation of the Republic of Kazakhstan and provides accurate and understandable answers. The system relies on semantic approaches, analyzing the meaning of the text to correctly interpret queries and find answers. Machine learning and natural language processing (NLP) technologies, including semantic models, are used.