SOTASTREAM: A Streaming Approach to Machine Translation Training
Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer.
Tags:Paper and LLMsMachine Translation managementPricing Type
- Pricing Type: Free
- Price Range Start($):
GitHub Link
The GitHub link is https://github.com/marian-nmt/sotastream
Introduce
“Sotastream is a data augmentation tool designed for training pipelines. It utilizes infinibatch to create a continuous stream of shuffled training data and offers real-time data manipulation, augmentation, mixing, and sampling. It can be installed from PyPI or GitHub and provides entry points for both module and command-line usage. Developers are encouraged to use the editable mode for direct code edits. The tool supports various usage examples and pipeline options. Sotastream’s development is led by the TextMT Team at Microsoft Translator.”
Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer.
Content
Sotastream is a tool for data augmentation for training pipeline. It uses infinibatch internally to generate an infinite stream of shuffled training data and provides a means for on-the-fly data manipulation, augmentation, mixing, and sampling.

Related
However, due to the unavailability of experts in these locations, the data has to be transferred to an urban healthcare facility (AMD and glaucoma) or a terrestrial station (e. g, SANS) for more precise disease identification.










