Develop Tools & CodePaper and LLMs

SOTASTREAM: A Streaming Approach to Machine Translation Training

Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer.

Tags:

Pricing Type

  • Pricing Type: Free
  • Price Range Start($):

GitHub Link

The GitHub link is https://github.com/marian-nmt/sotastream

Introduce

“Sotastream is a data augmentation tool designed for training pipelines. It utilizes infinibatch to create a continuous stream of shuffled training data and offers real-time data manipulation, augmentation, mixing, and sampling. It can be installed from PyPI or GitHub and provides entry points for both module and command-line usage. Developers are encouraged to use the editable mode for direct code edits. The tool supports various usage examples and pipeline options. Sotastream’s development is led by the TextMT Team at Microsoft Translator.”

Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer.

Content

Sotastream is a tool for data augmentation for training pipeline. It uses infinibatch internally to generate an infinite stream of shuffled training data and provides a means for on-the-fly data manipulation, augmentation, mixing, and sampling.

SOTASTREAM: A Streaming Approach to Machine Translation Training

Related