Sure thing, from what I understand, it's TUBITAK's hand-rolled version of an end-to-end machine learning platform. ML/Data scientists, software engineers at various institutions across Turkey doing critical basic science and (defence) R&D probably used (or wanted to use) commercial MLaaS (Machine Learning as a Service) platforms like
Azure ML or
Amazon Sagemaker, which pose data ownership issues. It seems Safir Zeka (SZ) covers a lot of what those platforms have to offer. I will be simplifying a lot, so the following is not an accurate overview on MLOps.
- Web/Data Scraping: A specialized "crawler" is submitted to SZ which likely schedules and distributes
(lightweight) VMs (virtual machines) collect the targeted data sources (for speed and the prevention of throttling). These are used to build large textual/media datasets for training models. OpenAI infamously scraped a large portion of the open web to train its family of GPT LLNs (large language models).
- Data Sources: When datasets get sufficiently large, it's difficult to organize and distribute them to users (machines or people). SZ probably provides an interface to databases (structured data only) or data lakes (all types of data, less structured, like a simplified computer filesystem for very large files on the cloud) that users can stream (
Apache Spark/
Flink) or download on-demand. The sources can be versioned, provisioned and distributed across multiple servers for added convenience, safety and redundancy.
- Data
Pipelines: Raw data usually needs to be prepared before experimentation or training of ML models. In the diagram, they give the example of normalization. E.g. disparate date formats converted to a standard ISO version or numbers min-max normalized to lie in a certain interval (usually [0, 1]). SZ probably interfaces with
Apache Airflow, which is a library that allows you chain scripts that do this stuff. These pipelines can be scheduled to run at specified intervals or triggered by programmed events (like new data). They can be processed incrementally or in batches.
- Data Visualization: Processed data from these pipelines are usually cached and written to one of those data sources. Since they mention containers, which are lightweight VMs that can be spun up quickly on the cloud, SZ provides services where you launch one of these containers (there are probably templates) and work on your experiments/analyses on the cloud via a notebook interface.
DataBricks and/or
Jupyter notebooks are probably integrated. Google has
Colab, which works similarly, but is closed-source and runs only on their cloud. These notebooks can be used to develop the models, do data analysis and visualization etc. They can be automated as part of a pipeline.
- Training: When a model is ready to be trained, either a notebook or script is submitted to SZ which likely distributes the processing across (multiple) powerful clusters to massively speed up training. It's indicated that SZ provides a dashboard interface that allows users to monitor the progress/health of the training process. The most basic metric of how much you've trained your ML model is the loss function. When the loss is no longer decreasing in sufficiently large steps, you've just about finished training. Below is a screenshot of
TensorBoard, which SZ probably provides an interface over. You'll need to report these statistics to SZ in your training script for the monitoring to work, so there's definitely an API (application programming interface) for it.
View attachment 62911
- Publishing: When your model is ready, it's essentially a massive multidimensional grid of numbers (a Tensor). SZ likely allows users to easily host these models that live on a container image that has, in addition to the model, the necessary server code to respond to web requests (e.g., asking ChatGPT a question) by first transforming said requests into a feature vector (what the grid of numbers will be fed), then reply with a processed response, i.e., the model's prediction. The data blob that forms the model as well as the container are all stored and executed on the same cloud infrastructure as the rest of the features I discussed earlier.
I've worked previously worked in data engineering and data scientist roles, and life was a pain before proper MLOps, so this is definitely a lovely homegrown platform if it matches or approaches Microsoft, Amazon, or Google's offerings. Cheers.
PS: This post should probably be moved to the Science and Tech thread.