đ https://shre.ink/Tobias-Zwingmann-SOT-Approach
Tobias Zwingmannâs article introduces the đŚ-đ˘-đ§ Framework â a simple yet powerful method to structure AI observability across: Strategic, Operational and Technical. Why Watching AI Systems Hits Differently
AI systems, unlike traditional IT applications, evolve over time and interact with the real world. Because of this, monitoring them isnât just about system uptime or speed â itâs about ensuring quality, fairness, accuracy, and relevance over time.
Monitoring vs Observability
Zwingmann explains that while monitoring is reactive and based on predefined metrics, observability is proactive and offers deeper insights. It answers why something happened, not just what happened â which is crucial in AI, where errors often stem from data drift or model degradation rather than code bugs.
The four pillars of observability for AI systems
To effectively monitor AI systems, four key pillars are necessary:
Metrics: predefined KPIs such as accuracy, latency, or user engagement.
Logs: system logs that capture real-time model behavior.
Traces: end-to-end visibility of the AI lifecycle, from data input to prediction output.
Data: the foundation of any AI system â including both training and production data â which must be continuously checked for quality, bias, and drift.
The SOT framework - A simple approach for complex systems
To bring structure to AI observability, Zwingmann introduces the S-O-T Framework, a three-layered approach that simplifies what needs to be monitored and how.
Strategic (S): covers business-level metrics and strategic goals. Think about metrics that align with business outcomes, such as customer retention or fraud detection rates. These are often reviewed monthly or quarterly by leadership.
Operational (O): focuses on the day-to-day performance of the AI system. It includes tracking model accuracy, error rates, and alerts about drift or anomalies. These are the metrics product managers or data leads would track weekly.
Technical (T): drills down into infrastructure and runtime metrics: GPU usage, latency, data pipeline failures. Itâs the domain of DevOps and MLOps, reviewed daily or in real-time.
How often should you review or update these metrics?
Each layer of the S-O-T Framework operates on a different cadence:
Strategic: Monthly or quarterly
Operational: Weekly
Technical: Daily or real-time
This staggered review ensures coverage across all dimensions without overwhelming teams.
Getting started - practical next steps
Zwingmann provides actionable advice to implement the S-O-T Framework in real projects.
Start small: Choose one AI use case and define just a few metrics per layer (S, O, T) to begin with.
Use simple tools: No need for complex platforms. Use spreadsheets, open-source tools, or dashboards like Grafana to track metrics.
Scale gradually: Expand your observability stack as your AI portfolio grows. Donât over-engineer early on.
Learn and Iterate: Observability is an evolving practice. Be open to refining metrics and adjusting tools as you gain experience.
Conclusion
Keeping AI alive and effective isnât just a technical challenge â itâs a strategic and operational one. By embracing observability and adopting the S-O-T Framework, organizations can ensure their AI systems remain aligned with business goals and continue to deliver value over time.