Reliability Engineering in the Cloud

Posted By: lucky_aut

Reliability Engineering in the Cloud
Published 08/2025
MP4 | Video: h264, 1280x720 | Audio: AAC, 44.1 KHz, 2 Ch
Language: English | Duration: 5h 2m | Size: 1.2 GB

This video course teaches engineering strategies for promoting chaos engineering practices, observability and monitoring techniques, disaster recovery exercises, reliability metrics, fast data-driven decision-making, and the application of Gen AI/LLMs.

Participants will learn how to increase the reliability and scalability of their systems in the cloud, improve the efficiency of their operations, and gain valuable skills to enable faster incident response. They will also learn to automate operations to improve time to restore and time to detect to the greatest possible extent using modern cloud services, AI/LLMs, and best-in-class tools. The course will help participants understand how operational agility, lean principles, and chaos experimentation can foster a culture of continuous improvement built on collaboration and knowledge sharing. Given the lack of literature and established frameworks in this domain, learners will benefit from practical, domain-specific approaches and examples they can apply directly within their organizations and teams.

Check out Mariya and Carlos's book Reliability Engineering in the Cloud: Strategies and Practices for AI-Powered Cloud-Based Systems (Addison-Wesley, 2025) for an even deeper dive.

Learn How To

Set an enterprise-wide CRE strategy for thousands of applications and dependencies
Evaluate methods to increase the reliability and scalability of systems in the cloud
Ignite faster incident response while automating operations to improve time to restore and time to detect to the maximum possible extent
Recognize that operational agility and chaos experimentation bring a culture of continuous improvement built on collaboration and knowledge sharing between teams
Build effective strategies, promoting chaos engineering practices, observability and monitoring techniques, disaster recovery exercises, reliability metrics, fast data-driven decision-making, and practical examples of techniques and tooling for success
Identify domain-specific approaches and review examples to apply to your organizations and teams.

Who Should Take This Course

Software engineers and development teams responsible for designing, deploying, or maintaining cloud-native applications, with a focus on improving system reliability, scalability, and fault tolerance.
Enterprise and technology leaders who are seeking to enhance the resilience of their cloud infrastructure, streamline operational efficiency, and reduce incident response times through modern reliability engineering practices.