Developing Resilient IT Systems with Chaos Engineering and Automated Recovery Protocols
Keywords:
Chaos Engineering, Automated Recovery, IT Resilience, Fault Injection, Self-Healing Systems, Cloud Reliability, Site Reliability Engineering (SRE), Fault ToleranceAbstract
Modern IT infrastructure demands high availability, robustness, and fault tolerance, especially in distributed cloud-native systems. As digital ecosystems grow increasingly complex, traditional manual recovery mechanisms prove insufficient. This paper investigates how Chaos Engineering combined with automated recovery protocols enhances system resilience by proactively identifying vulnerabilities and swiftly recovering from disruptions. We explore recent advancements, methodologies, and implementations, illustrating their effectiveness in real-world deployments. Through the integration of controlled fault injection and intelligent self-healing mechanisms, organizations can achieve near-zero downtime and ensure operational continuity even in adverse scenarios.
References
Jha, N. N., and P. Manwani. Self-Healing Payment Systems via AI-Driven Anomaly Recovery: A Zero-Downtime Framework for Secure and Reliable Transactions. IJCET, 2024.
Srinivas Adilapuram, "Enhancing Java API Security with AI and Machine Learning: Smarter Defenses for a Safer Digital World", International Journal of Science and Research (IJSR), Volume 14 Issue 3, March 2025, pp. 341-345, https://www.ijsr.net/getabstract.php?paperid=SR25307091014, DOI: https://www.doi.org/10.21275/SR25307091014
Alozie, C. E., J. I. Akerele, and E. Kamau. Fault Tolerance in Cloud Environments: Techniques and Best Practices from Site Reliability Engineering. ResearchGate, 2024.
Monroe, S. Investigate Methodologies for Intentionally Introducing Failures to Improve System Resilience and Fault Tolerance. ResearchGate, 2023.
Srinivas Adilapuram, (2024) Eliminating Manual Onboarding Delays: Real-Time Solutions with Java Spring Boot and SFG APIS. International Journal of Computer Engineering and Technology (IJCET), 15(6), 1630-1637.
Basiri, A., et al. “Chaos Engineering.” IEEE Cloud Computing, vol. 7, no. 4, 2020, pp. 30–39.
Gao, F., Y. Zhang, and X. Liu. “Fault Injection in Kubernetes Clusters.” ACM SIGOPS, 2021.
Adilapuram, S. (2015). Optimizing Spring Boot Application Security and Code Quality with a Certified Jenkins Pipeline. International Journal of Computer Science and Information Technology Research, 5(4), 51-58. DOI: https://doi.org/10.5281/zenodo.14545911
Yuan, D., et al. “Simple Testing Can Prevent Most Critical Failures.” USENIX OSDI, 2019.
Dragoni, N., et al. “Microservices: Resilience by Design.” Journal of Systems and Software, vol. 172, 2022.
Pahl, C., and P. Jamshidi. Model-Based Cloud Self-Healing. Springer LNCS, 2020.
Ramaswamy, S., and D. Patel. “Implementing Chaos Engineering in DevOps Pipelines for Proactive Resilience.” Journal of Cloud Computing, vol. 10, no. 1, 2021, pp. 45–57.
Izrailevsky, Y., and A. Tseitlin. “The Netflix Simian Army: Chaos Engineering in Production.” ACM Queue, vol. 18, no. 3, 2020, pp. 22–36.
Adilapuram, S. (2024). Docker vs. Kubernetes on Google Cloud Platform for Cost-Effective Spring Boot Deployments. International Journal of Science and Research (IJSR), 13(12), 1217–1221. https://doi.org/10.21275/SR241217083147
Chen, X., J. Liu, and H. Zhang. “Autonomous Recovery in Cloud-Native Systems Using Reinforcement Learning.” Proceedings of the IEEE CLOUD Conference, 2022.
Allspaw, J. “The Infinite Hows of Resilience Engineering in Complex IT Systems.” ACM SIGSOFT Software Engineering Notes, vol. 44, no. 2, 2019, pp. 32–35.
Lévesque, M., and M. A. Vouk. “Self-Healing Systems: Architectures and Best Practices.” IEEE Software, vol. 40, no. 1, 2023, pp. 58–67.
Ponce, R., and G. Singh. “Fault Injection for Resilience Testing in Cloud Systems: A Systematic Review.” Future Generation Computer Systems, vol. 118, 2021, pp. 246–260.