Site Reliability Engineer In Ai Resume Example

Professional ATS-optimized resume template for Site Reliability Engineer In Ai positions

Jane Doe

Senior Site Reliability Engineer – AI & Machine Learning

Email: jane.doe@example.com | Phone: (555) 123-4567 | LinkedIn: linkedin.com/in/janedoe | GitHub: github.com/janedoe

PROFESSIONAL SUMMARY

Innovative and detail-oriented Senior Site Reliability Engineer specializing in AI infrastructure, model deployment, and scalable systems. Over 8 years of experience optimizing AI pipelines, automating critical operations, and enhancing system reliability in fast-paced environments. Adept at deploying large-scale ML models, implementing observability frameworks, and ensuring high availability for AI-driven applications. Passionate about leveraging automation and cutting-edge cloud technologies to enable robust AI solutions.

SKILLS

- **Hard Skills:**

- AI/ML Model Deployment & Optimization

- Cloud Platforms: AWS, GCP, Azure

- Kubernetes & Docker Containerization

- CI/CD Pipelines & Automation (Jenkins, GitLab CI, ArgoCD)

- Infrastructure as Code (Terraform, Pulumi)

- Monitoring & Observability (Prometheus, Grafana, Datadog, ELK Stack)

- Distributed Systems & Microservices Architecture

- Data Pipeline Orchestration (Apache Airflow, Kubeflow)

- SLO/SLA Management & Incident Response

- **Soft Skills:**

- Strong analytical and problem-solving abilities

- Cross-functional collaboration with Data Science and Engineering teams

- Effective communicator for technical and executive stakeholders

- Continuous Improvement mindset

- Agile methodologies and DevOps culture adoption

WORK EXPERIENCE

*Senior Site Reliability Engineer – AI Infrastructure*

*InnovateAI Labs | San Francisco, CA*

June 2022 – Present

- Led the migration of AI model deployment pipelines to a Kubernetes-based platform, reducing deployment time by 35%.

- Built and maintained scalable data ingest and processing pipelines supporting real-time AI inference workloads using Apache Kafka and Airflow.

- Implemented comprehensive monitoring for ML pipelines, significantly decreasing latency issues and improving system uptime to 99.99%.

- Collaborated with ML teams to optimize resource utilization, resulting in a 20% cost reduction for cloud infrastructure.

- Developed automated incident response scripts and runbooks, accelerating resolution times during outages.

*Cloud Operations & SRE Engineer – Machine Learning Platforms*

*DataX Solutions | New York, NY*

March 2018 – May 2022

- Architected end-to-end deployment solutions for ML models with Kubernetes, Docker, and Terraform, ensuring repeatability and security.

- Maintained high availability of AI services, managing autoscaling policies for fluctuating workloads with GCP AutoML and Cloud Run.

- Established alerting and dashboarding using Prometheus and Grafana, increasing proactive issue detection and resolution efficiency.

- Automated onboarding of new models and data pipelines, reducing manual intervention by 40%.

- Supported AI research teams by developing reproducible CI/CD workflows integrated with GitLab and Jenkins.

*Junior Infrastructure Engineer – Data Science Ops*

*FastData Analytics | Boston, MA*

July 2015 – February 2018

- Assisted in deploying and maintaining ML model repositories, ensuring reproducibility and version control.

- Implemented containerization practices to streamline environment setup for data scientists.

- Managed data pipeline workflows and supported model validation processes across cloud environments.

EDUCATION

**Bachelor of Science in Computer Science**

Massachusetts Institute of Technology (MIT)

*2011 – 2015*

CERTIFICATIONS

- Certified Kubernetes Administrator (CKA) – 2023

- Google Cloud Professional Data Engineer – 2022

- DevOps Foundations (AWS Certified DevOps Engineer – prelims) – 2021

PROJECTS

- **Real-Time AI Monitoring Platform:** Developed a custom observability platform leveraging Prometheus, Grafana, and machine learning anomaly detection models to predict system failures before incidents occurred.

- **Automated Model Deployment Pipeline:** Led a project to build a CI/CD framework automating deployment, rollback, and versioning of ML models, reducing manual steps by 60%.

- **Scalable Data Ingestion System:** Designed a streaming data architecture with Kafka, Spark, and Flink, supporting real-time analytics for NLP applications with 99.999% uptime.

TOOLS & TECHNOLOGIES

- Kubernetes, Docker, Helm

- Terraform, Pulumi

- Prometheus, Grafana, Datadog, ELK Stack

- Apache Kafka, Spark, Flink

- ML Workflow Orchestration: Kubeflow, Airflow

- CI/CD: Jenkins, GitLab CI, ArgoCD

- Cloud Platforms: AWS (SageMaker, EKS, Lambda), GCP (Vertex AI, Cloud Composer), Azure (ML Studio)

LANGUAGES

- Python (Advanced, ML & Automation)

- Bash & PowerShell

- SQL & NoSQL (BigQuery, DynamoDB)

Build Resume for Free

Create your own ATS-optimized resume using our AI-powered builder. Get 3x more interviews with professionally designed templates.

More Resume Examples