top of page

Job Title: AI DevOps Engineer

 

Location:

[Location] (Opportunities for remote/hybrid/flexible work available)

​

Reports to:

Head of DevOps, DevOps Manager

 

Role Purpose

We are looking for an AI DevOps Engineer to design, implement, and maintain robust, high-performance infrastructure supporting advanced machine learning applications. You will help ensure that AI models are efficiently built, tested, and deployed at scale while adhering to best practices around operational reliability and security. This role offers on-site, remote, or hybrid work options based on your preference. You will be a key contributor, working closely with diverse teams to support the end-to-end lifecycle of cutting-edge AI solutions that address real-world challenges and drive tangible impact.

 

Company Overview

[Company Name] is a forward-thinking organization in the [Industry] sector, recognized for pioneering AI-focused products and services. We nurture a culture that values collaboration, integrity, and continuous growth. Named among the top employers in the industry, we maintain an inclusive environment that encourages skill development and exposure to modern technology. Our teams are dedicated to staying at the forefront of AI research and development, making [Company Name] an ideal place for individuals who enjoy delivering high-quality innovations with measurable results.

 

Key Responsibilities

  • Build and Maintain AI Infrastructure

    • Set up, configure, and manage cloud-based environments (AWS, GCP, or Azure) tailored for machine learning workloads.

    • Leverage containerization (Docker) and orchestration (Kubernetes) to streamline deployment and scaling of AI applications.

  • Design and Optimize MLOps Pipelines

    • Develop continuous integration (CI) and continuous delivery (CD) pipelines for machine learning models, ensuring smooth code integration and minimized deployment downtime.

    • Automate testing frameworks for AI models, incorporating checks for model accuracy, data integrity, and performance metrics.

  • Performance Monitoring and Troubleshooting

    • Monitor resource utilization (GPU/CPU usage, memory, and network) to pinpoint bottlenecks and improve infrastructure efficiency.

    • Diagnose and resolve production incidents, collaborating with Data Scientists, ML Engineers, and other stakeholders to implement quick and lasting fixes.

  • Infrastructure as Code (IaC)

    • Use Terraform, AWS CloudFormation, or equivalent to define reproducible, scalable, and secure infrastructure.

    • Maintain version-controlled configurations that simplify environment rollouts and updates.

  • Security and Compliance

    • Enforce security best practices across AI environments, from data encryption at rest and in transit to identity and access management.

    • Ensure systems comply with relevant standards or regulations (e.g., GDPR, HIPAA) where applicable.

  • Collaboration and Documentation

    • Work closely with Data Scientists and Machine Learning Engineers to understand model requirements, experiment tracking, and deployment strategies.

    • Document processes, workflows, and system architectures for transparency and consistency among teams.

    • Present findings, metrics, and recommendations to leadership and other technical groups.

  • Innovation and Best Practices

    • Stay informed about the latest AI DevOps and MLOps methods, tooling, and industry trends.

    • Identify potential improvements to existing workflows and propose enhancements to tools or technologies used.

 

Required Skills and Qualifications

  • Educational Background

    • Bachelor’s degree (or higher) in Computer Science, Engineering, or an equivalent technical field, or demonstrable professional experience.

  • DevOps Expertise

    • Proficiency in version control (Git) and continuous integration (Jenkins, GitLab, or similar).

    • Experience orchestrating containerized workloads using Docker and Kubernetes.

  • AI/Machine Learning Domain Knowledge

    • Experience working with at least one major ML framework (TensorFlow, PyTorch, or scikit-learn).

    • Familiarity with data preprocessing, model training, and model evaluation workflows.

  • Cloud Services and Infrastructure

    • Hands-on experience with AWS, GCP, or Azure for AI and big data workloads (e.g., EC2, S3, IAM, GCS, Cloud Dataflow).

    • Comfort with Infrastructure as Code (Terraform, AWS CloudFormation) and ability to manage production-level environments at scale.

  • Automation and Scripting

    • Strong scripting abilities in Python, Bash, or similar languages.

    • Ability to automate repetitive tasks, integrations, and deployments for both infrastructure and ML pipelines.

  • Security and Reliability

    • Understanding of best practices for security, including network segmentation, encryption, and secure configuration.

    • Familiarity with incident response processes and building high-availability systems.

  • Problem-Solving and Communication

    • Capable of diagnosing production-level issues under time constraints, coordinating closely with various teams to find solutions.

    • Skilled at explaining technical concepts to both technical and non-technical audiences, facilitating effective collaboration.

 

Preferred

  • Experience with GPU-based workloads, distributed ML training, or HPC environments.

  • Knowledge of data versioning solutions (e.g., DVC, MLflow) and experiment tracking.

  • Exposure to Agile or Scrum methodologies for project management.

​

Perks and Benefits:

Clearly outline the benefits and perks of the role.

​

 

How to Apply:

End with a strong call to action encouraging candidates to apply. Include a direct link to the application page and provide contact information for further queries.

​

Please ensure each job description includes all relevant information in compliance with local, state, and national laws. This includes:

 

  • Salary Information: Provide a clear salary range to maintain transparency and meet legal requirements.

  • Privacy Policies: Protect candidate privacy by following all applicable data protection and privacy laws.

  • Equality & Non-Discrimination: Include an equal opportunity statement to uphold our commitment to a diverse, inclusive workplace that does not discriminate based on race, gender, age, disability, or any other protected characteristic.

  • Accessibility: Make reasonable accommodations available for candidates with disabilities and include information on how they can request assistance throughout the hiring process.

  • Environmental and Social Responsibility: If your company has sustainability initiatives or community engagement programs, mentioning them briefly can attract candidates who prioritize working for socially responsible employers.

  • Transparent Hiring Process: Briefly explain the hiring process (e.g., “Our interview process typically includes three stages: an initial screening, a technical interview, and a final interview”) to help candidates know what to expect.

Want to know about the talent market for Vice President of Marketing?

If you'd like to find out what's happening in the AI and Data talent markets, or if we can help you secure talent for your team from specific markets, book a no-obligation 20-30 minute consultation call.

bottom of page