DevOps Interview Questions: Infrastructure, CI/CD, and What They're Really Testing

DevOps interviews have a reputation for being hard to prepare for. Part of the reason is that "DevOps engineer" covers an enormous range of roles — from infrastructure-heavy SRE positions at large tech companies, to CI/CD pipeline specialists at mid-size SaaS businesses, to cloud migration engineers at enterprises moving off legacy systems. The tools and priorities vary significantly.

But there is a more consistent underlying thing that DevOps interviewers are trying to assess: how you reason about systems. Can you trace a failure back to its root cause? Can you think about a deployment pipeline end-to-end? Do you understand the tradeoffs between reliability and velocity? Do you know when to automate and when to leave something manual? These reasoning patterns show up in every DevOps interview, regardless of the specific tech stack.

This guide covers the questions you will actually get asked, what they are really testing, and how to answer in a way that demonstrates the thinking patterns experienced DevOps engineers exhibit.

What DevOps Interviewers Are Really Evaluating

Before diving into specific questions, it is worth understanding the evaluation axes DevOps interviewers are using.

Systems thinking: Do you understand how components interact? When something breaks, do you trace the problem upstream and downstream, or do you fix the obvious symptom and move on? Interviewers test this with failure scenario questions and postmortem-style discussions.

Operational maturity: Have you shipped things that other people depend on? Have you been on-call and dealt with real incidents? Candidates who have only worked in purely greenfield or purely development contexts often struggle with the operational questions — what breaks at scale, how you handle toil, what a good alert looks like.

Collaboration and communication: DevOps is intrinsically cross-functional. Do you work effectively with developers who do not know much about infrastructure? Can you explain a deployment architecture to a non-technical stakeholder? How do you handle tension between developer velocity and infrastructure stability?

Automation instinct: The cultural core of DevOps is automating toil. Interviewers want to see that you reflexively ask "can this be automated?" when describing manual processes — not as a buzzword, but as a genuine operational instinct.

Infrastructure and Systems Questions

"Walk me through what happens when a Kubernetes pod crashes."

This is a diagnostic question. The interviewer is not testing whether you know the crash loop backoff sequence by memory — they want to see whether you can reason through a system's behavior end-to-end.

A strong answer covers: the container exits (with what exit code, and why exit codes matter for distinguishing application errors from OOM kills from healthcheck failures), the kubelet detects the state change, the pod enters a restart cycle, crash loop backoff kicks in (exponential backoff up to five minutes by default), liveness probes are relevant here, and the observability tools you would use (kubectl describe, kubectl logs --previous, events in the namespace).

The interviewer is also watching to see if you ask clarifying questions: "What kind of pod — stateless deployment, statefulset? Is this in production or dev? Are there resource limits set?" That narrowing-down instinct is a signal of real troubleshooting experience.

"How would you design the networking for a three-tier web application on AWS?"

This is an architecture question that tests how you think about security, availability, and traffic flow simultaneously. A well-structured answer covers:

Public subnet for the load balancer (ALB) with internet gateway
Private subnets for the application tier (ECS, EKS, or EC2), preferably across multiple AZs
Private subnets for the data tier (RDS with Multi-AZ for HA)
Security groups as the primary access control layer — principle of least privilege, only allowing traffic from the layer above
VPC flow logs for observability
NAT gateway for outbound internet from private subnets if needed

Then add nuance: "In a real implementation I'd also think about VPC endpoints to avoid traffic leaving the AWS network for S3 and DynamoDB calls, and whether we need a WAF in front of the ALB."

Strong candidates build the answer incrementally and mention tradeoffs. Weak candidates recite a list of AWS services without explaining why they are used or what the alternatives are.

"What is the difference between a rolling deployment and a blue-green deployment?"

A deceptively simple question that tests whether you understand deployment strategies and when each is appropriate.

Rolling deployment: gradually replace instances of the old version with the new version. At any point during the deployment, some traffic is hitting the old version and some is hitting the new. Simpler to implement, uses fewer resources, but risks mixed-version states that can cause issues with database migrations or breaking API changes.

Blue-green deployment: run two identical environments (blue = current, green = new). Switch traffic all at once (or gradually via weighted routing). Instant rollback (flip back to blue), no mixed-version state, but requires double the infrastructure resources during the cutover window.

Canary release is a third option worth mentioning: route a small percentage of traffic (1%, 5%) to the new version, monitor, expand gradually. Better for risk management but more complex to implement and monitor.

The right answer also includes when you would choose each: "Blue-green when a clean cutover is critical and you can afford the resource cost. Rolling when you need to save infrastructure cost and the changes are backward-compatible. Canary when you are making a change with high uncertainty and want real traffic validation before full rollout."

NextCV features — AI-tailored CVs, cover letters, and interview prep

CI/CD Pipeline Questions

"Tell me about a CI/CD pipeline you built or significantly improved."

Behavioral question, but technical in content. The interviewer wants to know: what was the starting state, what did you change, what were the outcomes, and what did you learn?

A strong answer is specific about the tooling (GitHub Actions, Jenkins, GitLab CI, CircleCI, ArgoCD, etc.), the specific problems you solved (slow builds, flaky tests, manual deployment steps, lack of environment parity), and the measurable outcomes (build time reduced from 45 to 12 minutes, deployment frequency increased from weekly to daily, rollback time from 2 hours to 5 minutes).

What makes the answer strong is not the specific tools — it is demonstrating that you understood the system well enough to diagnose bottlenecks, had the judgment to prioritize the right improvements, and measured the outcomes in terms that connect to developer experience and business velocity.

"How do you handle secrets in a CI/CD pipeline?"

This is both a security question and a practical one. Common approaches: environment variables injected at runtime, HashiCorp Vault with dynamic secrets, AWS Secrets Manager or GCP Secret Manager with IAM-based access, Kubernetes secrets (with the caveat that they are only base64-encoded, not encrypted, by default — ETCD encryption at rest is a separate step).

Strong answers also cover what not to do: secrets in environment variables that get logged, secrets in code repositories, secrets passed as build arguments that get baked into Docker image layers.

The follow-up is often: "What happens if a secret gets accidentally committed to the repo?" A strong answer covers: rotate immediately, scan git history (git-secrets, truffleHog, GitHub's own secret scanning), check if the secret was used in any build logs, and conduct a brief incident review to understand how it happened and prevent recurrence.

Linux and Scripting Questions

"A server is running slowly. How do you diagnose it?"

The diagnostic methodology question. The answer demonstrates whether you have a systematic approach or just try random things.

A structured approach: start with the four main resource categories — CPU, memory, disk I/O, network. Commands for each:

CPU: top, htop, mpstat, vmstat — is there a specific process consuming CPU? Is it user-space or kernel-space? Is there CPU steal (common on shared cloud instances)?
Memory: free -h, vmstat, /proc/meminfo — is there memory pressure? Swap usage? Is the OOM killer active (dmesg | grep -i oom)?
Disk: iostat -x 1, iotop — high I/O wait? Which processes are doing the I/O? Is it reads or writes?
Network: netstat -s, ss -s, iftop — are there connection table exhaustion issues? Packet drops? Retransmits?

Then: "Once I have identified the resource bottleneck, I'd look at the application logs alongside the resource timeline to correlate the slowdown with specific requests or events."

"Write a script to find the top 10 processes by memory usage."

Scripting questions test whether you can write functional code quickly, not perfectly-optimized code. A correct answer: ps aux --sort=-%mem | head -11 (or awk with ps aux for more control). Bonus points for mentioning that ps snapshots might not reflect transient spikes and that you might prefer top -b -n 1 for a single-pass batch output or /proc/<pid>/status for more detail.

The Incident and Reliability Questions

"Tell me about an incident you handled. What happened, and what did you do?"

The most important behavioral question in a DevOps interview. The interviewer is evaluating: how you diagnose under pressure, how you communicate during an incident, your depth of technical understanding, and your post-incident habits.

A good incident story has: a clear timeline (when did you first notice, what were the initial symptoms, what was the business impact), a diagnosis narrative (what you checked, what you ruled out, what the root cause turned out to be), a resolution (temporary mitigation vs. permanent fix), and a retrospective (what the incident revealed about the system's weaknesses and what changed afterward).

The best candidates show that they learned something from the incident that changed how they build or operate systems. That learning loop — incident → understanding → systemic improvement — is the signature of a mature DevOps engineer.

See how NextCV tailors your CV to match the job posting

Preparing for the System Design Component

Many senior DevOps interviews include a system design question specific to infrastructure: "Design a logging and monitoring system for a microservices architecture" or "How would you design the infrastructure for a globally distributed API with 99.99% SLA?"

These questions are not primarily about knowing the exact right answers. They are about demonstrating a structured thinking process: clarify requirements first (traffic volume, geographic distribution, latency SLA, cost constraints), identify the key design decisions and tradeoffs, build up from fundamentals, and be explicit about what you are optimizing for.

Preparing your CV to reflect this kind of systems thinking — not just tool proficiency — is the foundation for these conversations. Tools like NextCV can help you frame your experience in terms of the outcomes and reasoning patterns that DevOps interviewers are looking for, rather than just the list of technologies you have used.

The underlying truth of DevOps interviews is that the tools are largely interchangeable — Kubernetes vs. ECS, Jenkins vs. GitHub Actions, Prometheus vs. Datadog. What is not interchangeable is the mindset: operational ownership, systems thinking, and the discipline to build things that are observable, reliable, and maintainable.