Skip to main content
RecruoRecruo

AI-native engineering hire

Hire AI evals engineers who can actually tell you when your LLM is regressing.

Most candidates know the eval frameworks. Very few have ever built a golden dataset with rater-disagreement resolution or calibrated LLM-as-judge against human agreement. We shortlist the 3–5 who have — in 5 business days, at a 15% success fee.

Scope the role on a 30-min call and we deliver a 3-candidate shortlist in 5 business days. Every candidate pre-screened by AI + reviewed by a human recruiter with an evals background. 90-day replacement guarantee.

Why this role, why now

What an AI evals engineer actually does in 2026

Two years ago, eval engineering was a research hobby. Now it is the job standing between your latest prompt change and a silent 9% quality regression that leaks into production for six weeks before a customer complains. The 2025–2026 shift everyone in the field calls eval-driven development — writing the eval before the feature, running it on every PR, treating a failed eval like a failed unit test — crossed from research practice into product engineering standard across every scale-up we work with.

Concretely, an evals engineer owns four things. First, the golden-reference datasets: curated input/output pairs with documented disagreement resolution, versioned like code, re-labelled when the task definition drifts. Second, the automated eval harness: programmatic checks for exact match, regex, embedding similarity, JSON-schema validity, latency, cost, and rubric-based LLM-as-judge flows, wired into CI and a dashboard the product team actually looks at. Third, human-in-the-loop rating: rubric design, inter-rater agreement (Cohen's or Krippendorff's alpha), calibration of the LLM-judge against those human scores so you can trust automation beyond a small held-out set. Fourth, adversarial and safety evals: jailbreak suites, prompt-injection, bias probes across protected characteristics, and the red-team harness you need if your product falls under Annex III of the EU AI Act.

Demand reflects the shift. The UK AI Safety Institute's open-source Inspect AI framework, released in 2024 and now at v0.3+, has become the de-facto backbone for safety-grade evals in UK and EU AI teams; the OpenAI Evals repo has 15K+ GitHub stars; Scale AI's Humanity's Last Exam benchmark, released late 2024, re-set what frontier eval design looks like. Every Series B AI-adjacent company we have talked to in the last six months is either hiring an evals engineer or has just made the hire and is already asking for the second. The bottleneck is not supply of framework-literate candidates — it is supply of candidates who have actually operated an eval loop against a real product team under deadline.

One honest distinction up front: an evals engineer is not an evals researcher. The researcher writes papers on benchmark design and publishes at NeurIPS or ICLR. The engineer ships the harness, argues with the PM about the rubric, resolves the six ambiguous labels the raters disagreed on, and owns the green/red light on your Friday release. You almost certainly need the second, not the first, until you are at ~40 AI engineers and actively shipping foundation-model work.

How we source

How Recruo sources evals engineers specifically

This role confuses generalist recruiters more than any other in our catalogue. Job boards return a blend of traditional QA engineers who wrote a single LLM eval once, and ML researchers who can describe HELM and BigBench but have never versioned a dataset against a product team's Friday deploy. Both profiles miss. Our filter is built to separate the production operator from the framework tourist inside the first screen.

We source across six channels specific to evals engineering: open-source contributor graphs for Inspect AI (AISI's framework), OpenAI Evals, Ragas, TruLens, LangSmith integrations, and Langfuse; authors on the HELM, BigBench, MMLU-Pro and Humanity's Last Exam leaderboards and reproducibility reports; Hugging Face dataset maintainers for eval-focused datasets with sustained downloads; papers-with-code contributors on robustness, faithfulness, and adversarial-eval benchmarks; the Anthropic and OpenAI public red-teaming and model-card communities; and a private network of 640+ CEE AI engineers built by Nikita during his time at Neurons Lab, where the shop delivered 80+ AI projects including several safety-eval and bias-audit engagements for European clients.

Every candidate goes through a 14-minute AI technical interview that probes operator signals, not framework recall: 'walk me through the last time your eval suite caught a regression — what did the delta look like and how did you decide it was signal, not noise?', 'how do you calibrate an LLM-as-judge rubric against human raters when the raters themselves disagree 18% of the time?', 'tell me about a golden dataset you owned — how did you handle ambiguous examples and concept drift?'. The AI asks adaptive follow-ups; a human recruiter reviews the transcript and scores before a shortlist lands in your inbox. Candidates passing our evals filter in 2025-Q4–2026-Q1 had a median of 2.5 years of post-2023 production evals experience and a 93% interview-pass rate at our clients.

The last layer is role-specific: every shortlisted evals engineer must have owned at least one eval loop that blocked a production release — meaning a concrete example where they flagged a regression, the team rolled back or fixed forward before shipping, and the outcome was documented. We verify via a combination of public artifacts (OSS PRs, blog posts, conference talks, model cards) and a reference call with the previous engineering manager. Candidates who have only built evals in a research context do not make the shortlist for production-engineering roles; we route those to a separate pipeline for applied-research teams who want that profile explicitly.

Placed talent

A recent placement, anonymised

Senior AI evals engineer, Bucharest-based · Placed 2026-Q1

Outcome: Shortlisted in 5 business days. Client interview pass: first round with two additional deep-dives on rubric design. Signed offer in 9 days from shortlist. Still in role (3 months in at time of writing); her harness has already blocked 2 model upgrades that regressed on the demographic-parity dimension.

  • Built the bias + jailbreak eval harness at a UK AI safety startup (Series A, Annex III product) covering 14 output dimensions: accuracy, faithfulness, refusal calibration, demographic parity, jailbreak resistance, prompt-injection resistance, latency p95, cost per output, toxicity, PII leakage, hallucination rate, sycophancy, instruction-following drift, and JSON-schema validity.
  • Contributor to **Inspect AI** (AISI's open-source eval framework) — merged 4 PRs covering custom scorer patterns and rubric-based grading; co-authored a cookbook entry on calibrating LLM-judges against human raters.
  • Designed the golden-reference dataset workflow adopted company-wide: 2-rater labelling with Cohen's kappa ≥ 0.78 gate, adjudication protocol for sub-threshold examples, monthly drift audit against freshly-labelled stratified sample.
  • Prior role: ML engineer at a Romanian fintech running a credit-scoring model; that experience is how she knew what a production release gate actually has to look like.
  • Daily working language: English (C1, verified in our interview); Romanian and French native.
  • Working setup: hybrid from Bucharest, attended onsite in London once per quarter for rater-calibration workshops.
  • B2B contractor model (SRL in Romania); total comp to client €84K/yr vs London-local €128K equivalent for a comparable senior evals engineer.

Profile composed from 2 real placements in this role in 2025-Q4–2026-Q1 plus one near-placement (offer declined for geographic reasons). Personally identifying details anonymised per GDPR Art. 5. Salary figures are averaged across all three.

Hiring difficulty

Benchmarks we track

Evals engineering is the second-hardest AI role we hire for in 2026. The scarcity is not framework familiarity — most inbound candidates can name Inspect AI, Ragas and LangSmith — it is the operator instinct to know when an eval is measuring something real versus rewarding pattern-match artifacts.

CV → AI screen pass rate

19%

Source: Recruo internal (n=143 inbound CVs, 2025-Q4–2026-Q1)

AI screen → human shortlist pass rate

42%

Source: Recruo internal (n=27 AI-screen passes, 2025-Q4–2026-Q1)

Shortlist → offer rate at client

78%

Source: Recruo internal (n=9 shortlists delivered, 2025-Q4–2026-Q1)

Median time-to-shortlist

5 business days

Source: Recruo internal (n=9 engagements, 2025-Q4–2026-Q1)

UK market median time-to-hire (AI safety / evals roles)

84 days

Source: Hays UK AI Roles Salary Guide, 2026 edition, AI safety subset (accessed 2026-04-12)

CEE salary delta vs UK-local

34–44% lower

Source: Recruo placements (n=3 evals roles) cross-referenced with DOU 2026-Q1 senior ML survey and BestJobs RO 2026-Q1 data

The 19% CV→screen pass rate is higher than for LLM engineers — the pool of self-identified evals candidates is smaller and slightly more self-selecting, since "evals engineer" is not yet a title every junior claims. But the 42% AI-screen→shortlist rate is where the real filter happens: about half of framework-literate candidates cannot answer the rater-disagreement-resolution question, and those candidates tend to ship harnesses that look thorough and measure nothing. The 78% shortlist→offer rate is the highest in our catalogue — when we do surface a production-ready evals engineer, clients almost always close.

Reviewed by

Oleh Datskiv

Oleh Datskiv

CEO & Co-founder

Oleh is CEO of Recruo and a 7-year AI engineer. Most recently Associate AI Lead at N-iX (2024–2026) leading GenAI/ML R&D prototypes; prior production computer vision and robotics work at GlobalLogic and SoftServe (including MBZIRC 2020). NeurIPS 2020 workshop co-author; MSc in Data Science from Ukrainian Catholic University. He personally reviews every evals-engineer shortlist before it reaches you.

FAQ

Frequently asked questions

Book a 30-min discovery call

Scope one open ai evals engineers role and get a 3-candidate shortlist in 5 business days. £0 upfront, 90-day replacement guarantee.