Question 1

What is the difference between an evals engineer and a traditional QA engineer for AI products?

Accepted Answer

Scope of what "correct" means. QA engineers test deterministic systems — given X, the output should equal Y. Evals engineers test stochastic systems — given X, the distribution of outputs should satisfy a rubric, and the rubric itself is fuzzy. A good evals engineer comes from one of three backgrounds: an ML engineer who got dragged into shipping, a senior QA lead who picked up rubric design and calibration, or an applied researcher who joined product. All three work; none of them are interchangeable with generic QA talent. We screen for the transition, not the origin story.

Question 2

Do your evals engineers have EU AI Act compliance experience?

Accepted Answer

Yes — we screen for it explicitly for any client whose product is Annex III (employment, credit scoring, critical infra, education, law enforcement, migration). Our shortlist for those roles has working knowledge of Article 9 (risk management), Article 10 (data governance), Article 13 (transparency), and Article 15 (accuracy, robustness, cybersecurity). We do not pretend this is legal qualification — for that you still need counsel — but your evals engineer is the person who translates the regulation into concrete test dimensions in your harness. We cover the intersection in detail in our EU AI Act hiring checklist.

Question 3

How do you verify a candidate has built golden datasets and calibrated LLM-as-judge flows, not just read about them?

Accepted Answer

Three layers. (1) Our AI interview asks for a walkthrough of a real dataset they owned: labelling protocol, disagreement-resolution steps, the kappa or alpha numbers they hit, how they handled concept drift. Someone who has not owned a dataset cannot invent plausible answers — the follow-ups drill into edge cases. (2) We require a public artifact where possible: an OSS PR to Inspect AI / Ragas / OpenAI Evals, a blog post, a conference talk at an applied-ML venue, or a published model card. (3) We do a reference call with the previous engineering manager focused specifically on rubric design and rater calibration. Candidates who cannot produce evidence in at least two of the three layers do not make shortlist.

Question 4

Should we hire an evals engineer or add evals responsibilities to our LLM engineers?

Accepted Answer

Honest answer: below 3 LLM engineers and one shipped feature, assign evals as a 20% responsibility on your most senior LLM engineer. Above that scale, dedicate the role. The failure mode of the "shared" model is predictable — evals work gets deprioritised the week a feature ships, harness rot accumulates, and by month six you are running evals that pass on inputs nobody actually sends. A dedicated evals engineer cuts that feedback loop from quarters to days. If you are genuinely unsure where you sit on that curve, we will say so on the intro call — we have turned down two engagements in the last year where the client was better served hiring a second LLM engineer first.

Question 5

What stacks and frameworks do your evals engineers typically work in?

Accepted Answer

Primary stacks we place into: Python (Inspect AI, OpenAI Evals, Ragas, TruLens, DeepEval), observability and trace-linking (LangSmith, Langfuse, Weights & Biases, Arize Phoenix), dataset tooling (Argilla, Label Studio, custom Streamlit/Gradio apps), CI integration (GitHub Actions, GitLab CI, Buildkite), and human-rating platforms (Surge AI, Scale Rapid, in-house annotator tooling). Comfort reading model-card and red-teaming literature from Anthropic, OpenAI, DeepMind and AISI is a baseline requirement. If you have an unusual stack (e.g. a custom eval runner built on Temporal or a proprietary rater platform), flag it on the intro call — we have placed into more unusual stacks but the timeline may stretch to 8 business days.

Question 6

What salary range should we expect for a senior evals engineer from CEE?

Accepted Answer

Ranges we have placed at in 2026-Q1: Poland €72–98K, Ukraine €60–82K, Romania €68–90K (all annual, B2B contractor, senior 5–8y experience including at least 2y on evals specifically). London-local equivalents run £108–138K. The 34–44% delta is consistent across roles and has been stable for four quarters. This is not a quality gap, it is a local-market gap. See our CEE hiring guide for benchmarks across our full role catalogue.

Question 7

How does your pricing compare to Toptal or Proxify for an evals engineer?

Accepted Answer

On a €82K senior evals engineer, Recruo charges a success fee of typically 15% (€12,300), paid once the candidate passes their 90-day mark. Toptal typically runs a marked-up hourly rate that totals ~50% markup on an annual-equivalent basis; Proxify uses a different monthly-retainer model we cover in detail on our vs Toptal and vs Proxify pages. All three have legitimate strengths — the honest take is that Recruo fits long-term hires into product teams, and Toptal/Proxify fit short-term contract capacity. Evals engineers are almost always long-term hires, which is why we rarely see clients churn from us to a marketplace for this role specifically.

Question 8

Can I hire just one evals engineer, or do I need to commit to a team?

Accepted Answer

Single-role engagements are the default. Most clients start with one evals engineer to validate the flow before scaling. A success fee of typically 15% applies per placement — no retainer, no upfront cost, no minimum commitment. If the first shortlist does not produce a hire, we re-run sourcing at no extra cost; if the second also misses, you owe nothing on Standard, and we refund upfront payments on Hybrid or Retained. We have only hit that second-shortlist point once in this role, and we ended up placing the candidate on an adjacent team.

Hire AI evals engineers who can actually tell you when your LLM is regressing.

What an AI evals engineer actually does in 2026

How Recruo sources evals engineers specifically

A recent placement, anonymised

Benchmarks we track

Frequently asked questions

Get a shortlist of 3–5 vetted candidates in 5 days

Also on Recruo