Research by i10X has uncovered a serious problem with AI recruitment. It turns out that the same candidate’s resumes can get recommendations for hiring that differ by as much as 42 percentage points just because different Large Language Models (LLMs) generated the resumes. The study looked at 1,576 data points of 100 profiles that were evaluated by top-notch models (including GPT-5.4, Claude Sonnet 4.6, Gemini 3 Pro, and Grok 4.3) and revealed that systemic evaluation biases were present and these prejudiced the selection of candidates even before the human review takes place. Among others, Claude Sonnet 4.6 was the most rigorous in giving the lowest score but also showed a heavy self-bias approving 84% of the resumes it had generated while only 42% of those written by GPT However GPT penalized its own writing style by as much as 15 percentage points.
Also Read: Recur Software Acquires PCRecruiter to Expand Recruiting Technology Portfolio
However, resumes written by Gemini met approval everywhere and achieved an excellent 94.5% average hire rate across all test groups, although the difference in marks between GPT and Claude, which reached up to as much as 29 points on identical documents, was the highest. Since a slight drop in the grade or a “maybe” decision in automated tracking systems usually result in a candidate being declined without human intervention, this model-to-model unseen friction reveals a serious integrity risk of HR technology for enterprises. By pointing out these unseen obstacles, the results show that algorithmic choice is still a source of arbitrary compliance and operational weaknesses. That’s why, B2B procurement executives are being advised to adopt multi-model validation techniques that are objective to guarantee fair, reproducible corporate talent acquisition.
