AI Hiring Bias Research

Hired by AI

How Names, Gender, and Profile Photos Shape Algorithmic Candidate Scoring

A two-phase controlled audit of demographic and appearance bias in a production AI recruitment system, testing 250 synthetic candidates across 10 jobs, 5 ethnicities, and 5 attractiveness tiers.

250
Synthetic Candidates
150 text + 100 with photos
10
Job Positions
Tech to Healthcare
5
Ethnic Origins
Anglo to East Asian
2
Bias Dimensions
Text signals + Appearance
AI-Generated Candidates

Meet the Test Subjects

100 AI-generated LinkedIn headshots across 5 ethnicities, 2 genders, and 5 photo-quality tiers — none of these people exist.

Female Anglo candidate — Tier 3
Anglo
T3 · F
Male East Asian candidate — Tier 5
East Asian
T5 · M
Female Middle Eastern candidate — Tier 3
Middle Eastern
T3 · F
Male Greek candidate — Tier 4
Greek
T4 · M
Female South Asian candidate — Tier 4
South Asian
T4 · F
Male Anglo candidate — Tier 1
Anglo
T1 · M
Female East Asian candidate — Tier 4
East Asian
T4 · F
Male Middle Eastern candidate — Tier 4
Middle Eastern
T4 · M
Female Greek candidate — Tier 5
Greek
T5 · F
Male South Asian candidate — Tier 3
South Asian
T3 · M
Male Anglo candidate — Tier 4
Anglo
T4 · M
Female South Asian candidate — Tier 2
South Asian
T2 · F
Male East Asian candidate — Tier 1
East Asian
T1 · M
Female Anglo candidate — Tier 5
Anglo
T5 · F
Male Greek candidate — Tier 2
Greek
T2 · M
Female Middle Eastern candidate — Tier 1
Middle Eastern
T1 · F

Tier 1 = low-quality webcam shot  →  Tier 5 = professional studio headshot. All faces are AI-generated — no real individuals.

Key Findings

What We Discovered

Our audit reveals that while name and gender bias is minimal, photo quality creates a significant scoring gap.

Subtle But Measurable

A 1.09-point gender spread and 1.32-point ethnicity spread — small numbers that, at scale across thousands of applicants, can systematically shift who gets hired.

Appearance Matters Most

An 8.6-point gap between low-quality and professional photos — dwarfing all other bias sources combined.

Anonymization Penalty

Removing names and pronouns backfires: neutral candidates score ~1 point lower on average than named ones.

Culture Fit Gap

Females score +0.9 vs males on Culture Fit — the most subjective criterion is also the most bias-prone.

Phase 1 — Text Signals

Gender & Ethnicity Bias

150 candidates with identical qualifications — only names, pronouns, and cultural affiliations differ.

Average Score by Gender

Spread: 1.09 points — minimal bias

Average Score by Ethnicity

Spread: 1.32 points — no severe bias

Per-Criterion Breakdown by Gender

Culture Fit and Communication show the largest gender gaps

Phase 2 — Appearance

The Photo Effect

100 candidates with AI-generated LinkedIn photos. Same qualifications — only the photo changes.

T1
Below Avg
79.7
avg score
T2
Average
86.5
avg score
T3
Above Avg
88.3
avg score
T4
Attractive
88.3
avg score
T5
Very Attractive
88.0
avg score

Score by Attractiveness Tier

8.6-point gap between T1 and T3

Tier Scores by Gender

Female candidates show larger T1 penalty

The Attractiveness Premium

Candidates with low-quality photos (T1) scored 8.6 points lower than those with professional headshots (T3) — despite having identical qualifications. This is the single largest bias source in the entire audit, far exceeding the 1.09-point gender spread or 1.32-point ethnicity spread. Notably, the AI explicitly referenced photo quality in its reasoning for Communication scores.

Methodology

How We Tested

A controlled experimental audit using synthetic candidates with identical qualifications.

1

Synthetic Candidates

250 identical-qualification profiles generated programmatically

2

AI Scoring

Gemini 3.1 Pro scores each candidate on 4 criteria (0-100)

3

Photo Generation

100 AI-generated LinkedIn photos across 5 attractiveness tiers

4

Bias Measurement

Compare scores across gender, ethnicity, and appearance

Phase 1 Design

Design10 x 5 x 3 factorial
Candidates150 (50M / 50F / 50N)
Jobs tested10 positions
Ethnicities5 origins
AI ModelGemini 3.1 Pro
Temperature0 (deterministic)

Phase 2 Design

Design5 x 5 x 2 x 2 factorial
Candidates100 (50M / 50F)
Attractiveness tiers5 (webcam to studio)
Photos generated100 AI-generated
Job testedSoftware Engineer (fixed)
Photo modelGemini 3 Pro Image
Deep Dive

Notable Outliers

While overall bias is minimal, specific role-origin combinations show concerning gaps.

9.1
Registered Nurse, South Asian

Neutral candidate scored 90.1 vs female at 81.0 — the largest single gap in Phase 1.

5.0
Software Engineer, South Asian

Raj Patel scored 84.5 vs Priya Patel at 89.5 — a notable male penalty in tech.

12.8
T1 Female Appearance Penalty

Female T1 (76.7) vs T3 (89.5) — low-quality photos punish women more severely.

Literature

Building on Existing Research

Our study fills gaps in the AI hiring bias literature — from Gemini testing to appearance-based discrimination.

Bertrand & Mullainathan (2004)American Economic Review

Are Emily and Greg More Employable than Lakisha and Jamal?

Foundational name-based audit — we extend to AI systems with 5 ethnicities

Haim et al. (2024)arXiv

The Silicon Ceiling: Auditing GPT's Race and Gender Biases in Hiring

Most comparable — we test Gemini in production, add appearance testing

Gulati et al. (2025)arXiv

Beauty and the Bias: Attractiveness Impact on Multimodal LLMs

Found attractiveness impacts 86% of LLM decisions — we confirm in hiring context

EU AI Act (2024)EU Regulation

AI systems in recruitment classified as high-risk

Mandatory bias audits by Aug 2026 — our methodology provides a template

Gaps This Study Fills

Tests Gemini (vs mostly GPT-focused studies)
Audits a production hiring system, not a lab experiment
Uses AI-generated photos with controlled visual parameters
Includes Greek/Mediterranean ethnicity (novel category)
Includes neutral/anonymized baseline condition
Multi-criterion continuous scoring (0-100)
Combined text + appearance in a single study
Separates photo quality from facial features
Why It Matters

Regulatory Context

AI hiring tools are facing unprecedented regulatory scrutiny worldwide.

EU AI Act

AI recruitment systems classified as high-risk. Mandatory bias audits, transparency, and human oversight required by August 2026.

NYC Local Law 144

Requires annual independent bias audits of automated employment decision tools, with public disclosure of impact ratios by race and sex.

EEOC Guidance

Title VII liability extends to AI hiring tools that produce disparate impact, regardless of vendor responsibility.

Transparent AI Hiring Starts Here

We believe the first step to building fair AI is measuring and publishing the results — even when they reveal uncomfortable truths.

Research conducted by Humanlike AI. All candidates are synthetic — no real individuals were evaluated. AI models tested: Google Gemini 3.1 Pro Preview (scoring) and Gemini 3 Pro Image Preview (photos). Full methodology and raw data available upon request.