Software Engineer Intern Microsoft AI
Redmond, WA –
- Doubled GDPval-AA task completion scores from 40% to 80% for Copilot Cowork and Copilot Tasks by diagnosing eval quality gaps and improving prompt and workflow coverage.
- Diagnosed intermittent failures in the eval infrastructure and reduced run failure rate by 70% by distributing execution across time windows to eliminate resource contention.
- Built an internal leaderboard to surface Copilot eval score trends across model versions, system prompts, and skills, enabling the team to catch capability regressions before they reach production.