First impressions play a big role in human interaction, but rarely do they play a more influential role than in hiring. The hiring process is full of first impressions: from the first time a recruiter sees a resume, to a candidate's final interview with a hiring manager. Every first impression the candidate makes informs the evaluator's perception of their responses and experiences. It is even estimated that 30% of interviewers make their decision about an interviewee within the first five minutes of the interview.
At HireVue, we remove this sort of “first impression” bias from the hiring process by turning recorded video interviews into validated pre-hire assessments; the entire context of the candidate’s response is evaluated objectively with machine learning. This means that predicting competencies and personality traits from video is something that is of prime interest to the HireVue Data Science team.
HireVue's Director of Data Science, Lindsey Zuloaga, explains why HireVue predicts competencies using video interviews and game-based challenges.
As with many Machine Learning questions, it all comes down to getting the proper training data. Upon discovering the ChaLearn First Impressions dataset, we were excited by the potential. The dataset includes 10,000 clips (average duration 15s) extracted from more than 3,000 different YouTube high-definition (HD) videos of people facing and speaking in English to a camera. Amazon Mechanical Turks (human evaluators) were given some training, then shown pairs of videos to compare to one another on the Big Five (which person is more open, agreeable, etc.), along with an interview flag (which person they would rather invite in for a job interview).
The Big Five Personality Traits are a common way to classify personality - Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism (OCEAN). Usually, Psychologists measure these traits with multiple choice questionnaires, but because gathering both video and administering those tests would be time-consuming and expensive, the creators of this dataset decided to use YouTube videos and get personality traits from human evaluators.
These comparisons were then done over many different video pairs, and shown to several evaluators. Then some fancy math was used to translate these comparisons into an overall score for each video on each of the six measures (Big 5 + interview progression).
The ECCV ChaLearn LAP 2016 challenge was a competition for individuals and teams to see how well they could predict The Big Five personality traits from video. Several groups were able to obtain impressive results using cutting edge data processing techniques and algorithms.
Given our history with looking at human judgment of job candidates, however, we were curious as to what was actually being measured through these “apparent personality traits”. The truth is, even with highly trained human evaluators and consistent conditions and questions, human-evaluated personality assessment can be difficult. Fifteen seconds of a random video clip very likely would not contain the information necessary to assess these traits. What evaluators are then left to rely on is truly a “first impression”, with the information they are gathering informed by what the person looks like and sounds like in the snippet.
While this dataset is described as a “personality dataset”, in reality, the dataset actually gives us much more insight into how humans perceive personality, rather than how they display it. In this case: how do Mechanical Turks intuit personality traits from only 15 seconds of video?
To investigate, we used our trained deep learning models to predict age, race, gender, and attractiveness for the subject of each of these videos. These models are trained with self-identified age, race, and gender, and average attractiveness as evaluated by other people. We then looked at Mechanical Turk-generated score distributions for each of the measured attributes for different groups. The results were striking.
In all of these graphs, the x-axis is the normalized score, meaning when we look at people between 0.9 and 1.0, we are looking at the top 10% scorers . On the y-axis, we have the proportion of the people that were given that score (the area under the curve is 1).
Below you can see three examples of score distributions. In the graph on the left, we can see that the majority of population was ranked in the upper half of the score range - most of the people evaluated were scored highly. In the center graph, the distribution is flat. This indicates the same percentage of people received a score of 0.2 (for example) as the percentage who received a 0.8. The graph on the right tells us that most of the population received low scores.
With that context, let’s dig into the data.
Let’s start with age. The score differences here show that older people are seen as more conscientious and less neurotic, which may be seen as positive, but are seen as less agreeable, open, extroverted, and less likely to be given a job interview.
Looking at male/female differences, we see that female scores are generally distributed more towards the higher score end, with the exception of agreeableness.
Score distributions by ethnicity show that whites and Asians were consistently rated higher than blacks and others in all six dimensions.
Splitting the attractiveness rating into three tiers, we see what is probably the strongest trend in the data. This is especially interesting because fairness based on looks is not addressed in most processes (being unattractive is not a legally protected class). In the First Impressions dataset, better looking people are seen as more everything, below average looking people are seen as less, and average looking people have a pretty flat distribution.
These results are illuminating on several levels. It's well observed that first impressions play a big role in the interviewing process. As mentioned previously, a 2015 study found that 30% of interviewers made their decision about an interviewee within the first five minutes of the interview. More than that, the first impression interviewers form of a candidate greatly influences how they perceive the candidate's responses throughout the interview.
On the data science practitioner side of things, these results are a powerful reminder of the importance of auditing algorithmic assessments for adverse impact to ensure that AI-driven evaluation is not mimicking bias from the training data.
Unfortunately, none of these results are particularly shocking for those of us combating bias in the hiring space. While humans do add value to the decision-making process, they are often a source of bias. It's incredibly important for recruiters and hiring managers to have an objective evaluation of each candidate that they can check their "first impression" and "gut feeling" against. Properly vetted, AI can be that objective decision support, providing crucial insight so humans can make better, less biased hiring decisions.
Lindsey Zuloaga, PhD, is HireVue’s Director of Data Science. She holds a PhD in Applied Physics from Rice University and leads HireVue’s Data Science team, building the sophisticated machine learning algorithms that analyze video interviews and make hiring fairer. Find her on LinkedIn.