THE BIG HR DATA THEORY

Ben-Taylor

In this video, HireVue’s Chief Data Scientist Ben Taylor breaks down the basics of big data and cloud computing, and how HR professionals can use this to make more informed talent decisions.

Watch this on-demand webinar now to learn:

  • What resume modelling is, and how recruiters can use it to avoid hiring bias
  • Quick wins recruiters can apply just by using the data already in their ATS
  • How predictive analytics can be used to identify top talent from resumes and video interviews

Webinar Transcript:

Scott: Please join us in welcoming Ben Taylor, Chief Data Scientist at HireVue.

This is some specific research I’ve been working with gold, nanoparticle, biosensor, modeling and optimization. And then I did what every chemical engineer does and I went and worked for Intel and Micron doing photolithography, process control, yield modeling. And then I took a break from that and went and worked for a hedge fund. We had a million dollar GPU cluster, did algorithmic stock trading, distributed optimization, and then I ended up going over to HireVue.

And now I do interview prediction for them and we do HR analytics. So this is an example of a chemical engineer getting into HR. Our most recent hire for data science was a physicist. So it’s just a neat cross-pollination that we’re seeing within HR. So HR is starting to get disrupted with analytics and with outside talent coming in and looking at this problem.

So then, on the right, these are some photos of some of the activities that we do in our state. So this is Utah in the morning before work, something we try to do every week or me personally. I have this gear. It’s a nice way to take a break from this stuff. But before I jump into this, there’s a lot of hype around big data and I don’t know if you guys use Google Trends, but I love Google Trends. You can pull up HR analytics, you can pull up any topic you’re interested in, and you can see how it’s trending and when the trend started, when it’s peaking out. And so, if you did this for big data, which is in blue, you would see that it appears to start around in 2011. But the thing that you’re missing is if you look for Hadoop, which Hadoop is the name of a software stack that is open source, which has enabled big data. You’d see it actually starts back in 2007.

So what was happening in 2007 and what did this big data technology need to be created. So the problem to ask is what pain was being fixed? So I’m going to do a little experiment right now where let’s come up with a problem where I give everyone who is listening a sheet of paper. And on this paper there are ten e-mail addresses, and next to the e-mail address is a birthday. And I’m going to tell you my e-mail address and I want you to tell me what my birthday is.

If I hand this piece of paper out to everyone that’s listening right now, we would be amazed at how quickly people are able to return back my birthday once I give them my e-mail. It would be sub second. What if I made it a little bit more complicated and I said, “You must start at the beginning of the list and read it in order to the end,” and my e-mail happens to be at the end. Well now it’s a few seconds.

Let’s change that list and let’s make it so it’s a thousand users long. Well, now it’s taking a few minutes. What if it’s a million? What if it’s a billion? Now it’s something that can’t even . . . We can’t return the response for my request. This is an example where we’ve broken the system. And so I know you guys are already thinking, “Well, that’s a dumb example, because we could sort those e-mails out alphabetically. And even if you gave a million e-mails, I could still find the e-mail that I’m interested in and the birthday, because I could sort it alphabetically and I could find it that way.”

And you’re right, because that’s something called indexing. Computers already do that. They sort the data in the database to enable lookup to be that much faster. So let’s make this even harder. Let’s say we hand out 10,000 resumes. So in HR we deal with this a lot. How do we get through the resumes quickly? How do we pick out the right people to phone screen in a timely manner?

So if you have 10,000 resumes, and I ask you to find me an individual who has an Ivy League education, their GPA is about 3.8, they’ve done an internship, and they like hockey. And if I give you that specific query, you’re going to see that’s not something you can really sort very well. It’s really hard to deal with that, because you can’t just index it. And so, that’s a good example of a big data problem. So to fix this, we’re going to do something called MapReduce. So we’re going to map out a request and then we’re going to reduce it back to a single response or a single result.

So, rather than having one individual go through the 10,000 resumes one by one, looking for the strange query or request that I’ve asked, we’re going to divide that up. So everyone who’s listening has a few resumes and you’ll see when I send out the request, now even though we have tens of thousands of resumes, every individual represents a single computer. They can look at those quickly even though it’s a complex query, and they can aggregate those results back. 

So I can still find out where those resumes are, if they exist. But to make it even more challenging, let’s say that some of you aren’t reliable. You’re not quite listening, you’re actually working on the side or you’re playing with your phone, and you have that resume that I need and you get the request and you fail to execute on it. And that’s something that actually happens with computers, where you’re dealing with unreliable hardware. 

And so Hadoop, that software that we mentioned previously, takes care of all of that. So they take care of the distribution, they distribute the data, and they take care of the reliability issue, where some nodes that do have the data aren’t reliable. And by doing that, you just replicate it. So imagine if I took that stack of resumes and I made two or three copies and I divided it up among the group, then that would be a pretty robust way for me to send out requests and get responses on a large data set.

So that’s essentially the magic that’s happening within big data. You’re dealing with MapReduce on distributed systems and you’re dealing with this reliability concern. So before we can really appreciate some of the magic that we use now with big data, we need to understand cloud computing. So cloud computing, the funny thing about it is we all use it every day, even if we don’t realize it. So we use cloud computing and a lot of us don’t really appreciate what it is.

If my kids are watching Netflix and I’m streaming Dora, that Dora movie is actually saved somewhere. It’s in a physical location, it’s not just in the cloud. That is coming from a server in Virginia from Amazon or maybe in Oregon, depending on where the data center is that’s servicing my request. And the brilliant thing about cloud computing, the big thing that changed there, was this concept of infrastructure as a service, because we’ve all heard of software as a service, but before 2007, you couldn’t just go out and rent 100 servers for one hour to do some analysis. You’d actually have to commit to it and it would be a big expense. You’d spend 10, $100,000 to get the equipment in and then you spend a few weeks provisioning the hardware and the software and getting it all set up. And this is something I personally did when I worked with the hedge fund, where we built a million-dollar super computer and it’s a very intense task.

It took us over a month to get all the hardware, get everything imaged, and get it stood up and then to sanity check the software as you would add more notes to the system. So it doesn’t scale very well, and it’s not an efficient way for you to run your business. And then the other concern there too, is you can’t pause depreciation. So we all know computers depreciate like crazy. The iPhones we have, the laptops we’ve purchased, in a few years, they’ll be worth less than half of what they were when we purchased them. And five or ten years, they’ll be worth essentially nothing. So this is something that we’re all aware of. But with cloud computing, I can use the resources and then I can turn them off where I’m not being built for them and that’s essentially pausing depreciation.

So, continuing on with the concept of the cloud being used every day, Google Maps. So that’s definitely a cloud application. It’s running on multiple computers. No single computer can hold all of the satellite images that are used when we look at Google Maps, or that route information. So, when we send a request, it’s being routed to the appropriate individual to give us what we want. And then, same with search results when you’re searching for photos in Instagram or something like that, that’s being serviced by a cloud network. It’s really stuff that we use every day, so Dropbox, Amazon, these are all services that run on the cloud.

So something that has happened with this wonderful trend of big data and cloud computing has been this explosion in data variety. So we’re all familiar with data variety in HR. And so HR has some of the worst data out there, as far as industries go. Our data’s unstructured, it’s messy, it’s not consistent. You’ve got performance appraisals, you’ve got inter-rater reliability. The classic example for data variety coming from HR is the resume, where you’ve got things like GPA that we think are predictive, but they’re not consistently in the same place. We have cover letters. Some of them are formatted, some of them aren’t. And what we’re hitting on, which we’ve already touched briefly but I’ll explain what this is. We have this concept of structured and unstructured data. A lot of people use the Excel example for structured data. But really, you can think of it as a place where pieces of data has been predefined. So, one way that you can think of it is this clean room on the left, and I know none of our playrooms look like this for anyone listening that has kids. They only look like this when we have visitors coming over. But then on the right, this is actually what the playrooms usually look like because they’re being used and things are everywhere.

So you can imagine on the right, if you’re given this data set, this data set on the right essentially represents the resume. So the GPA isn’t on the right, isn’t on the left, I can’t find it, it wasn’t included. There’s hobbies. Where did it go? It’s just all over the place. And so, the picture on the right, defines something. It shows you where a schema has been predefined. So the schema tells us where the data should be. So a database is a classic example of a structured data set, Excel. But when you go to the right, you would start thinking about things like resume, images, audio in it. Then in the kids’ part of the view, we have videos. These are things that don’t really fit in Excel, they’re not well-defined. They have to be structured somehow. So, we’ll start with the resume modeling and we’ll go through this in a little bit more detail, about how we can structure this and make this data set useful.

So, the great thing I like about resume modeling is, before we get into that, we can talk about how well humans are at this process. So humans are historically terrible when it comes to screening talent, and it’s because we can be both affected negatively and positively based on our own personal limited scope, or whatever bias we may have. So I may be interviewing an individual and they’re doing all right, they have some of the competencies I’m interested in, but I’m not really excited about the individual. I think there could be better individuals out there. And then they mention that they like road biking, and I’m a huge fan of road biking. Let’s say this is something that I do competitively. I do it on a weekly basis. That can change my perception of this individual to the point where now those competencies that were weaker are maybe overlooked. And we see this a lot where people will hire based on likeability, humor or these soft skills that may not translate to actual job performance.

And then it can also happen on the other side, where an individual who is skilled can be disqualified for reasons that aren’t justified, because they’re coming from our own bias that we may have. And a sad example that we still see in the hiring process is with resumes. So an individual with a first name Jose will send a resume out to multiple employers and he’ll wait to see what type of callback he gets on the position that he applied to, and then he’ll send out the identical resume to the exact same employers and he’ll get an increase in callbacks, because he’s first name is Joe instead of Jose.

So sad examples like that, where a human will look at the name and make a judgment call based on an individual’s race, where when it comes to computer modeling with the resume, the computer is not going to use that in the decision process, to decide whether or not this individual should be ranked higher or lower than one of their peers.

And a cool thing, before we get into the details of how we can do resume modeling, there’re some really quick wins that can be realized before you jump into the resume model. And one example would be what school did this individual go to? So that’s something that’s pretty easy to do in your ATS, in your applicant tracking system. You may already placeholders for schools. And if you can query that and pull that out, you can just rank that by high performance or acceptance rate and you may see that . . . What we did when we checked this, we saw that Stanford was number one for the schools out there, for doing well on an interview. So, compared to your standard acceptance rates, if you went to Stanford, you’re more likely to do well in an interview than people that did not. And then in the top ten list, I believe we saw Harvard and then other schools that we’re familiar with, MIT and Berkeley. And then down below, you saw some of the community colleges that were less known coming out towards the tail.

And the other thing that we saw was the resume extension. So we felt that “docx” was more predictive of success that “doc.” So an individual that was using “docx” was more likely to get a good interview score than people only use “doc.” And then we saw that PDF was more likely than “docx.” And the neat thing there that isn’t really a surprise, that should just be common sense is when you’re applying for a job, it’s important for you to be up-to-date. And so, using “docx” is an example of being up-to-date. So whether it’s up-to-date on software or methods, being up-to-date can help applicants during the hiring process. 

And then the other thing we saw, so we have a code platform that we sell with our application in. So looking at some of the code data, we wanted to see if there are things that differentiated candidates. Even though they all passed the coding assessment, we wanted to see if there’re code elements, in there, that differentiated people that were more likely be accepted than from those that weren’t.

And so some of the things that we saw, which was also common sense, came back to code readability. So if you didn’t have a space between your import statements, where you’re importing libraries at the beginning of your code, if you didn’t have a space between that and your main functions, that was a negative indicator. And then we also saw, if you didn’t have a space between your operations, so if you have an equal sign or you’re doing addition or something and you’re taking up these wide spaces and smashing your code together, then that was also an indicator.

And so these are also things that aren’t really common sense either. You want your code to be readable, because ideally it’s going to be supported from team members. And so those are some quick things that we were able to look at without doing some of the heavier resume modeling. So you may be able to see some benefits like that, where you just look at your ATS and you see whether there’s something like school or something else that’s already structured for you, where you can see how well it does.

So we’ve been talking a lot about unstructured data and this is a one-slide representation of what you can do to make sense of that data, and what data scientists will do with it. So this example on the left, this could represent a resume, but it really doesn’t matter what it is. It could be a Facebook profile, it could be LinkedIn, or it could be Twitter. We have this data and somehow we need to clean up this messy room. We need to define where these elements belong.

And so for this case, we’re taking the name, we’re taking the job title, the school, the hobbies, and we’re assigning columns for these specific features. But you can see with school names like Yale or Harvard, they’re still not usable by the computer. And so once we’ve structured the data, we have to do what’s called munging or data wrangling, where you tokenize the data into some numeric format. In the end, the computer needs a numeric input that actually makes sense of it. So this resume is being boiled down into this numeric vector that represents a single resume, and that’s used later for the supervised learning. Like I said, you can apply this to LinkedIn, Facebook. And if you have additional insights with LinkedIn endorsements or something, where you actually want to be placing them into a specific column, you can also do that. And there’s resume parsing services that will do this for you, so you don’t necessarily need to complete this entire flow. You can use a resume parsing service if your ATS doesn’t already provide you one and you can parse these resume out into an Excel style format.

And then on the flip side, there’s generalized prediction services out there for you like Data Robot, Amazon, Google who will all take that data set and they’ll predict an outcome. And so you could essentially build a resume prediction tool without having a data scientist at all, just by outsourcing the unstructured to structuring step with the resume parsing service, and then outsourcing the prediction step by taking that structured data and predicting an outcome.

So I’m going to show you a neat case study that we did, and it’s actually the main focus of our group right now is interview prediction. So if you have an input set and let’s say this input set, instead of being resumes or LinkedIn profiles or Facebook profiles, it’s now a recorded video interview and then my target output set is an annual performance review after the individual is hired. So did they exceed expectations? Did they meet or were they below? And so the challenge is, “Can I build a model that will predict on future interviews whether this is a top performer or a low performer?” Someone that should be reviewed or not reviewed.

And this problem amuses me because for your classical, traditional statistician, they don’t know where to start. There’s not really a clear starting point, because this is so unfamiliar and we see that a lot with unstructured data. But from the data science perspective, there’s always a pathway to structure this and to get it into a workable format. So, in the end, we need to figure out how to take a video and represent it as one dimensional vector, and then we can make use of that.

So it may seem really daunting or challenging, but like any unstructured problem, you just need to cut it up into pieces where you can manage it. And so if you piecemeal this out, with the video, you have audio coming out then you’ve got the video. For the audio, you can do voice to text, there’s voice to text transcription that you can do, and then from that you can run that through a personality model, you can work with the individual words. And then from audio you can do some signal processing. Video, you can get expression and then do similar processing on the expression.

So there’s a pathway for different elements of the video to be structured down into something that’s manageable. So, we’re all familiar with raw audio. We see it on iTunes or when we’re listening to a call or something. We might see this audio form moving to the side. And with that, you can break it out into utterances. So this top part on the right, we’ve broken out these different segments into what we call utterances. An utterance can be a sound, a word or a group of words.

So you can think this is an utterance. Those are all utterances. So you can look at your utterance rate, your utterance length, and then we can look at your utterance gap. We can also do some spectrum analysis, which is that second plot on the right. And on the bottom, we can do some repetition analysis. So if you’re being repetitive, it doesn’t matter what language you’re speaking, it’s easy to see that your utterances are repetitive compared to your peers.

And then with text, which we mentioned, that can be put through these secondary models, where we can predict engagement, motivation, distress, aggression or sentiment. Depending on what you’re trying to predict, you can build other models against text. And . . .(audio cuts) predict those outcomes. Then IBM . . . (audio cuts) one that’s well known, where you can take text and you throw it up into the Personality Insights engine that IBM has with Watson, and then you can predict, you can get all these big five personality traits coming back that you put into your model… (audio cuts)

So video indicators, these are interesting. You can take expressions coming from the interview whether it’s smiling, confusion or other action units. You can get a time series from that. And then from that, you do some signal processing on it and you can get it down to specific numbers that represent different features, and you can pull that into a model. So once you’ve built a model . . . And right now, you can generalize this, so it doesn’t matter if we built the model from resume or if we built it from a Facebook profile. Once we’ve structured the data, so we have columns for every feature and it’s numeric, we can validate that model.

So within data science, when you validate, you define a training set which can be 70, 80% of your data, and then you train the model but you need to be careful. You need to make sure that you hold out what we call a validation or hold out set, so the remainder is not used at all to influence the model or the decisions that you make around how you’re modeling that. And then once you do that, you can take that validation set that was untouched and you can bring it back into the picture and you can pass it through the model, then you can gauge and see how well you’re doing.

And the problem with that is you may say, “Well, with that 60 or 70% that was selected, that could have been cherry picked.” Or maybe . . .(audio cuts) they like to do is they like to do something called k-folding and I’ll go through this quickly, because I don’t want this to be too technical. But this is useful if you’re thinking about analytics and you’re trying to gauge assessment providers or other HR analytics providers, to see if they’re up-to-date on how they do their validation.

So with this, I can take the data set and I can cut it up into nine sections. And now the powerful thing about this is I can actually train. I have the nine sections, so on red that shares you . . .(audio cuts) I can train, I can validate on that model, and then for the second fold I can do the same. But now I’m validating on the second step down and I can keep doing this. And what this allows me to do it allows me to . . . (audio cuts) by using . . . and I’m predicting all out of samples. So that’s great because I can come up with these plots. So this is called an ROC curve. I know this is more technical than maybe the general audience wants to get into, but I want to show you some cool things that you can get doing some of this analysis.(audio cuts)

Once we’ve built this type of an output, I can find points on this plot and see what the value gain is. So this line on the diagonal, that shows me how well all the interviews do across the board. It shows me what my percent reduction is and what my score is on the bottom. So, if I only start looking at interviews that are above 85, I can say, “Well, what does that look like?” So if I do that, and I come up here to where this dotted dash line intersects the green line, I can see we’ve got 70% retention of our top performers. But if I look at the blue line, I can see that we realized this 75% reduction in the total interview volume. And so if you look at that, you can say, “Well, since we already have 70% retention . . .(audio cuts)performers . . . ” Let’s say sourcing is a trivial thing for this particular data set. I may say, “Well, I’m going to increase sourcing by 30%.” So if I increase sourcing by 30%, I can hit my top performer target, but for the total number of interviews that I would have to watch within this range, I’m actually getting a 67 . . .(audio cuts)

And the other thing that’s happening, if you look in that pool, individuals that are getting scores above 75, there’s a 300% increase in top performer occurrence. So if you think about it, top performers are coming through just by chance. So if I stumble across a top performer on an interview, it’s all random. It’s just they happen to apply, I stumble across them, that’s great. But now using some of the predictive modeling, we can get a concentration effect, where now I’m three times more effective with the interviews that I’m reviewing with coming across top performers.

And these types of results, it can be applied to resume modeling, if you want to do . . . I was speaking . . .(audio cuts) about doing competency modeling for Facebook. So just from a public Facebook profile, can they predict performance . . . (audio cuts) from Facebook? So if I’m trying to pull on these outside things and come up with some ranking, this is a great way for you to . . . (audio cuts) some of this is language. And in the end, you get sorted candidates coming in…(audio cuts)

So you can sort the top performers towards the top and react to them quickly, which is a big problem because if you’re in the market for recruiting great talent, there’s a good chance that they’re already being recruited. And so sometimes you’re competing against other employers to get . . .(audio cuts) [inaudible 00:28:18] them . . . (audio cuts) you react to it. And then on the next slide, yeah, you can sort the candidates. So what can you do? So the riskiest thing you can do and also the most expensive, but it could be the best thing you could do, is you can actually hire an HR data scientist…(audio cuts)

So for a lot of us, that’s a big commitment. We don’t have the justification. We don’t have the momentum to do that. The next step down would be maybe hiring a consultant. And so for . . .(audio cuts) and define a problem and they can deliver on a result, which you can use to get that momentum. And then down towards the end, you have an HR analyst. They don’t quite have a programming background, but they have enough to at least scope some of the problems. And then at the lower end, you have third party tools where you can use Tableau, you can use . . . (audio cuts) to fill in the gaps for you and give you value…(audio cuts)

So in summary, we talked about Hadoop, we talked about MapReduce, we talked about unstructured data and predictive analytics, cloud computing, and “alpha” is the term that is used for your prediction quality. We’re all fighting for more “alpha.” I think I’m out of time. But if anyone has any questions, I know this is was kind of rushed, please contact me on Twitter @bentaylordata. And if you’re tweeting, I prefer you use the hashtag #talentinsights. If you have any questions, you can tweet at me directly and I’m more than happy to answer any data science questions related to HR.

We didn’t really talk about sparse data, but that’s another issue with HR data. There’s a lot of [inaudible 00:30:09] information and inconsistencies. But yeah, just feel free to reach out to me on LinkedIn or Twitter and I’m more than happy to answer what questions you might have around HR analytics. So I think I’m out of time.