This post was compiled using my responses to John Paul
Titlow's questions about data transformation for his Fast Company piece The Productivity Snag That Keeps Data
Scientists Up At Night:
After separately compiling serial killer databases for James Alan Fox and Eric Hickey, I built a community where serial homicide experts could share
information in a digital workspace: the Serial Homicide Expertise and Information Sharing
Collaborative (SHEISC). I act as the data coordinator for the
SHEISC initiative, the goal of which is to synchronize and standardize all
serial homicide data collection efforts so the stereotypes, myths and lore that
surround the study of serial murderers can be addressed. The following folks
contributed their entire datasets to the project - Ronald Hinch, Brigadier
GĂ©rard Labuschagne, Janet McClellan, Bryan Nelson, Kenna Quinet and John White.
It was my job to collect and then merge these data files together with the Fox/Hickey data and
make sense of the sometimes disparate layouts.
Mike Aamodt of Radford University manages the Serial
Killer Information Center. Since 1992, Mike and his students have been amassing
information on serial killers in what is now a database of 3,000 offenders.
Mike and I partnered together to combine all of our data to create the first
national serial homicide database for use by law enforcement officers and
academic researchers.
Aside from the seamless merging of the SHEISC and Radford data, almost all other aspects of this joint data project were challenging. First and foremost, convincing folks who operated in information silos to share data that they had a vested interest in keeping private took the better part of two years. Since it was an ongoing process, we would receive large datasets over a spaced out period of time. This caused the current work that we were doing to the main database to be put on hold while we examined the new data file for new offender names and inconsistencies. As I previously mentioned, the setup of each file was different. Some files were messy while others were pristine. Misspellings caused several duplicate records that had to be removed before further research was done. Most times, an author would have a unique variable that we would then add to the main database. This would require us to revisit every unique offender in the main file and search for information on the new variable to fill in the gaps. We also had to take into consideration the author's motivations for including or excluding certain pieces of information or why variables were categorized in a certain manner. What we discovered was that the addition of offenders was a subjective process with higher priority given to those offenders who received greater amounts of news coverage.
Some authors I contacted lost their original datafile due to computer issues and sent in paper records that had to be transcribed into Excel. I also manually copied data from both of Michael Newton's editions of The Encyclopedia of Serial Killers.
Due to lack of access to primary sources like police records and FBI files, information across all databases was gathered from secondary sources and as a result we often encountered conflicting records. None of the databases contained the source of the information so it was impossible to verify which record was accurate. Differences in how we each defined the term serial killer was a consistent problem that led to incredible variety in not only the number of offenders contained in each dataset but the types as well. We have just assembled a 'Serial Homicide Data Expert Panel' to discuss which classifications of offenders can be included in the main Radford dataset. The inclusion of hitmen and gang members has been an ongoing point of contention among folks in the field for decades (coming to a head now because of Aaron Hernandez). We hope to resolve this issue once and for all with the outcome of the expert panel.
The scrubbing was all a manual (and tedious) process. I
accomplished the de-duping task by copying all offender names into one excel
file and assigning a color to all the names from a specific dataset so that I
could track who the data came from. We discovered over time that data from
specific authors was more reliable than others. We found that the data gathered
by teams of folks was more likely to contain faulty data versus that compiled
by one person alone. My theory is that it was difficult to maintain
"version control".
But...since we got new data on a rolling basis, this color scheme process had to be repeated at least five times. After a while, I began to recognize names and could recall which records would be duplicates before I did the color coding function. The issue was determining which records to exclude. We resolved to combine the duplicates and refer to our trusted source when deciding which piece of information to keep when we encountered conflicts over the 160 variables.
One of the most difficult aspects was determining the offender's race. Oftentimes, news reports would not include a photo of the offender. Some authors would make assumptions based on the region of the country where the crimes occurred and coupled that information with the offender's name. Other folks looked at the race of their victims since the grand majority of these offenses are intraracial.
In regards to cleaning, some authors coded race by spelling out the entire word or using either a W for White or a C for Caucasian, AA for African American or B for Black. All of those records had to be recoded to match the primary selection which ended up being W, B, H (Hispanic), A (Asian) and O (Other). The use of 'Other' varied as well. Some offered a few examples (such as Native American) and most lumped Hispanic and Asian into 'Other'. Most of the time, the datasets did not include a legend so those records tagged with 'Other' had to researched further.
As for gender, most authors used M, F or O while others spelled out Male, Female or Other. The use of 'Other' in this case was reserved for the one instance of a transgendered offender.
Another frustrating aspect of the cleaning was that most datasets had all the information for 'victim type' lumped under one cell. Others broke the victim types down into separate columns for men, women, children etc. and marked the relevant selection with either an X or a 0/1 (no/yes) scheme. The information from the single cells had to be parsed out following the 0/1 (no/yes) scheme. Each X was converted to a 1 and the gaps got a 0.
An offender's record is never truly complete since new information about additional victims (or false claims) is sometimes discovered. In order for this information to be incorporated into the database, it must first be found by one of the researchers (I have a Google alert set up and run the terms "serial killer" and "serial homicide" through LexisNexis once a week).
In the end, we used SPSS to run the statistics.
The SHEISC's role is to use data to present an accurate depiction of serial murderers so that the public can be educated about these threats to their safety since there are approximately ten television shows on the air at the moment that exist merely to exploit this phenomenon and often misinforms their audience.
But...since we got new data on a rolling basis, this color scheme process had to be repeated at least five times. After a while, I began to recognize names and could recall which records would be duplicates before I did the color coding function. The issue was determining which records to exclude. We resolved to combine the duplicates and refer to our trusted source when deciding which piece of information to keep when we encountered conflicts over the 160 variables.
One of the most difficult aspects was determining the offender's race. Oftentimes, news reports would not include a photo of the offender. Some authors would make assumptions based on the region of the country where the crimes occurred and coupled that information with the offender's name. Other folks looked at the race of their victims since the grand majority of these offenses are intraracial.
In regards to cleaning, some authors coded race by spelling out the entire word or using either a W for White or a C for Caucasian, AA for African American or B for Black. All of those records had to be recoded to match the primary selection which ended up being W, B, H (Hispanic), A (Asian) and O (Other). The use of 'Other' varied as well. Some offered a few examples (such as Native American) and most lumped Hispanic and Asian into 'Other'. Most of the time, the datasets did not include a legend so those records tagged with 'Other' had to researched further.
As for gender, most authors used M, F or O while others spelled out Male, Female or Other. The use of 'Other' in this case was reserved for the one instance of a transgendered offender.
Another frustrating aspect of the cleaning was that most datasets had all the information for 'victim type' lumped under one cell. Others broke the victim types down into separate columns for men, women, children etc. and marked the relevant selection with either an X or a 0/1 (no/yes) scheme. The information from the single cells had to be parsed out following the 0/1 (no/yes) scheme. Each X was converted to a 1 and the gaps got a 0.
An offender's record is never truly complete since new information about additional victims (or false claims) is sometimes discovered. In order for this information to be incorporated into the database, it must first be found by one of the researchers (I have a Google alert set up and run the terms "serial killer" and "serial homicide" through LexisNexis once a week).
In the end, we used SPSS to run the statistics.
The SHEISC's role is to use data to present an accurate depiction of serial murderers so that the public can be educated about these threats to their safety since there are approximately ten television shows on the air at the moment that exist merely to exploit this phenomenon and often misinforms their audience.
No comments:
Post a Comment