Your browser (Internet Explorer 6) is out of date. It has known security flaws and may not display all features of this and other websites. Learn how to update your browser.
X
Post

Cleaning and standardizing White House visitor data

By Earl F Glynn | Franklin Center

This is the fourth technical article in a series about analyzing White House visitor data. This article looks at cleaning and standardizing visitor data, and creating descriptive statistics of the data fields.

The complete White House visitor data file released each month has a number of data problems that need to be cleaned up before working with the file.

Problems are encountered in most of the data fields of interest, including names, dates, and locations.  Standardization of formats and values are introduced for some of the key fields to simplify searches and analysis.

See the many technical details of the cleanup and standardization below.


“Problem” characters in file?

An earlier article explained how to download the complete visitor data file via the Socrata data playerWhite_House_Visitor_Records_Requests.csv.

Output from CharCount utility showing frequency counts of all ASCII characters in White House visitor file from Nov 30, 2012. (Click on graphic to enlarge)

Having seen a number of problem data files over the years, I often use a CharCount utility to inspect a file for  “problem” characters.

ASCII codes are fairly standard for 128 characters (x’00′ to x’7F’), but the 128 “extended ASCII codes” from x’80′ to x’FF’ are not.

Output from the CharCount utility (click on graphic at right) shows a small number of characters in the “extended” range.  Some of these characters are interpreted differently on Macintosh, Linux and Windows computers.

Since the July 27, 2012 released visitor file, three “new” characters now appear in the file:  x’92′, x’A2′, x’B1′.  Macs and Windows computers will interpret x’92′ differently.  The characters x’A2′ (¢) and x’B1′ (±) appear to be interpreted the same way, but it’s unclear why they are now in the file.

The “problem” characters could be removed or replaced with something like the Linux sed command, but for now these problem characters will be ignored.

Some names with non-English characters might include these special characters and could be difficult to find, or might sort in an unexpected order.


Duplicate Records

I wrote the WhiteHouse-Visitors.R script, and refined it over the past few months, to define a repeatable process for working with the White House visitor data.

This script first read the text file as a vector of strings, and looked for duplicates using R’s table function  (comments in green show results):

FILENAME <- "White_House_Visitor_Records_Requests.csv"

# Read as vector of strings.  Check for duplicates.
s <- readLines(FILENAME)
length(s)
# Counts include header row
#[1] 2834619   (Nov  2012)

counts <- table(s)
table(counts)

#counts        (Nov 2012)
#      1       2       3       4       5       8
#2820434    6915      99      10       2       1

# Create unique list of records
s <- unique(s)
length(s)
#[1] 2827461   (Nov  2012)

The counts vector above shows 1 record occurred 8 times, 2 records occurred 5 times, 10 records occurred 4 times, 99 records occurred 3 times, and 6915 records occurred twice.

The remaining 2,820,434 records were unique and occurred only once.

The file WhiteHouse-Visitors-Duplicate-Records.txt shows the duplicates.

Selected duplicates are shown below:

counts record
     8 WINTER,BENJAMIN,P,U87115,0,VA,3/14/10 11:05,B0402,,,3/12/10 11:11,3/14/10 11:25 ...
     5 Barron,Matthew,E,U16109,0,VA,6/16/2012 19:15,B0402,,,6/15/2012 0:00,6/16/2012 19:15 ...
     4 Hepner,Brent,E,U26651,,VA,,,,,7/24/12 0:00,8/4/12 7:30 ...
     3 DAVID,,,,,,,,,,,9/22/10 ...

It’s fairly unlikely that 8 different people named “Benjamin P Winter” all attended the same White House meeting on March 14, 2010.  Likewise, it’s fairly unlikely that three different people named “Matthew E Barron” or four people named “Brent E Hepner” attended the same White House meetings.  The cause of such duplicates is unknown, but could be human error.

There are several cases in the file of apparent duplicates visitors attending a Rosh Hashana Reception at the Vice President’s Residence (code VPR) where only a single first name is given as the last name.  In this case, it’s quite plausible three different people named “David” attended the event, but without any other information, there is no information loss caused by deletion of the duplicates.

When all the duplicate strings were removed using R’s unique function, the file was found to have 2,827,461 unique records (including the header record).


Standardizing all names

Inspection of various name fields (visitor, visitee, caller) shows a number of problems.  For example, names should not contain numbers (e.g., 2, 5, 9) or punctuation (e.g., [ or ?).

The R standardize_name function (in separate file WhiteHouse-SharedFunctions.R) was written to make names more consistent so searches and sort orders would be more predictable:

# Example:
# standardize_name("           e3.         E..          [(Cumm`ings?           ")
# [1] "e e cummings"

standardize_name <- function(s)
{
  s[s=="n/a"] <- ""

  s <- gsub("/$",  "", s)   # Why do a number of names end with a slash?
  s <- gsub("_|`|0|1|2|3|4|5|6|7|8|9", "", s)  # strange middle names
  s <- gsub("\\-|\\(|\\)|\\[|\\?",     "", s)  # Remove strange characters

  s <- gsub("\\."," ",s)   # Get rid of initials in names
  s <- gsub(" +", " ", s)  # Replace multiple blanks with single blank

  s <- str_trim(s)         # Remove any leading/trailing blanks
  s <- tolower(s)          # force to lower case

  s
}

Hyphenated last names sometimes have hyphens and sometimes do not. Perhaps hyphens should be retained in such last names, but for uniformity and consistency in name searches, all hyphens were removed.


Removing records with missing visitor names

If the point of the visitor records is to identify a visitor by name, it doesn't make sense to keep records without complete names -- at least a first and a last name.

The file WhiteHouse-Visitors-Missing-Names.txt (tab delimited file) shows 326 records with incomplete names.

One record from the 3/30/2012 release is missing both names and all other information.

The majority of the records with incomplete names are from visits to the Vice President's Residence (VPR) released on 12/31/2010 where only a first name (in the last name column) was specified.


Combined name fields

In the original file the visitor names are broken into three fields (NAMELAST, NAMEFIRST, NAMEMID), while the visitee and caller names are broken into two fields (e.g., visitee_namelast and visitee_namefirst).

To enable searching for a name  in a single field, combined names were introduced using a pipe ("|") separator:  NAME (for visitor), visitee and CALLER.

# Combined name fields
d$NAME    <- paste(d$NAMELAST, d$NAMEFIRST, d$NAMEMID, sep="|")
d$visitee <- paste(d$visitee_namelast, d$visitee_namefirst, sep="|")
d$CALLER  <- paste(d$CALLER_NAME_LAST, d$CALLER_NAME_FIRST, sep="|")

Examples:

NAME:

  • dunn|anita|b
  • geithner|timothy|f
  • reich|robert|b

visitee:

  • axelrod|david
  • office|visitors        [tourists]
  • plouffe|david
  • potus|                [President of the United States]

CALLER:

  • doebler|max
  • gallagher|clare

Standardizing selected visitee names

Unfortunately, there is little standardization of names in the file.  Names of the same person on different visits can be specified in various ways.

For now it does not seem wise to introduce any standardization of visitor names, but at least certain visitee names need standardization.

With the visitee names, I have identified 100 variations of “POTUS” (President of the United States), 10 variations of “FLOTUS” (First Lady of the United States), 5 variations of POTUS/FLOTUS combinations, and 6 variations of VPOTUS (Vice President of the United States).  Other variations may be found in the future.

I combine all known variations for a single person in a file now, and use a simple R function to assign the same standard name to all the variations:

# For now let's only standardize these visitees:
# POTUS, FLOTUS, POTUS/FLOTUS and VPOTUS

fix.visitee <- function(v, file.list, standard.name)
{
  name.set <- read.csv(file.list, as.is=TRUE)
  v[v %in% name.set$visitee] <- standard.name
  v
}

d$visitee <- fix.visitee(d$visitee, "F:/WhiteHouse/List/visitee-POTUS.csv",        "potus|")
d$visitee <- fix.visitee(d$visitee, "F:/WhiteHouse/List/visitee-FLOTUS.csv",       "flotus|")
d$visitee <- fix.visitee(d$visitee, "F:/WhiteHouse/List/visitee-POTUS-FLOTUS.csv", "potus/flotus|")
d$visitee <- fix.visitee(d$visitee, "F:/WhiteHouse/List/visitee-VPOTUS.csv",       "vpotus|")

Additional standardization of visitee names may be introduced after studying known White House staff names.


Single location field and standardization

Similar to the treatment of names, a single LOCATION field was introduced to be the combination of MEETING_LOC (building) and MEETING_ROOM.

Examples:

  • wh|oval office
  • wh|rose garden
  • wh|situation room
  • oeob|bowling alley

The visitor file lacks standardization of location names, e.g., 21 different LOCATION values have been identified to mean “wh|oval office”, 6 variations identify the “wh|rose garden”, 27 variations identify the “wh|situation room”, and 39 variations identify the “oeob|bowling alley”.

At present the R script imposes some standardization for 18 different locations in a way similar to the standardization of visitee values.  (See “location” files in this directory.)

Additional standardization may be added for selected locations at a later date.


Standardization of date/time fields

The White House visitor data has five date/time fields:

  • APPT_MADE_DATE,
  • APPT_START_DATE,
  • APPT_CANCEL_DATE,
  • LastEntryDate, and
  • RELEASE_DATE.

(See White House WAVES Data Fields for more information).

Date/time fields were observed to have a number of inconsistencies:

  • Extra leading or trailing blanks.
  • Most fields had dates and times, but some only had dates.
  • Some times had a 12-hour clock, others had a 24-hour clock.
  • Some 12-hour clock dates used am/pm, other used AM/PM.
  • Some dates had 2-digit years, others had 4-digit years.
  • Some dates mysteriously were numbers like 40619.61714.

After some programmatic editing, all dates fields were convert to an ISO 8601 format (yyyy-mm-dd HH:MM:SS) using R’s strptime function.  See the function fix.WhiteHouse.dates in the file WhiteHouse-SharedFunctions.R.

The ISO 8601 format allows dates as character strings to be sorted into chronological order.

The most unexpected conversions needed were the numbers, like 40619.61714.

These numbers can be entered into Excel and formatted as dates to reveal what that really mean, such as 3/17/2011 2:48 PM for 40619.61714.  This online tip was found to do this conversion in R.

A few date/time values were problems that have not been fixed programmatically yet:

  • 6/1/20091 1:39:23AM
  • 8/14/20091 0:29:51AM
  • 8/31/200 912:00:00PM
  • 9/18/200 912:01:00PM

Standardization of the description field

The description fields sometime contains comments about a visit, but text was entered with a number of inconsistencies.

Here are programmatic changes made to the description fields to make them more consistent:

  • All letters were forced to lower case.
  • Leading and trailing blanks were removed.
  • Multiple internal spaces were replaced by a single space.
  • Slashes or periods at the end of description fields were removed.

“Cleaned-Enhanced” Files

After making all the changes and standardizations described above to the White House visitor file in R, “cleaned enhanced files” were created in this online “CLEANED-ENHANCED” directory as a new starting point for any analysis.

These files are all tab-delimited.

The 715 MB file WhiteHouse-Visitor-Records.txt has 2,827,135 visit records.  [2,827,461 unique visit records – 326 records missing names).

The 715 MB file was zipped into a 121 MB file, WhiteHouse-Visitor-Records-all-through-2012-11-30-Release.ZIP for download.

Separate monthly files for all release dates are also available in the CLEANED-ENHANCED directory.

The new 82,614 visit records released on Nov. 30, 2012 can be downloaded in a separate 21 MB file:


Metadata

Frequency counts for all values in each data field (including the added NAME, visitee, CALLER and LOCATION fields) were created for the full visitor file, as well as a separate file with the statistics only for the Nov. 30, 2012 release.

Each file in the following directories shows the frequency counts for all possible values found:

In addition to the single files by data field, a summary file of all fields was created corresponding to the ALL and RELEASE sets:

See Table 3 in White House WAVES Data Fields for a formatted version of the ALL summary file.


Next article:  The cleaned-enhanced file  WhiteHouse-Visitor-Records.txt will be divided into categories to simplify analysis of POTUS and White House Staff visits.


Series about analyzing White House visitor data


Please, feel free to steal our stuff! Just remember to credit WatchdogLabs.org or the Franklin Center for Government & Public Integrity.


efg

Contact Info: Email: Earl.Glynn@FranklinCenterHq.org, Twitter: @WatchdogLabs, Facebook: http://www.facebook.com/WatchdogLabs

Leave a comment  

name*

email*

website

Submit comment