When City Hall expanded free early years education to more than 50,000 four-year-olds across the five boroughs, MODA stepped in to provide crucial data management and analytics support.
“On receiving – winning the funding in Albany on April 1st, 2014 to take our full day Pre-K program from 20,000 students to 53,000 students, we awoke to find that that left us approximately four months or five months to achieve the goal. Luckily, one City office stepped up, created a new analysis mechanism and allowed us to accurately target where the kids were who needed this opportunity […] On that day, we saw a whole new reality begin in New York City education.” Mayor Bill de Blasio, March 3, 2015
In spring 2014, the City of New York embarked on a historic effort to give every four-year-old access to free and high-quality early years education. By the summer’s end, more than 53,000 children would be enrolled in the new pre-Kindergarten (‘pre-K’) programs.
Before the school year began, the City faced the daunting task of contacting the families of each four-year-old to guide them through the new enrollment process. Making use of methods such as entity resolution and fuzzy matching, MODA created a set of “golden records” that volunteer outreach officers could use to contact families and monitor progress towards enrollment targets.
What is the analytics question?
A nationwide push for universal access to early childhood education gained pace during the 1990s and 2000s as scholarly research quantified its effect on children’s educational outcomes. Benefits of Pre-K include developing social and emotional skills, boosting academic achievement and narrowing the vocabulary gap between rich and poor children.
In March 2014, New York City Mayor Bill de Blasio won $340 million in State funding to make pre-K free across the city.
With just eight months until the start of the school year, the City’s Pre-Kindergarten for All (PKA) team partnered with MODA to address the data management challenges necessary for this historic outreach effort. The PKA team assembled a legion of volunteer workers to contact parents and guardians of all New York City four-year-olds. However, volunteers lacked a reliable list of the names, addresses and phone numbers of those parents and guardians. MODA was tasked with assembling such a list from disparate and overlapping records on households that might include a four-year-old child.
What data is required?
MODA began by reviewing what datasets the City already had on households likely to include a four-year-old child. HHS Connect, a City program and data system for coordinating case management between agencies, maintains data related to health and social programs from eight agencies. This ranges from birth records to data on families who had been involved in the City’s homeless shelter system.
Experian, a consumer credit reporting agency, provided data on families believed to have a four-year-old child, based on consumer transactions. For each family, the company provided a name, address and telephone number.
The team also used data from three sources related to the City’s existing pre-K education programs: the billing system used to pay Early Education Centers for their services (‘Pre-Kids’); the portal through which parents select and match with available seats (SEMS); and the post-enrollment tracking system for families using the centers (ATS).
What data analysis is applicable?
Simply stacking every agency’s records on top of one another in a single list would result in many duplicates. Outreach workers would risk alienating families through repeated and uncoordinated phone calls, while authoritatively tracking progress on contacting each eligible family would be impossible.
The key analysis task was to create a set of “golden records” that included a single, unique record for each family with a four-year-old child, removing any duplicates from overlapping records in the source datasets.
Creating unique records by linking multiple data sources is often complicated by different formats (eg., ‘Janet A. Smith’ vs. ‘Smith, J. A.’), different spellings, and missing data. The team addressed the challenges in several steps:
- The structure of key fields such as last name, address and phone number were consolidated across datasets.
- MODA changed all text characters to uppercase, replaced bad data (eg., phone number ‘000-000-0000’) with null values, and removed special characters.
- MODA created unique household IDs. Where records from more than one dataset matched on two or more key fields, MODA assigned the same unique ID.
- Fuzzy matching methods joined records that referred to the same entity, but were spelled or formatted differently. Insignificant words such as ‘the’ or ‘Mr’ were removed, and the key parts of each string were matched.
- The method was applied using SAS DQMATCH, a proprietary algorithm for which open source equivalents are widely available today.
Having created unique household IDs, the team built individual identifiers for each child within a household using Levenshtein distance to allow for flexibility in matching records that differed by minor spelling differences (eg. ‘Sara’ vs. ‘Sarah’).
data upk_res_full1; set upk_res_full; by household_id; prev_child = lag(child_first_name); dist = complev(child_first_name,prev_child); if first.household_id then new_house=1; index=index(strip(prev_child),strip(child_first_name)); index1=index(strip(child_first_name),strip(prev_child)); index2=index(strip(child_first_name),strip(child_last_name)); if new_house = 1 then new_child=1; else if dist>2 and new_house ne 1 and index ne 1 and index1 ne 1 and index2 ne 1 then new_child=1; if first.household_id then inhh_child_count=0; if new_child then inhh_child_count+1; house_child=cats(household_id,inhh_child_count); child_fs_num1 = rank(substr(child_first_name,1,1)) - 64; child_fs_num2 = rank(substr(child_first_name,2,2)) - 64; child_fs_num3 = rank(substr(child_first_name,3,3)) - 64; child_fs_num4 = rank(substr(child_first_name,4,4)) - 64; child_fs_num5 = rank(substr(child_first_name,5,5)) - 64; child_fs_num6 = rank(substr(child_first_name,6,6)) - 64; if child_fs_num1 < 1 then child_fs_num1=.; if child_fs_num2 < 1 then child_fs_num2=.; if child_fs_num3 < 1 then child_fs_num3=.; if child_fs_num4 < 1 then child_fs_num4=.; if child_fs_num5 < 1 then child_fs_num5=.; if child_fs_num6 < 1 then child_fs_num6=.; child_fs_num = cats(child_fs_num1,child_fs_num2,child_fs_num3,child_fs_num4,child_fs_num5,child_fs_num6); prev_child_fs_num = lag(child_fs_num); if child_first_name ne '' then do; if new_child=1 then child_id=cats(household_id,child_fs_num); else child_id=cats(household_id,prev_child_fs_num); end; else do; if new_child=1 then child_id=cats(household_id, '000000'); else child_id=cats(household_id,'000000'); end; run;
How can the analysis improve the operation?
The extended project team created a Customer Relationship Management (CRM) system (“Brianna”) to provide volunteer outreach workers with the golden records MODA produced. The system provided each volunteer a unique list of families to contact, their telephone number, and the details on the pre-K programs nearest to them.
Between July 1 and the start of the school year in September, more than 1.2 million texts, emails and phone calls were delivered to eligible familires. The golden records, combined with the Brianna CRM system, allowed volunteers to reach eligible families systematically, crossing families from their list once successfully contacted.
Is the model sustainable?
The MODA-generated golden records, which were developed under significant time constraints, underpinned the outreach effort required to launch universal pre-K.
Ultimately the code was handed off to the Department of Education and the Department of Information Technology and Telecommunications (DoITT), providing a method suitable for future annual enrollment drives where reaching the maximum number of eligible families at lowest volunteer cost is a priority.
By the time the effort concluded, a total of 53,520 students were enrolled in one of the new pre-K facilities, helping to deliver upon a major mayoral promise and improve lifelong learning outcomes for children.