Tuesday, January 24, 2017

Week 1- Cleaning the data and creating a file of dictionaries

This week was filled with looking at the large dataset that we were given from CARES, Inc. in Albany, NY. There are 12 different files, with the largest file containing over 310,000 lines (each line represents an individual's entry) and the smallest containing only 2 lines. These files contain information that individuals staying in homelessness programs have entered into HMIS's system. We created a google spreadsheet to explain of the fields that are in each file. This will require some time since this must be analyzed in two different manuals that HMIS have provided. My goal is to document the fields from two files per week.

Last semester we were given a sample dataset that we put the following variables into a dictionary: start date of program, end date of program, length of stay, personal ID, and project type (Emergency Shelter, Rapid Re-housing, Street Outreach, etc.). With the new dataset, we need to add these same variables into a new dictionary. This will be a general function that can read in any of the 12 files and return a list of dictionaries.

My goal for this past week: 

  • To analyze the dataset and familiarize myself with the new data
  • Write the existing dictionary to a file and read it in using Python
  • Determine which process is more efficient/faster: 1. Reading in the text file and creating a dictionary or 2. Reading in a file with the existing dictionaries
Results: I started to analyze the dataset and have finished the headings for the first file. Reading in a file with the dictionaries is much quicker (~35x faster!!!!)

My goal for the following week:
  • Create a function that reads in the new dataset but creates the same dictionary as last semester 
  • Analyze the variables within two of the files 

1 comment: