LivingData – Building collaborative datasets
Long before socially inclined technology projects start creating tangible impact for their shareholders, begins the torrent of data generated by technology interventions. This data is typically valuable, since it contains new information about communities who have often never been connected to any digital media. Moreover, the data usually also contains performance related information related to the intervention itself. However, the data is collected in so many different forms and formats that it is usually very difficult to analyze data from all sources comprehensively.
Ever since Mojolab began, we’ve experienced a consistent challenge in being able to manage data that has been coming at us like an avalanche from our many sources in the field.In our case, data is collected both passively by automated systems as well as actively by human particpants. The automatically collected data is stored in several databases, each on a different remotely placed notebook server, connected only by a VPN.
The manual data is collected as a set of excel sheets, often made by participants from the field. These sheets often vary in format, even for similar data. For example, two volunteers in two different locations may collect participant data for workshops in completely dissimilar looking Excel sheets, when compared based on column names, but which represent the same type of data, in that both are lists of people.
To solve this problem we came up with the idea of a tool that would be able to combine tables in different formats to make more comprehensive data sets.
In the simplest case, the tool would be able to take two tables (as CSV files), for example (see figure below)
Both these tables represent lists of people. However, trying to map them into a database based on the field names would require writing a translator for each set of column names, since computers are typically sensitive to typecase and order. Add to this the fact that this data is incremental to pre-existing data which is possibly mapped to an existing table structure (see figure below):
In such scenarios, the typical field response is to try to train people to use standard formats. However, this adds a training overhead to the addition of new people. Also, human error can never be completely eliminated.
What we needed was an automated tool to take column names from new tables, merge them to the best degree possible and then map the merged column names to the columns in our pre-existing dataset. New columns discovered should be appended to the existing table structure and the pre-existing records should all be populated with null values for the new fields (see below)
The long term vision is that people will be able to use this toolset to stitch together datasets from the field. We call it LivingData and its available at https://github.com/mojolab/LivingData. We’re very keen to have people collaborate on the code, so people who want to develop, please send an email to arjun at mojolab dot org.
Next steps planned are separate libraries to autofill latitudes and longitudes of location fields using web based geocoding services. Down the road we’d also like to integrate it into Google Docs and see if services can be built on top of it.
It’s also important to mention that we came up with the name before we read this article in Wired by Phil Agre – but the theme is pretty similar- http://www.wired.com/wired/archive/2.11/agre.if.html
All you need to run the code is a Linux command line with python installed…here is a sample of how the tool works in its most basic form:
Usage for tools/MergeCSV.py
1. Take two CSV Files you want to merge (For example: F1.csv, F2.csv)
2. Copy the script
3. Run the following command on termainal: python MergeCSV.py F1.csv F2.csv
4. Follow the instructions on the command line
5. A new file Merged.csv will be created
Tested on Ubuntu 11.10, 12.04
1. wget https://github.com/mojolab/LivingData/archive/master.zip
2. unzip master.zip
3. mv LivingData-master livingdata
4. cd livingdata
5. cd tools
6. python MergeCSV.py <path to first file> <path to second file> <path to merged file>
NOTE: One way to do this on a Windows machine is to have an Ubuntu VM using VirtualBox, install guest additions and share a folder from the Windows machine into the VM.
Then a putty session can be used for the rest.
SAMPLE OUTPUT FOR MergeCSV.py
bin lib LICENSE README samples status tools
root@maa:/opt/dev/LivingData# cd tools/
root@maa:/opt/dev/LivingData/tools# cat ../samples/1.csv
root@maa:/opt/dev/LivingData/tools# cat ../samples/2.csv
root@maa:/opt/dev/LivingData/tools# python MergeCSV.py ../samples/1.csv ../samples/2.csv
Headers for First file are: [‘NAME’, ‘EMAIL’, ‘PHONE’]
Headers for Second file are: [‘name’, ‘email’]
Please select corresponding Attribute Number for <name> from:
Enter Your Choice, write None if nothing matches:0
Enter New Name for name & NAME: MERGEDNAME
Please select corresponding Attribute Number for <email> from:
Enter Your Choice, write None if nothing matches:MERGEDEMAIL
Wrong Choice, Please Enter Your Again: 0
Enter New Name for email & EMAIL: MERGEDEMAIL
Enter New Name for PHONE: MERGEDPHONE
Merged Headers are: [(‘name’, ‘NAME’), (‘email’, ‘EMAIL’), (None, ‘PHONE’)]
New Headers are: [‘MERGEDNAME’, ‘MERGEDEMAIL’, ‘MERGEDPHONE’]
root@maa:/opt/dev/LivingData/tools# cat Merged.csv