Who Paid the Bill? Money in Congressional Politics

The idea behind this project was two-fold:

To collect and present the data requested.
To automate the process of data collection and presentation in such a way that it could be dynamic and incorporate new data through the use of tools like cron.

In order to do this, many techniques were employed, which I will try to briefly outline here:

First, I needed to collect the names and campaign donation information on each Congressman in the current (112th) Congress. To do this, I followed the following process:

Select a site that has the information that I require:
- I decided that Open Secrets was the best option because they're consistent and highly rated as reputable.
Then, I had to figure out how to get the info from their site. To this end, we discovered that they use a unique numerical identifier for each Congressman. In addition, they make it easy to detect which election cycle you're accessing information from. In the following examples, I've bolded the relevant fields:

http://www.opensecrets.org/politicians/summary.php?cid=N00005906&cycle=2010
http://www.opensecrets.org/politicians/summary.php?cycle=Career&type=I&cid=N00003675&newMem=N

Then, I needed to download their index file and run an XSL Stylesheet on it to extract the relevant profile id numbers to a text file (Note: I'm only looking at career and 2010 data right now).
Next, I wrote a shell script to handle the following processes:

To grab all of the files, I used curl combined with an iterative loop through the identification numbers delimited in the text file.
Since these files were not in well-formed XML, I used a command-line version of jTidy to convert them so that I could use an XSL Stylesheet to extract the information that I needed.
In order to remove the DTD and XML namespace information, I used a tool called sed. This was also used to convert untransformable characters like the "&".
Then, in order to apply the XSLT in batch-mode, I used the command line version of SaxonHE in an iterative loop over the directory.

Next, I uploaded the transformed XML to the eXist database on Obdurodon so that I could run queries.
Finally, I used xQuery and PHP to search for and output the data that I wanted into a readable format.

Second, I needed to get the bills from the current Congressional session, make them queryable, and present them in a readable format:

These are freely available to the public in relatively decent, though not well-formed, XML from the House.gov site. I followed the same steps above using sed and jTidy to convert the files into well-formed XML.
Then, I used an XSL Stylesheet to extract all of the links from the index file and output them to a text file so I could use an iterative loop and curl to "grab" all of the files
Next, to grab the specific data that I need, because I can link to the full-text of the bill, I applied an XSL Stylsheet and stored the resulting files into the eXist database.
Finally, I used xQuery and PHP to search for and output the data that I wanted into a readable format.

Then, in order to connect the data, I followed the following steps:

Created an XML file that contains each Congressman's name, state, state abbreviation, and party affiliation to use to sycnchronize data:
- I took the table provided here, ran it through an XSL stylesheet, and uploaded that data to the eXist database

To Be Continued...

Who Paid the Bill?? Tracking Campaign Donations in Congressional Politics | Bills

Methodology: How stuff gets done