The idea behind this project was two-fold:
- To collect and present the data requested.
- To automate the process of data collection and presentation in such a way that it could be dynamic and incorporate new data through the use of tools like cron.
In order to do this, many techniques were employed, which I will try to briefly outline here:
First, I needed to collect the names and campaign donation information on each Congressman in the current (112th) Congress. To do this, I followed the following process:
- Select a site that has the information that I require:
- I decided that Open Secrets was the best option because they're consistent and highly rated as reputable.
- Then, I had to figure out how to get the info from their site. To this end, we discovered that they use a unique numerical identifier for each Congressman. In addition, they make it easy to detect which election cycle you're accessing information from. In the following examples, I've bolded the relevant fields:
- http://www.opensecrets.org/politicians/summary.php?cid=N00005906&cycle=2010
- http://www.opensecrets.org/politicians/summary.php?cycle=Career&type=I&cid=N00003675&newMem=N
- Then, I needed to download their index file and run an XSL Stylesheet on it to extract the relevant profile id numbers to a text file (Note: I'm only looking at career and 2010 data right now).
- Next, I wrote a shell script to handle the following processes:
- To grab all of the files, I used curl combined with an iterative loop through the identification numbers delimited in the text file.
- Since these files were not in well-formed XML, I used a command-line version of jTidy to convert them so that I could use an XSL Stylesheet to extract the information that I needed.
- In order to remove the DTD and XML namespace information, I used a tool called sed. This was also used to convert untransformable characters like the "&".
- Then, in order to apply the XSLT in batch-mode, I used the command line version of SaxonHE in an iterative loop over the directory.
- Next, I uploaded the transformed XML to the eXist database on Obdurodon so that I could run queries.
- Finally, I used xQuery and PHP to search for and output the data that I wanted into a readable format.
Second, I needed to get the bills from the current Congressional session, make them queryable, and present them in a readable format:
- These are freely available to the public in relatively decent, though not well-formed, XML from the House.gov site. I followed the same steps above using sed and jTidy to convert the files into well-formed XML.
- Then, I used an XSL Stylesheet to extract all of the links from the index file and output them to a text file so I could use an iterative loop and curl to "grab" all of the files
- Next, to grab the specific data that I need, because I can link to the full-text of the bill, I applied an XSL Stylsheet and stored the resulting files into the eXist database.
- Finally, I used xQuery and PHP to search for and output the data that I wanted into a readable format.
Then, in order to connect the data, I followed the following steps:
- Created an XML file that contains each Congressman's name, state, state abbreviation, and party affiliation to use to sycnchronize data:
- I took the table provided here, ran it through an XSL stylesheet, and uploaded that data to the eXist database
To Be Continued...