Top Bar - All sites

Helping Community News Startups

Data mining tools

Big stories lurk online and in computer files

When the I-35 bridge in Minneapolis collapsed in August 2007, the programmers and journalists at msnbc.com flew into action. Less than 48 hours after the bridge tumbled into the Mississippi River, they had turned a national database on the condition of bridges into a clickable state-by-state map, giving viewers the chance to see whether their bridges might be dangerous.

It was a dramatic demonstration of the power that data or “structured information” bring to storytelling. And, of course, many news organizations jumped at the chance to bring local angles to the story as everyone asked the key question: “Are the bridges I cross safe?”

Investigative Reporters and Editors, a professional group that sells data files including the bridge inventory to journalists through its database library, fielded more than 150 requests for bridge information in the first day after the collapse.

Some CitJ sites also saw the opportunity, either for local or national coverage. The blog fortiusone.com, which provides a geographic angle to many stories (it calls itself “Moving Beyond Push Pins,” a reference to Google mapping), posted a detailed explanation of how it got and used the data.

The bridge collapse is only one example of how data could be incorporated into your site. And more folks seem to be getting on the data bandwagon.

Derek Willis, database editor at washingtonpost.com, says, “Journalists are doing more with data in general, thanks to the Internet.”

Finding data isn’t hard

And, he says finding data is fairly simple. “Most of the information available is generated by various government agencies in the form of reports, lists and databases that are posted to the Web. Some non-governmental organizations have also produced or collected additional data, and in some cases individuals are building datasets on their own,” Willis says.

For example, here is a page put together by the Virginia State Auditor’s office reporting up-to-date payments by the state to various localities. The page even includes a button for exporting the information to an Excel spreadsheet in case you want to do more analysis or build your own graphics.

Mapping is a popular way to use data, especially with the advent of Google maps.

“People starting out need only curiosity and some basic tools.”
— Derek Willis
washingtonpost.com

An early demonstration of this was chicagocrime.org, developed by Adrian Holovaty, a journalist and computer programmer. Holovaty generally is recognized as one of the best at building data applications for use in journalism. Besides chicagocrime, he has helped build projects forwashingtonpost.com and recently won a grant from the Knight Foundation to bring the concepts of chicagocrime to other communities and other data. He explains how he does chicagocrime, as well as his thoughts on “journalist programmers” in this interview with Online Journalism Review.

“Doing journalism through computer programming is just a different way of accomplishing (the usual journalistic tasks of gathering, distilling and presenting information). Namely, the technique favors automation wherever possible,” he says in the interview.

In this model, it’s “journalism” that conceives the “story” the data could tell. It’s the “programming” that provides the automation. And, especially when you are working with large datasets, automation is the only sane way to go.

What you need to know to get started

Obviously, some of the tools and skills required here are fairly sophisticated. “But people starting out need only curiosity and some basic tools,” Willis says. “They include knowledge of HTML and various data formats (spreadsheets, databases, text), a knowledge of building and querying data and, lately, experience with a programming language to automate most of those tasks.”

It also helps to have some familiarity or at least comfort-level with working with numbers, because most structured data is, in reality, a table full of numbers. If you don’t have the computer or math skills, you could “crowdsource” on your site and in your community to find someone who would be willing to help — either by teaching or taking on projects. You might also consider building a partnership with a news organization that has computer-assisted reporting expertise.

The next thing you need is some data or “structured information,” which simply means that the information is arranged in a consistent form, such as rows and columns. You won’t have to go far to find it, Willis says.”The fairly easy starting point is to look online at government sites for information provided in a consistent format — whether that’s HTML, a spreadsheet or something more sophisticated. Then download them and start building your database,” he says. Excel (the spreadsheet built into Microsoft Office) is a good choice to get started using these data tools.

The best data-driven applications are not static. Instead, they provide users with ways to ask questions and look at the information the way they want. See, for example, the Indianapolis Star’s Indy911 application, which tracks emergency calls in the region in real-time.

Issues with the data

You also will want to ask the experts who build the databases to supply you with the “record layout” of the tables or files you are using. Typically, these are government officials. This layout should give the names of the “fields” or “columns” in the file and tell you how many characters are in them and what kind of information is stored in each field.

If you don’t have the computer skills, “crowdsource” your audience to find an expert who can help.

Here is an example from the federal Education Department on a dataset containing statistics about elementary and secondary schools. This data has been widely used by journalists trying to assess school performance and would be a great addition to a hyperlocal CitJ site. You might be a bit bewildered the first time you look at one of these things, but usually agencies are willing to provide experts to help you understand it better.

Once you start poking around in the data you will find mistakes, including typos, missing numbers, data entered in the wrong field — the list is endless. So, the first rule in working with data is “all data is dirty.” And dirty data produces errors in reporting.

Obviously, as journalists (professional or amateur) we don’t want to make mistakes. So be on the lookout for numbers that seem way out of line (a $1,000,000 political donation, for example, probably is a typo). But still, it probably isn’t possible to check every number and every name in a database such as the Federal Election Commission’s massive files on campaign contributions and spending.

One rule you could follow is that if you single out individuals or companies in your stories or posts, call them to double-check any information you plan to use. Sometimes, Willis says, “You can compare the results of your work to original records to make sure that it matches up.”

But despite the difficulties, there are major advantages to giving your audience access to data. Not only does it help people understand complex situations; but folks in the community can often bring new insights. Willis says one of the best features of providing data is that it creates, “the ability to allow more people to ask meaningful questions of data.” He adds, “Professional reporters don’t always know all of the questions to ask a given set of data. In this case, the more eyes on the data, the better.”

Graphic display of Census data

You’ll find that graphics make it easier for people to understand complex information; that’s one reason why mapping has been so popular. A couple of sites really show the power of graphical display of information: manyeyes.com, which presents visualizations on hundreds of topics, and ilovemountains.org, which built a plug-in for Google Earth to show the devastation of mountaintop strip mining.

Try asking the American FactFinder section of the Census site for demographic information about a particular ZIP code. The Census Bureau is a treasure trove of information about communities. But sometimes getting to the data and making sense of it takes a bit of work. There are an increasing number of online tools that can help you chart the data.

 

Next Section

Powered by WordPress. Designed by Woo Themes