Total Pageviews

Popular Posts

Wednesday, June 11, 2014

Want to be a Data Journalist

Data journalism is huge. I don't mean 'huge' as in fashionable - although it has become that in recent months - but 'huge' as in 'incomprehensibly enormous'. It represents the convergence of a number of fields which are significant in their own right - from investigative research and statistics to design and prog...ramming. The idea of combining those skills to tell important stories is powerful - but also intimidating. Who can do all that?




The reality is that almost no one is doing all of that, but there are enough different parts of the puzzle for people to easily get involved in, and go from there. To me, those parts come down to four things:



1. Finding data

'Finding data' can involve anything from having expert knowledge and contacts to being able to use computer assisted reporting skills or, for some, specific technical skills such as MySQL or Python to gather the data for you.



2. Interrogating data

Interrogating data well means you need to have a good understanding of jargon and the wider context within which data sits, plus statistics - a familiarity with spreadsheets can help save a lot of time.



3. Visualising data

Visualising and mashing data has historically been the responsibility of designers and coders, but an increasing number of people with editorial backgrounds are trying their hand at both - partly because of a widening awareness of what is possible, and partly because of a lowering of the barriers to experimenting with them.



4. Mashing data

Tools such as ManyEyes for visualisation, and Yahoo! Pipes for mashups, have made it possible for me to get journalism students stuck in quickly with the possibilities - and many catch the data journalism bug soon after.



How to begin?

So where does a budding data journalist start? An obvious answer would be "with the data" - but there's a second answer too: "With a question".



Journalists have to balance their role in responding to events with their role as an active seeker of stories - and data is no different. The New York Times' Aron Pilhofer recommends that you "Start small, and start with something you already know and already do. And always, always, always remember that the goal here is journalism." The Guardian's Charles Arthur suggests "Find a story that will be best told through numbers", while The Times' Jonathan Richards and The Telegraph's Conrad Quilty-Harper both recommend finding your feet and coming up with ideas by following blogs in the field and attending meetups such as Hacks/Hackers.



There is no shortage of data being released that you can get your journalistic teeth into. The open data movement in the UK and internationally is seeing a continual release of newsworthy data, and it's relatively easy to find datasets being released by regulators, consumer groups, charities, scientific institutions and businesses. You can also monitor the responses to Freedom of Information requests on What Do They Know, and on organisations' own disclosure logs. And of course, there's the Guardian's own datablog.



A second approach, however, is to start with a question - "Do speed cameras cost or save money?" for example, was one topical question that was recently asked on Help Me Investigate, the crowdsourcing investigative journalism site that I run - and then to search for the data that might answer it (so far that has come from a government review and a DfT report). Submitting a Freedom of Information request is a useful avenue too (make sure you ask for the data in CSV or similar format).



Whichever approach you take, it's likely that the real work will lie in finding the further bits of information and data to fill out the picture you're trying to clarify. Government data, for example, will often come littered with jargon and codes you'll need to understand. A call to the relevant organisation can shed some light. If that's taking too long, an advanced search for one of the more obscure codes can help too - limiting your search, for example, by including site:gov.uk filetype:pdf (or equivalent limitations for your particular search) at the end.



You'll also need to contextualise the initial data with further data. Say you have some information about a government department's changing wage bill, for example: has the department workforce expanded? How does it compare to other government departments? What about wider wages within the industry? What about inflation and changes in the cost of living? This context can make a difference between missing and spotting a story.



Quite often your data will need cleaning up: look out for different names for the same thing, spelling and punctuation errors, poorly formatted fields (e.g. dates that are formatted as text), incorrectly entered data and information that is missing entirely. Tools like Freebase Gridworks can help here.



At other times the dataset you need will come in an inconvenient format, such as a PDF, Powerpoint, or a rather ugly webpage. If you're lucky, you may be able to copy and paste the data into a spreadsheet. But you won't always be lucky.



At these moments some programming knowledge comes in handy. There's a sliding scale here: at one end are those who can write scripts from scratch that scrape a webpage and store the information in a spreadsheet. Alternatively, you can use a website like Scraperwiki which already has example scripts that you can customise to your ends - and a community to help. Then there are online tools like Yahoo! Pipes and the Firefox plugin OutWit Hub. If the data is in a HTML table you can even write a one-line formula in Google Spreadsheets to pull it in. Failing all the above, you might just have to record it by hand - but whatever you do, make sure you publish your spreadsheet online and blog about it so others don't have to repeat your hard work.



Once you have the data you need to tell the story, you need to get it ready to visualise. Trim off everything peripheral to what you need in order to visualise your story. There are dozens of free online tools you can use to do this. ManyEyes and Tableau Public are good places to start for charts. This poster by A. Abela (PDF) is a good guide to what charts work best for different types of data.



Play around. If you're good with a graphics package, try making the visualisation clearer through colour and labelling. And always include a piece of text giving a link to the data and its source - because infographics tend to become separated from their original context as they make their way around the web.



For maps, the wonderful OpenHeatMap is very easy to use - as long as your data is categorised by country, local authority, constituency, region or county. Or you can use Yahoo! Pipes to map the points of interest. Both of these are actually examples of mashups, which is useful if you like the word "mashups" and want to use it at parties. There are other tools too, but if you want to get serious about mashing up, you will need to explore the world of programming and APIs. At that point you may sit back and think: "Data journalism is huge."



And you know what? I said that once.



Paul Bradshaw is founder, Help Me Investigate and Reader in Online Journalism, Birmingham City University and teaches at City University in London. He publishes the Online journalism blog

Why we need more Data Journalism

Why we need more data journalism


Chris Walker * January 20, 2014



I read a lot of news. You might call me a news junkie, and I suspect many of you are news junkies, too. Every morning I dedicate a couple hours to reading news articles from four or five news sites. I enjoy investing time in reading the news, because I like to be informed about important developments in the world and about the theor...ies that attempt to explain them. Being informed, I believe, makes me a more enlightened citizen and a more interesting person. Armed with my daily news studies, I like to think that I can go out into the world and make better decisions as a voter and consumer.



I’ve been at it for several years now. And the early verdict on whether I’ve attained enlightened citizen status is, well, disappointing. Given the quantity of news I consume every day, I should understand the world far more deeply than I do. I feel informed about current events—after all I can spout off the major headlines of the day and even tell you the name of the Chinese president (with correct pronunciation). But there’s this gnawing sense that most of the news articles and blog posts I’m consuming are empty calories, and that I’m not getting any closer to the crux of things.



A promising development for journalism, and for those of us who hope to become better informed, is the rise of open data. According to the Open Knowledge Foundation:



Open data is data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and sharealike.

There are ever-expanding oceans of open data on the internet, free and accessible to the public. For example data.gov, the U.S. government’s open data portal, now contains about 85,000 searchable datasets. That’s a lot of data. The wealth of information available on data.gov and similar websites can inform the broader public on many issues that matter to us, such as crime rates, healthcare outcomes, affordable housing construction, government budgets, the health of the economy, disease prevalence, quality of education, attitudes toward gay marriage, equality of opportunity, you get the idea.



It might be tempting then to conclude that all this data is ushering us into a golden age of public discourse, in which citizens can easily become well-informed on any topic. But while the effort by governments and research institutions to publish open datasets is commendable, the availability of the data doesn’t necessarily make it accessible to most people. The main problem is that pretty much every open dataset looks like this:

This is an excerpt of interstate migration data from the 2012 American Community Survey (ACS), published by the U.S. Census Bureau. There are needles of truth buried in that haystack that are relevant to interesting questions on migration trends. But how is the average person supposed to figure out what the data has to say? The 2012 ACS migration dataset isn’t huge—it’s only about 70 KB—but it still contains over 6,200 individual data points. The irony is that the data is publicly available and free to use—it’s by definition open—but it’s presented in a format that’s essentially useless to the vast majority of people. This should come as no surprise to anyone familiar with the term big data. If I’ve learned anything from years spent doing data analysis and customizing data analytics software, it’s that even with small data it takes the right tools and a lot of work to separate the signal from the noise, to interpret what the data is really saying and how it relates to things people care about. It takes effort to distil 6,200 data points into a few useful insights.



More fundamentally, how would the average person even know to pull up that particular ACS dataset in the first place? One does not simply get up in the morning and casually peruse data.gov over a cup of coffee, looking for trends in interstate migration (okay, I do). You would already have to be interested in migration to find a dataset that sheds light on it. Put another way, discovery doesn’t happen without motivation, which means the bulk of those 85,000 datasets on data.gov are essentially invisible to the average person.



Data alone doesn’t lead to a better informed public; the other half of the equation, of course, is a journalism sector that’s able to use the data to enhance storytelling and communication of big complex issues. We already outsource much information-processing to bloggers and reporters, relying on them to curate the daily deluge of information involving everything from politics to pop culture. Asking the right questions and separating the signal from the noise, in the interest of the public, is exactly what good journalism is all about.



But there isn’t enough data-driven storytelling making its way into the news cycle. By data-driven storytelling, I don’t mean burying a handful of statistics into a long-form article. I mean using the wealth of data available today to put a story into its proper context, for example to convey the historical trends, the categorical patterns and outliers, and the geographic distributions relevant to the story. I may be biased because my background is in data analytics, and I’ll fully concede that not every issue can be presented with quantified information, but we can be getting much more value from open datasets.



Consider the variety of news stories that can be enriched by incorporating data on a topic as seemingly academic as U.S. migration trends. To list just a few issues, migration data helps us to better understand regional differences in economic hardship, the effectiveness of economic policy reforms, which cities face urban planning challenges, the ability of people to become entrepreneurs, the American psyche of reinventing oneself, and the evolution of party affiliation and political beliefs in battleground congressional districts.



The lack of depth in data reporting is related to a more general trend in journalism today, which is that news stories increasingly prioritize immediacy at the expense of context. We now learn about more developments from more parts of the world faster than we ever have before, but each story comes with shallower context. A recent example that sticks out in my mind is the U.S. government shutdown episode and the subsequent budget deal at the end of 2013. Covering the shutdown was an occasion for the news media to help the public better grasp the composition of the federal budget, how various proposals impacted components of the budget, and the relative impacts of budget proposals on the deficit and national debt. Instead, news coverage was more of a play-by-play of the mudslinging and partisan theatrics within Congress.



It’s important to point out that journalists aren’t solely responsible for the shift towards immediacy. It’s our fault too. Reading habits have changed, as we now have access to more news sources and are almost always plugged in to them, either on our mobile devices or desktops. As a result our attention spans are much shorter. When we open a news story, we want to get to the main point quickly, then swipe to the next item in our never-ending feeds.



There’s got to be a better way to tell the whole story without losing the reader’s attention. I believe one viable option is through data visualization. Journalists can address the tension between immediacy and context by integrating more interactive graphics into storytelling. A great data visualization can capture and hold a reader’s attention while also conveying broader context about the subject, literally painting the bigger picture for the reader. As an example of what I mean, here is the 2012 ACS migration dataset, presented as a visualization that anyone can explore.

I’m launching datawovn.com to help address the issues in journalism discussed above. First, that a great wealth of knowledge is locked up in open datasets, and unlocking that knowledge requires more exploration and analysis of data by investigative journalists and independent bloggers. Second, too many stories prioritize immediacy over context, but an engaging interactive visualization can hold on to a reader’s attention while simultaneously conveying more substance than text could alone. We don’t have to resort to sensational sound bites. Data visualization is a powerful tool for communicating ideas and one that is especially suited to mobile and desktop browsing, which is how most of us consume news today.



Catching journalism up to the data-driven era can’t be accomplished by a single news outlet or blog. I firmly believe we need many more reporters and bloggers to integrate open data into their work. We need more data journalists. Part of the reason we don't have more data journalists is a lack of familiarity with the tools for data journalism. If you’re interested in getting involved in data journalism yourself, there are many great resources for getting started. Sign up for Alberto Cairo’s next MOOC on infographics and data visualization. Read Scott Murray’s book on developing interactive visualizations for the web. Check out ProPublica’s Nerd Blog and all the incredible data analysis and visualization software tools compiled by visualizing data.



I hope you enjoy interacting with the visualizations on this site, and more importantly that they help you to better understand a complex issue affecting our world, make you a slightly more enlightened citizen, and maybe even inspire you to investigate and report on some data yourself.



My goal is to keep this site free of ads or a paywall, and I’m working full-time on maintaining the site and producing data visualizations. Last summer I quit my job in New York City, moved to India, and have thrown myself completely into this project. It’s a lot of work, it’s been thrilling and terrifying, and I’ve only just started. Please consider supporting me with a recurring monthly subscription by visiting the About page.



If you’d like to comment on what I’ve written here, or have ideas for stories, drop me a note. I’d love to hear from you.



-Chris

Mumbai, India