Advancing Science by Making Big Data More Social for Researchers

Making the huge volume of data out there useful to humans is about contextualizing it, but if Facebook has taught us anything is that this must be done responsibly.

Back in 2012 Forbes Magazine declared that data was “the new oil,” and nobody these days disputes the fact that there is a ridiculously large amount of useful data available out there. The use of HTTP as an access method and semantic web languages as interchange formats have turned the Web into the largest decentralised database the world has ever seen. The problem we face, however, is that there are major issues around reliability, accessibility and socialisation of that data that stop it from being as universally useful as it could be.

There are challenges in extracting scientific data from PDFs Share on X

It was Tim Berners-Lee who once said that the next evolution of the World Wide Web – or Web 3.0 if you prefer – would be about the “Giant Global Graph”. What he was talking about was Big Data, but in a social dynamic context which people can easily access and take advantage of. In the words of Gerhard Weikum, Research Director at the Max Planck Institute for Computer Science, nearly all experimental “Big Data” is “utterly boring,” with evaluations ending up in “completely synthetic data with a synthetic workload that has been beaten to death for the last twenty years”. To make this data “interesting”, what he proposes is to bring Big Data and Open Data together, creating Linked Open Data.

Major administrative authorities already publish their statistical data in a Linked Data aware format, but the actual value of these datasets is not unleashed or fully exploited, because data needs context to be of value, and “socialising” is what provides such context. One example of this is the Digital Agenda EU Portal, which has a huge number of datasets on important European indicators, but does not allow people to share their findings or to discuss its interpretations. This means that the context, which gives the data most of its meaning, is simply missing.

As a researcher, you tend to spend most of your time trying to make sense of datasets Share on X

That is the problem that a group of EU-funded researchers are trying to tackle, together with industry partners such as London-based research collaboration platform Mendeley. They launched an open beta version of 42-data, a portal that aims- as the name will suggest to any fans of “Hitchhiker’s Guide to the Galaxy” – to provide answers to the universe and everything by socialising statistical data. This is the main output of theCODE project, which has a remit of facilitating “community-driven insight generation” by lifting non-semantic web data silos in an ecosystem around Linked Open Data, bootstrapped by micropayments and trust mechanisms. Their goal is to essentially create a “flea market for research data” by combining crowd sourced workflows with offline statistical data. This would create a Linked Open Data cloud capable of generating customised datasets to backup and answer all manner of research questions.

Scientific articles are obviously the perfect fodder for this cloud database, but they come with a major problem attached: most papers are in PDF format, which means that it’s difficult (not to say impossible) to extract the primary research data contained in tables and figures. The CODE project, which Mendeley participates in, addresses this by reverse-engineering the paper to extract this information in a format that can then be easily processed and analysed.

“We have a long-standing partnership with Mendeley which started with the TEAM-Beam project, and was then extended to CODE,” says Professor Michael Granitzer, from the University of Passau, Germany, the academic partners responsible for the 42-data portal. “The vision with CODE is basically to make the daily lives of researchers a bit easier, and Mendeley is the perfect partner for that, because it already offers so many tools like the group collaboration and the open API. In the scope of the CODE project, we developed and deployed lots of tools to analyse research publications. Most of that analysis consists of information from inside the paper itself, the primary statistical research data such as tables. We enrich this analysis with linked Open Data to generate meaningful insights and broaden a researcher’s view in a user-friendly way, with sophisticated visualisations that can generate interactive charts and other assets for their research,” he explains.

As a researcher, you tend to spend most of your time actually making (or trying to make) sense out of datasets, and this process means that you have less time to come up with interesting insights and advance research in your field. Take, for example, a researcher preparing to write up their paper: Why is the proposed approach better? The hypothesis they’re putting forward must be backed up by meaningful data, so they are faced with the task of extracting and aggregating statistical primary research data that is stored in tables within various research papers, and then combining, comparing and contrasting this with their own evaluation data. Without integrating these workflows, you’d need a plethora of tools, specially since copying and pasting from PDFs does not work for this type of data. Within 42-data, however, Mendeley hosts and pre-processes those papers, using the Know-Center services to extract the information in a format that is easily processed and manipulable.

The platform itself collects that table-based data and the University of Passau uses Mendeley’s API to merge all those single results accordingly, creating a “data cube” of merged and linked data. This data cube presents the researcher with an integrated view of all those disparate data sources. “But that’s not even the end of it, as a data cube can then be enriched with Open Data to offer up even more insights,” concludes Granitzer. The analysis and discovery is thus not limited to the initial dataset, as the platform offers virtually endless possibilities for customised mixing and matching within what 42-data calls “Data Cubes” to address specific research questions and needs. Individual cubes can be interconnected and aggregated using a graphical interface, which guides the user through and warns of any integrity constraint violations, and how these can be solved, by modifying its structure.

This uncovers some exciting possibilities for accelerating scientific discovery; if some of the sensemaking legwork was automated by such portals, we could see the emergence of a virtual meeting place for people interested in getting insights from such Open Data sets, similarly to how Mendeley users interact in groups based around their research interests. “It is a well-known fact that discoveries in academia come out of intense communication processes, and that is what we’re looking to support,” says Florian Stegmaier, Senior Researcher at the University of Passau. “In addition, the social/crowdsourcing aspect of the platform means that we’re going way beyond the text-based model of asking questions, broadening the scope of discussions to include virtually everything. You could assess the suitability of your research ideas based on existing data, ask for statistics to be included in a paper, or simply discuss a range of published papers to get an in-depth view of the subject,” he enthuses.

But analysing, integrating and sharing data comes with associated costs, as does running such a portal. Beyond the initial EU Seventh Framework Programme grant, how does 42-data actually propose to fund itself? “It’s crucial to establish a value chain for data that creates a positive benefit-to-cost ratio, and we are doing that through two main mechanisms: Reputation and Donations,” Michael Granitzer explains. Reputation is certainly the core motivation driver in the crowdsourcing ecosystem, as we’ve seen withStackExchange and Github, amongst many other high-profile examples. They set out to provide a similar proposition, where users contribute to open-source data projects, analysing data sets and creating interesting insights.

In order for this reputation model to work within the Web of Data you need to establish provenance. This means there is a solid chain of data, which tells you the origin or source of every individual piece of information within that chain. That includes records about which individuals were involved in creating, changing or extracting the data at any given point in time. If a particular person generates a data cube with their query, their ID is stored in that cube to guarantee this reproducible mapping (in the case of data extracted from a paper hosted on Mendeley, the metadata referring to the author name, abstract, publication date, academic status, discipline, research interests, etc. is automatically extracted and linked to the cube). The plan as the platform develops is to triangulate this information with community ratings and recommendation algorithms to produce a “user trust score” that will further feed the reputational ecosystem.

What we see with data today is a similar situation to what we had in the era prior to Web 2.0 Share on X

Donations also provide monetary incentives, in the community-driven financing model that Wikipedia pioneered. A “revenue chain” is created by allowing people to donate to users, questions, answers or resources that they find particularly helpful. The idea is to explore the long tail of micro-payments by keeping it flexible. You can target your donation to a specific user, or if it’s a collaborative effort, this can be sent to multiple targets, with user’s trust and reputation scores on the site also influencing how well they do out of those transactions, which is hoped will foster a stronger and more cooperative community. “The complete ecosystem is driven by trust and reputation mechanisms. The higher the trust is, the more likely one will donate for something,” says Granitzer.

What we see with data today is a similar situation to what we had in the era prior to Web 2.0, where there was a lot of content around, but socialisation over that content was not enabled. Just as we’ve seen with the social media boom of recent years, however, there is now an opportunity and appetite for creating communities of interest around the socialisation of data. Through exploring Linked Open Data, users should be empowered to aggregate and integrate interesting data, quickly tailoring it to their specific research needs. That is, however, just the first step, as this increased socialisation could make these datasets accessible to non-scientists as well. The growing momentum of the Citizen Science movement goes to show the enormous potential of opening up science in this way, and the possibilities that this opens up are truly amazing.

Exploring Linked Open Data, users could aggregate interesting data and tailor it to their specific research needs Share on X

Originally published in the Huffington Post

Alice Bonasio is a VR Consultant and Tech Trends’ Editor in Chief. She also regularly writes for Fast Company, Ars Technica, Quartz, Wired and others. Connect with her on LinkedIn and follow @alicebonasio and @techtrends_tech on Twitter.

Similar Posts

Finding Last-Minute Father’s Day Gifts That Dad Will Actually Use

Making Sourdough Baking Accessible

Rethinking Recycling

Bringing AI Into Prenatal Ultrasound

Building the Google of Sex Education

Tech Trends Spring Gift Guide

Connecting London and Texas at SXSW

Cybercrime Goes Industrial