Big Data: The Security Perspective (Part 1)

If you would like to read the next part in this article series please go to Big Data: The Security Perspective (Part 2).


In the IT industry, the timeline can be measured by the emergence of buzzwords (or buzzphrases). Catchy terms that make for good soundbites, they’re often not fully understood even by their biggest proponents and most respected “experts.” One of the hottest buzzwords of the past year was “big data,” but what does it really mean? And what does it mean in terms of security? In this article, we’ll talk about the big data trend from a security perspective.

In this, Part 1, we’re going to address the problem of securing big data and in Part 2, we’ll look at whether and how big data can be used to enhance overall network security.

The simplest definition of “big data” is exactly what it sounds like: massive amounts (think petabytes, exabytes, zettabytes and beyond). A zettabyte is equal to over a trillion gigabytes (1,099,511,627,776 GB, to be exact). That’s a lot of data by anybody’s standards. Of course, the amount of data that constitutes “big” changes over time. When I bought my first IBM PC with its massive 10MB hard drive, the “puny” 500GB drives that come on today’s $299 computers would have been seen as the repository for an enormous amount of data. Twenty years from now, multi-domegemegrottebyte drives (1,000,000,000,000,000,000,000,000,000,000,000 Bytes) might be commonplace.

But “big data” isn’t only about volume. It’s also about complexity – the relationships, hierarchies and links between data points and data sets. And it’s about velocity – the speed at which data is produced and processed. Finally, it’s about variety – structured, unstructured and semi-structured data.

It’s considered “big data” when traditional relational database approaches to processing it will no longer work due to the volume of data or the velocity at which the data is being collected or needs to be processed or because of its unstructured/semi-structured nature.

Big Data Analytics

The benefit of collecting such massive amounts of data is the potential ability to glean information from it that businesses can use in the decision-making process. “Big Data Analytics” is another currently popular buzzword that refers to methods of “data mining” (yet another buzzword) to extract the relevant information that can be used for trend spotting, predictive modeling and forecasting.

The most basic form of analytics is encompassed by “business intelligence” (which you might remember as a top buzzword from a few years back), which is used more in recognizing and identifying what has happened in the past. Modern analytics goes further, to apply the analysis to predict what will happen in the future.

I recently wrote an article for the GFI Talk Tech to Me blog called “TMI/NEK: Too Much Information, Not Enough Knowledge.” Big data analytics seek to turn the overload of information that is big data into knowledge that can be useful to a business.

Where does security come in?

There are two different aspects from which we can examine the relationship between big data and security:

  1. The task of securing the big data itself, and
  2. The use of big data analytics to spot security trends and make security-related predictions.

We’ll look at each of these and what you need to know about each when working with big data.

Securing big data

You might think securing big data would be pretty much the same as securing smaller volumes of data, but there are important differences. Going back to the complexity factor that we mentioned above, big data generally derives from many different sources. Because the different data points are likely to physically reside in a variety of different locations, you have to be aware of how access restrictions are configured. You must keep track of where the data is stored and who has access to it and how they are able to access it (i.e., from the internal network, from outside the network via mobile devices, etc.).

You also need to assess the nature of the data and pay special attention to data that requires high security, which includes both data that belongs to you (trade secrets, intellectual property, business plans and strategies) and data that belongs to others (client social security numbers, credit card and bank account numbers, medical records, employee personal information).

You’re obviously looking at many of the same issues you always deal with in securing data: internal and external threats. But the volume and complexity of the data complicate the security processes. Add to that the fact that Apache Hadoop, the open source architecture that’s most commonly used to provide the framework for big data deployments and the distributed applications that are used to process big data, is a Java-based programming framework that was not designed with security as a priority.

That means security is “tacked on” as an afterthought, rather than incorporated deeply into the design of the systems. The good news is that there are security solutions targeted specifically at this problem. For example, SHadoop (Secure Hadoop) is layered on top of Hadoop in Zettaset implementations to provide Kerberos authentication, role-based security and audit logging.

These distributed clusters of computers that are managed by Hadoop present new security challenges that can’t be adequately secured using “old school” IT technologies that rely on perimeter boundaries (firewalls, IDS/IPS and other “edge” solutions). This means security must move inward, with controls that are designed to protect the data itself rather than the network (akin to placing your valuables in a safe so that even if a burglar does penetrate your perimeter security by getting into your house, there is still a strong barrier to getting to the important assets).

We also have to look at the analytics tools. The problem here is that there are dozens of different tools that work in very different ways, many of them free/open source software. Because many of them are specialized for handling particular types of data, you may end up using a mix of tools together. Because they are free, there may be no centralized support or point of accountability for the tools’ security.

Because big data is about utilizing data stored across different company departments or domains, responsibility for the security of specific sets of data has to shift. It becomes necessary to take an organization-wide approach to security rather than applying security measures in the context of separate, isolated “islands” of data.

The unstructured data that makes up much of big data has to be stored and sorted differently from the highly structured data that many IT security professionals are used to dealing with. Traditional database (SQL) management provides many security controls such as multi-factor authentication, data encryption, firewalls protecting the (contained) databases, and so forth. The “NoSQL” systems used in big data infrastructures have fewer and less sophisticated security controls with Kerberos authentication and access control lists.

Because a big data ecosystem contains such a plethora of information, it’s an especially attractive target to attackers. This makes it even more important to ensure that it’s secured. But as always, security must be balanced with accessibility and performance because, in general, the first is on the opposite end of a continuum from the last two. More security usually equals less accessibility and decreased performance. This can matter more in a big data environment because in many cases, the usefulness of the data is dependent on real-time processing and access.

In some ways, big data security is where network security in general was 15 years ago and companies will soon find themselves in the same position where “it just grew that way.” Organizations are excited about the potential uses of big data and that may lead to pushing the security implications onto a back burner.

Solutions to the problem of securing big data require thinking outside the box in which data has traditionally been confined. Technological solutions are only part of the answer. You also need strong policies that govern how data is handled and where it is stored; otherwise you’ll find that the data is more distributed than you’d like – including across external systems such as mobile devices and laptop and home computers that are not under the control of the organization’s IT security team. A first step is a data discovery effort to locate all of the data so it can be classified according to sensitivity and security needs.

The steps for securing big data involves all of the following:

  • preventative measures (setting up the correct access controls and encrypting data)
  • monitoring both the data itself and data access on an on-going basis to detect changes in security needs and to identify threats and potential threats
  • deleting data that is no longer useful to the organization and deleting it so it can no longer pose a risk

In regard to the last, once bitten by the big data bug, organizations are prone to becoming “data hoarders,” collecting more and more data just for the sake of increasing the volume and keeping it long past its useful life and the time when it should be “thrown away.”


Big data holds a big potential for businesses, as it can be a powerful tool that – used properly – will provide information that can be leveraged to an organization’s advantage. However, it can also create big problems if the security issues inherent in collecting, storing, and using it are not addressed up front.

In Part 2 of this series, we’ll look at big data security analytics, or how big data techniques can be applied to security data in order to improve overall network security.

If you would like to read the next part in this article series please go to Big Data: The Security Perspective (Part 2).

Leave a Comment

Your email address will not be published.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Scroll to Top