Doing your Database on AWS (Part 4)

If you would like to read the other parts in this article series please go to:

Introduction

In Part 1 of this series, we discussed the evolution of the digital database and the benefits and drawbacks of DBaaS (Database as a Service) and we provided a brief overview of Amazon Relational Database Service (RDS). In Part 2, we continued that discussion with an examination of how to work with an RDS instance in a Virtual Private Cloud (VPC) and how to move a database instance into a VPC if it isn’t in one already. In Part 3, we talked about Amazon’s DynamoDB services, which would be your database option of choice for less structured and less predictable data that doesn’t lend itself to the traditional SQL-based tabular model.

Going big: data warehouse in the cloud

Data warehousing refers to storing large amounts of data, often from multiple sources, in a central repository where a single query engine can be used to view data from different departments and transaction processing databases across the entire enterprise. This provides a consistent structure, presentation and data model and makes it easier to analyze the data for business intelligence. Data warehouses are designed for optimal analytic access patterns.

Data warehousing has been around at least since the 1980s, but the concept has evolved – from offline data warehouses to real time data warehouses to integrated data warehouses and now to cloud-based data warehouses.

For those organizations whose data storage and database needs extending as far as the petabyte range, Amazon Web Services has a service for you. It’s called RedShift and it gives you a way to warehouse tremendous amounts of data in the cloud in a cost effective way, and benefit from a fully managed solution without sacrificing performance or security.

Building and maintaining a traditional data warehouse can be prohibitively expensive. The upfront and ongoing costs involved in procuring the physical space and the large amount of hardware required for storing and processing multiple terabytes or petabytes of data represents a huge investment, even on a successful enterprise’s budget. Even though per-unit storage costs continue to decline, the sheer amount of data that we collect is increasing at an even faster rate.

Assuming all of this data isn’t going to just sit there in archives, but it is actually going to be used, in order to achieve maximum functionality and performance you need a parallel processing environment so that indexing, caching, accessing, analyzing and performing normal operations (creating, updating, deleting) of data can be done quickly. But the implementation of large scale parallel processing is costly.

You also need to take into consideration the fact that the infrastructure isn’t the only cost that will increase exponentially as the amount of data increases. Administrative overhead will also increase. Managing a data warehouse properly requires a number of employees, and personnel costs can be another very expensive factor. Of course, even before you start to construct your data warehouse, you have the cost of consulting and design; the last thing you want is a massive data warehouse that “just grew that way.”

Once you do have the data warehouse infrastructure in place, there will be a fairly hefty upfront cost in the form of ETL (extract, transform and load), which refers to the process of transforming the data from your legacy applications to the data warehouse environment. Of course, you’ll also have the cost of the server operating systems and the database and data management software. If all of these costs seem overwhelming, it makes sense to consider cloud-based data warehousing as an alternative.

But first: Factors to consider before moving your data warehouse to the cloud

As with any decision to move resources to the cloud, there are a number of factors to consider. Organizations shouldn’t get so excited about the potential cost savings that they overlook the necessity of a careful evaluation to determine whether a cloud-based data warehousing solution is the best choice in a particular situation.

Of course, one of the important considerations is security. You need to assess the risk of putting your information into the cloud based on its sensitivity. If your organization operates in a regulated industry, you also must think about compliance requirements and whether they can be met by your cloud provider.

In addition to security, one of the biggest concerns regarding cloud-based data warehousing is performance. Internet connectivity speeds have increased to an amazing degree over the years, but still are and probably always will be behind the transfer rates that can be achieved over a local network. However, the network transfer rate isn’t the only thing that affects how quickly you’re able to access warehoused data. There are “tricks of the trade” that cloud providers can use to provide you with overall performance that is higher than that of a traditional database.

The volume of data that you will be storing and how much data will be loaded on a daily or weekly basis are some other things to think about as part of the decision-making process. Some cloud providers have limitations on maximum capacity that might exceed your needs. In other cases, if you’re just getting started with your data warehousing, some providers might have a minimum capacity for which you have to pay that is greater than what you need.

First look at Amazon Redshift

Amazon Redshift attempts to address all of these issues with features to improve security and performance and flexible options to make it more cost effective for different needs and situations.

For example, Amazon Redshift allows you to get started with a single node data warehouse with 160 GB of storage. You can scale all the way up to a 1.6 PB multi-node configuration as your needs expand. And of course, as with most AWS services, you can take advantage of the free trial to evaluate whether Redshift works for you. If you already have an AWS account, you can easily sign up for Redshift via the AWS Management Console or you can create a new account and sign up on the Amazon Redshift web page.

Security is controlled through Redshift cluster security groups. Using the principle of least privilege, clusters are completely locked down by default (nobody has access). You must create security groups and associate them with a cluster in order to give users access. A default security group is created when you create the first cluster, but it is empty. You have to add inbound access rules to the security group, and you can create your own security groups for more fine-grained management of inbound traffic when you don’t want to provide the same access to all users.

You can enable database encryption to protect the warehoused data when it is at rest; this will encrypt the active cluster and any backups, including the system metadata, with AES 256 encryption. This is optional but recommended and of course if your data is sensitive in any way and/or if you operate in a regulated industry, encrypting your data is mandatory. Of course, the down side of encryption is that it will negatively impact performance.

Speaking of performance, Amazon claims performance “up to ten times” higher than traditional databases for warehousing and analytics workloads. This is because of their use of columnar data storage, advanced compression technology and massively parallel processing (MPP) that automatically distributes the data and queries across multiple nodes (if your cluster has multiple nodes).

Data warehousing is a complex subject, so you might need to do a little studying to become familiar with the nomenclature and concepts. Amazon attempts to make it as simple as possible but you’ll need to know, for example, the difference between a leader node (which receives the client queries, parses them and creates executions plans) and a compute node (which is responsible for carrying out the steps that are delineated in the execution plan and sending results back to the leader node, which aggregates the data and returns it to the client applications).

There is also another distinction, between dense storage and dense compute nodes. As their names imply, the first is optimized for maximum data storage and the latter for highest performance for your workloads.

You need to understand all of this because part of setting up your Redshift data warehouse cluster is specifying the number of nodes and node types that will make up your cluster and the type of node determines its storage capacity, memory and CPU resources. You can have as few as 1 node (depending on node size; not all sizes allow for single node clusters) in a cluster. On the other end, it’s possible to have up to 128 nodes in a cluster (depending on node type), but there is a default limit of 16 nodes per region (per account). However, you can request more.

Summary

In this, Part 4 of our series on AWS databases in the cloud, we took a brief look at Amazon Redshift and how it can be used for data warehousing without the very high investments required to set up and maintain traditional data warehouses by those organizations with massive amounts of data to store and analyze. This concludes our series and I hope it helps you to understand some of the options that are available for you if you’re considering moving your databases to AWS.

If you would like to read the other parts in this article series please go to:

Leave a Comment

Your email address will not be published.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Scroll to Top