AWS Redshift Optimization – A Case Study

AWS Redshift – An Overview


Amazon Redshift is a fast, fully managed, petabyte-scaled data warehouse solution that uses columnar storage to minimize Input/Output (I/O), provide high data compression rates, and offer fast performance. As a typical Data Warehouse, it is primarily designed for Online Analytic Processing (OLAP) and Business Intelligence (BI) and not designed to be used as an Online Transaction Processing (OLTP) tool. It supports ANSI-SQL and is a massively parallel processing database.

Redshift architecture (See Figure 01) consists of a tightly coupled EC2 Compute nodes cluster. The Redshift cluster is built on a single availability zone in order to negate any network latency issue between availability zones. Having all nodes in close proximity will reduce network latency and will improve performance.

Users can create one or more clusters with each cluster having multiple databases. Most of the time there is one Redshift cluster and additional clusters can be added for resilience purposes. Any cluster can have two types of nodes, namely a leader node and compute nodes.

The leader node facilitates the communication between the BI client and the compute nodes. Each leader node has a SQL end point. It coordinates the parallel query execution. When a request comes to the leader node, it parses the query and generates an execution plan and a compiled code to be executed in the compute nodes.

The compute nodes process the incoming requests in parallel. Each compute node has a dedicated CPU, memory and a storage. Each compute node can scale out/in and scale up/down (resizing the Redshift cluster). Each compute node consists of slices. The slices are portions of memory and disk. The data is loaded to the slices in parallel. It has a “shared nothing” architecture. All compute nodes are independent of each other and there is no contention between nodes.

Redshift can decide automatically how the data distributes between slices. Also a user can specify one column as the distribution key. When a query is executed, the query optimizer on the leader node redistributes the data on the compute nodes as needed in order to perform any joins and aggregations.

The Challenge

With time, as you load more and more data and apply DML commands, the performance can deteriorate. A US client from the healthcare sector wanted to apply best practices on to their Redshift cluster in order to speed up and improve its performance.

How Auxenta Helped

Leveraging Auxenta’s Redshift expertise, the team proposed a set of best practices and optimization strategies for the client’s Redshift cluster installation.

The Solution

The given Redshift cluster was analyzed based on the following key indicators.

  • Table Design
  • Sort keys
  • Compression
  • Analyze
  • Vacuum
  • Primary and foreign keys
  • Query Design
  • Queries with alerts
  • Queries affected by work load management (WLM) configurations
  • Tables being used in queries with maximum impact on query performance
  • Percentage of queries being affected by tables
  • Tables scanned in select join queries
  • Select queries in peak CPU usage
  • Tables using peak CPU usage
  • WLM Management
  • Queue resources hourly
  • Queue resources hourly with CPU usage
  • Query patterns per user/group
  • WLM configurations for Redshift

Benefits to the client

  • Identify the causes and performance drawbacks in the Redshift cluster
  • Guidelines for improvement
  • Gaining a good knowledge about Redshift optimization knowledge

References

Ishara Nuwan Karunathilake

Senior Tech Lead