{"id":2942,"date":"2019-06-26T00:36:07","date_gmt":"2019-06-25T22:36:07","guid":{"rendered":"http:\/\/miro.borodziuk.eu\/?p=2942"},"modified":"2019-09-01T00:51:42","modified_gmt":"2019-08-31T22:51:42","slug":"amazon-redshift","status":"publish","type":"post","link":"http:\/\/miro.borodziuk.eu\/index.php\/2019\/06\/26\/amazon-redshift\/","title":{"rendered":"Amazon Redshift"},"content":{"rendered":"<p>Redshift is a managed <strong>data warehousing<\/strong> solution, which can scale to <strong>petabytes<\/strong> or more.<\/p>\n<p><!--more--><br \/>\nUse Redshift to integrate with SQL; Business Intelligence (BI); and Extract, Transform, Load (ETL) tools to generate reports.<\/p>\n<p>Redshift Spectrum allows you to perform<strong> SQL queries<\/strong> against exabytes of <strong>unstructured data<\/strong> in <strong>S3<\/strong>. It scales compute capacity based on the data being retrieved.<\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"color: #3366ff;\">Redshift Architecture<\/span><br \/>\nA Redshift cluster is a set of nodes that consists of a <strong>leader node<\/strong> and one or more<strong> compute nodes<\/strong>. The type and number of compute nodes needed depends on the size of the data, the number of queries executed, and the required query execution performance.<\/p>\n<p><span style=\"color: #999999;\">Leader node<\/span>: Receives queries from client applications, parses the queries, and develops execution plans, which are an ordered set of steps to process these queries. The leader node then coordinates the parallel execution of these plans with the compute nodes, aggregates the intermediate results from these nodes, and finally returns the results back to the client applications. You can only have <strong>one<\/strong> leader node.<\/p>\n<p><span style=\"color: #999999;\">Compute nodes<\/span>: Execute the steps specified in the execution plans and transmit data among themselves to serve these queries. The intermediate results are sent back to the leader node for aggregation before being sent back to the client applications.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-2947 aligncenter\" src=\"http:\/\/miro.borodziuk.eu\/wp-content\/uploads\/AmazonPayments.jpg\" alt=\"\" width=\"588\" height=\"780\" srcset=\"http:\/\/miro.borodziuk.eu\/wp-content\/uploads\/AmazonPayments.jpg 588w, http:\/\/miro.borodziuk.eu\/wp-content\/uploads\/AmazonPayments-226x300.jpg 226w\" sizes=\"(max-width: 588px) 100vw, 588px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"color: #3366ff;\">Redshift vs. RDS<\/span><br \/>\nBoth enable you to run traditional RDBMSs. RDS is typically used for <strong>OLTP<\/strong> and reporting. Redshift is appropriate for <strong>massively large<\/strong> data sets. Redshift provides excellent <strong>scale-out<\/strong> options and can be used to prevent interference with an OLTP workload.<\/p>\n<p>&nbsp;<\/p>\n<p>When to use which product?<\/p>\n<p><span style=\"color: #999999;\">RDS<\/span><\/p>\n<ul>\n<li><strong>OLTP<\/strong><\/li>\n<li>Read replicas across regions<\/li>\n<li>Snapshots in S3<\/li>\n<li>Lives inside VPC<\/li>\n<li>Security is DB user based<\/li>\n<\/ul>\n<p><span style=\"color: #999999;\">Redshift<\/span><\/p>\n<ul>\n<li><strong>OLAP<\/strong> Accessed via SQL<\/li>\n<li>Massive amounts of data<\/li>\n<li>Complex queries across multiple data sources<\/li>\n<li>Lives inside VPC<\/li>\n<li>Security is DB user based<\/li>\n<li>Best for structured data (e.g., CSV files)<\/li>\n<\/ul>\n<p><span style=\"color: #999999;\">DynamoDB<\/span><\/p>\n<ul>\n<li>Millisecond read latency<\/li>\n<li>Fully managed<\/li>\n<li>No backups required (PITR)<\/li>\n<li>Security is IAM based<\/li>\n<\/ul>\n<p><span style=\"color: #999999;\">Athena<\/span><\/p>\n<ul>\n<li>Apache Hive Query Language (HQL)<\/li>\n<li>Single data source<\/li>\n<li>Queries generally faster than Redshift<\/li>\n<li>Security is IAM based<\/li>\n<li>Better for <strong>ad hoc<\/strong> querying<\/li>\n<\/ul>\n<p><span style=\"color: #999999;\">Elastic MapReduce (EMR)<\/span><\/p>\n<ul>\n<li>Based on Apache Hadoop<\/li>\n<li>Best for <strong>unstructured<\/strong> data<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Redshift is a managed data warehousing solution, which can scale to petabytes or more.<\/p>\n","protected":false},"author":1,"featured_media":2943,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[79],"tags":[],"_links":{"self":[{"href":"http:\/\/miro.borodziuk.eu\/index.php\/wp-json\/wp\/v2\/posts\/2942"}],"collection":[{"href":"http:\/\/miro.borodziuk.eu\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/miro.borodziuk.eu\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/miro.borodziuk.eu\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/miro.borodziuk.eu\/index.php\/wp-json\/wp\/v2\/comments?post=2942"}],"version-history":[{"count":7,"href":"http:\/\/miro.borodziuk.eu\/index.php\/wp-json\/wp\/v2\/posts\/2942\/revisions"}],"predecessor-version":[{"id":2951,"href":"http:\/\/miro.borodziuk.eu\/index.php\/wp-json\/wp\/v2\/posts\/2942\/revisions\/2951"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/miro.borodziuk.eu\/index.php\/wp-json\/wp\/v2\/media\/2943"}],"wp:attachment":[{"href":"http:\/\/miro.borodziuk.eu\/index.php\/wp-json\/wp\/v2\/media?parent=2942"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/miro.borodziuk.eu\/index.php\/wp-json\/wp\/v2\/categories?post=2942"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/miro.borodziuk.eu\/index.php\/wp-json\/wp\/v2\/tags?post=2942"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}