Streams, Firehoses, and Buckets: An Introduction to AWS Data Analytics
AWS Data Analytics can mean different things to different organizations and can encompass a wide range of applications. In general, data analytics involves the use of statistical methods to describe data and extract trends.
The collection phase of AWS data analytics gathers raw data from a source and stores it in a database or other storage resource. Data collection resources are based on a publication and subscription model, where producers publish data to the resource and consumers subscribe to collect data. In AWS, two common methods of collecting data are Kinesis and SQS.
AWS Kinesis can be broken into Data Streams and Firehoses:
Kinesis Data Streams are real-time solutions. They can gather data from CloudWatch Logs, Kinesis Analytics, Kinesis SDK, Kinesis Agents installed on EC2 instances or on-prem instances, or other third party libraries. The data is broken down into immutable data blobs attached to a Record Key and Sequence Number. The Record key is used to determine which shard of the Data Stream the data blob will be sent through.
Data Streams are made up of one or more shards, each shard has a limited throughput of 1Mb/s for producers and 2 Mb/s for consumers. The number of shards to be provisioned can be changed to meet throughput requirements. The number of shards in a Data Stream can be increased or decreased via an API call to merge or split shards.
Kinesis Data Firehoses on the other hand are not quite real time. Firehose uses a buffer with a size limit and time limit. The buffer sizes and time limit range from 1-128 MB and 60-900 seconds respectively. Data is written to the buffer and when the time limit or size limit is met, the data in the buffer is transmitted all at once and can be picked up by consumers.
Kinesis Firehose uses similar producers as Data Streams but only has Redshift, S3, Elasticsearch, and Splunk as available consumers. However, unlike Streams, the data in Firehose is mutable and integrates with Lambda. The Lambda functions can be used to modify the data in transit or use an SDK to integrate with other AWS data analytics resources.
While Kinesis is used to stream data on a more consistent basis, another method of data collection worth considering is a Simple Queue Service (SQS). SQS is a more traditional publication and subscription service where data is sent to the queue as a topic and message. Producers send data with topics to SQS, consumers who have subscribed to certain topics then process the messages and remove them from the queue.
Consumers of SQS delete messages from the queue after being processed whereas multiple consumers can subscribe to Kinesis.
Another limitation for SQS is that messages only support strings of maximum 256 kb, a way around this is to send data to a S3 bucket and send the metadata as a message that way the consumer can access the data in the S3 bucket.
A common task for data analytics is to Extract Transform Load (ETL). The collection phase extracts raw data from a source and sends it into a cloud architecture. The processing phase transforms the raw data into structured data that is more usable for analytics and loads it into storage or sends it to another resource.
Lambda functions are serverless functions that get triggered by events. Lambda functions store scripts written in various languages, when the event is triggered the script gets executed.
For AWS data analytics Lambda functions can be used for real time processing or transformation. Events from Kinesis Firehose, SQS, or S3 can trigger a Lambda function to take the data and use the script to perform transformations. Lambda functions, being serverless, are meant for short, stateless processes. Lambda functions time out with a maximum 15 minutes.
To perform complex or long running transformations it may be necessary to split the script into multiple smaller well defined steps to process the data. These smaller steps can be organized into Step Functions which manage a sequence of Lambda functions, called a workflow. Step functions allow for Lambda functions to be configured in workflows which take the outputs of the previous functions to be used as inputs in the next step.
By setting up Lambda functions as Step Functions error handling, ordering, and states are managed, allowing for more complex processing jobs to be completed serverless.
The Glue service provides multiple solutions for AWS data analytics. A crawler can discover schemas from unstructured data stored in S3 buckets. Crawlers will look at data in S3 buckets and auto discover schemas, once a schema is discovered tables are created in the database section and the schemas can be registered in the Schema Registry. Schemas can be stored in the data catalog for EMR or Athen to query S3 buckets with SQL code.
Studio can be used for ETL jobs can be created using a low-code console or by uploading python or spark scripts to perform the processing. For Glue ETL jobs, a source is identified from; S3, a Relational DB, Redshift, Kinesis, or Kafka. An extensive suite of predefined transformations to map, join, drop, filter, ect can then be used on the source data. The transformed data can then be loaded into S3 or a Glue Catalog database.
Once the data has gone through processing and ETL, insights can then be extracted from the processed data. In the analytics phase the data descriptions and trends can be inferred. Identifying maximums, minimums, variations, outliers, cyclical, or upwards/downwards trends can be used to better understand the data.
For real-time analytics Kinesis AWS Data Analytics easily integrates with the other streaming Kinesis services. As data is passed through Kinesis Analytics it can auto-detect schemas for the data and allow for analytics to be applied via SQL queries. The output of the Kinesis Analytics can then be streamed back into a Data Stream or Firehose.
Athena uses the data catalog generated by Glue to allow unstructured data to be searched using SQL queries. Querying S3 directly removes the need for extra steps to load the data into a database before querying. Athena provides a powerful tool to perform ad-hoc analysis on unstructured data without managing servers.
Athena works best when the data is in a common format in an S3 bucket, for other examples where various data needs to be pulled from multiple sources, another service such as Redshift would be better suited.
Quicksight is an end user focused tool to build dashboards and visualize data. Quicksight is a Business Intelligence (BI) tool that draws on many different sources and types of data to explore. As a service meant for end users, Quicksight provides an in-memory engine to create graphics allowing users to quickly explore their data.
This overview is not an exhaustive list of services that can be utilized to build a data analytics application but they are ubiquitous and would probably appear in most data analytics applications. With many of these services it may be hard to decide which one to pick with the cost and benefits associated with each.
Instead of compromising on requirements, AWS recommends decoupling the requirements and using the appropriate services to meet requirements. For example, if data needs to be streamed realtime to identify outliers or spikes and also needs to be processed to be stored, instead of using a Data Stream or Firehose, use both.
Kinesis Data Stream and Kinesis Analytics can be used to provide the real-time streaming and analytics to identify outliers and spikes. A Kinesis Firehose can be used in parallel to consume the same data and stream to an S3 bucket, then Glue can be used to perform ETL to process the data.