Scraping ETF KPIs with AWS Lambda, AWS Fargate, and Alpha Vantage & Yahoo Finance APIs

Saturday, 22 Jun 2024 AWS, Python

Note: The section Wrapping Up links to the complete project codebase. You can always return to this guide for a more detailed walkthrough of the project design and infrastructure.

Overview

In this post, we will create a Python 3 based daily scraper designed to gather key performance indicators and metrics for actively listed Exchange-Traded Funds (ETFs) using the Alpha Vantage and Yahoo Finance APIs. The scraper leverages several AWS services to ensure seamless and automated data collection and storage.

At a high-level, the scraper involves the following resources and tools:

AWS Services and Tools

AWS CloudFormation: Automates and templatizes the provisioning of AWS resources.
Amazon VPC: Isolates the compute resources within a logically virtual network.
AWS Fargate: Runs the containerized scraping application code.
AWS Lambda: Triggers the Fargate task to run the scraper.
Amazon EventBridge: Schedules the daily execution of the Lambda function.
Amazon ECR: Stores Docker images used by the AWS Fargate tasks.
Amazon S3: Stores the scraped ETF data as well as the Lambda function source code.
Amazon IAM: Creates roles and policies for AWS principals to interact with each other.
Amazon CloudWatch: Logging for Lambda and Fargate tasks.

Development and Deployment Tools

Poetry: Manages the dependencies of the project.
Docker: Containerizes the application code.
Boto3: AWS SDK for Python to interact with AWS services.
GitHub Actions: Automates the deployment processes to ECR and Lambda directly from the GitHub repository.

API Setup

The Alpha Vantage API offers a wide range of financial data, including stock time series, technical and economic indicators, and intelligence capabilities. To access the hosted endpoints, we need to claim a free API key from Alpha Vantage’s website. This key will be used as an environment variable in our scraper code.

The Yahoo Finance API, accessed via the yfinance Python package, provides a simple interface to obtain key performance indicators and metrics for ETFs. The package is not an official Yahoo Finance API but is widely used for financial data extraction.

Important Considerations

Alpha Vantage API: The free tier allows up to 25 requests per day. More details can be found in the support section of Alpha Vantage’s website. In this project, we will use the Listing & Delisting Status endpoint, which returns a list of active or delisted US stocks and ETFs as of the latest trading day.
Yahoo Finance API: There are no officially documented usage limits (that I am aware of). However, to avoid triggering Yahoo’s blocker, the package author recommends respecting the rate-limiter as documented in the Smarter Scraping section of the readme. For this project, we will limit our requests to 60 per minute, which is sufficient to gather data for thousands of ETFs within a reasonable time frame ~ 1 hour.

Infrastructure Setup

To keep the resource creation process organized, we will use CloudFormation yaml templates to break the AWS resources into manageable components that can be easily deployed and torn down as logical units. The following diagram depicts the entire infrastructure for the ETF scraper on AWS cloud:

Virtual Private Cloud

AWS Fargate requires a virtual private cloud to function properly. Every AWS account comes with a default VPC with a public subnet in each availability zone. For this project, we create a separate VPC using either the private subnets template (link) or the public subnets template (link).

Public Subnets

In this setup, the ECS Fargate task that runs the containerized application is placed in public subnets, which allows the application code to directly access the internet via the internet gateway. The table below summarizes the key components from the template:

Component	Description
VPC	A virtual private network with a CIDR block of `10.0.0.0/16`, providing 65,536 IP addresses.
Internet Gateway	Connects the VPC to the internet, defined by the `InternetGateway` and `AttachGateway` resources. Each VPC can have only one internet gateway.
Public Subnets	Two subnets (`10.0.3.0/24` and `10.0.4.0/24`) with public IP addresses across two availability zones. These subnets are routable to the internet through the internet gateway.
Route Table	`RouteTable` handles routes for public subnets to the internet gateway.
Security Groups	The `SecurityGroup` resource for this project allows all outbound traffic for API calls but does not allow any inbound traffic.

The diagram below depicts the infrastructure in the us-east-1 region:

Private Subnets

In this stack, the ECS Fargate task is placed in private subnets. The table below summarizes the key components from the template:

Component	Description
VPC	A virtual private network with a CIDR block of `10.0.0.0/16`, providing 65,536 IP addresses.
Internet Gateway	Connects the VPC to the internet, defined by the `InternetGateway` and `AttachGateway` resources. Each VPC can have only one internet gateway.
Public Subnets	Two subnets (`10.0.1.0/24` and `10.0.2.0/24`) with public IP addresses across two availability zones.
Private Subnets	Two subnets (`10.0.3.0/24` and `10.0.4.0/24`) without public IP addresses across two availability zones.
NAT Gateways	Located in the public subnets, NAT gateways (`NATGateway1` in `PublicSubnet1` and `NATGateway2` in `PublicSubnet2`) allow instances in private subnets to connect to the internet. Each NAT gateway is in a different availability zone to ensure robust internet access, even in the event of an outage in an availability zone.
Elastic IPs	Public IP addresses associated with the NAT gateways. Each NAT gateway must have an Elastic IP for internet connectivity.
Route Tables	`RouteTablePublic` handles routes for public subnets to the internet gateway, while `RouteTablePrivate1` and `RouteTablePrivate2` manage routes for private subnets to NAT gateways.
Security Groups	The `SecurityGroup` resource for this project allows all outbound traffic for API calls but does not allow any inbound traffic.

This infrastructure can be visualized as follows:

Which Subnet Setup Should We Choose?

When deciding between public and private subnets, consider the following definitions from the official documentation:

Public Subnet: The subnet has a direct route to an internet gateway. Resources in a public subnet can access the public internet.
Private Subnet: The subnet does not have a direct route to an internet gateway. Resources in a private subnet require a NAT device to access the public internet.

As we shall see, the ETF scraper code only needs outbound traffic to make API calls, and no inbound traffic is expected. Therefore, the two setups are functionally equivalent for this project. Regardless of whether the ECS Fargate task is deployed in public subnets or private subnets, we can specify a security group with rules that prevent any inbound traffic and allows only outbound traffic.

Still, while the public subnets setup may seem simpler, it increases the risk of misconfiguration at the security group level if any inbound traffic is in fact required. The best practice, in general, is to deploy in private subnets and use NAT gateways to access the internet.

S3 & Elastic Container Registry

Two critical resources are:

S3 Bucket: Stores the ETF data scraped by the application and the packaged Lambda function source code.
ECR Repository: Stores the Docker image used by the ECS Fargate task to run the scraper.

The CloudFormation template (link) is parameterized with the following inputs from the user:

S3BucketName (String): The name of the S3 bucket to be created.
ECRRepoName (String): The name of the ECR repository to be created.

IAM Roles and Policies

To run this ETF data scraper, we need to set up various IAM roles and policies to give principals (i.e., AWS services like ECS and Lambda) permissions to interact with each other. In addition, we need to create a role with the necessary permissions for workflows to automate the deployment tasks. The following CloudFormation template (link) defines these roles and policies for Lambda, ECS, and GitHub Action workflows.

The template requires the following parameters:

S3BucketName (String): The name of the S3 bucket created earlier.
ECRRepoName (String): The name of the ECR repository created earlier.
ECRRepoArn (String): The ARN of the ECR repository created earlier.
GithubUsername (String): The GitHub username.
GithubRepoName (String): The GitHub repository name.

The last two parameters ensure that only GitHub Actions from the specified repository (and the main branch) can assume the role with permissions to update the Lambda function code and push Docker images to ECR.

Compared to using an IAM user with long-term credentials stored as repository secrets, creating roles assumable by workflows with short-term credentials is a more secure method. This is the recommended approach by AWS for automating deployment tasks. To learning more about this approach, consider exploring the following resources:

Lambda Execution Role

The Lambda execution role allows Lambda to interact with other AWS services.

Role Name: ${AWS::StackName}-lambda-execution-role
Policies:
- LambdaLogPolicy: Allows Lambda to write logs to CloudWatch.
- LambdaECSPolicy: Allows Lambda to run ECS tasks.
- LambdaIAMPolicy: Allows Lambda to pass the ECS execution role and task role to ECS; this policy is useful for restricting the Lambda function to only pass specified roles to ECS.

ECS Execution Role

The ECS execution role allows ECS to interact with other AWS services.

Role Name: ${AWS::StackName}-ecs-execution-role
Policies:
- ECSExecutionPolicy: Allows ECS to log into and pull images from ECR, write logs to CloudWatch, get environment files from S3.

ECS Task Role

The ECS task role allows the Fargate task to interact with S3, enabling the application code to upload the scraped data. The task role should contain all permissions required by the application code running in the container. It is separate from the ECS execution role, which is used by ECS to manage the task and not by the task itself.

Role Name: ${AWS::StackName}-ecs-task-role
Policies:
- ECSTaskPolicy: Allows the Fargate task to upload and get objects from S3.

GitHub Actions Role

We enable workflows to authenticate with AWS through Github’s OIDC provider, facilitating secure and direct interactions with AWS services without needing to store long-term credentials as secrets.

Role Name: ${AWS::StackName}-github-actions-role
Trust Relationship:
- Establish a trust relationship with GitHub’s OIDC provider, allowing it to assume this role when authenticated via OIDC.
- Access is restricted to actions triggered by push to the main branch of the specified GitHub repository, ensuring that only authorized code changes can initiate AWS actions.
Policies:
- GithubActionsPolicy: Allows the workflows that assume this role to update the Lambda function, push Docker images to ECR, and interact with S3.

Outputs

The template outputs the ARNs of the roles and the secrets for the GitHub Actions user, which can then be accessed from the console.

LambdaExecutionRoleArn: ARN of the Lambda execution role.
ECSExecutionRoleArn: ARN of the ECS execution role.
ECSTaskRoleArn: ARN of the ECS task role.
GithubActionsRoleArn: ARN of the GitHub Actions role.

ECS Fargate

The CloudFormation template (link) for ECS Fargate requires the following parameters:

The IAM role ARNs:

ECSExecutionRoleArn (String): The ARN of the ECS execution role exported from the IAM template.
ECSTaskRoleArn (String): The ARN of the ECS task role exported from the IAM template.

The Task Definition Parameters:

CpuArchitecture (String): The CPU architecture of the task. Default is X86_64. Important: Ensure this is compatible with the architecture for which the Docker image is built.
OperatingSystemFamily (String): The operating system family of the task. Default is LINUX.
Cpu (Number): The hard limit of CPU units for the task. Default is 1024 (i.e., 1 vCPUs).
Memory (Number): The hard limit of memory (in MiB) to reserve for the container. Default is 2048 (i.e., 2 GB).
SizeInGiB (Number): The amount of ephemeral storage (in GiB) to reserve for the container. Default is 21.

Other parameters:

EnvironmentFileS3Arn (String): The S3 ARN of the environment file for the container. This file contains the environment variables required by the application code. More details on the environment file are in the Application Code section below.
ECRRepoName (String): The name of the ECR repository created earlier.

Cluster

An ECS Fargate task is typically run in a cluster, which is a logical grouping of tasks. The template linked above creates an ECS cluster with the following properties:

ClusterSettings: Enables container insights for the cluster, which automatically collects usage metrics for CPU, memory, disk, and network.
CapacityProviders: Specifies FARGATE and FARGATE_SPOT (i.e., interruption tolerant tasks at discounted rate relative to on-demand) as capacity providers to optimize cost and availability.
DefaultCapacityProviderStrategy: Distributes tasks evenly between FARGATE and FARGATE_SPOT.
Configuration: Enables ExecuteCommandConfiguration with DEFAULT logging using awslogs, which uses the logging configurations defined in the container definition.

Task & Container Definitions

The task definition specifies the IAM roles and compute resources for the task, while the container definition specifies the Docker image and environment variable file locations, and logging configuration for the container.

Important: ECS Fargate requires the awsvpc network mode, providing each task with an elastic network interface. This ensures that each task has its own network interface, improving isolation and security. In our Lambda function code (link), we use the boto3 library to run the ECS Fargate task, specifying the subnets and security group to attach to the network interface.

Outputs

The template outputs three values:

ECSFargateClusterName: The name of the ECS cluster.
ECSFargateTaskDefinitionFamily: The name of the task definition family.
ECSFargateContainerName: The name of the container within the task definition.

All the above will be used as environment variables in the Lambda function to properly trigger the ECS Fargate task.

Lambda & EventBridge

The last CloudFormation template (link) sets up an AWS Lambda function and an Amazon EventBridge rule to automate the execution of our ETF scraper.

The template requires the following parameters:

S3BucketName (String): The name of the S3 bucket where the Lambda function code is stored.
EventBridgeScheduleExpression (String): The schedule expression for the EventBridge rule (e.g., rate(1 day)), which defines how frequently the Lambda function is triggered.
LambdaExecutionRoleArn (String): The ARN of the Lambda execution role, which grants the Lambda function the necessary permissions to interact with other AWS services.
Architectures (String): The architecture of the Lambda function (x86_64 or arm64). Default is x86_64.
Runtime (String): The runtime environment for the Lambda function. Default is python3.11.
Timeout (Number): The timeout duration for the Lambda function in seconds. Default is 30 seconds. Since the Lambda function simply triggers the ECS Fargate task and does not perform the scraping itself, the timeout can be set to a lower value.

Lambda Function

The Lambda function is responsible for invoking the ECS Fargate task, via boto3, which runs the ETF scraper application code. The following properties are important to note:

Handler: The handler method within the Lambda function’s code. For this project, it is lambda_function.lambda_handler. In general, it should match the file name and the method name in the source code.
Runtime: The runtime environment for the Lambda function, set to python3.11 to match with the python version specified in pyproject.toml.
Code: The location in S3 where the Lambda function’s deployment package (ZIP file) is stored.

EventBridge Rule

The EventBridge rule triggers the Lambda function on a predefined schedule. For this project, the expression is set to cron(00 22 ? * MON-FRI *), which triggers the Lambda function at 10 PM UTC time (i.e., 4PM EST or 5PM CST) from Monday to Friday after the market closes.

EventBridge allows us to create a server-less, time-based trigger for our Lambda function. This means we can automate the scraping task to run at regular intervals (e.g., daily), ensuring timely data collection without manual scheduling.

Important: To allows EventBridge to invoke the Lambda function, we need to grant this principal the necessary permissions to invoke the function.

Cost Considerations

The cost of running this project on AWS will depend on the frequency of data collection, the number of ETFs scraped, and the AWS services usage and infrastructure decisions.

All estimates below are generated using the AWS Pricing Calculator.

Nevertheless, to optimize costs and ensure efficient resource usage, it’s essential to fine-tune the resources to match the actual requirements of the project. The Container Insights feature in CloudWatch is helpful for monitoring the performance of the Fargate tasks. This feature is already enabled in the ECS Cluster template, allowing us to track metrics such as CPU and memory usage, network traffic, and task health.

Potentially Non-Negligible Costs

VPC

NAT Gateway:

Gateway usage: $730$ hours/month x $0.045 = \$32.85$
Data processing: $3$ GB/month x $0.045 = \$0.14$ (This may vary depending on the data processed)
Total NAT Gateway cost: $\$32.99$/month

Note: This cost may be avoided by using a public subnet setup.

Public IPv4 Address:

$1$ address x $730$ hours/month x $0.005 = \$3.65$
Total Public IPv4 Address cost: $\$3.65$/month

Potentially Negligible Costs & Free Tier

Fargate

Assuming $21$ trading days per month, the cost of running the Fargate task for $21$ days with the following resources:

vCPU hours:
- $21$ tasks x $1$ vCPU x $0.67$ hours x $0.04048$/hour = $\$0.57$
GB hours:
- $21$ tasks x $2.00$ GB x $0.67$ hours x $0.004445$/GB/hour = $\$0.13$
Ephemeral storage:
- $20$ GB (no additional charge)
Total Fargate cost: $\$0.70$/month

Lambda

Assuming the Lambda function is triggered $21$ times per month, the cost of running the Lambda function falls within the free tier:

Memory allocation:
- $128$ MB ($0.125$ GB)
Ephemeral storage:
- $512$ MB ($0.5$ GB)
Compute time:
- $21$ requests x $2,000$ ms = $42$ seconds ($5.25$ GBs)
Free tier:
- $400,000$ GB-s and $1,000,000$ requests
Billable GB-s and requests:
- $0$ GB-s, $0$ requests
Total Lambda cost: $\$0.00$/month

S3 and ECR

For this project, the size of the scraped data are in the thousands and the size of the files are in order of KBs. The cost of storing the data in S3 per month is negligible:

Storage:
- $1$ GB x $0.023 = \$0.02$
Total S3 cost: $\$0.02$/month

Similarly, the size of the docker image for this project is $\sim 142$ MB:

Storage:
- $142.16$ MB/month x $0.0009765625$ GB/MB = $0.138828125$ GB/month
Total ECR cost: $\$0.0139$/month

EventBridge Scheduler

Invocations:
- $21$ invocations (first $14,000,000$ free)
Total EventBridge cost: $\$0.00$/month

CloudWatch

Assuming we are only storing logs for the Lambda function and Fargate task, the cost of CloudWatch is very much negligible:

Total CloudWatch cost: $\$0.00$/month

Total Estimated Monthly Cost

Service	Monthly Cost
NAT Gateway	$\$32.99$/month
Public IPv4 Address	$\$3.65$/month
Fargate	$\$0.70$/month
Lambda	$\$0.00$/month
S3	$\$0.02$/month
ECR	$\$0.0139$/month
EventBridge Scheduler	$\$0.00$/month
CloudWatch	$\$0.00$/month
Total	$\$37.3739$/month

This breakdown provides an estimate of the monthly costs associated with the project. The biggest cost driver is the NAT Gateway, which can be avoided by using a public subnet setup. The costs of Fargate, Lambda, S3, ECR, EventBridge, and CloudWatch are all within the free tier or negligible for this project.

Github Action Workflows

To automate the deployment processes, we use two workflows:

ecr_deployment.yaml (link): Building and pushing the Docker image to Amazon ECR.
lambda_deployment.yaml (link): Zipping the lambda source code to s3 and reflecting the changes in the lambda function via an update operation.

The workflows require the following GitHub secrets:

AWS_GITHUB_ACTIONS_ROLE_ARN: The ARN of the GitHub Actions role created in the IAM template.
AWS_REGION: The AWS region where the resources are deployed.
ECR_REPOSITORY: The name of the ECR repository for storing Docker images (for ecr_deployment.yaml).
S3_BUCKET: The name of the S3 bucket where the Lambda function code is stored (for lambda_deployment.yaml).
LAMBDA_FUNCTION: The name of the Lambda function to update (for lambda_deployment.yaml).

Both workflows are triggered on push to the main branch if certain pre-specified files are modified. The ecr_deployment.yaml workflow is triggered if any of the following files are modified:

Dockerfile: Changes to the Dockerfile
src/**: Changes to the source code
!src/deploy_stack.py: Exclude changes to the deployment script since it is not part of the Docker image
main.py: Changes to the entry point of the application
pyproject.toml & poetry.lock: Changes to dependencies

The lambda_deployment.yaml workflow is triggered only if the lambda_function.py file is modified.

Application Code

The application code consists of the following files:

└── src
    ├── __init__.py
    ├── api.py
    └── utils.py
├── main.py

Modules

The main.py script (link) serves as the entry point for the scraper application. Its primary function is to orchestrate the overall scraping process, which includes querying ETF data from external APIs, processing the data, and writing the results to an S3 bucket. The following environment variables are required:

API_KEY (String): The Alpha Vantage API key.
S3_BUCKET (String): The name of the S3 bucket where the scraped data should be stored.
IPO_DATE (String): The cutoff date for the IPO status check. The scraper will only fetch data for ETFs that were listed on or after this date. The format is YYYY-MM-DD.
MAX_ETFS (Int): The maximum number of ETFs to scrape. This can be set to a lower value for testing purposes.
PARQUET (String): Whether to save the scraped data as Parquet files. If set to a string value True, the data is saved in Parquet format; otherwise, it is saved as a CSV flat file.

In addition, add the following environment variable to .env to run the scraper in dev mode when running main.py locally

ENV (String): The environment in which the scraper is running. Set this to dev (e.g., ENV = dev)

Important: Ensure that this environment variable is removed from .env before uploading it to S3 for production.

The api.py module (link) contains the query_etf_data function, which is responsible for fetching ETF data from the Alpha Vantage and Yahoo Finance APIs.

Key Performance Indicators

The key performance indicators (KPIs) and metrics fetched and processed by the scraper include:

Previous Close: The last closing price of the ETF, useful for understanding recent prices of the ETF.
NAV Price: Net Asset Value price, which represents the value of each share’s portion of the fund’s underlying assets and cash at the end of the trading day.
Trailing P/E: Trailing price-to-earnings ratio, indicating the ETF’s valuation relative to its earnings.
Volume: The total number of shares traded during the last trading day.
Average Volume: The average number of shares traded over a specified period, e.g., 30 days.
Bid and Ask Prices: The highest price a buyer is willing to pay (bid) and the lowest price a seller is willing to accept (ask), along with their respective sizes. More details on bid size can be found here.
Category: The classification of the ETF, providing context on the type of assets it holds.
Beta (Three-Year): A volatility measure of the ETF relative to the market, typically proxied by the S&P 500, over the past three years.
YTD Return: Year-to-date return, measuring the ETF’s performance since the first trading day of the current calendar year.
Three-Year and Five-Year Average Returns: The average returns over the past three and five years, respectively, providing long-term performance insights. These are derived from the compound annual growth rate formula.

\[\begin{align*} \text{CAGR} = \left( \frac{\text{Ending Value}}{\text{Beginning Value}} \right)^{\frac{1}{n}} - 1 \end{align*}\]

DockerFile

The application code is containerized using Docker to be deployed on AWS cloud. The following Dockerfile (link) takes a multi-stage approach to build the image efficiently.

Base Stage (python-base):
- Sets up a lightweight Python environment using a base image.
- Sets essential environment variables for optimized performance and dependency management.
Builder Stage (builder):
- Installs system dependencies and Poetry.
- Copies the pyproject.toml and poetry.lock files onto the container and installs the project dependencies in a virtual environment.
Production Stage (production):
- Copies the project directory with dependencies from the builder stage.
- Copies the application code onto the container.
- Sets the working directory and specifies the command to run the application using the Python interpreter from the virtual environment created during the builder stage.

This Dockerfile is adapted from a discussion in the Poetry GitHub repository.

Deployment

There are two options for deploying the aws resources:

Using the AWS Console:
- Use an IAM user with the necessary permissions, i.e., the user with administrator access.
- This is straightforward as everything is accomplished via a graphical user interface.
Using the Python Script deploy_stack.py:
- The script (link) utilizes the boto3 library to deploy the CloudFormation stacks programmatically.
- The AWS CLI must be installed (link) and configured as documented (link)
- The default profile in the configured credentials file must correspond to an IAM user with administrator access.

Steps to Deploy Via the AWS Console

When creating the stacks using the console, follow the steps in the order specified below:

VPC Stack: Create the VPC stack.

S3 & ECR Stack: Create the S3 & ECR stack.

IAM Stack: Create the IAM stack.

Add Secrets to GitHub:
- AWS_GITHUB_ACTIONS_ROLE_ARN: The ARN of the GitHub Actions role created in the IAM template.
- AWS_REGION: The AWS region where the resources are deployed.
- ECR_REPOSITORY: The name of the ECR repository created in the S3 & ECR template.
Trigger the ecr_deployment.yaml Workflow: This workflow builds the Docker image and pushes it to ECR.
Run the upload_env_to_s3.sh (link) and zip_lambda_to_s3.sh (link) Scripts: These scripts upload the environment file and the Lambda function code to the S3 bucket we created. If the AWS CLI is not installed, manual upload to the S3 bucket also works.

Lambda & EventBridge Stack: Create the Lambda & EventBridge stack.
Add Two More Secrets to GitHub:
- S3_BUCKET: The name of the S3 bucket created in the S3 & ECR template.
- LAMBDA_FUNCTION: This is the name of the lambda function created in the previous step.
ECS Fargate Stack: Create the ECS Fargate stack.

Add Environment Variables to Lambda:
- ASSIGN_PUBLIC_IP: Set to ENABLED when using public subnets and DISABLED when using private subnets.
- ECS_CLUSTER_NAME: Output from the ECS Fargate stack.
- ECS_CONTAINER_NAME: Output from the ECS Fargate stack.
- ECS_TASK_DEFINITION: Output from the ECS Fargate stack.
- SECURITY_GROUP: Output from the VPC stack.
- SUBNET_1: Output from the VPC stack.
- SUBNET_2: Output from the VPC stack.
- env: Defaults to prod, but can be overridden to dev for testing purposes.

Steps to Deploy Programmatically

Install AWS CLI

The AWS command line interface is a tool for managing AWS services from the command line. Follow the installation instruction here to install the tool for any given operating system. Verify the installation by running the following commands:

$ which aws
$ aws --version

There are several ways to configure the AWS CLI and credentials in an enterprise setting to enhance security.

As of 2023, AWS recommends managing access centrally using the IAM Identity Center. While it is still possible to manage access using traditional IAM methods (i.e., with long-term credentials), current AWS documentation encourages transitioning to IAM Identity Center for improved security and efficiency.

The steps in this guide are applicable regardless of whether we are using the traditional IAM method or the IAM Identity Center. As long as we have a user— either IAM or IAM Identity Center-based— with the necessary permissions, the outlined steps can be followed.

For simplicity, though it violates the principle of least privilege, all resources can be provisioned using an administrator-level user. However, it’s important to remain vigilant about IAM and resource access management best practices, particularly in enterprise environments where security and access control are critical.

Run the Python Script

The deploy_stack.py python script requires the following command line arguments:

--template_file: The path to the CloudFormation template file.
--parameters_file: The path to the JSON file containing stack parameters.
--save_outputs: An optional flag to save the stack outputs to a JSON file.

An example parameters file is provided here.

# Activate the virtual environment
$ poetry shell
# Navigate to the project root directory
$ cd path_to_project_root
# Set the PYTHONPATH to the current directory
$ export PYTHONPATH=$PWD

We can then deploy each stack as follows (e.g. the public VPC stack):

# Run the script and save output to json
$ poetry run python3 src/deploy_stack.py \
    --template_file cloudformation_templates/vpc-public.yaml \
    --parameters_file cloudformation_templates/stack_parameters.json \
    --save_outputs

The script outputs will be saved in the outputs directory, assuming the --save_outputs flag is used. These outputs can be utilized to populate the parameters for subsequent stack deployments.

The order to deploying the resources are the same as those outlined in the AWS Console section above.

VPC Stack
S3 & ECR Stack
- Build and push the Docker image to ECR.
- Manually upload the environment file and Lambda function code to S3, or use shell scripts to automate this step. This must be done initially and only once, before triggering the GitHub Actions workflows.
IAM Stack
- Add the following secrets to GitHub: AWS_GITHUB_ACTIONS_ROLE_ARN, AWS_REGION, ECR_REPOSITORY.
- All subsequent deployments of the Lambda code and ECR images can be automated via the GitHub Actions workflows.
Lambda & EventBridge Stack
- Add the following secrets to GitHub: S3_BUCKET, LAMBDA_FUNCTION.
ECS Fargate Stack
- Add the necessary environment variables to the Lambda function.

Tips for troubleshooting:

Ensure that the paths to the template and parameters files are correct.
Verify that the AWS credentials are configured correctly and that the necessary permissions sets are specified. Note: creation of IAM entities requires administrator-level permissions.
Check the JSON parameters file stack_parameters.json (link) for any syntax errors or missing values that could affect the deployment.

Test Trigger the Lambda Function

To test the Lambda function, we can manually trigger it from the AWS console. The logs from the Lambda function and the ECS Fargate task can both be viewed in CloudWatch.

Configure a test event with a payload consisting of {'env': 'dev'}:

View container logs in CloudWatch:

Wrapping Up

ETFs are a great beginner-friendly investment strategy to build diversified portfolios before gaining the confidence to manage our own portfolios more actively.

By automating the data collection and storage with AWS services and Python, we can ensure up-to-date and accurate information with minimal manual effort. This allows us to focus on analyzing the data and making informed investment decisions.

Finally, all source files are available in the following repository.

AWS Elastic Container Service AWS Lambda AWS Finance Scraper

Service	Monthly Cost
NAT Gateway	\(\$32.99\)/month
Public IPv4 Address	\(\$3.65\)/month
Fargate	\(\$0.70\)/month
Lambda	\(\$0.00\)/month
S3	\(\$0.02\)/month
ECR	\(\$0.0139\)/month
EventBridge Scheduler	\(\$0.00\)/month
CloudWatch	\(\$0.00\)/month
Total	\(\$37.3739\)/month

Scraping ETF KPIs with AWS Lambda, AWS Fargate, and Alpha Vantage & Yahoo Finance APIs

Overview

AWS Services and Tools

Development and Deployment Tools

API Setup

Important Considerations

Infrastructure Setup

Virtual Private Cloud

Public Subnets

Private Subnets

Which Subnet Setup Should We Choose?

S3 & Elastic Container Registry

IAM Roles and Policies

Lambda Execution Role

ECS Execution Role

ECS Task Role

GitHub Actions Role

Outputs

ECS Fargate

Cluster

Task & Container Definitions

Outputs

Lambda & EventBridge

Lambda Function

EventBridge Rule

Cost Considerations

Potentially Non-Negligible Costs

VPC

Potentially Negligible Costs & Free Tier

Fargate

Lambda

S3 and ECR

EventBridge Scheduler

CloudWatch

Total Estimated Monthly Cost

Github Action Workflows

Application Code

Modules

Key Performance Indicators

DockerFile

Deployment

Steps to Deploy Via the AWS Console

Steps to Deploy Programmatically

Install AWS CLI

Run the Python Script

Test Trigger the Lambda Function

Wrapping Up

Yang (Ken) Wu

Related