Scraping ETF KPIs with AWS Lambda, AWS Fargate, and Alpha Vantage & Yahoo Finance APIs
Note: The section Wrapping Up links to the complete project codebase. You can always return to this guide for a more detailed walkthrough of the project design and infrastructure.
Overview
In this post, we will create a Python 3 based daily scraper designed to gather key performance indicators and metrics for actively listed Exchange-Traded Funds (ETFs) using the Alpha Vantage and Yahoo Finance APIs. The scraper leverages several AWS services to ensure seamless and automated data collection and storage.
At a high-level, the scraper involves the following resources and tools:
AWS Services and Tools
- AWS CloudFormation: Automates and templatizes the provisioning of AWS resources.
- Amazon VPC: Isolates the compute resources within a logically virtual network.
- AWS Fargate: Runs the containerized scraping application code.
- AWS Lambda: Triggers the Fargate task to run the scraper.
- Amazon EventBridge: Schedules the daily execution of the Lambda function.
- Amazon ECR: Stores Docker images used by the AWS Fargate tasks.
- Amazon S3: Stores the scraped ETF data as well as the Lambda function source code.
- Amazon IAM: Creates roles and policies for AWS principals to interact with each other.
- Amazon CloudWatch: Logging for Lambda and Fargate tasks.
Development and Deployment Tools
- Poetry: Manages the dependencies of the project.
- Docker: Containerizes the application code.
- Boto3: AWS SDK for Python to interact with AWS services.
- GitHub Actions: Automates the deployment processes to ECR and Lambda directly from the GitHub repository.
API Setup
The Alpha Vantage API offers a wide range of financial data, including stock time series, technical and economic indicators, and intelligence capabilities. To access the hosted endpoints, we need to claim a free API key from Alpha Vantage’s website. This key will be used as an environment variable in our scraper code.
The Yahoo Finance API, accessed via the yfinance Python package, provides a simple interface to obtain key performance indicators and metrics for ETFs. The package is not an official Yahoo Finance API but is widely used for financial data extraction.
Important Considerations
Alpha Vantage API: The free tier allows up to 25 requests per day. More details can be found in the support section of Alpha Vantage’s website. In this project, we will use the Listing & Delisting Status endpoint, which returns a list of active or delisted US stocks and ETFs as of the latest trading day.
Yahoo Finance API: There are no officially documented usage limits (that I am aware of). However, to avoid triggering Yahoo’s blocker, the package author recommends respecting the rate-limiter as documented in the Smarter Scraping section of the readme. For this project, we will limit our requests to 60 per minute, which is sufficient to gather data for thousands of ETFs within a reasonable time frame ~ 1 hour.
Infrastructure Setup
To keep the resource creation process organized, we will use CloudFormation yaml templates to break the AWS resources into manageable components that can be easily deployed and torn down as logical units. The following diagram depicts the entire infrastructure for the ETF scraper on AWS cloud:
Virtual Private Cloud
AWS Fargate requires a virtual private cloud to function properly. Every AWS account comes with a default VPC with a public subnet in each availability zone. For this project, we create a separate VPC using either the private subnets template (link) or the public subnets template (link).
Public Subnets
In this setup, the ECS Fargate task that runs the containerized application is placed in public subnets, which allows the application code to directly access the internet via the internet gateway. The table below summarizes the key components from the template:
Component | Description |
---|---|
VPC | A virtual private network with a CIDR block of 10.0.0.0/16 , providing 65,536 IP addresses. |
Internet Gateway | Connects the VPC to the internet, defined by the InternetGateway and AttachGateway resources. Each VPC can have only one internet gateway. |
Public Subnets | Two subnets (10.0.3.0/24 and 10.0.4.0/24 ) with public IP addresses across two availability zones. These subnets are routable to the internet through the internet gateway. |
Route Table | RouteTable handles routes for public subnets to the internet gateway. |
Security Groups | The SecurityGroup resource for this project allows all outbound traffic for API calls but does not allow any inbound traffic. |
The diagram below depicts the infrastructure in the us-east-1
region:
Private Subnets
In this stack, the ECS Fargate task is placed in private subnets. The table below summarizes the key components from the template:
Component | Description |
---|---|
VPC | A virtual private network with a CIDR block of 10.0.0.0/16 , providing 65,536 IP addresses. |
Internet Gateway | Connects the VPC to the internet, defined by the InternetGateway and AttachGateway resources. Each VPC can have only one internet gateway. |
Public Subnets | Two subnets (10.0.1.0/24 and 10.0.2.0/24 ) with public IP addresses across two availability zones. |
Private Subnets | Two subnets (10.0.3.0/24 and 10.0.4.0/24 ) without public IP addresses across two availability zones. |
NAT Gateways | Located in the public subnets, NAT gateways (NATGateway1 in PublicSubnet1 and NATGateway2 in PublicSubnet2 ) allow instances in private subnets to connect to the internet. Each NAT gateway is in a different availability zone to ensure robust internet access, even in the event of an outage in an availability zone. |
Elastic IPs | Public IP addresses associated with the NAT gateways. Each NAT gateway must have an Elastic IP for internet connectivity. |
Route Tables | RouteTablePublic handles routes for public subnets to the internet gateway, while RouteTablePrivate1 and RouteTablePrivate2 manage routes for private subnets to NAT gateways. |
Security Groups | The SecurityGroup resource for this project allows all outbound traffic for API calls but does not allow any inbound traffic. |
This infrastructure can be visualized as follows:
Which Subnet Setup Should We Choose?
When deciding between public and private subnets, consider the following definitions from the official documentation:
- Public Subnet: The subnet has a direct route to an internet gateway. Resources in a public subnet can access the public internet.
- Private Subnet: The subnet does not have a direct route to an internet gateway. Resources in a private subnet require a NAT device to access the public internet.
As we shall see, the ETF scraper code only needs outbound traffic to make API calls, and no inbound traffic is expected. Therefore, the two setups are functionally equivalent for this project. Regardless of whether the ECS Fargate task is deployed in public subnets or private subnets, we can specify a security group with rules that prevent any inbound traffic and allows only outbound traffic.
Still, while the public subnets setup may seem simpler, it increases the risk of misconfiguration at the security group level if any inbound traffic is in fact required. The best practice, in general, is to deploy in private subnets and use NAT gateways to access the internet.
S3 & Elastic Container Registry
Two critical resources are:
- S3 Bucket: Stores the ETF data scraped by the application and the packaged Lambda function source code.
- ECR Repository: Stores the Docker image used by the ECS Fargate task to run the scraper.
The CloudFormation template (link) is parameterized with the following inputs from the user:
S3BucketName
(String): The name of the S3 bucket to be created.ECRRepoName
(String): The name of the ECR repository to be created.
IAM Roles and Policies
To run this ETF data scraper, we need to set up various IAM roles and policies to give principals (i.e., AWS services like ECS and Lambda) permissions to interact with each other. In addition, we need to create a role with the necessary permissions for workflows to automate the deployment tasks. The following CloudFormation template (link) defines these roles and policies for Lambda, ECS, and GitHub Action workflows.
The template requires the following parameters:
S3BucketName
(String): The name of the S3 bucket created earlier.ECRRepoName
(String): The name of the ECR repository created earlier.ECRRepoArn
(String): The ARN of the ECR repository created earlier.GithubUsername
(String): The GitHub username.GithubRepoName
(String): The GitHub repository name.
The last two parameters ensure that only GitHub Actions from the specified repository (and the main
branch) can assume the role with permissions to update the Lambda function code and push Docker images to ECR.
Compared to using an IAM user with long-term credentials stored as repository secrets, creating roles assumable by workflows with short-term credentials is a more secure method. This is the recommended approach by AWS for automating deployment tasks. To learning more about this approach, consider exploring the following resources:
Lambda Execution Role
The Lambda execution role allows Lambda to interact with other AWS services.
- Role Name:
${AWS::StackName}-lambda-execution-role
- Policies:
- LambdaLogPolicy: Allows Lambda to write logs to CloudWatch.
- LambdaECSPolicy: Allows Lambda to run ECS tasks.
- LambdaIAMPolicy: Allows Lambda to pass the ECS execution role and task role to ECS; this policy is useful for restricting the Lambda function to only pass specified roles to ECS.
ECS Execution Role
The ECS execution role allows ECS to interact with other AWS services.
- Role Name:
${AWS::StackName}-ecs-execution-role
- Policies:
- ECSExecutionPolicy: Allows ECS to log into and pull images from ECR, write logs to CloudWatch, get environment files from S3.
ECS Task Role
The ECS task role allows the Fargate task to interact with S3, enabling the application code to upload the scraped data. The task role should contain all permissions required by the application code running in the container. It is separate from the ECS execution role, which is used by ECS to manage the task and not by the task itself.
- Role Name:
${AWS::StackName}-ecs-task-role
- Policies:
- ECSTaskPolicy: Allows the Fargate task to upload and get objects from S3.
GitHub Actions Role
We enable workflows to authenticate with AWS through Github’s OIDC provider, facilitating secure and direct interactions with AWS services without needing to store long-term credentials as secrets.
- Role Name:
${AWS::StackName}-github-actions-role
- Trust Relationship:
- Establish a trust relationship with GitHub’s OIDC provider, allowing it to assume this role when authenticated via OIDC.
- Access is restricted to actions triggered by push to the
main
branch of the specified GitHub repository, ensuring that only authorized code changes can initiate AWS actions.
- Policies:
- GithubActionsPolicy: Allows the workflows that assume this role to update the Lambda function, push Docker images to ECR, and interact with S3.
Outputs
The template outputs the ARNs of the roles and the secrets for the GitHub Actions user, which can then be accessed from the console.
- LambdaExecutionRoleArn: ARN of the Lambda execution role.
- ECSExecutionRoleArn: ARN of the ECS execution role.
- ECSTaskRoleArn: ARN of the ECS task role.
- GithubActionsRoleArn: ARN of the GitHub Actions role.
ECS Fargate
The CloudFormation template (link) for ECS Fargate requires the following parameters:
The IAM role ARNs:
ECSExecutionRoleArn
(String): The ARN of the ECS execution role exported from the IAM template.ECSTaskRoleArn
(String): The ARN of the ECS task role exported from the IAM template.
The Task Definition Parameters:
CpuArchitecture
(String): The CPU architecture of the task. Default isX86_64
. Important: Ensure this is compatible with the architecture for which the Docker image is built.OperatingSystemFamily
(String): The operating system family of the task. Default isLINUX
.Cpu
(Number): The hard limit of CPU units for the task. Default is1024
(i.e., 1 vCPUs).Memory
(Number): The hard limit of memory (in MiB) to reserve for the container. Default is2048
(i.e., 2 GB).SizeInGiB
(Number): The amount of ephemeral storage (in GiB) to reserve for the container. Default is21
.
Other parameters:
EnvironmentFileS3Arn
(String): The S3 ARN of the environment file for the container. This file contains the environment variables required by the application code. More details on the environment file are in the Application Code section below.ECRRepoName
(String): The name of the ECR repository created earlier.
Cluster
An ECS Fargate task is typically run in a cluster, which is a logical grouping of tasks. The template linked above creates an ECS cluster with the following properties:
- ClusterSettings: Enables container insights for the cluster, which automatically collects usage metrics for CPU, memory, disk, and network.
- CapacityProviders: Specifies
FARGATE
andFARGATE_SPOT
(i.e., interruption tolerant tasks at discounted rate relative to on-demand) as capacity providers to optimize cost and availability. - DefaultCapacityProviderStrategy: Distributes tasks evenly between
FARGATE
andFARGATE_SPOT
. - Configuration: Enables
ExecuteCommandConfiguration
withDEFAULT
logging usingawslogs
, which uses the logging configurations defined in the container definition.
Task & Container Definitions
The task definition specifies the IAM roles and compute resources for the task, while the container definition specifies the Docker image and environment variable file locations, and logging configuration for the container.
Important: ECS Fargate requires the awsvpc
network mode, providing each task with an elastic network interface. This ensures that each task has its own network interface, improving isolation and security. In our Lambda function code (link), we use the boto3
library to run the ECS Fargate task, specifying the subnets and security group to attach to the network interface.
Outputs
The template outputs three values:
- ECSFargateClusterName: The name of the ECS cluster.
- ECSFargateTaskDefinitionFamily: The name of the task definition family.
- ECSFargateContainerName: The name of the container within the task definition.
All the above will be used as environment variables in the Lambda function to properly trigger the ECS Fargate task.
Lambda & EventBridge
The last CloudFormation template (link) sets up an AWS Lambda function and an Amazon EventBridge rule to automate the execution of our ETF scraper.
The template requires the following parameters:
S3BucketName
(String): The name of the S3 bucket where the Lambda function code is stored.EventBridgeScheduleExpression
(String): The schedule expression for the EventBridge rule (e.g.,rate(1 day)
), which defines how frequently the Lambda function is triggered.LambdaExecutionRoleArn
(String): The ARN of the Lambda execution role, which grants the Lambda function the necessary permissions to interact with other AWS services.Architectures
(String): The architecture of the Lambda function (x86_64
orarm64
). Default isx86_64
.Runtime (String)
: The runtime environment for the Lambda function. Default ispython3.11
.Timeout
(Number): The timeout duration for the Lambda function in seconds. Default is30
seconds. Since the Lambda function simply triggers the ECS Fargate task and does not perform the scraping itself, the timeout can be set to a lower value.
Lambda Function
The Lambda function is responsible for invoking the ECS Fargate task, via boto3
, which runs the ETF scraper application code. The following properties are important to note:
- Handler: The handler method within the Lambda function’s code. For this project, it is
lambda_function.lambda_handler
. In general, it should match the file name and the method name in the source code. - Runtime: The runtime environment for the Lambda function, set to
python3.11
to match with the python version specified in pyproject.toml. - Code: The location in S3 where the Lambda function’s deployment package (ZIP file) is stored.
EventBridge Rule
The EventBridge rule triggers the Lambda function on a predefined schedule. For this project, the expression is set to cron(00 22 ? * MON-FRI *)
, which triggers the Lambda function at 10 PM UTC time (i.e., 4PM EST or 5PM CST) from Monday to Friday after the market closes.
EventBridge allows us to create a server-less, time-based trigger for our Lambda function. This means we can automate the scraping task to run at regular intervals (e.g., daily), ensuring timely data collection without manual scheduling.
Important: To allows EventBridge to invoke the Lambda function, we need to grant this principal the necessary permissions to invoke the function.
Cost Considerations
The cost of running this project on AWS will depend on the frequency of data collection, the number of ETFs scraped, and the AWS services usage and infrastructure decisions.
All estimates below are generated using the AWS Pricing Calculator.
Nevertheless, to optimize costs and ensure efficient resource usage, it’s essential to fine-tune the resources to match the actual requirements of the project. The Container Insights feature in CloudWatch is helpful for monitoring the performance of the Fargate tasks. This feature is already enabled in the ECS Cluster template, allowing us to track metrics such as CPU and memory usage, network traffic, and task health.
Potentially Non-Negligible Costs
VPC
NAT Gateway:
- Gateway usage: \(730\) hours/month x \(0.045 = \$32.85\)
- Data processing: \(3\) GB/month x \(0.045 = \$0.14\) (This may vary depending on the data processed)
- Total NAT Gateway cost: \(\$32.99\)/month
Note: This cost may be avoided by using a public subnet setup.
Public IPv4 Address:
- \(1\) address x \(730\) hours/month x \(0.005 = \$3.65\)
- Total Public IPv4 Address cost: \(\$3.65\)/month
Potentially Negligible Costs & Free Tier
Fargate
Assuming \(21\) trading days per month, the cost of running the Fargate task for \(21\) days with the following resources:
- vCPU hours:
- \(21\) tasks x \(1\) vCPU x \(0.67\) hours x \(0.04048\)/hour = \(\$0.57\)
- GB hours:
- \(21\) tasks x \(2.00\) GB x \(0.67\) hours x \(0.004445\)/GB/hour = \(\$0.13\)
- Ephemeral storage:
- \(20\) GB (no additional charge)
- Total Fargate cost: \(\$0.70\)/month
Lambda
Assuming the Lambda function is triggered \(21\) times per month, the cost of running the Lambda function falls within the free tier:
- Memory allocation:
- \(128\) MB (\(0.125\) GB)
- Ephemeral storage:
- \(512\) MB (\(0.5\) GB)
- Compute time:
- \(21\) requests x \(2,000\) ms = \(42\) seconds (\(5.25\) GBs)
- Free tier:
- \(400,000\) GB-s and \(1,000,000\) requests
- Billable GB-s and requests:
- \(0\) GB-s, \(0\) requests
- Total Lambda cost: \(\$0.00\)/month
S3 and ECR
For this project, the size of the scraped data are in the thousands and the size of the files are in order of KBs. The cost of storing the data in S3 per month is negligible:
- Storage:
- \(1\) GB x \(0.023 = \$0.02\)
- Total S3 cost: \(\$0.02\)/month
Similarly, the size of the docker image for this project is \(\sim 142\) MB:
- Storage:
- \(142.16\) MB/month x \(0.0009765625\) GB/MB = \(0.138828125\) GB/month
- Total ECR cost: \(\$0.0139\)/month
EventBridge Scheduler
- Invocations:
- \(21\) invocations (first \(14,000,000\) free)
- Total EventBridge cost: \(\$0.00\)/month
CloudWatch
Assuming we are only storing logs for the Lambda function and Fargate task, the cost of CloudWatch is very much negligible:
- Total CloudWatch cost: \(\$0.00\)/month
Total Estimated Monthly Cost
Service | Monthly Cost |
---|---|
NAT Gateway | \(\$32.99\)/month |
Public IPv4 Address | \(\$3.65\)/month |
Fargate | \(\$0.70\)/month |
Lambda | \(\$0.00\)/month |
S3 | \(\$0.02\)/month |
ECR | \(\$0.0139\)/month |
EventBridge Scheduler | \(\$0.00\)/month |
CloudWatch | \(\$0.00\)/month |
Total | \(\$37.3739\)/month |
This breakdown provides an estimate of the monthly costs associated with the project. The biggest cost driver is the NAT Gateway, which can be avoided by using a public subnet setup. The costs of Fargate, Lambda, S3, ECR, EventBridge, and CloudWatch are all within the free tier or negligible for this project.
Github Action Workflows
To automate the deployment processes, we use two workflows:
ecr_deployment.yaml
(link): Building and pushing the Docker image to Amazon ECR.lambda_deployment.yaml
(link): Zipping the lambda source code to s3 and reflecting the changes in the lambda function via anupdate
operation.
The workflows require the following GitHub secrets:
- AWS_GITHUB_ACTIONS_ROLE_ARN: The ARN of the GitHub Actions role created in the IAM template.
- AWS_REGION: The AWS region where the resources are deployed.
- ECR_REPOSITORY: The name of the ECR repository for storing Docker images (for
ecr_deployment.yaml
). - S3_BUCKET: The name of the S3 bucket where the Lambda function code is stored (for
lambda_deployment.yaml
). - LAMBDA_FUNCTION: The name of the Lambda function to update (for
lambda_deployment.yaml
).
Both workflows are triggered on push to the main
branch if certain pre-specified files are modified. The ecr_deployment.yaml
workflow is triggered if any of the following files are modified:
Dockerfile
: Changes to the Dockerfilesrc/**
: Changes to the source code!src/deploy_stack.py
: Exclude changes to the deployment script since it is not part of the Docker imagemain.py
: Changes to the entry point of the applicationpyproject.toml
&poetry.lock
: Changes to dependencies
The lambda_deployment.yaml
workflow is triggered only if the lambda_function.py
file is modified.
Application Code
The application code consists of the following files:
└── src
├── __init__.py
├── api.py
└── utils.py
├── main.py
Modules
The main.py
script (link) serves as the entry point for the scraper application. Its primary function is to orchestrate the overall scraping process, which includes querying ETF data from external APIs, processing the data, and writing the results to an S3 bucket. The following environment variables are required:
API_KEY
(String): The Alpha Vantage API key.S3_BUCKET
(String): The name of the S3 bucket where the scraped data should be stored.IPO_DATE
(String): The cutoff date for the IPO status check. The scraper will only fetch data for ETFs that were listed on or after this date. The format isYYYY-MM-DD
.MAX_ETFS
(Int): The maximum number of ETFs to scrape. This can be set to a lower value for testing purposes.PARQUET
(String): Whether to save the scraped data as Parquet files. If set to a string valueTrue
, the data is saved in Parquet format; otherwise, it is saved as a CSV flat file.
In addition, add the following environment variable to .env
to run the scraper in dev mode when running main.py
locally
ENV
(String): The environment in which the scraper is running. Set this todev
(e.g.,ENV = dev
)
Important: Ensure that this environment variable is removed from .env
before uploading it to S3 for production.
The api.py
module (link) contains the query_etf_data
function, which is responsible for fetching ETF data from the Alpha Vantage and Yahoo Finance APIs.
Key Performance Indicators
The key performance indicators (KPIs) and metrics fetched and processed by the scraper include:
- Previous Close: The last closing price of the ETF, useful for understanding recent prices of the ETF.
- NAV Price: Net Asset Value price, which represents the value of each share’s portion of the fund’s underlying assets and cash at the end of the trading day.
- Trailing P/E: Trailing price-to-earnings ratio, indicating the ETF’s valuation relative to its earnings.
- Volume: The total number of shares traded during the last trading day.
- Average Volume: The average number of shares traded over a specified period, e.g., 30 days.
- Bid and Ask Prices: The highest price a buyer is willing to pay (bid) and the lowest price a seller is willing to accept (ask), along with their respective sizes. More details on bid size can be found here.
- Category: The classification of the ETF, providing context on the type of assets it holds.
- Beta (Three-Year): A volatility measure of the ETF relative to the market, typically proxied by the S&P 500, over the past three years.
- YTD Return: Year-to-date return, measuring the ETF’s performance since the first trading day of the current calendar year.
- Three-Year and Five-Year Average Returns: The average returns over the past three and five years, respectively, providing long-term performance insights. These are derived from the compound annual growth rate formula.
\[\begin{align*} \text{CAGR} = \left( \frac{\text{Ending Value}}{\text{Beginning Value}} \right)^{\frac{1}{n}} - 1 \end{align*}\]
DockerFile
The application code is containerized using Docker to be deployed on AWS cloud. The following Dockerfile
(link) takes a multi-stage approach to build the image efficiently.
- Base Stage (
python-base
):- Sets up a lightweight Python environment using a base image.
- Sets essential environment variables for optimized performance and dependency management.
- Builder Stage (
builder
):- Installs system dependencies and Poetry.
- Copies the
pyproject.toml
andpoetry.lock
files onto the container and installs the project dependencies in a virtual environment.
- Production Stage (
production
):- Copies the project directory with dependencies from the builder stage.
- Copies the application code onto the container.
- Sets the working directory and specifies the command to run the application using the Python interpreter from the virtual environment created during the builder stage.
This Dockerfile
is adapted from a discussion in the Poetry GitHub repository.
Deployment
There are two options for deploying the aws resources:
- Using the AWS Console:
- Use an IAM user with the necessary permissions, i.e., the user with administrator access.
- This is straightforward as everything is accomplished via a graphical user interface.
- Using the Python Script
deploy_stack.py
:
Steps to Deploy Via the AWS Console
When creating the stacks using the console, follow the steps in the order specified below:
- VPC Stack: Create the VPC stack.
- S3 & ECR Stack: Create the S3 & ECR stack.
- IAM Stack: Create the IAM stack.
Add Secrets to GitHub:
AWS_GITHUB_ACTIONS_ROLE_ARN
: The ARN of the GitHub Actions role created in the IAM template.AWS_REGION
: The AWS region where the resources are deployed.ECR_REPOSITORY
: The name of the ECR repository created in the S3 & ECR template.
Trigger the
ecr_deployment.yaml
Workflow: This workflow builds the Docker image and pushes it to ECR.Run the
upload_env_to_s3.sh
(link) andzip_lambda_to_s3.sh
(link) Scripts: These scripts upload the environment file and the Lambda function code to the S3 bucket we created. If the AWS CLI is not installed, manual upload to the S3 bucket also works.
Lambda & EventBridge Stack: Create the Lambda & EventBridge stack.
Add Two More Secrets to GitHub:
S3_BUCKET
: The name of the S3 bucket created in the S3 & ECR template.LAMBDA_FUNCTION
: This is the name of the lambda function created in the previous step.
ECS Fargate Stack: Create the ECS Fargate stack.
- Add Environment Variables to Lambda:
ASSIGN_PUBLIC_IP
: Set toENABLED
when using public subnets andDISABLED
when using private subnets.ECS_CLUSTER_NAME
: Output from the ECS Fargate stack.ECS_CONTAINER_NAME
: Output from the ECS Fargate stack.ECS_TASK_DEFINITION
: Output from the ECS Fargate stack.SECURITY_GROUP
: Output from the VPC stack.SUBNET_1
: Output from the VPC stack.SUBNET_2
: Output from the VPC stack.env
: Defaults toprod
, but can be overridden todev
for testing purposes.
Steps to Deploy Programmatically
Install AWS CLI
The AWS command line interface is a tool for managing AWS services from the command line. Follow the installation instruction here to install the tool for any given operating system. Verify the installation by running the following commands:
$ which aws
$ aws --version
There are several ways to configure the AWS CLI and credentials in an enterprise setting to enhance security. For this project, however, we will use the long-term credentials approach on our personal machines.
- Log in to the AWS Management Console:
- Use the user with administrator access since we need to create a new IAM user.
- Create a New IAM User With Programmatic Access:
- Navigate to the IAM console and create a new user with programmatic access.
- Attach the
AdministratorAccess
policy (documentation) to this user directly or to the user group to which this user belongs. - Securely store the access key id and secret access key provided.
- Configure the AWS CLI:
- Run the following command to configure the AWS CLI:
$ aws configure AWS Access Key ID [None]: aws_access_key_id AWS Secret Access Key [None]: aws_secret_access_key Default region name [None]: aws_region Default output format [None]: json
- Replace
aws_access_key_id
andaws_secret_access_key
with the values provided when creating the IAM user.
- Note: While this method is sufficient for this project, it is not recommended for enterprise settings. Even for non-enterprise settings, it is advisable to:
- Rotate the access keys regularly via the IAM console of the user with administrator access.
- Delete any unused access keys to reduce potential security risks.
- Avoid adding these keys to any public repositories or share them with unauthorized users.
The steps above ensure that boto3
can interact with AWS services using the configured credentials files.
Run the Python Script
The deploy_stack.py
python script requires the following command line arguments:
--template_file
: The path to the CloudFormation template file.--parameters_file
: The path to the JSON file containing stack parameters.--save_outputs
: An optional flag to save the stack outputs to a JSON file.
An example parameters file is provided here.
# Activate the virtual environment
$ poetry shell
# Navigate to the project root directory
$ cd path_to_project_root
# Set the PYTHONPATH to the current directory
$ export PYTHONPATH=$PWD
We can then deploy each stack as follows (e.g. using the VPC stack):
# Run the script and save output to json
$ poetry run python3 src/deploy_stack.py --template_file cloudformation_templates/vpc-private.yaml --parameters_file cloudformation_templates/stack_parameters.json --save_outputs
The script outputs will be saved in the outputs directory, assuming the --save_outputs
flag is used. These outputs can be utilized to populate the parameters for subsequent stack deployments.
The steps to deploying the resources are the same as those outlined in the AWS Console section above.
- VPC stack
- S3 & ECR stack
- IAM stack
- Add the secrets to GitHub:
AWS_GITHUB_ACTIONS_ROLE_ARN
,AWS_REGION
,ECR_REPOSITORY
- Upload environment file and Lambda function code to S3 either manually or via the shell scripts
- Build and push the Docker image to ECR via the GitHub Actions workflow
- Add the secrets to GitHub:
- Lambda & EventBridge stack
- Add the secrets to GitHub:
S3_BUCKET
,LAMBDA_FUNCTION
- Add the secrets to GitHub:
- ECS Fargate stack
- Add the environment variables to the Lambda function
Tips for troubleshooting:
- Ensure that the paths to the template and parameters files are correct.
- Verify that the AWS credentials are configured correctly and that the necessary permissions are attached to deploy CloudFormation stacks, i.e., administrator access.
- Check the JSON parameters file
stack_parameters.json
(link) for any syntax errors or missing values that could affect the deployment.
Test Triggering the Lambda Function
To test the Lambda function, we can manually trigger it from the AWS console. The logs from the Lambda function and the ECS Fargate task can both be viewed in CloudWatch.
- Configure a test event with a payload consisting of
{'env': 'dev'}
:
- View container logs in CloudWatch:
Wrapping Up
ETFs are a great beginner-friendly investment strategy to build diversified portfolios before gaining the confidence to manage our own portfolios more actively.
By automating the data collection and storage with AWS services and Python, we can ensure up-to-date and accurate information with minimal manual effort. This allows us to focus on analyzing the data and making informed investment decisions.
Finally, all source files are available in the following repository.