AWS: Using the AWS CLI to search within files in an S3 Bucket - Part 1

Recently as part of my job I needed to help colleagues find some very specific files within an S3 bucket. This post deals with setting up some basic infrastructure to mimic how the original data was generated, and then a script to generate some test data. In the next post I’ll go through how the search script works.

Repository

All of the code and the scripts are available on my Github.

Starting with the project itself, I’ve employed a devcontainer, this is based on the Typescript-Node devcontainer.

The devcontainer.json has been updated at lines 10-13 to pull in the AWS CLI feature and the AWS CDK feature contribution:

"features": {
    "ghcr.io/devcontainers/features/aws-cli:1": {},
    "ghcr.io/devcontainers-contrib/features/aws-cdk:2": {}
},

This means that the aws cli and cdk will be available in the container. To avoid the need to setup the cli each time the container is built I’ve mapped my local .aws folder to the devcontainer lines (17-19):

"mounts": [
    "source=${localEnv:HOME}/.aws,target=/home/node/.aws,type=bind,consistency=cached"
],

The above configuration was put together based on Rehan Haider’s blog post.

Infrastructure

The aim of the stack is to setup infrastructure like the following:

Container diagram showing an SNS Topic publishing to an SQS Queue, which invokes a Lambda, which then writes to an S3 Bucket

The lambda function uses a javascript function to write incoming events directly to the s3 bucket:

const lambda = new Function(this, 'findansqsevent-lambda',
{
    runtime: Runtime.NODEJS_18_X,
    code: Code.fromAsset(path.join(__dirname, '/../js-function/findansqsevent')),
    handler: "index.handler",
    environment: {
        bucket: bucket.bucketName,
        prefix: "outputFolder/"
    }
});

As the lambda is initially run the bucket and prefix environment variables are imported to be used in each execution:

const { bucket, prefix } = process.env;

The function is setup to be able to deal with multiple incoming messages, cycling through all of them, and PUTting them one at a time in the S3 bucket:

for (const { messageId, body } of event.Records) {
    console.log('SQS message %s: %j', messageId, body);
    
    const command = new PutObjectCommand({
        Body: body,
        Bucket: bucket,
        Key: prefix.endsWith('/') ?
            `${prefix}${messageId}.json` : 
            `${prefix}/${messageId}.json`
    });
    
    const response = await client.send(command);
    console.log('S3 response: ' + JSON.stringify(response));
}

The stack should first be built using the command npm run build. Once this has completed it can de deployed with cdk deploy. This assumes you have the aws cli and cdk installed, and configured.

Data generation script

The repository includes the `generate-messages.sh shell script that publishes events.

The script takes two parameters:

-t|--topic <value> - the arn of the topic deployed
-i|--iterations <value> - the number of events to generate

The script uses the aws sns topic command to publish events to the sns topic created by the stack above:

aws sns publish --topic-arn $topic --message "$value" > /dev/null

The output of the aws sns publish command is redirected to /dev/null to keep the output clean. The output looks similar to the following:

{
    "MessageId": "123a45b6-7890-12c3-45d6-333322221111"
}

For this use case I’m not too concerned with the message id, however it may be important to you when publishing messages.

In order to run the script you need to set the execute permission, the following command will do this for the current user only: chmod u+x generate-messages.sh

To be able to run the script you will need the aws cli installed and configured.

AWS: Using the AWS CLI to search within files in an S3 Bucket - Part 1 - Cloudformation

When setting up infrastructure is more straightforward than the 'simple' task you need to perform

Repository

Infrastructure

Data generation script