Previously on codemunkies: AWS: Using the AWS CLI to search within files in an S3 Bucket - Part 1 - Cloudformation

A clock counts down

Executing find-file.sh

As with the generate-messages.sh script you will need to set the execution bit, and this can be easily done with this command: chmod u+x find-file.sh

Once you have made find-file.sh executable you invoke it like so: ./find-file.sh -b "<bucket name>" -p "<prefix>" -n "<grep pattern>"

There are various parameters that can be set:

  • -b|--bucket <value> - the bucket to search
  • -p|--prefix <value> (optional) - the prefix (or folder) to search
  • -o|--objectCount <value> (optional) - the number of object keys to retrieve from the bucket at one time
  • -n|--pattern <value> - the grep pattern to use to test messages in files
  • -m|--manyMatches - when included flags that the script should carry on past the first match
  • -j|--jsonPath <value> - a jq json path within the file to extract
  • -i|--iterations <value> - the maximum number of times to retrieve object keys from the bucket

Processing command options

Almost half of the script is concerned with setting up variables:

bucket="testBucket"
prefix=""
objectCount=10
stopAtFirst=1
pattern="test"
breakFirst=1
jsonPath=".Message"
iterations=5

And then processing the command line to either override, or set the variables:

while [[ $# -gt 0 ]]; do
    case $1 in
        -b|--bucket)
            bucket=$2
            shift 2
            ;;
        -p|--prefix)
            prefix=$2
            shift 2
            ;;
        -o|--objectCount)
            objectCount=$2
            shift 2
            ;;
        -n|--pattern)
            pattern=$2
            shift 2
            ;;
        -m|--manyMatches)
            breakFirst=0
            shift
            ;;
        -j|--jsonPath)
            jsonPath=$2
            shift 2
            ;;
        -i|--iterations)
            iterations=$2
            shift 2
            ;;
        *)
            shift
            ;;
    esac
done

In truth being reasonably new to bash scripting this was probably the toughest part of the script to create šŸ˜”. Josh Sherman has a good description of how the parsing works.

Iterations

The script provides two modes of operation. The default is to quit out at the first match. However by specifying -m or --manyMatches the script can be forced to to run through all of the search iterations. However, you should beware, this could turn out to be an expensive (and time consuming) choice. Lines 49 and 50 setup a couple of variables that are used to control this behaviour:

found=0
count=1

Processing

The main processing loop of the script can be expressed in pseudo-code like this:

Get a limited list of objects, and save in files.json

while(files.json exists) {
    foreach(key in files.json) {
        Get the object named in the key, save it to temp.json

        Extract the specific json key from the file
        if(the exists in the extrated value) {
            set found to true
            output object key
        }

        if(found is true and not finding many matches) {
            break the foreach key loop
        }
    }

    if(found is true and not finding many matches) {
        break the while loop
    } else {
        increment the loop count
    }

    if(found is true and the loop count exceeds the iterations) {
        break the while loop
    }

    get the token for the next objects
    get the next list of objects using the token and save in files.json
}

remove temp.json
remove files.json

The actual code looks like this:

while [[ -f files.json ]]
do
    while read key
    do
        aws s3api get-object --bucket "$bucket" --key "$key" temp.json > /dev/null

        if grep -q -i "$pattern"; then
            echo "Match: $key - Message: $(cat temp.json | jq "$jsonPath")"
            found=1
        fi < <(cat temp.json | jq "$jsonPath")

        if [ $found -gt 0 ] && [ $breakFirst -gt 0 ]; then
            break
        fi
    done < <(cat files.json | jq -r '.Contents | .[] | .Key')

    if [[ $found -gt 0 ]] && [[ $breakFirst -gt 0 ]]; then
        break
    else
        echo "Iteration $count completed"
        count=$(expr $count + 1)
    fi

    if [[ $found -gt 0 ]] && [[ $count -gt $iterations ]]; then
        break
    fi

    if [[ -f files.json ]]; then
        nextToken=$( cat files.json | jq -r '.NextToken' )

        aws s3api list-objects --bucket "$bucket" --prefix "$prefix" --max-items $objectCount --starting-token $nextToken > files.json
    fi
done

if [[ -f temp.json ]]; then
    rm temp.json
fi

if [[ -f files.json ]]; then
    rm files.json
fi

Getting a list of objects from S3

There are two subtly different calls made to get a list of objects from S3. The first call gets the first $objectCount objects using the default sorting (the key name, ascending):

aws s3api list-objects --bucket "$bucket" --prefix "$prefix" --max-items $objectCount > files.json

In the json that is returned there is a NextToken key-value-pair, and this is triggered by the --max-items argument. To carry on and get the next set of objects the value in NextToken must be supplied with the call. I do a Command Substitution to get the token value from the file:

nextToken=$( cat files.json | jq -r '.NextToken' )

The token can then be used in the next call to the s3api list-objects command in the --starting-token argument:

aws s3api list-objects --bucket "$bucket" --prefix "$prefix" --max-items $objectCount --starting-token $nextToken > files.json

Process Substitution

A repeated issue I encountered whilst developing the script (and a reason why the loop is not structured in a more straight forward way) is that variables I set (in particular found) would not be set outside the loop. Much searching brought me to I set variables in a loop thatā€™s in a pipeline. Why do they disappear after the loop terminates? Or, why canā€™t I pipe data to read?. The basic issue is that I am piping commands together to get the output I want. Originally I had something similar to this: cat temp.json | jq "$jsonPath" | grep -q -i "$pattern". This will work, and youā€™ll get an actionable exit value from grep, but when I test that value, and then set a variable based upon it, I am doing it in the subshell running the grep command. However the script is running in a parent shell, and the variable is not automatically passed back up.

Of the workarounds offered, I plumped for Process Substitution as this resulted in the code that was most understandable to me (Lines 59-62 demonstrate this most clearly):

if grep -q -i "$pattern"; then
    echo "Match: $key - Message: $(cat temp.json | jq "$jsonPath")"
    found=1
fi < <(cat temp.json | jq "$jsonPath")

Here <(cat temp.json | jq "$jsonPath") is the process being substituted in, and the output is piped (fi < <(cat temp.json | jq "$jsonPath")) to the grep command, which runs in the context of the parent shell.

Is this the best way of finding a file?

A question I have asked myself whilst writing up this little script is ā€œis this the best way of finding a file in s3?ā€ and the honest answer is: it depends.

Going back to the original situation, I was asked to help colleagues who were looking for examples of a type of file that was written to the bucket with high frequency. I also didnā€™t need a specific file, I just needed an example of a type of file. So rather than looking for a needle in a hay stack, I was more likely looking for the brown m&ms, except actually even more likely the red, green and blue m&ms.

If I needed to find something very specific in a large set of objects than I would strongly consider syncing the bucket to a local machine and then searching over those files.

This is just one solution to my particular issue, Iā€™ve provided it here so that I can remember it in future, but hopefully to help someone else.