AWS: Using the AWS CLI to search within files in an S3 Bucket - Part 2 - the Search Script
How to 'just' search for a file in an S3 bucket
Previously on codemunkies: AWS: Using the AWS CLI to search within files in an S3 Bucket - Part 1 - Cloudformation
Executing find-file.sh
As with the generate-messages.sh
script you will need to set the execution bit, and this can be easily done with this command: chmod u+x find-file.sh
Once you have made find-file.sh
executable you invoke it like so: ./find-file.sh -b "<bucket name>" -p "<prefix>" -n "<grep pattern>"
There are various parameters that can be set:
-b|--bucket <value>
- the bucket to search-p|--prefix <value>
(optional) - the prefix (or folder) to search-o|--objectCount <value>
(optional) - the number of object keys to retrieve from the bucket at one time-n|--pattern <value>
- the grep pattern to use to test messages in files-m|--manyMatches
- when included flags that the script should carry on past the first match-j|--jsonPath <value>
- a jq json path within the file to extract-i|--iterations <value>
- the maximum number of times to retrieve object keys from the bucket
Processing command options
Almost half of the script is concerned with setting up variables:
bucket="testBucket"
prefix=""
objectCount=10
stopAtFirst=1
pattern="test"
breakFirst=1
jsonPath=".Message"
iterations=5
And then processing the command line to either override, or set the variables:
while [[ $# -gt 0 ]]; do
case $1 in
-b|--bucket)
bucket=$2
shift 2
;;
-p|--prefix)
prefix=$2
shift 2
;;
-o|--objectCount)
objectCount=$2
shift 2
;;
-n|--pattern)
pattern=$2
shift 2
;;
-m|--manyMatches)
breakFirst=0
shift
;;
-j|--jsonPath)
jsonPath=$2
shift 2
;;
-i|--iterations)
iterations=$2
shift 2
;;
*)
shift
;;
esac
done
In truth being reasonably new to bash scripting this was probably the toughest part of the script to create š. Josh Sherman has a good description of how the parsing works.
Iterations
The script provides two modes of operation. The default is to quit out at the first match. However by specifying -m
or --manyMatches
the script can be forced to to run through all of the search iterations. However, you should beware, this could turn out to be an expensive (and time consuming) choice. Lines 49 and 50 setup a couple of variables that are used to control this behaviour:
found=0
count=1
Processing
The main processing loop of the script can be expressed in pseudo-code like this:
Get a limited list of objects, and save in files.json
while(files.json exists) {
foreach(key in files.json) {
Get the object named in the key, save it to temp.json
Extract the specific json key from the file
if(the exists in the extrated value) {
set found to true
output object key
}
if(found is true and not finding many matches) {
break the foreach key loop
}
}
if(found is true and not finding many matches) {
break the while loop
} else {
increment the loop count
}
if(found is true and the loop count exceeds the iterations) {
break the while loop
}
get the token for the next objects
get the next list of objects using the token and save in files.json
}
remove temp.json
remove files.json
The actual code looks like this:
while [[ -f files.json ]]
do
while read key
do
aws s3api get-object --bucket "$bucket" --key "$key" temp.json > /dev/null
if grep -q -i "$pattern"; then
echo "Match: $key - Message: $(cat temp.json | jq "$jsonPath")"
found=1
fi < <(cat temp.json | jq "$jsonPath")
if [ $found -gt 0 ] && [ $breakFirst -gt 0 ]; then
break
fi
done < <(cat files.json | jq -r '.Contents | .[] | .Key')
if [[ $found -gt 0 ]] && [[ $breakFirst -gt 0 ]]; then
break
else
echo "Iteration $count completed"
count=$(expr $count + 1)
fi
if [[ $found -gt 0 ]] && [[ $count -gt $iterations ]]; then
break
fi
if [[ -f files.json ]]; then
nextToken=$( cat files.json | jq -r '.NextToken' )
aws s3api list-objects --bucket "$bucket" --prefix "$prefix" --max-items $objectCount --starting-token $nextToken > files.json
fi
done
if [[ -f temp.json ]]; then
rm temp.json
fi
if [[ -f files.json ]]; then
rm files.json
fi
Getting a list of objects from S3
There are two subtly different calls made to get a list of objects from S3. The first call gets the first $objectCount
objects using the default sorting (the key name, ascending):
aws s3api list-objects --bucket "$bucket" --prefix "$prefix" --max-items $objectCount > files.json
In the json that is returned there is a NextToken
key-value-pair, and this is triggered by the --max-items
argument. To carry on and get the next set of objects the value in NextToken
must be supplied with the call. I do a Command Substitution to get the token value from the file:
nextToken=$( cat files.json | jq -r '.NextToken' )
The token can then be used in the next call to the s3api list-objects
command in the --starting-token
argument:
aws s3api list-objects --bucket "$bucket" --prefix "$prefix" --max-items $objectCount --starting-token $nextToken > files.json
Process Substitution
A repeated issue I encountered whilst developing the script (and a reason why the loop is not structured in a more straight forward way) is that variables I set (in particular found
) would not be set outside the loop. Much searching brought me to I set variables in a loop thatās in a pipeline. Why do they disappear after the loop terminates? Or, why canāt I pipe data to read?. The basic issue is that I am piping commands together to get the output I want. Originally I had something similar to this: cat temp.json | jq "$jsonPath" | grep -q -i "$pattern"
. This will work, and youāll get an actionable exit value from grep
, but when I test that value, and then set a variable based upon it, I am doing it in the subshell running the grep
command. However the script is running in a parent shell, and the variable is not automatically passed back up.
Of the workarounds offered, I plumped for Process Substitution as this resulted in the code that was most understandable to me (Lines 59-62 demonstrate this most clearly):
if grep -q -i "$pattern"; then
echo "Match: $key - Message: $(cat temp.json | jq "$jsonPath")"
found=1
fi < <(cat temp.json | jq "$jsonPath")
Here <(cat temp.json | jq "$jsonPath")
is the process being substituted in, and the output is piped (fi < <(cat temp.json | jq "$jsonPath")
) to the grep command, which runs in the context of the parent shell.
Is this the best way of finding a file?
A question I have asked myself whilst writing up this little script is āis this the best way of finding a file in s3?ā and the honest answer is: it depends.
Going back to the original situation, I was asked to help colleagues who were looking for examples of a type of file that was written to the bucket with high frequency. I also didnāt need a specific file, I just needed an example of a type of file. So rather than looking for a needle in a hay stack, I was more likely looking for the brown m&ms, except actually even more likely the red, green and blue m&ms.
If I needed to find something very specific in a large set of objects than I would strongly consider syncing the bucket to a local machine and then searching over those files.
This is just one solution to my particular issue, Iāve provided it here so that I can remember it in future, but hopefully to help someone else.