Scraping S3 Buckets
An open S3 bucket is a special type of open directory. It cannot be scraped with Wget because the directory is not returned as a list of hyperlinks. Instead, it is returned as XML data which must be parsed into a list of files. This article explains two different methods to do so.
Using a Python Script
We have created a Python script to automatically grab a list of URLs from an open S3 bucket. Follow these steps to use the script:
- Download and install Python if you don't have it yet.
- Get the script from here. Press Ctrl-S to save it to your computer.
- Open the folder where you saved the Python script. Click the path bar at the top of the window and press Ctrl-C to copy the path.
- Open the Command Prompt, then type
cd
followed by a space. Then paste in the path and press Enter. - Type
python scrapeS3.py BucketURL
to run the script. Replace "BucketURL" with the actual bucket URL, such ashttps://example.s3.amazonaws.com/
. - A file called
urls.txt
will appear containing all of the URLs extracted from the bucket. You can download the URLs with a tool like Wget or cURLsDownloader.
Using the AWS CLI
If the S3 bucket URL contains s3.amazonaws.com
, then Amazon's official AWS command-line tool can be used to grab a list of files in the bucket.
First, install the tool by following these instructions. Or if you have Python installed, you can run pip install awscli
.
Next, determine the bucket name. There are two possible formats that the URL can follow:
- If the bucket URL begins with
s3.amazonaws.com
, then the subsequent part of the URL is the bucket name. The bucket name ofhttps://s3.amazonaws.com/example/
would beexample
. - Otherwise, the bucket name is the part of the URL before
s3.amazonaws.com
. The bucket name ofhttps://example.s3.amazonaws.com/
would beexample
. - Note that sometimes, you'll find an S3 URL referencing a specific region, for example
https://example.s3.us-east-1.amazonaws.com/
. In this case, you can simply remove the region string from the URL to gethttps://example.s3.amazonaws.com/
. Then you can treat the URL according to the previous 2 rules.
Next, run the following command, replacing bucketname
with the bucket name you found:
aws s3api list-objects --no-sign-request --bucket bucketname --output text --query "Contents[].{Key: Key}" > files.txt
You will now have a list of files. To transform it into a list of URLs, add the bucket URL to the beginning of each line of the file. You can do this in a text editor like Sublime Text or Notepad++ using the Find and Replace feature. Open the file, then press Ctrl-H to open the Find and Replace dialog. Activate the "Regular Expression" option, then type the following into the input boxes:
- Find:
(.*)
- Replace:
BucketURL\1
Type the actual bucket URL instead of "BucketURL", for example https://example.s3.amazonaws.com/\1
.