Scraping S3 Buckets: Difference between revisions
(Created page with "An open S3 bucket is a special type of open directory. It cannot be scraped with Wget because the directory is not returned as a list of hyperlinks. Instead, it is returned as...") |
mNo edit summary |
||
Line 18: | Line 18: | ||
Next, determine the bucket name. There are two possible formats that the URL can follow: | Next, determine the bucket name. There are two possible formats that the URL can follow: | ||
* If the bucket URL begins with < | * If the bucket URL begins with <code>s3.amazonaws.com</code>, then the subsequent part of the URL is the bucket name. The bucket name of <code><nowiki>https://s3.amazonaws.com/example/</nowiki></code> would be <code>example</code>. | ||
* Otherwise, the bucket name is the part of the URL ''before'' <code>s3.amazonaws.com</code>. The bucket name of <code><nowiki>https://example.s3.amazonaws.com/</nowiki></code> would be <code>example</code>. | * Otherwise, the bucket name is the part of the URL ''before'' <code>s3.amazonaws.com</code>. The bucket name of <code><nowiki>https://example.s3.amazonaws.com/</nowiki></code> would be <code>example</code>. | ||
Revision as of 20:29, 9 July 2022
An open S3 bucket is a special type of open directory. It cannot be scraped with Wget because the directory is not returned as a list of hyperlinks. Instead, it is returned as XML data which must be parsed into a list of files. This article explains two different methods to do so.
Using a Python Script
We have created a Python script to automatically grab a list of URLs from an open S3 bucket. Follow these steps to use the script:
- Download and install Python if you don't have it yet.
- Get the script from here. Press Ctrl-S to save it to your computer.
- Open the folder where you saved the Python script. Click the path bar at the top of the window and press Ctrl-C to copy the path.
- Open the Command Prompt, then type
cd
followed by a space. Then paste in the path and press Enter. - Type
python scrapeS3.py BucketURL
to run the script. Replace "BucketURL" with the actual bucket URL, such ashttps://example.s3.amazonaws.com/
. - A file called
urls.txt
will appear containing all of the URLs extracted from the bucket. You can download the URLs with a tool like Wget or cURLsDownloader.
Using the AWS CLI
If the S3 bucket URL contains s3.amazonaws.com
, then Amazon's official AWS command-line tool can be used to grab a list of files in the bucket.
First, install the tool by following these instructions. Or if you have Python installed, you can run pip install awscli
.
Next, determine the bucket name. There are two possible formats that the URL can follow:
- If the bucket URL begins with
s3.amazonaws.com
, then the subsequent part of the URL is the bucket name. The bucket name ofhttps://s3.amazonaws.com/example/
would beexample
. - Otherwise, the bucket name is the part of the URL before
s3.amazonaws.com
. The bucket name ofhttps://example.s3.amazonaws.com/
would beexample
.
Next, run the following command, replacing bucketname
with the bucket name you found:
aws s3api list-objects --no-sign-request --bucket bucketname --output text --query "Contents[].{Key: Key}" > files.txt
You will now have a list of files. To transform it into a list of URLs, add the bucket URL to the beginning of each line of the file. You can do this in a text editor like Sublime Text or Notepad++ using the Find and Replace feature. Open the file, then press Ctrl-H to open the Find and Replace dialog. Activate the "Regular Expression" option, then type the following into the input boxes:
- Find:
(.*)
- Replace:
BucketURL\1
Type the actual bucket URL instead of "BucketURL", for example https://example.s3.amazonaws.com/\1
.