Scraping S3 Buckets: Difference between revisions
mNo edit summary |
No edit summary |
||
Line 9: | Line 9: | ||
# Open the Command Prompt, then type <code>cd</code> followed by a space. Then paste in the path and press Enter. | # Open the Command Prompt, then type <code>cd</code> followed by a space. Then paste in the path and press Enter. | ||
# Type <code>python scrapeS3.py BucketURL</code> to run the script. Replace "BucketURL" with the actual bucket URL, such as <code><nowiki>https://example.s3.amazonaws.com/</nowiki></code>. | # Type <code>python scrapeS3.py BucketURL</code> to run the script. Replace "BucketURL" with the actual bucket URL, such as <code><nowiki>https://example.s3.amazonaws.com/</nowiki></code>. | ||
# A file called <code>urls.txt</code> will appear containing all of the URLs extracted from the bucket | # A file called <code>urls.txt</code> will appear containing all of the URLs extracted from the bucket. | ||
== Using the AWS CLI == | == Using the AWS CLI == | ||
Line 29: | Line 29: | ||
* Replace: <code>BucketURL\1</code> | * Replace: <code>BucketURL\1</code> | ||
Type the actual bucket URL instead of "BucketURL", for example <code><nowiki>https://example.s3.amazonaws.com/\1</nowiki></code>. | Type the actual bucket URL instead of "BucketURL", for example <code><nowiki>https://example.s3.amazonaws.com/\1</nowiki></code>. | ||
== Downloading the Files == | |||
Now that you have a list of URLs, you can download them with [[Helpful Tools|a tool like Wget or cURLsDownloader]]. | |||
<noinclude> | <noinclude> | ||
[[Category:Other Guides]] | [[Category:Other Guides]] | ||
</noinclude> | </noinclude> |
Revision as of 20:44, 9 July 2022
An open S3 bucket is a special type of open directory. It cannot be scraped with Wget because the directory is not returned as a list of hyperlinks. Instead, it is returned as XML data which must be parsed into a list of files. This article explains two different methods to do so.
Using a Python Script
We have created a Python script to automatically grab a list of URLs from an open S3 bucket. Follow these steps to use the script:
- Download and install Python if you don't have it yet.
- Get the script from here. Press Ctrl-S to save it to your computer.
- Open the folder where you saved the Python script. Click the path bar at the top of the window and press Ctrl-C to copy the path.
- Open the Command Prompt, then type
cd
followed by a space. Then paste in the path and press Enter. - Type
python scrapeS3.py BucketURL
to run the script. Replace "BucketURL" with the actual bucket URL, such ashttps://example.s3.amazonaws.com/
. - A file called
urls.txt
will appear containing all of the URLs extracted from the bucket.
Using the AWS CLI
If the S3 bucket URL contains s3.amazonaws.com
, then Amazon's official AWS command-line tool can be used to grab a list of files in the bucket.
First, install the tool by following these instructions. Or if you have Python installed, you can run pip install awscli
.
Next, determine the bucket name. There are two possible formats that the URL can follow:
- If the bucket URL begins with
s3.amazonaws.com
, then the subsequent part of the URL is the bucket name. The bucket name ofhttps://s3.amazonaws.com/example/
would beexample
. - Otherwise, the bucket name is the part of the URL before
s3.amazonaws.com
. The bucket name ofhttps://example.s3.amazonaws.com/
would beexample
. - Note that sometimes, you'll find an S3 URL referencing a specific region, for example
https://example.s3.us-east-1.amazonaws.com/
. In this case, you can simply remove the region string from the URL to gethttps://example.s3.amazonaws.com/
. Then you can treat the URL according to the previous 2 rules.
Next, run the following command, replacing bucketname
with the bucket name you found:
aws s3api list-objects --no-sign-request --bucket bucketname --output text --query "Contents[].{Key: Key}" > files.txt
You will now have a list of files. To transform it into a list of URLs, add the bucket URL to the beginning of each line of the file. You can do this in a text editor like Sublime Text or Notepad++ using the Find and Replace feature. Open the file, then press Ctrl-H to open the Find and Replace dialog. Activate the "Regular Expression" option, then type the following into the input boxes:
- Find:
(.*)
- Replace:
BucketURL\1
Type the actual bucket URL instead of "BucketURL", for example https://example.s3.amazonaws.com/\1
.
Downloading the Files
Now that you have a list of URLs, you can download them with a tool like Wget or cURLsDownloader.