Scraping S3 Buckets: Difference between revisions

From Flashpoint Datahub
Jump to navigation Jump to search
No edit summary
No edit summary
 
(4 intermediate revisions by 2 users not shown)
Line 9: Line 9:
# Open the Command Prompt, then type <code>cd</code> followed by a space. Then paste in the path and press Enter.
# Open the Command Prompt, then type <code>cd</code> followed by a space. Then paste in the path and press Enter.
# Type <code>python scrapeS3.py BucketURL</code> to run the script. Replace "BucketURL" with the actual bucket URL, such as <code><nowiki>https://example.s3.amazonaws.com/</nowiki></code>.
# Type <code>python scrapeS3.py BucketURL</code> to run the script. Replace "BucketURL" with the actual bucket URL, such as <code><nowiki>https://example.s3.amazonaws.com/</nowiki></code>.
# A file called <code>urls.txt</code> will appear containing all of the URLs extracted from the bucket. You can download the URLs with [[Helpful Tools|a tool like Wget or cURLsDownloader]].
# A file called <code>urls.txt</code> will appear containing all of the URLs extracted from the bucket.
Notes on using the python script:
* You can start the script after a certain point if you get an error while running. Type <code>python scrapeS3.py BucketURL StartKey</code> with "StartKey" being the directory or file name you want to continue from. An example would be <code><nowiki>python scrapeS3.py https://example.s3.amazonaws.com/ file.png</nowiki></code>


== Using the AWS CLI ==
== Using the AWS CLI ==


If the S3 bucket URL contains <code>s3.amazonaws.com</code>, then Amazon's official AWS command-line tool can be used to grab a list of files in the bucket.
If the S3 bucket URL contains <code>s3.amazonaws.com</code>, then Amazon's official AWS command-line tool can be used to grab a list of files in the bucket.
* Note that sometimes, you'll find an S3 URL referencing a specific region, for example <code><nowiki>https://example.s3.us-east-1.amazonaws.com/</nowiki></code>. In that case, you can simply remove the region string from the URL to get, for example, <code><nowiki>https://example.s3.amazonaws.com/</nowiki></code>. Then you can continue with the rest of the tutorial as normal.


First, install the tool by following [https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html these instructions]. Or if you have Python installed, you can run <code>pip install awscli</code>.
First, install the tool by following [https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html these instructions]. Or if you have Python installed, you can run <code>pip install awscli</code>.
Line 21: Line 22:
* If the bucket URL begins with <code>s3.amazonaws.com</code>, then the subsequent part of the URL is the bucket name. The bucket name of <code><nowiki>https://s3.amazonaws.com/example/</nowiki></code> would be <code>example</code>.
* If the bucket URL begins with <code>s3.amazonaws.com</code>, then the subsequent part of the URL is the bucket name. The bucket name of <code><nowiki>https://s3.amazonaws.com/example/</nowiki></code> would be <code>example</code>.
* Otherwise, the bucket name is the part of the URL ''before'' <code>s3.amazonaws.com</code>. The bucket name of <code><nowiki>https://example.s3.amazonaws.com/</nowiki></code> would be <code>example</code>.
* Otherwise, the bucket name is the part of the URL ''before'' <code>s3.amazonaws.com</code>. The bucket name of <code><nowiki>https://example.s3.amazonaws.com/</nowiki></code> would be <code>example</code>.
* Note that sometimes, you'll find an S3 URL referencing a specific region, for example <code><nowiki>https://example.s3.us-east-1.amazonaws.com/</nowiki></code>. In this case, you can simply remove the region string from the URL to get <code><nowiki>https://example.s3.amazonaws.com/</nowiki></code>. Then you can treat the URL according to the previous 2 rules.


Next, run the following command, replacing <code>bucketname</code> with the bucket name you found:
Next, run the following command, replacing <code>bucketname</code> with the bucket name you found:
Line 29: Line 31:
* Replace: <code>BucketURL\1</code>
* Replace: <code>BucketURL\1</code>
Type the actual bucket URL instead of "BucketURL", for example <code><nowiki>https://example.s3.amazonaws.com/\1</nowiki></code>.
Type the actual bucket URL instead of "BucketURL", for example <code><nowiki>https://example.s3.amazonaws.com/\1</nowiki></code>.
== Downloading the Files ==
Now that you have a list of URLs, you can download them with [[Helpful Tools|a tool like Wget or cURLsDownloader]].
<noinclude>
[[Category:Other Guides]]
</noinclude>

Latest revision as of 18:48, 14 November 2022

An open S3 bucket is a special type of open directory. It cannot be scraped with Wget because the directory is not returned as a list of hyperlinks. Instead, it is returned as XML data which must be parsed into a list of files. This article explains two different methods to do so.

Using a Python Script

We have created a Python script to automatically grab a list of URLs from an open S3 bucket. Follow these steps to use the script:

  1. Download and install Python if you don't have it yet.
  2. Get the script from here. Press Ctrl-S to save it to your computer.
  3. Open the folder where you saved the Python script. Click the path bar at the top of the window and press Ctrl-C to copy the path.
  4. Open the Command Prompt, then type cd followed by a space. Then paste in the path and press Enter.
  5. Type python scrapeS3.py BucketURL to run the script. Replace "BucketURL" with the actual bucket URL, such as https://example.s3.amazonaws.com/.
  6. A file called urls.txt will appear containing all of the URLs extracted from the bucket.

Notes on using the python script:

  • You can start the script after a certain point if you get an error while running. Type python scrapeS3.py BucketURL StartKey with "StartKey" being the directory or file name you want to continue from. An example would be python scrapeS3.py https://example.s3.amazonaws.com/ file.png

Using the AWS CLI

If the S3 bucket URL contains s3.amazonaws.com, then Amazon's official AWS command-line tool can be used to grab a list of files in the bucket.

First, install the tool by following these instructions. Or if you have Python installed, you can run pip install awscli.

Next, determine the bucket name. There are two possible formats that the URL can follow:

  • If the bucket URL begins with s3.amazonaws.com, then the subsequent part of the URL is the bucket name. The bucket name of https://s3.amazonaws.com/example/ would be example.
  • Otherwise, the bucket name is the part of the URL before s3.amazonaws.com. The bucket name of https://example.s3.amazonaws.com/ would be example.
  • Note that sometimes, you'll find an S3 URL referencing a specific region, for example https://example.s3.us-east-1.amazonaws.com/. In this case, you can simply remove the region string from the URL to get https://example.s3.amazonaws.com/. Then you can treat the URL according to the previous 2 rules.

Next, run the following command, replacing bucketname with the bucket name you found:

aws s3api list-objects --no-sign-request --bucket bucketname --output text --query "Contents[].{Key: Key}" > files.txt

You will now have a list of files. To transform it into a list of URLs, add the bucket URL to the beginning of each line of the file. You can do this in a text editor like Sublime Text or Notepad++ using the Find and Replace feature. Open the file, then press Ctrl-H to open the Find and Replace dialog. Activate the "Regular Expression" option, then type the following into the input boxes:

  • Find: (.*)
  • Replace: BucketURL\1

Type the actual bucket URL instead of "BucketURL", for example https://example.s3.amazonaws.com/\1.

Downloading the Files

Now that you have a list of URLs, you can download them with a tool like Wget or cURLsDownloader.