Skip to main content

Using S3 Tools

Once you have an Object Storage API key and at least one bucket, you can use standard S3-compatible tools to manage objects. All S3 operations are performed directly from your Crusoe Cloud VMs.

S3 Endpoint

Use the following endpoint format for your location:

https://object.<location>.crusoecloudcompute.com

For example:

https://object.us-east1-a.crusoecloudcompute.com

Configuring s3cmd

s3cmd is a command-line tool for interacting with S3-compatible storage.

Installation

# Ubuntu/Debian
sudo apt-get install s3cmd

Configuration

Create or edit ~/.s3cfg with the following:

[default]
access_key = YOUR_ACCESS_KEY
secret_key = YOUR_SECRET_KEY
host_base = object.<location>.crusoecloudcompute.com
host_bucket = object.<location>.crusoecloudcompute.com
use_https = True
signature_v2 = False
info

Set host_bucket to the same value as host_base (without a %(bucket)s prefix) because Crusoe Object Storage uses path-style URLs only.

Alternatively, run s3cmd --configure and manually set the endpoint values.

Common Operations

# List all buckets
s3cmd ls

# List objects in a bucket
s3cmd ls s3://my-training-data

# Upload a file
s3cmd put model-checkpoint.tar s3://my-training-data/checkpoints/

# Upload a directory recursively
s3cmd put --recursive ./dataset/ s3://my-training-data/datasets/

# Download a file
s3cmd get s3://my-training-data/checkpoints/model-checkpoint.tar ./

# Download a directory recursively
s3cmd get --recursive s3://my-training-data/datasets/ ./local-datasets/

# Delete an object
s3cmd del s3://my-training-data/checkpoints/old-checkpoint.tar

# Get object info (metadata)
s3cmd info s3://my-training-data/checkpoints/model-checkpoint.tar

# Multipart upload (automatic for files > 15 MB)
# Adjust chunk size if needed:
s3cmd put --multipart-chunk-size-mb=64 large-dataset.tar s3://my-training-data/

# List active multipart uploads
s3cmd multipart s3://my-training-data

Configuring rclone

rclone is a versatile tool for managing files on cloud storage, and is particularly useful for syncing data and migrating objects from other cloud providers into Crusoe.

Installation

# Ubuntu/Debian
sudo apt-get install rclone

Configuration

Run rclone config and create a new remote, or manually add the following to ~/.config/rclone/rclone.conf:

[crusoe]
type = s3
provider = Other
access_key_id = YOUR_ACCESS_KEY
secret_access_key = YOUR_SECRET_KEY
endpoint = https://object.<location>.crusoecloudcompute.com
acl = private
force_path_style = true
info

The force_path_style = true setting is required because Crusoe Object Storage does not support virtual-hosted-style URLs.

Common Operations

# List all buckets
rclone lsd crusoe:

# List objects in a bucket
rclone ls crusoe:my-training-data

# Upload a file
rclone copy ./model-checkpoint.tar crusoe:my-training-data/checkpoints/

# Upload a directory
rclone copy ./dataset/ crusoe:my-training-data/datasets/

# Download a file
rclone copy crusoe:my-training-data/checkpoints/model-checkpoint.tar ./

# Sync a local directory to a bucket (mirror)
rclone sync ./dataset/ crusoe:my-training-data/datasets/

# Check data integrity
rclone check ./dataset/ crusoe:my-training-data/datasets/

# Get file info
rclone lsl crusoe:my-training-data/checkpoints/

Migrating Data from AWS S3

rclone can transfer data directly between cloud providers. To copy data from AWS S3 into Crusoe Object Storage:

  1. Configure an AWS S3 remote in rclone (named aws in this example).
  2. Run:
rclone copy aws:source-bucket/path/ crusoe:my-training-data/path/ \
--transfers 16 \
--checkers 8 \
--s3-upload-concurrency 4

Adjust --transfers and related flags based on your available bandwidth and the number of files.


Configuring boto3 (Python)

boto3 is the AWS SDK for Python, widely used in ML pipelines and data processing scripts.

Installation

sudo apt install python3-boto3

Configuration

import boto3

s3 = boto3.client(
"s3",
endpoint_url="https://object.<location>.crusoecloudcompute.com",
aws_access_key_id="YOUR_ACCESS_KEY",
aws_secret_access_key="YOUR_SECRET_KEY",
)
note

The region_name parameter is not required. If your S3 client requires one, you can set it to any placeholder value (e.g., us-east-1). The Crusoe S3 endpoint handles routing internally.

Common Operations

# List buckets
response = s3.list_buckets()
for bucket in response["Buckets"]:
print(bucket["Name"])

# List objects in a bucket
response = s3.list_objects_v2(Bucket="my-training-data")
for obj in response.get("Contents", []):
print(obj["Key"], obj["Size"])

# Upload a file
s3.upload_file(
"model-checkpoint.tar",
"my-training-data",
"checkpoints/model-checkpoint.tar",
)

# Upload with multipart (automatic for large files)
from boto3.s3.transfer import TransferConfig

config = TransferConfig(
multipart_threshold=64 * 1024 * 1024, # 64 MB
multipart_chunksize=64 * 1024 * 1024,
max_concurrency=10,
)
s3.upload_file(
"large-dataset.tar",
"my-training-data",
"datasets/large-dataset.tar",
Config=config,
)

# Download a file
s3.download_file(
"my-training-data",
"checkpoints/model-checkpoint.tar",
"./model-checkpoint.tar",
)

# Delete an object
s3.delete_object(
Bucket="my-training-data",
Key="checkpoints/old-checkpoint.tar",
)

# Get object metadata
response = s3.head_object(
Bucket="my-training-data",
Key="checkpoints/model-checkpoint.tar",
)
print(f"Size: {response['ContentLength']}, Last Modified: {response['LastModified']}")

# Copy an object within the same bucket
s3.copy_object(
Bucket="my-training-data",
Key="checkpoints/model-checkpoint-backup.tar",
CopySource="my-training-data/checkpoints/model-checkpoint.tar",
)

Using boto3 with a Session

For scripts that interact with multiple buckets or need credential management:

import boto3

session = boto3.Session(
aws_access_key_id="YOUR_ACCESS_KEY",
aws_secret_access_key="YOUR_SECRET_KEY",
)

s3 = session.resource(
"s3",
endpoint_url="https://object.<location>.crusoecloudcompute.com",
)

# Upload using the resource interface
bucket = s3.Bucket("my-training-data")
bucket.upload_file("local-file.bin", "remote-key/local-file.bin")

# Iterate all objects
for obj in bucket.objects.all():
print(obj.key, obj.size)

Benchmarking Object Storage

You can use standard S3 benchmarking tools to measure performance from within your Crusoe Cloud VMs. Below is an example using Warp, a purpose-built S3 benchmarking tool.

Install Warp

#for linux x86_64
wget https://dl.min.io/aistor/warp/release/linux-amd64/archive/warp.v1.4.0

#for linux arm64
wget https://dl.min.io/aistor/warp/release/linux-arm64/archive/warp.v1.4.0

chmod +x warp.v1.4.0

Run a Write Benchmark

./warp put \
--host object.<location>.crusoecloudcompute.com \
--access-key YOUR_ACCESS_KEY \
--secret-key YOUR_SECRET_KEY \
--tls \
--bucket bench-bucket \
--obj.size 256MiB \
--concurrent 16 \
--duration 60s

Run a Read Benchmark

./warp get \
--host object.<location>.crusoecloudcompute.com \
--access-key YOUR_ACCESS_KEY \
--secret-key YOUR_SECRET_KEY \
--tls \
--bucket bench-bucket \
--obj.size 256MiB \
--concurrent 16 \
--duration 60s

Run a Mixed Workload Benchmark

./warp mixed \
--host object.<location>.crusoecloudcompute.com \
--access-key YOUR_ACCESS_KEY \
--secret-key YOUR_SECRET_KEY \
--tls \
--bucket bench-bucket \
--obj.size 64MiB \
--get-distrib 80 \
--put-distrib 15 \
--delete-distrib 5 \
--concurrent 32 \
--duration 120s
note

Create a dedicated bucket for benchmarking to avoid impacting production data.