Skip to main content

Using S3 Tools

Once you have an Object Storage API key and at least one bucket, you can use standard S3-compatible tools to manage objects. All S3 operations are performed directly from your Crusoe Cloud VMs.

S3 Endpoint

Use the following endpoint format for your location:

https://object.<location>.crusoecloudcompute.com

For example:

https://object.us-east1-a.crusoecloudcompute.com

Configuring s3cmd

s3cmd is a command-line tool for interacting with S3-compatible storage.

Installation

# Ubuntu/Debian
sudo apt-get install s3cmd

Configuration

Create or edit ~/.s3cfg with the following:

[default]
access_key = YOUR_ACCESS_KEY
secret_key = YOUR_SECRET_KEY
host_base = object.<location>.crusoecloudcompute.com
host_bucket = object.<location>.crusoecloudcompute.com
use_https = True
signature_v2 = False
info

Set host_bucket to the same value as host_base (without a %(bucket)s prefix) because Crusoe Object Storage uses path-style URLs only.

Alternatively, run s3cmd --configure and manually set the endpoint values.

Common Operations

# List all buckets
s3cmd ls

# List objects in a bucket
s3cmd ls s3://my-training-data

# Upload a file
s3cmd put model-checkpoint.tar s3://my-training-data/checkpoints/

# Upload a directory recursively
s3cmd put --recursive ./dataset/ s3://my-training-data/datasets/

# Download a file
s3cmd get s3://my-training-data/checkpoints/model-checkpoint.tar ./

# Download a directory recursively
s3cmd get --recursive s3://my-training-data/datasets/ ./local-datasets/

# Delete an object
s3cmd del s3://my-training-data/checkpoints/old-checkpoint.tar

# Get object info (metadata)
s3cmd info s3://my-training-data/checkpoints/model-checkpoint.tar

# Multipart upload (automatic for files > 15 MB)
# Adjust chunk size if needed:
s3cmd put --multipart-chunk-size-mb=64 large-dataset.tar s3://my-training-data/

# List active multipart uploads
s3cmd multipart s3://my-training-data

Configuring rclone

rclone is a versatile tool for managing files on cloud storage, and is particularly useful for syncing data and migrating objects from other cloud providers into Crusoe.

Installation

# Ubuntu/Debian
sudo apt-get install rclone

Configuration

Run rclone config and create a new remote, or manually add the following to ~/.config/rclone/rclone.conf:

[crusoe]
type = s3
provider = Other
access_key_id = YOUR_ACCESS_KEY
secret_access_key = YOUR_SECRET_KEY
endpoint = https://object.<location>.crusoecloudcompute.com
acl = private
force_path_style = true
info

The force_path_style = true setting is required because Crusoe Object Storage does not support virtual-hosted-style URLs.

Common Operations

# List all buckets
rclone lsd crusoe:

# List objects in a bucket
rclone ls crusoe:my-training-data

# Upload a file
rclone copy ./model-checkpoint.tar crusoe:my-training-data/checkpoints/

# Upload a directory
rclone copy ./dataset/ crusoe:my-training-data/datasets/

# Download a file
rclone copy crusoe:my-training-data/checkpoints/model-checkpoint.tar ./

# Sync a local directory to a bucket (mirror)
rclone sync ./dataset/ crusoe:my-training-data/datasets/

# Check data integrity
rclone check ./dataset/ crusoe:my-training-data/datasets/

# Get file info
rclone lsl crusoe:my-training-data/checkpoints/

Migrating Data from AWS S3

rclone can transfer data directly between cloud providers. To copy data from AWS S3 into Crusoe Object Storage:

  1. Configure an AWS S3 remote in rclone (named aws in this example).
  2. Run:
rclone copy aws:source-bucket/path/ crusoe:my-training-data/path/ \
--transfers 16 \
--checkers 8 \
--s3-upload-concurrency 4

Adjust --transfers and related flags based on your available bandwidth and the number of files.


Configuring boto3 (Python)

boto3 is the AWS SDK for Python, widely used in ML pipelines and data processing scripts.

Installation

sudo apt install python3-boto3

Configuration

import boto3

s3 = boto3.client(
"s3",
endpoint_url="https://object.<location>.crusoecloudcompute.com",
aws_access_key_id="YOUR_ACCESS_KEY",
aws_secret_access_key="YOUR_SECRET_KEY",
)
note

The region_name parameter is not required. If your S3 client requires one, you can set it to any placeholder value (e.g., us-east-1). The Crusoe S3 endpoint handles routing internally.

Common Operations

# List buckets
response = s3.list_buckets()
for bucket in response["Buckets"]:
print(bucket["Name"])

# List objects in a bucket
response = s3.list_objects_v2(Bucket="my-training-data")
for obj in response.get("Contents", []):
print(obj["Key"], obj["Size"])

# Upload a file
s3.upload_file(
"model-checkpoint.tar",
"my-training-data",
"checkpoints/model-checkpoint.tar",
)

# Upload with multipart (automatic for large files)
from boto3.s3.transfer import TransferConfig

config = TransferConfig(
multipart_threshold=64 * 1024 * 1024, # 64 MB
multipart_chunksize=64 * 1024 * 1024,
max_concurrency=10,
)
s3.upload_file(
"large-dataset.tar",
"my-training-data",
"datasets/large-dataset.tar",
Config=config,
)

# Download a file
s3.download_file(
"my-training-data",
"checkpoints/model-checkpoint.tar",
"./model-checkpoint.tar",
)

# Delete an object
s3.delete_object(
Bucket="my-training-data",
Key="checkpoints/old-checkpoint.tar",
)

# Get object metadata
response = s3.head_object(
Bucket="my-training-data",
Key="checkpoints/model-checkpoint.tar",
)
print(f"Size: {response['ContentLength']}, Last Modified: {response['LastModified']}")

# Copy an object within the same bucket
s3.copy_object(
Bucket="my-training-data",
Key="checkpoints/model-checkpoint-backup.tar",
CopySource="my-training-data/checkpoints/model-checkpoint.tar",
)

Using boto3 with a Session

For scripts that interact with multiple buckets or need credential management:

import boto3

session = boto3.Session(
aws_access_key_id="YOUR_ACCESS_KEY",
aws_secret_access_key="YOUR_SECRET_KEY",
)

s3 = session.resource(
"s3",
endpoint_url="https://object.<location>.crusoecloudcompute.com",
)

# Upload using the resource interface
bucket = s3.Bucket("my-training-data")
bucket.upload_file("local-file.bin", "remote-key/local-file.bin")

# Iterate all objects
for obj in bucket.objects.all():
print(obj.key, obj.size)

Benchmarking Object Storage

You can use standard S3 benchmarking tools to measure performance from within your Crusoe Cloud VMs. Below is an example using elbencho, a distributed storage benchmark for file systems, object stores, and block devices.

Configure S3 Credentials

Before running benchmarks, configure your S3 credentials using one of the methods below. Elbencho reads credentials from the standard AWS credential chain, so there is no need to pass keys directly on the command line.

Using an AWS config file avoids exposing credentials in shell history.

# Create the AWS config directory if it does not exist
mkdir -p ~/.aws

Add your Crusoe Object Storage credentials to ~/.aws/credentials:

[crusoe]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY

Then export the profile and bucket name:

export AWS_PROFILE=crusoe
export AWS_ENDPOINT_URL_S3="https://object.<location>.crusoecloudcompute.com"
export BUCKET_NAME="YOUR_BUCKET_NAME"

Option B: Environment Variables

If you need to override credentials (for example, in CI pipelines), export them directly:

export AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY
export AWS_ENDPOINT_URL_S3="https://object.<location>.crusoecloudcompute.com"
export BUCKET_NAME="YOUR_BUCKET_NAME"
info

Replace <location> with your Crusoe Cloud location (e.g., us-east1-a). Create a dedicated bucket for benchmarking to avoid impacting production data.

Install elbencho

Download the latest static binary from the releases page. The static executable includes S3 support and has no external dependencies.

# Detect architecture, download, and extract
ARCH=$(uname -m | sed 's/arm64/aarch64/')
wget https://github.com/breuner/elbencho/releases/latest/download/elbencho-static-${ARCH}.tar.gz
tar -xf elbencho-static-${ARCH}.tar.gz

sudo mv elbencho /usr/local/bin/elbencho

Run a Write Benchmark

This writes 10 objects of 1 GiB each across 1 directory using 128 threads:

elbencho \
--s3endpoints $AWS_ENDPOINT_URL_S3 \
--s3objprefix ${HOSTNAME}/ \
--write \
--size 1G \
--block 16M \
--dirs 1 \
--files 10 \
--threads 128 \
$BUCKET_NAME
FlagDescription
--s3endpoints $AWS_ENDPOINT_URL_S3Crusoe S3 endpoint URL
--s3objprefix ${HOSTNAME}/Per-host prefix, enables multi-VM runs without conflicts
--writeWrite mode
--size 1GObject size (1 GiB per object)
--block 16MBlock size per request (16 MiB)
--dirs 1Number of directories
--files 10Number of objects per directory
--threads 128Number of concurrent threads
--latReport latency statistics

Run a Read Benchmark

After writing objects, run a read benchmark against the same data:

elbencho \
--s3endpoints $AWS_ENDPOINT_URL_S3 \
--s3objprefix ${HOSTNAME}/ \
--read \
--size 1G \
--block 16M \
--dirs 1 \
--files 10 \
--threads 128 \
$BUCKET_NAME

Use the same --size, --files, --dirs, and --s3objprefix values as the write benchmark to ensure the read test targets the same objects.

Run a Mixed Workload Benchmark

To run a combined write and read benchmark:

elbencho \
--s3endpoints $AWS_ENDPOINT_URL_S3 \
--s3objprefix ${HOSTNAME}/ \
--write --read \
--size 64M \
--block 1M \
--dirs 1 \
--files 256 \
--threads 16 \
$BUCKET_NAME

This runs both write and read phases sequentially against the same set of objects, giving you throughput numbers for both operations in a single run.

Save Results to a File

Use the --resfile and --csvfile flags to persist benchmark results:

elbencho \
--s3endpoints $AWS_ENDPOINT_URL_S3 \
--s3objprefix ${HOSTNAME}/ \
--write \
--size 1G \
--block 16M \
--dirs 1 \
--files 10 \
--threads 128 \
--lat --nolive \
--resfile results.txt \
--csvfile results.csv \
$BUCKET_NAME

Clean Up Benchmark Objects

After benchmarking, remove the test objects:

elbencho \
--s3endpoints $AWS_ENDPOINT_URL_S3 \
--s3objprefix ${HOSTNAME}/ \
--dirs 1 \
--files 10 \
--threads 128 \
--delfiles \
--s3multidel 1 \
$BUCKET_NAME

Verify that the bucket is empty:

aws s3 ls s3://$BUCKET_NAME --recursive
aws s3 rm s3://$BUCKET/${HOSTNAME}/ --recursive

To remove the bucket, either use the Crusoe Cloud Console as described here. If you enabled versioning, you will need to delete all versions of the objects before deleting the bucket.