Using S3 Tools
Once you have an Object Storage API key and at least one bucket, you can use standard S3-compatible tools to manage objects. All S3 operations are performed directly from your Crusoe Cloud VMs.
S3 Endpoint
Use the following endpoint format for your location:
https://object.<location>.crusoecloudcompute.com
For example:
https://object.us-east1-a.crusoecloudcompute.com
Configuring s3cmd
s3cmd is a command-line tool for interacting with S3-compatible storage.
Installation
# Ubuntu/Debian
sudo apt-get install s3cmd
Configuration
Create or edit ~/.s3cfg with the following:
[default]
access_key = YOUR_ACCESS_KEY
secret_key = YOUR_SECRET_KEY
host_base = object.<location>.crusoecloudcompute.com
host_bucket = object.<location>.crusoecloudcompute.com
use_https = True
signature_v2 = False
Set host_bucket to the same value as host_base (without a %(bucket)s prefix) because Crusoe Object Storage uses path-style URLs only.
Alternatively, run s3cmd --configure and manually set the endpoint values.
Common Operations
# List all buckets
s3cmd ls
# List objects in a bucket
s3cmd ls s3://my-training-data
# Upload a file
s3cmd put model-checkpoint.tar s3://my-training-data/checkpoints/
# Upload a directory recursively
s3cmd put --recursive ./dataset/ s3://my-training-data/datasets/
# Download a file
s3cmd get s3://my-training-data/checkpoints/model-checkpoint.tar ./
# Download a directory recursively
s3cmd get --recursive s3://my-training-data/datasets/ ./local-datasets/
# Delete an object
s3cmd del s3://my-training-data/checkpoints/old-checkpoint.tar
# Get object info (metadata)
s3cmd info s3://my-training-data/checkpoints/model-checkpoint.tar
# Multipart upload (automatic for files > 15 MB)
# Adjust chunk size if needed:
s3cmd put --multipart-chunk-size-mb=64 large-dataset.tar s3://my-training-data/
# List active multipart uploads
s3cmd multipart s3://my-training-data
Configuring rclone
rclone is a versatile tool for managing files on cloud storage, and is particularly useful for syncing data and migrating objects from other cloud providers into Crusoe.
Installation
# Ubuntu/Debian
sudo apt-get install rclone
Configuration
Run rclone config and create a new remote, or manually add the following to ~/.config/rclone/rclone.conf:
[crusoe]
type = s3
provider = Other
access_key_id = YOUR_ACCESS_KEY
secret_access_key = YOUR_SECRET_KEY
endpoint = https://object.<location>.crusoecloudcompute.com
acl = private
force_path_style = true
The force_path_style = true setting is required because Crusoe Object Storage does not support virtual-hosted-style URLs.
Common Operations
# List all buckets
rclone lsd crusoe:
# List objects in a bucket
rclone ls crusoe:my-training-data
# Upload a file
rclone copy ./model-checkpoint.tar crusoe:my-training-data/checkpoints/
# Upload a directory
rclone copy ./dataset/ crusoe:my-training-data/datasets/
# Download a file
rclone copy crusoe:my-training-data/checkpoints/model-checkpoint.tar ./
# Sync a local directory to a bucket (mirror)
rclone sync ./dataset/ crusoe:my-training-data/datasets/
# Check data integrity
rclone check ./dataset/ crusoe:my-training-data/datasets/
# Get file info
rclone lsl crusoe:my-training-data/checkpoints/
Migrating Data from AWS S3
rclone can transfer data directly between cloud providers. To copy data from AWS S3 into Crusoe Object Storage:
- Configure an AWS S3 remote in rclone (named
awsin this example). - Run:
rclone copy aws:source-bucket/path/ crusoe:my-training-data/path/ \
--transfers 16 \
--checkers 8 \
--s3-upload-concurrency 4
Adjust --transfers and related flags based on your available bandwidth and the number of files.
Configuring boto3 (Python)
boto3 is the AWS SDK for Python, widely used in ML pipelines and data processing scripts.
Installation
sudo apt install python3-boto3
Configuration
import boto3
s3 = boto3.client(
"s3",
endpoint_url="https://object.<location>.crusoecloudcompute.com",
aws_access_key_id="YOUR_ACCESS_KEY",
aws_secret_access_key="YOUR_SECRET_KEY",
)
The region_name parameter is not required. If your S3 client requires one, you can set it to any placeholder value (e.g., us-east-1). The Crusoe S3 endpoint handles routing internally.
Common Operations
# List buckets
response = s3.list_buckets()
for bucket in response["Buckets"]:
print(bucket["Name"])
# List objects in a bucket
response = s3.list_objects_v2(Bucket="my-training-data")
for obj in response.get("Contents", []):
print(obj["Key"], obj["Size"])
# Upload a file
s3.upload_file(
"model-checkpoint.tar",
"my-training-data",
"checkpoints/model-checkpoint.tar",
)
# Upload with multipart (automatic for large files)
from boto3.s3.transfer import TransferConfig
config = TransferConfig(
multipart_threshold=64 * 1024 * 1024, # 64 MB
multipart_chunksize=64 * 1024 * 1024,
max_concurrency=10,
)
s3.upload_file(
"large-dataset.tar",
"my-training-data",
"datasets/large-dataset.tar",
Config=config,
)
# Download a file
s3.download_file(
"my-training-data",
"checkpoints/model-checkpoint.tar",
"./model-checkpoint.tar",
)
# Delete an object
s3.delete_object(
Bucket="my-training-data",
Key="checkpoints/old-checkpoint.tar",
)
# Get object metadata
response = s3.head_object(
Bucket="my-training-data",
Key="checkpoints/model-checkpoint.tar",
)
print(f"Size: {response['ContentLength']}, Last Modified: {response['LastModified']}")
# Copy an object within the same bucket
s3.copy_object(
Bucket="my-training-data",
Key="checkpoints/model-checkpoint-backup.tar",
CopySource="my-training-data/checkpoints/model-checkpoint.tar",
)
Using boto3 with a Session
For scripts that interact with multiple buckets or need credential management:
import boto3
session = boto3.Session(
aws_access_key_id="YOUR_ACCESS_KEY",
aws_secret_access_key="YOUR_SECRET_KEY",
)
s3 = session.resource(
"s3",
endpoint_url="https://object.<location>.crusoecloudcompute.com",
)
# Upload using the resource interface
bucket = s3.Bucket("my-training-data")
bucket.upload_file("local-file.bin", "remote-key/local-file.bin")
# Iterate all objects
for obj in bucket.objects.all():
print(obj.key, obj.size)
Benchmarking Object Storage
You can use standard S3 benchmarking tools to measure performance from within your Crusoe Cloud VMs. Below is an example using elbencho, a distributed storage benchmark for file systems, object stores, and block devices.
Configure S3 Credentials
Before running benchmarks, configure your S3 credentials using one of the methods below. Elbencho reads credentials from the standard AWS credential chain, so there is no need to pass keys directly on the command line.
Option A: AWS Config File (Recommended)
Using an AWS config file avoids exposing credentials in shell history.
# Create the AWS config directory if it does not exist
mkdir -p ~/.aws
Add your Crusoe Object Storage credentials to ~/.aws/credentials:
[crusoe]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
Then export the profile and bucket name:
export AWS_PROFILE=crusoe
export AWS_ENDPOINT_URL_S3="https://object.<location>.crusoecloudcompute.com"
export BUCKET_NAME="YOUR_BUCKET_NAME"
Option B: Environment Variables
If you need to override credentials (for example, in CI pipelines), export them directly:
export AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY
export AWS_ENDPOINT_URL_S3="https://object.<location>.crusoecloudcompute.com"
export BUCKET_NAME="YOUR_BUCKET_NAME"
Replace <location> with your Crusoe Cloud location (e.g., us-east1-a). Create a dedicated bucket for benchmarking to avoid impacting production data.
Install elbencho
Download the latest static binary from the releases page. The static executable includes S3 support and has no external dependencies.
# Detect architecture, download, and extract
ARCH=$(uname -m | sed 's/arm64/aarch64/')
wget https://github.com/breuner/elbencho/releases/latest/download/elbencho-static-${ARCH}.tar.gz
tar -xf elbencho-static-${ARCH}.tar.gz
sudo mv elbencho /usr/local/bin/elbencho
Run a Write Benchmark
This writes 10 objects of 1 GiB each across 1 directory using 128 threads:
elbencho \
--s3endpoints $AWS_ENDPOINT_URL_S3 \
--s3objprefix ${HOSTNAME}/ \
--write \
--size 1G \
--block 16M \
--dirs 1 \
--files 10 \
--threads 128 \
$BUCKET_NAME
| Flag | Description |
|---|---|
--s3endpoints $AWS_ENDPOINT_URL_S3 | Crusoe S3 endpoint URL |
--s3objprefix ${HOSTNAME}/ | Per-host prefix, enables multi-VM runs without conflicts |
--write | Write mode |
--size 1G | Object size (1 GiB per object) |
--block 16M | Block size per request (16 MiB) |
--dirs 1 | Number of directories |
--files 10 | Number of objects per directory |
--threads 128 | Number of concurrent threads |
--lat | Report latency statistics |
Run a Read Benchmark
After writing objects, run a read benchmark against the same data:
elbencho \
--s3endpoints $AWS_ENDPOINT_URL_S3 \
--s3objprefix ${HOSTNAME}/ \
--read \
--size 1G \
--block 16M \
--dirs 1 \
--files 10 \
--threads 128 \
$BUCKET_NAME
Use the same --size, --files, --dirs, and --s3objprefix values as the write benchmark to ensure the read test targets the same objects.
Run a Mixed Workload Benchmark
To run a combined write and read benchmark:
elbencho \
--s3endpoints $AWS_ENDPOINT_URL_S3 \
--s3objprefix ${HOSTNAME}/ \
--write --read \
--size 64M \
--block 1M \
--dirs 1 \
--files 256 \
--threads 16 \
$BUCKET_NAME
This runs both write and read phases sequentially against the same set of objects, giving you throughput numbers for both operations in a single run.
Save Results to a File
Use the --resfile and --csvfile flags to persist benchmark results:
elbencho \
--s3endpoints $AWS_ENDPOINT_URL_S3 \
--s3objprefix ${HOSTNAME}/ \
--write \
--size 1G \
--block 16M \
--dirs 1 \
--files 10 \
--threads 128 \
--lat --nolive \
--resfile results.txt \
--csvfile results.csv \
$BUCKET_NAME
Clean Up Benchmark Objects
After benchmarking, remove the test objects:
elbencho \
--s3endpoints $AWS_ENDPOINT_URL_S3 \
--s3objprefix ${HOSTNAME}/ \
--dirs 1 \
--files 10 \
--threads 128 \
--delfiles \
--s3multidel 1 \
$BUCKET_NAME
Verify that the bucket is empty:
aws s3 ls s3://$BUCKET_NAME --recursive
aws s3 rm s3://$BUCKET/${HOSTNAME}/ --recursive
To remove the bucket, either use the Crusoe Cloud Console as described here. If you enabled versioning, you will need to delete all versions of the objects before deleting the bucket.