A Python library for collecting and archiving posts from the Bluesky social network using the Jetstream API. This tool connects to Bluesky's firehose and saves posts in an organized file structure.
- Connects to Bluesky's Jetstream websocket API
- Three archiving modes:
- Posts only (default)
- All records (posts, likes, follows, etc.)
- Non-posts only (everything except posts)
- Archives data in JSONL format, organized by date and hour
- Optional real-time post text streaming to stdout
- Automatic reconnection on connection loss
- Efficient batch processing and disk operations
- Debug mode for detailed logging
- Optional handle resolution (disabled by default)
- Playback support from specific timestamps
- Clone the repository:
git clone https://github.com/yourusername/bluesky-firehose-archiver.git
- Navigate to the project directory:
cd bluesky-firehose-archiver
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
The archiver supports three distinct modes of operation:
- Posts Only (default):
python src/main.py
- Archives only posts (
app.bsky.feed.post
records) - Saves to
data/
directory - Files named
posts_YYYYMMDD_HH.jsonl
- All Records:
python src/main.py --archive-all
- Archives all record types (posts, likes, follows, etc.)
- Saves to
data_everything/
directory - Files named
records_YYYYMMDD_HH.jsonl
- Preserves complete record structure
- Non-Posts Only:
python src/main.py --archive-non-posts
- Archives everything except posts
- Saves to
data_non_posts/
directory - Files named
records_YYYYMMDD_HH.jsonl
- Useful for collecting only interactions and profile updates
Note: The --archive-all
and --archive-non-posts
modes cannot be used simultaneously.
python src/main.py [options]
Options:
--username Bluesky username (optional)
--password Bluesky password (optional)
--debug Enable debug output
--stream Stream post text to stdout in real-time
--measure-rate Track and display posts per minute rate
--get-handles Resolve handles while archiving (not recommended)
--cursor Unix microseconds timestamp to start playback from
You can use the archiver in your Python code:
from archiver import BlueskyArchiver
import asyncio
async def main():
# Initialize with desired options
archiver = BlueskyArchiver(
debug=True, # Enable debug logging
stream=True, # Stream posts to stdout
measure_rate=True, # Show collection rate
archive_all=False, # Default: posts only
get_handles=False # Don't resolve handles
)
try:
# Start archiving
await archiver.archive_posts()
finally:
# Ensure clean shutdown
archiver.stop()
if __name__ == "__main__":
asyncio.run(main())
Records are saved in JSONL (JSON Lines) format, organized by date and hour in different directories based on the archiving mode:
data/ # Posts only mode (default)
└── YYYY-MM/
└── DD/
└── posts_YYYYMMDD_HH.jsonl
data_everything/ # Archive all mode
└── YYYY-MM/
└── DD/
└── records_YYYYMMDD_HH.jsonl
data_non_posts/ # Non-posts mode
└── YYYY-MM/
└── DD/
└── records_YYYYMMDD_HH.jsonl
- Posts Only Mode (default):
{
"handle": "user.bsky.social",
"record": {
"text": "Post content",
"createdAt": "2024-03-15T01:23:45.678Z",
...
},
"rkey": "unique-record-key",
"did": "did:plc:abcd...",
"time_us": 1234567890
}
- Archive All & Non-Posts Modes:
{
"did": "did:plc:abcd...",
"time_us": 1234567890,
"kind": "commit",
"commit": {
"rev": "...",
"operation": "create",
"collection": "app.bsky.feed.like", // or other collection types
"rkey": "...",
"record": { ... }
}
}
You can start archiving from a specific point in time using the cursor functionality:
python src/main.py --cursor 1725911162329308
The cursor should be a Unix timestamp in microseconds. Playback will start from the specified time and continue to real-time. You can find timestamps in the saved records' time_us
field.
├── src/
│ ├── main.py # Entry point and CLI interface
│ └── archiver.py # Core archiving logic
├── data/ # Archived posts storage
├── requirements.txt # Project dependencies
└── README.md # This file
MIT License