A tool to scrape, extract, and save conversation data from ChatGPT-like web pages for later analysis or offline access. This project helps automate the process of extracting conversation content, including HTML and associated JSON data, from ChatGPT-powered applications.
- Scrapes ChatGPT conversation data: Extracts the conversation content and associated metadata.
- Saves data locally: Stores the conversation content in HTML format and JSON metadata.
- Easy-to-use: Just provide the URL, and the tool will fetch and save the data.
- Rust: The project is written in Rust. Ensure you have it installed on your machine.
- Dependencies:
reqwest
for making HTTP requests.select
for parsing HTML and selecting elements.serde
andserde_json
for JSON parsing.std::fs
for file handling.
To get started, clone this repository to your local machine:
git clone https://github.com/yourusername/ScrapChat.git
cd ScrapChat
Next, build the project:
cargo build --release
To use the scraper, simply run the following command:
cargo run -- <URL>
Where <URL>
is the web page containing the ChatGPT conversation you want to scrape.
For example,
cargo run -- https://chatgpt.com/share/676d8989-e548-8004-8e13-4a16c689d4b6
cargo run -- https://example.com/conversation
This will scrape the conversation, extract the necessary data, and save it in a local directory with the conversation's title as the folder name.
The data will be saved as:
linear_conversation.json
– The extracted JSON data that includes the conversation.
NOTE: This is a work in progress -- more data will be extracted soon.
chatgpt-conversation-scraper/
│
├── src/
│ └── main.rs # Main application logic
│ └── ... # Other utility files and contents
│
├── Cargo.toml # Rust package manager file
└── README.md # This file
Contributions are welcome! Please feel free to fork this repository and submit pull requests for any improvements, bug fixes, or new features.
To contribute:
- Fork this repository.
- Create a new branch.
- Make your changes.
- Submit a pull request.
New feature requests are highly appreciated.
This project is licensed under the MIT License - see the LICENSE file for details.
- This project uses the following libraries:
reqwest
: HTTP client for fetching web pages.select
: HTML parsing and selection.serde
andserde_json
: JSON parsing and serialization.