This project translates Reddit API responses into a PL/pgSQL script which loads the data into a Lemmy database.
In other words, it Reddit posts/comments and them into Lemmy.
Here's an example of a backup of the now-banned r/GenZhou up and running on a Lemmy test instance:
Community | Post |
---|---|
To get the JSON API response for a single post, you can call the proper Reddit API (requires an API key), or just append .json
to the comments URL, like this:
HTML: https://www.reddit.com/r/GenZedong/comments/laucjl/china_usa/
https://www.reddit.com/r/GenZedong/comments/laucjl
JSON: https://www.reddit.com/r/GenZedong/comments/laucjl/china_usa/.json?limit=10000
https://www.reddit.com/r/GenZedong/comments/laucjl.json?limit=10000
Note that we've also added the limit
parameter, because otherwise Reddit will pretty aggressively prune the comment tree with "Load more comments" links.
The response object contains the data for that one post and any replies. You can feed this directly into RedditLemmyImporter. However, if you want to import multiple posts, you can put multiple responses in the same input file, with each one separated by a newline. For example:
~ $ cat urls
https://www.reddit.com/r/GenZedong/comments/tpyft9/why_is_like_half_this_sub_made_of_trans_women/
https://www.reddit.com/r/GenZedong/comments/pet8zc/therapist_trans_stalin_isnt_real_she_cant_hurt/
https://www.reddit.com/r/GenZedong/comments/ttcyok/happy_trans_visibility_day_comrades/
https://www.reddit.com/r/GenZedong/comments/t9kbdm/women_of_genzedong_i_congratulate_you_for_your_day/
~ $ xargs -I URL curl --silent --user-agent "Subreddit archiver" --cookie "REDACTED" URL.json?limit=10000 < urls > dump.json
If you need a complete scraping solution, check out this Python script. It pulls posts into a local MongoDB database, which means you can run it on a cron to keep a local clone of posts as they're made. To export your dump.json
try something like this:
mongoexport --uri="mongodb://localhost:27017/subredditArchiveDB" --collection=GenZedong --out=dump-wrapped.json
/r/GenZhou was scraped by @DongFangHong@lemmygrad.ml
using this method. Data is available up to about a week before it was banned:
https://mega.nz/file/knBwmTJL#PpqO0I3Jv-xw-o7RBWSi0JSScjSV7-4Eb3JR5HzTc5w
Note that the script buries the data we need within a top-level property named json
. RedditLemmyImporter can handle this directly using the --json-pointer
option. For example:
java -jar redditLemmyImporter-0.3.jar -c genzhouarchive -u archive_bot -o import.sql --json-pointer=/json GenZhouArchive.json
Prerequisites: Java 8 or above
Download the jar file from the releases page and run it:
java -jar redditLemmyImporter-0.3.jar -c genzhouarchive -u archive_bot -o import.sql dump.json
In this case we're generating a PL/pgSQL script that will load the data from dump.json
into the comm genzhouarchive
under the user archive_bot
. The script will be written to import.sql
. Full command usage:
Usage: redditLemmyImporter [OPTIONS] dump
dump Path to the JSON dump file from the Reddit API. Required.
Specify - to read from stdin.
-c, --comm=name Target community name. Required.
-u, --user=name Target user name. Required.
--json-pointer=pointer Locate the Reddit API response somewhere within the top-level object in each input line.
See RFC 6901 for the JSON Pointer specification.
-o, --output-file=file Output file. Prints to stdout if this option isn't specified.
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
Prerequisites: JDK >=1.8, Maven 3.
Clone the repo and cd to the source tree. Run:
mvn compile
mvn exec:java -Dexec.args="-c genzhouarchive -u archive_bot -o import.sql path/to/dump.json"
(This will pull down dependencies from Maven Central so you must be connected to the internet during the compile step.)
You could also package a release and then follow the instructions from the previous section:
mvn clean package
java -jar target/redditLemmyImporter-0.3-SNAPSHOT.jar -c genzhouarchive -u archive_bot -o import.sql dump.json
Copy import.sql
to the server running Postgres and run this:
psql --dbname=lemmy --username=lemmy --file=import.sql
Note that this uses the default values for the database name and database username. If you've changed them in your Lemmy configuration then update the values accordingly.
The target comm and target user must already exist in your Lemmy instance or the SQL script will do nothing.
Copy import.sql
to the server running Docker and run this:
<import.sql docker exec -i $(docker ps -qf name=postgres) psql --dbname=lemmy --username=lemmy -
The target comm and target user must already exist in your Lemmy instance or the SQL script will do nothing.