This project enables you to fetch liked tweets from Twitter (using Selenium), save it to JSON and Excel files, and perform initial data analysis and image captions.
This is part of the initial steps for a larger personal project involving Large Language Models (LLMs). Stay tuned for more updates!
A new free image embedding tool with a supporting frontend that does not require GPU support has been added. (Supports multiple languages, although the results are better in English.)
For example, here are the results for a searchon "black cat" (in Chinese) but you can also searchfor "a group of people in a photo," "workflow graphs," or more abstract concepts like "sadness."
How to run:
-First, make sure that the data has already been downloaded; image downloading requires previous Twitter data (including image URLs).
- Run the newly added
download_images
in the notebook. -In the console, runstreamlit run image_search_webapp.py
and follow the prompts to automatically embed images. No need to embed repeatedly.
Before running the code, ensure you have the following:
- Required Python libraries (listed in
requirements.txt
) - Get your twitter auth token (Not API key)
- Quick text instruction:
-
- Go to your already logged-in twitter
- F12 (open dev tools) -> Application -> Cookies -> Twitter.com -> auth_key
- or follow the video demo in FAQ section.
- OpenAI API key (optional, only needed if you want to try the image captions feature)
- Clone the repository or download the project files.
- Install the required Python libraries by running the following command:
pip install -r requirements.txt
- Open the
config.py
file and replace the placeholders with your actual API keys:
- Set
TWITTER_AUTH_TOKEN
to your Twitter API authentication token. - Set
OPENAI_API_KEY
to your OpenAI API key.
To fetch data from Twitter and save it to JSON and Excel files, follow these steps:
- Open the
twitter_data_ingestion.py
file. - Modify the
fetch_tweets
function call at the bottom of the script with your desired parameters:
- Set the URL of the Twitter page you want to fetch data from (e.g.,
https://twitter.com/ilyasut/likes
). - Specify the start and end dates for the data range (in YYYY-MM-DD format).
-
Run the script by executing the following command (recommend run this in IDE directly):
python twitter_data_ingestion.py
-
The script will fetch the data from Twitter, save it to a JSON file, and then export it to an Excel file.
To perform initial data analysis on the fetched data, follow these steps:
- Open the
twitter_data_initial_exploration.ipynb
notebook in Jupyter Notebook or JupyterLab. - Run the notebook cells sequentially to load the data from the JSON file and perform various data analysis tasks.
Some sample results:
- The notebook also demonstrates how to use the OpenAI API to generate image captions for tweet images (with tweet metadata).
The project includes sample output files for reference:
sample_output_json.json
: A sample JSON file containing the fetched Twitter data.sample_exported_excel.xlsx
: A sample Excel file exported from the JSON data.
Feel free to explore and modify the code to suit your specific data analysis requirements.
-
Will I get banned? Could this affect my account?
- Selenium is one of the safest scraping methods out there, but it's still best to be cautious when using it for personal projects.
- I've been using it for quite a while without any issues.
- (Though, if you've got a spare / alt account, I'd recommend using that one's auth token instead)
-
How do I find the auth token?
- Check out this for a step-by-step guide!
Contributions to this project are welcome. If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.
- Initial structure and parts of the Selenium code inspired by Twitter-Scrapper.
- The image captioning feature is powered by the OpenAI API. You should be able to achieve similar results using Gemini 1.0 or Claude Haiku.
For any questions or issues, please open an issue in the repository.