Twitter-Insight💡: Data Scraping, Analysis, Image caption and More

This project enables you to fetch liked tweets from Twitter (using Selenium), save it to JSON and Excel files, and perform initial data analysis and image captions.

This is part of the initial steps for a larger personal project involving Large Language Models (LLMs). Stay tuned for more updates!

Example of Exported Excel sheets & Visualizations:

(new!) Experimental Embedding Based Image Search: Search unlabeled images with nartual language.

A new free image embedding tool with a supporting frontend that does not require GPU support has been added. (Supports multiple languages, although the results are better in English.)

For example, here are the results for a searchon "black cat" (in Chinese) but you can also searchfor "a group of people in a photo," "workflow graphs," or more abstract concepts like "sadness."

How to run:

-First, make sure that the data has already been downloaded; image downloading requires previous Twitter data (including image URLs).

Run the newly added download_images in the notebook. -In the console, run streamlit run image_search_webapp.py and follow the prompts to automatically embed images. No need to embed repeatedly.

Demo Video

Demo

Prerequisites

Before running the code, ensure you have the following:

Required Python libraries (listed in requirements.txt)
Get your twitter auth token (Not API key)
- Quick text instruction:
- - Go to your already logged-in twitter
  - F12 (open dev tools) -> Application -> Cookies -> Twitter.com -> auth_key
- or follow the video demo in FAQ section.
OpenAI API key (optional, only needed if you want to try the image captions feature)

Setup

Clone the repository or download the project files.
Install the required Python libraries by running the following command:

pip install -r requirements.txt

Open the config.py file and replace the placeholders with your actual API keys:

Set TWITTER_AUTH_TOKEN to your Twitter API authentication token.
Set OPENAI_API_KEY to your OpenAI API key.

Data Ingestion

To fetch data from Twitter and save it to JSON and Excel files, follow these steps:

Open the twitter_data_ingestion.py file.
Modify the fetch_tweets function call at the bottom of the script with your desired parameters:

Set the URL of the Twitter page you want to fetch data from (e.g., https://twitter.com/ilyasut/likes).
Specify the start and end dates for the data range (in YYYY-MM-DD format).

Run the script by executing the following command (recommend run this in IDE directly):
```
python twitter_data_ingestion.py
```
The script will fetch the data from Twitter, save it to a JSON file, and then export it to an Excel file.

Data Analysis

To perform initial data analysis on the fetched data, follow these steps:

Open the twitter_data_initial_exploration.ipynb notebook in Jupyter Notebook or JupyterLab.
Run the notebook cells sequentially to load the data from the JSON file and perform various data analysis tasks.

Some sample results:

Visualizing likes by media type over time
Creating a calendar heatmap of liked tweets per day

The notebook also demonstrates how to use the OpenAI API to generate image captions for tweet images (with tweet metadata).

Sample Output

The project includes sample output files for reference:

sample_output_json.json: A sample JSON file containing the fetched Twitter data.
sample_exported_excel.xlsx: A sample Excel file exported from the JSON data.

Feel free to explore and modify the code to suit your specific data analysis requirements.

FAQs:

Will I get banned? Could this affect my account?
- Selenium is one of the safest scraping methods out there, but it's still best to be cautious when using it for personal projects.
- I've been using it for quite a while without any issues.
- (Though, if you've got a spare / alt account, I'd recommend using that one's auth token instead)
How do I find the auth token?
- Check out this for a step-by-step guide!
  - video demo

Contributing

Contributions to this project are welcome. If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.

Acknowledgements

Initial structure and parts of the Selenium code inspired by Twitter-Scrapper.
The image captioning feature is powered by the OpenAI API. You should be able to achieve similar results using Gemini 1.0 or Claude Haiku.

For any questions or issues, please open an issue in the repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter-Insight💡: Data Scraping, Analysis, Image caption and More

Example of Exported Excel sheets & Visualizations:

(new!) Experimental Embedding Based Image Search: Search unlabeled images with nartual language.

Demo Video

Prerequisites

Setup

Data Ingestion

Data Analysis

Sample Output

FAQs:

Contributing

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
image/README_zh		image/README_zh
images		images
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md
config.py		config.py
image_search_webapp.py		image_search_webapp.py
requirements.txt		requirements.txt
twitter_data_ingestion.py		twitter_data_ingestion.py
twitter_data_initial_exploration.ipynb		twitter_data_initial_exploration.ipynb

AlexZhangji/Twitter-Insight-LLM

Folders and files

Latest commit

History

Repository files navigation

Twitter-Insight💡: Data Scraping, Analysis, Image caption and More

Example of Exported Excel sheets & Visualizations:

(new!) Experimental Embedding Based Image Search: Search unlabeled images with nartual language.

Demo Video

Prerequisites

Setup

Data Ingestion

Data Analysis

Sample Output

FAQs:

Contributing

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages