News events continue to be of outmost importance during the price discovery process of any asset. Some news events -- such as press releases -- have been found to represent a potential explanation for up to 24% of an asset's price movements. We seek to understand and classify what causes a news event to have a statistically significant effect on price. In particular, we aim to implement machine learning (ML) processes, and Natural Language Processing (NLP) techniques to extract what traders intuitively know when they choose which news articles/publications/news-beats to listen to, and consequently base their trading decisions on.
The following represent a collaborative project for identifying, processing, predicting, and recommending relevant information from news events to sophisticated traders.
For this project we decided to split our overall scope into three steps.
1) Web Scraping and API utilization for pulling news articles
2) IMetadata Extraction from the articles
3) Preprocessing and cleaning the articles
4) Collecting and cleaning Bitcoin price data
1) Filtering of relevant news content with relevant keywords
2) Preprocessing of content for modeling purposes (vectorization, text cleaning etc.)
1) Perform Clustering Analysis on articles
2) Classify events based on sentiment analysis
3) Identify significant price movements through event study methodology
4) Establish a causal relationship between a news event and its effect on the price of Bitcoin
<add text here once we are done modeling and have results on our efforts>
Further steps will involve the development of a learning recommender system built around both supervised and unsupervised learning methods. In particular, based on the foundation established by our team, we believe the following projects could be implemented to increase the strength and capacity of our model.
1. Acheive better granularity of data - create series of crawlers that extract the desired granular data.
1.a Examples include: pricing data, Google Trends data, etc.
2. Create two tiers of data acquisition processes.
2.a Broad spectrum crawlers should be scheduled to stay up to data with current events, while minimizing load.
2.b Event driven crawlers should be developed around predictive modeling and/or event detection when a significant price movement occurs.
3. Implement a predictive pricing model for asset classes.
3.a Use pricing predictions to trigger searches for relevant articles, and evaluate pricing and recommender models.
3.b Predictive pricing models could also be used for arbitrage purposes, and optimization of business processes.
4. Implement a distributive compuatition architecture
4.a In order to implement just-in-time analysis, future team will nead to export processes to cloud.
5. Increase the quality of prediction by instituting a magnitude/directional component to our time series modeling.
Python was used for all data gathering, cleaning, and modeling purposes.
Multiple python notebooks (Jupyter) were written in order to carry out project. To execute our procedure follow the following in order:
-
Scrapping (in order):
news_to_features(1o3)
url_to_contents(2o3)
many_df_to_one(3o3)
-
Time Series Modeling and Event Detection (in order):
price_change_marker
filter_news_by_marker
-
Natural Language Processing and Formulations (in order):
clean_articles_final(1o2)
clean_articles_final(2o2)
-
Classificatin and Modeling:
Word2Vec_and_KMeans_clustering
Final_Classification