Topic Modelling Financial News with Natural Language Processing

Natural Language Processing (NLP) is an area of data science that intersects computer science, artificial intelligence and linguistics and involves machine learning on unstructured text data. In essence, it is concerned with teaching machines to process, understand and even generate language.

There are many applications of natural language processing, including text classification, language identification and translation, sentiment analysis, text summarization, relationship extraction, information retrieval and many more, but for this piece of analysis I wanted to focus on topic modelling.

From Wikipedia:

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modelling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.

The “topics” produced by topic modelling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document’s balance of topics is.

Topic modelling is an unsupervised learning problem, where the end goal is to extract patterns or similarities in the data and create a number of document clusters that refer to the same underlying topic.

 

What’s moving the markets?

I’ve followed the financial markets on and off for around 10 years, and keeping on top of what’s driving global financial markets is an interesting – yet time consuming  – task. The markets are very reactive to a number of different factors, including geopolitics, economic news and corporate earnings, and the influence each has on asset prices at any given time is dynamic and always changing.

Some topics can linger for long periods – sometimes months and even years – while others can come and go in a matter of weeks. It makes trying to keep on top of what’s happening a real challenge, particularly if it isn’t your day job and you don’t have the time to keep one eye on the headlines.

To that end, I wanted to see if I could apply topic modelling to financial news articles and see if I could see which topics have dominated the headlines over the last decade or so.

 

Data collection

I wasn’t interested in tabloid news, so I concentrated on getting articles from a small number of broadsheets and newswires going back as far as I could, and to ensure the articles were going to be covering the kind of topics I was looking to extract, I only downloaded articles from the business, markets or economy sections of these websites.

In total, I managed to collect data for almost 85,000 different news articles.

 

Data pre-processing: tokenization

Text data is messy and ‘unstructured’ and requires quite a lot of cleaning before any machine learning or analysis can be applied to it. Again, from Wikipedia:

Unstructured data (or unstructured information) refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well.

Tokenization is the task of chopping up a piece of text into pieces called tokens. From Stanford NLP:

A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.

In simple terms, it’s easy to think about splitting a sentence into individual words and each word is a token. Once you have tokenized your text you can remove or transform these tokens in different ways.

Stop words

Stop words are words that occur so frequently in text that they offer little value in explaining or distinguishing between different documents of text. Back to Wikipedia:

Stop words usually refer to the most common words in a language. There is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list.

Some example stop words are: the, as, be, about, that, from, will, all, it, by, when, do, an, only, has, for, this.

There are a number of stop word lists available to use ‘out of the box’ and I used the English stop words list from python library spaCy.

Here’s a comparison of what a passage of text looks like before and after stop word removal:

Original text:

European shares began the month with a gain, as BNP Paribas rose on relief it had settled a U.S. sanctions case and mining companies rallied after encouraging economic data came out of China, the world’s top metals consumer. The pan-European FTSEurofirst 300 index closed up 0.9 percent at 1,382.31 points – notching its biggest one-day percentage gain since May 8. BNP Paribas rose 3.6 percent in trading volume of almost twice its 90-day daily average. It had lost about 20 percent – or $21 billion of its market value – since Feb. 13 when it announced the provision for the fine. The French bank pleaded guilty to two criminal charges and agreed to pay almost $9 billion to resolve allegations that in many financial dealings it violated U.S. sanctions against Sudan, Cuba and Iran. Analysts and investors said the stock could now recover ground lost over the last few months.

After stop word removal:

European shares began month gain, BNP Paribas rose relief settled U.S. sanctions case mining companies rallied encouraging economic data came China, world’s top metals consumer. The pan-European FTSEurofirst 300 index closed 0.9 percent 1,382.31 points – notching biggest one-day percentage gain since May 8. BNP Paribas rose 3.6 percent trading volume almost twice 90-day daily average. It lost 20 percent – $21 billion market value – since Feb. 13 announced provision fine. The French bank pleaded guilty two criminal charges agreed pay almost $9 billion resolve allegations many financial dealings violated U.S. sanctions Sudan, Cuba Iran. Analysts investors said stock could recover ground lost last months.

A lot of the original text still remained, but the small, insignificant ‘filler’ words were removed. I felt that the stop word list that I used here was fairly conservative and there were additional words that could be removed, but more on that shortly.

Numbers and punctuation

I also chose to remove numbers and punctuation from the text. Financial news articles tend to have lots of prices and quotes in them but don’t these don’t really explain the topic being written about, so I decided to strip these out too.

Here’s the same passage of text after all the numbers and punctuation were removed:

European shares began month gain BNP Paribas rose relief settled US sanctions case mining companies rallied encouraging economic data came China world top metals consumer The pan European FTSEurofirst index closed percent points notching biggest one day percentage gain since May BNP Paribas rose percent trading volume almost twice day daily average It lost percent billion market value since Feb announced provision fine The French bank pleaded guilty two criminal charges agreed pay almost billion resolve allegations many financial dealings violated US sanctions Sudan Cuba Iran Analysts investors said stock could recover ground lost last months

Entity extraction

It’s also possible to extract entities (people, companies, places etc) from text and apply some transformations to them so that the words don’t get split up during all of the data processing stages. For example an underscore could be added between ‘BNP’ and ‘Paribas’ to create ‘BNP_Paribas’.

I didn’t actually use this as part of my pre-processing pipeline; I tried it, but the entities it was underscoring seemed a bit hit and miss.

 

Latent Dirichlet allocation (LDA)

Because I felt that the stop word removal step of my data pre-processing wasn’t comprehensive enough, I threw my data into an LDA model and generated a handful of topics. LDA maps text documents from a word space to a topic space to give the topic distribution of different documents. With LDA, the input is a large corpus of text documents, and the output is a reduced semantic space for those input documents and words.

In doing this, I found that it was consistently picking words like ‘percent’, ‘index’, ‘said’ and ‘says’ that didn’t really offer any explanation about the articles, so I removed these words. I repeated this process a handful of times to find additional stop words that would be surplus to requirements.

After identifying these additional stop words, I added them to my original stop words list and re-ran the previous steps to prepare my data for the next stage in the process.

 

Stemming and lemmatizing

From the Stanford NLP website:

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

And from Wikipedia:

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

In other words, stemming and lemmatizing text is a normalisation process of reducing words to their word stem or root form. For example, the words “runner”, “runners” and “running” all essentially refer to the same thing, so it makes sense to group them together, and that’s the objective.

Comparison of stemmers

The nltk library in Python has a variety of stemming and lemmatizing options built-in and they generally achieve similar results. For example, below is the same passage of text after being stemmed or lemmatized:

Original (cleaned) text

European shares began month gain BNP Paribas rose relief settled US sanctions case mining companies rallied encouraging economic data came China world top metals consumer The pan European FTSEurofirst index closed percent points notching biggest one day percentage gain since May BNP Paribas rose percent trading volume almost twice day daily average It lost percent billion market value since Feb announced provision fine The French bank pleaded guilty two criminal charges agreed pay almost billion resolve allegations many financial dealings violated US sanctions Sudan Cuba Iran Analysts investors said stock could recover ground lost last months

Lancaster stemmed

europ shar beg mon gain bnp pariba ros reliev settl us sanct cas min company ral enco econom dat cam chin world top met consum the pan europ ftseurofirst index clos perc point notch biggest on day perc gain sint may bnp pariba ros perc trad volum almost twic day dai av it lost perc bil market valu sint feb annount provid fin the french bank plead guil two crimin charg agree pay almost bil resolv alleg many fin deal viol us sanct sud cub ir analyst invest said stock could recov ground lost last month

Porter stemmed

european share began month gain bnp pariba rose relief settl US sanction case mine compani ralli encourag econom data came china world top metal consum the pan european ftseurofirst index close percent point notch biggest one day percentag gain sinc may bnp pariba rose percent trade volum almost twice day daili averag It lost percent billion market valu sinc feb announc provis fine the french bank plead guilti two crimin charg agre pay almost billion resolv alleg mani financi deal violat US sanction sudan cuba iran analyst investor said stock could recov ground lost last month

Snowball stemmed

european share began month gain bnp pariba rose relief settl us sanction case mine compani ralli encourag econom data came china world top metal consum the pan european ftseurofirst index close percent point notch biggest one day percentag gain sinc may bnp pariba rose percent trade volum almost twice day daili averag it lost percent billion market valu sinc feb announc provis fine the french bank plead guilti two crimin charg agre pay almost billion resolv alleg mani financi deal violat us sanction sudan cuba iran analyst investor said stock could recov ground lost last month

Lemmatized

European share began month gain BNP Paribas rose relief settled US sanction case mining company rallied encouraging economic data came China world top metal consumer The pan European FTSEurofirst index closed percent point notching biggest one day percentage gain since May BNP Paribas rose percent trading volume almost twice day daily average It lost percent billion market value since Feb announced provision fine The French bank pleaded guilty two criminal charge agreed pay almost billion resolve allegation many financial dealing violated US sanction Sudan Cuba Iran Analysts investor said stock could recover ground lost last month

This was the final step in my pre-processing stage. Ultimately I chose to use Snowball stemmed text.

 

Feature extraction: vectorizing the data

In order to feed predictive or clustering models with text data, you need to turn the text into vectors of numerical values suitable for statistical analysis. So with my data been prepared and cleaned, the next step was to vectorize the data into a matrix of token counts.

I tried out a simple count vectorizer, as well as binary counts (instead of counting how many times a token appears in your data set, it is set to 1 if it was there at least once, or 0 if not), but in the end I chose to use a TF-IDF vectorizer.

Vectorizing the data produces a term-document matrix consisting of documents and the total vocabulary found in your data (all the different tokens found across all of the documents).

Term Frequency – Inverse Document Frequency (TF-IDF)

TF-IDF vectors are an extension to simple term-document matrices in that they weight token counts to reflect how important they are to a document.

The TF-IDF value increases proportionally to the number of times a token appears in a document, but is offset by the frequency of that token across all of the documents in your data set. The inverse document frequency therefore reduces the emphasis on common tokens (words/terms) that frequently appear across all of the documents.

The final TF-IDF specification I used had n-grams between 1 and 4, and filtered out tokens that had a minimum document frequency less than 25 (it would have to appear in at least 25 out of the 85,000 documents) and a maximum document frequency greater than 98% (if it’s in more than 98% of the 85,000 documents, remove it).

What that means is that rather than just counting individual tokens, it also created counts for occurrences of sequences of tokens up-to 4 in size. For example, if you take the phrase European shares began month gain, it would create tokens (and count the occurrences across all documents) for:

  • European
  • shares
  • began
  • month
  • gain
  • European shares
  • shares began
  • began month
  • month gain
  • European shares began
  • shares began month
  • began month gain
  • European shares began month
  • shares began month gain

…but only as long as these sequences of tokens appeared in at least 25 different articles; this stopped the number of dimensions from exploding in size. The final outcome was a matrix with 157,000 dimensions.

 

Dimensionality reduction: Latent Semantic Analysis (LSA)

The last step before clustering was to reduce the dimensions via LSA.

Again, from Wikipedia:

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.

Starting with the TF-IDF matrix, a technique called singular value decomposition (SVD) was used to reduce the number of dimensions (in my instance, I chose to reduce down to 400 dimensions) while preserving the similarity structure among documents. The resulting matrix is a reduced term space and a document space.

From here, it’s possible to use these vectors to make conceptual comparisons or, in my case, for clustering.

 

Clustering the latent vectors

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

The first clustering algorithm I chose to try was DBSCAN. With this particular algorithm, you don’t specify the number of clusters you want it to produce (but you are able to adjust other parameters that affect the way in which it produces clusters).

I tried clustering with DBSCAN a few times and each time the vast majority of clusters were labelled with -1, meaning they were labelled as ‘noise’ and didn’t fit into a cluster. Meanwhile, the remaining documents that it did cluster into topics were very specific, and it created 370 of them on the first attempt. For example, it found about 20 articles specifically talking about Bitcoin and grouped them into a cluster.

Because DBSCAN took a very long time to train (a couple of hours each time), and it was creating clusters that were far too specific for my liking, I decided to move on to K-Means clustering.

K-Means

In comparison, K-Means is much faster to compute, and you have control over the number of clusters you want the algorithm to produce. I wasn’t sure what the right number would be, so I ran the algorithm over a variety of cluster sizes and plotted the inertia score at each size. The inertia metric is the sum of squared distances in each cluster. That is, low inertia equals high density of the cluster, and this can give you a feel for what number of clusters might work with your data.

Inertia

I decided to try 150 clusters.

 

Exploring and labelling the topic clusters

After clustering my documents into 150 different clusters, I wanted to explore each cluster and see how well it had grouped similar topics.

I wanted to look at the key topics in each of the clusters, so I took the cluster centroids from my K-Means model (which had the dimensions 150×400 – so, 150 clusters and 400 latent dimensions created via LSA) and applied the inverse transform method on my LSA model to expand the data back to the original 157,000 dimensions. I then examined the top centroids in each cluster.

I was quite pleased with the clusters it had created. Here are a few examples:

Top cluster tokens

wdt_ID Cluster # item Top tokens
1 Cluster 0 0 spreadbett
2 Cluster 0 1 financi spreadbett
3 Cluster 0 2 seen
4 Cluster 0 3 expect britain ftse
5 Cluster 0 4 expect britain
6 Cluster 0 5 spreadbett expect
7 Cluster 0 6 financi spreadbett expect
8 Cluster 0 7 britain ftse germani dax
9 Cluster 0 8 britain ftse germani
10 Cluster 0 9 spreadbett expect britain
Cluster #

One thing that was apparent, however, was that some clusters appeared to cover the same topic, especially when it came to words such as Nikkei, Topix, Dow Jones, Nasdaq, Hong Kong, Shanghai, mini futures, FTSEeurofirst etc. It would seem that a very large proportion of the articles that I had downloaded were just market commentary – reports of what had happened that day in the major stock indices. I wasn’t interested in these. These were nearly all Reuters articles, so I probably should have been a bit more selective in the articles I downloaded.

Other clusters appeared to cover very similar, closely related topics. For example, if you filter on clusters 21 and 34 above –  both appear to be covering the labour market. In instances such as this, I gave them the same topic label, so they would have been aggregated in my final analysis. However, looking at the spellings of ‘labor’ and ‘labour’, it’s possible that cluster 21 was predominantly concerned with the US labour market and cluster 34 on the UK labour market.

I also extracted a sample of 30 news article titles for each cluster because sometimes it wasn’t obvious what the cluster was about just by looking at the tokens.

 

Final result: the topics

After labelling each of the clusters, all that was left to do was analyse the results.

When looking at the published date for all of the articles I had downloaded, the number of news articles in my data set was so small pre-2007 that I decided to concentrate on articles from 2008 onward as that gave me 10 years worth of data to work with. Also, I wanted to look at trends in topics over time, so to smooth out the trend lines, I aggregated the counts into a weekly time series and then converted the counts to percentages of total for that week.

Not all the topic clusters showed interesting trends, so I have picked out some of my favourites and grouped them into three broad categories; the economy, country or region specific news, and central banks, regulation and financial stability.

The economy

There were a handful of clusters that were clearly focused on economic news articles, ranging from tax, manufacturing & services PMIs, inflation and, as discussed already, the labour market.

Because I downloaded articles from UK and US websites, I don’t know which economy in particular the articles were referring to. However, there are some interesting peaks in the charts.

Take the house prices chart, for instance. There is a large peak in mid 2008, and a smaller peak around summer 2014. As a London home owner, I am well aware that in the midst of the global financial crisis in 2008, house prices in London (and the wider UK) fell for the first time since the 1990s, while in 2014 house prices in London went crazy and surged by double digit percentages in the space of a few months. These trends evidently made their way into the newspaper headlines.

Country / region specific news

I found looking at the country and region specific topics to be really interesting.

These are some interesting topics I identified in the charts:

  • UK: the peak in summer 2016 was obviously Brexit.
  • Russia: Russia annexed Crimea in March 2014, and in July 2014 the Boeing MH17 plane was shot down over Ukraine.
  • Scotland: Scotland went to the polls in September 2014 to vote on independence from the UK.
  • France: The prospect of Marine Le Pen potentially winning the French election grabbed the headlines in early 2017.
  • Italy: After a few months of uncertainty, the Italian state agreed in December 2016 to rescue Monte dei Paschi di Siena with a €20bn bailout.
  • US: I’m not actually sure what the big peak was in mid 2011, but I don’t think anybody would have any difficulty in figuring out why so much of the news was dominated with articles about the US in late 2016 onwards…
  • South Africa: I actually had to look this up when I saw this cluster. The peaks in 2014 and 2017 relate to South African Government debt. In early 2014, it was reported that outstanding debt had risen to its highest level since the mid-1980s, and in April 2017 South Africa’s credit rating was cut to junk status by S&P.
  • Greece & Eurozone: These two probably need no introduction. The Eurozone debt crisis intensified in 2010 and rumbled on for a number of years, with Greece the country in the most trouble and having to go through a series of bailouts. However, if you’re interested in what the large peak in Greece in 2015 was…that was when the country became the first developed country to miss a payment to the IMF.

Central banks, regulation and financial stability

The final set of topics were ones related to central banks, regulation and financial stability.

I think the Federal reserve topic is pretty quiet from around mid 2013 and earlier because I couldn’t download articles from the the US newspapers going back as far as I could with other newspapers. So it’s perhaps a bias in the data. However, if you look at cluster number 147 in the table above, you’ll see that the top tokens in this cluster were actually relating to the possibility of raising interest rates.

I think the peak in 2012 for the World Bank was in relation to the World Bank presidential election. This was unique in that the election featured the nomination of two non-United States candidates (Nigeria, Colombia) for the first time. Eventually, and amid controversy, the US nominee was announced as the new president in April 2012.

The peak in 2011 in the IMF chart relates to the appointment of Christine Lagarde as Managing Director.

I’ll leave it to the reader to figure out what all the other noteworthy subjects were in all the charts.

Next steps

I began this project with 85,000 uncategorized/unlabelled documents and ended with labelled articles that clearly show trends in what was being reported in the news. From here, I’d like to take this analysis further and look into the sentiment in these articles. For example, what’s the general tone in the articles about Donald Trump – is it positive, neutral or negative? Has it changed over time?

From there, it would also be interesting to see do some analysis and see how different asset classes tend to react to these different positive and negative news topics.

Code

 

Comments

  • akhil

    August 29, 2017 at 5:29 pm

    how and where did you collect all the data ?

  • Matt

    August 29, 2017 at 6:19 pm

    Hi Akhil, I scraped websites for the articles.

  • Sandeep

    August 30, 2017 at 4:03 am

    Hi Matt,

    Can I get Python or R Codes which you used to implement this?

  • Matt

    August 30, 2017 at 7:18 am

    Hi Sandeep, the Python code I used is on my GitHub. The link is just above these comments.

  • Rui

    September 12, 2017 at 1:07 am

    Can I get ur raw data?
    Thanks

  • Matt

    September 14, 2017 at 4:27 pm

    Hi Rui, I’m afraid not…some of the news articles I downloaded were from websites where I pay for a subscription, and I think I’d be violating the terms of use if I shared the content.

Your email address will not be published. Required fields are marked *