Data

Market quotations

After filtering out the quotations unrelated to the topic of interest, we are left with a total of 2962875 quotes. In this Section we present a deeper analysis about the characteristics of these quotes to gain a better understanding of the data at hand. Firstly, the next figure presents the WorldCloud chart where the most frequent words (after removal of stopwords) are written in a larger font.

WordCloud chart of word frequency

The reader can appreciate the prevelance of the words “market”, “company”, and “investment”, which provides further evidence that the filtering of the quotes worked relatively well. Obviously, the selection strongly depends on the seed words used to filter the sentences. Nevertheless, we notice that none of the most frequent words is one of the seed words we used for filtering, possibly indicating that the filter managed to capture the general financial semantic context.

On the other hand, it is quite surprising to see the word “will” appearing so often in the quotations, probably evidencing the forward-looking mindset that characterises the financial area. Also, from simply scrolling through the most frequent words there seem to be a rather abundance of words carrying a positive connotation such as “good”, “opportunity”, and “growth”. We will further analyse this topic later in the Sentiment Classification section.


Data wrangling

For our analysis we will consider the 100 people with the largets amount of quotations in our filtered dataset. While counting the quotes for each speaker we encountered an issue: ‘President Donald Trump’ and ‘Donald Trump’ were categorized as different speakers. We then decided to check if any other speaker had a similar problem. To do so, We first selected the 200 speakers with the largest number of quotations, then we split the name of each speaker in its different parts (name, surname, …), and last we checked if a person’s surname appeared different times. With this procedure, we discovered that Trump was not the only speaker that appeared with different names. In fact, we found that President Obama appeared with three different names (‘President Obama’, ’President Barack Obama’, ’Barack Obama’) and Prime Minister Theresa May appeared with two different names (‘Prime Minister Theresa May’, ’Theresa May’). To overcome this problem we decided to delete the title in front of each name and set the name using the name-surname structure.

  • ‘Prime Minister Theresa May’, ’Theresa May’ => ’Theresa May
  • ‘President Obama’, ’President Barack Obama’, ’Barack Obama’ => ’Barack Obama
  • ‘President Trump’, ’President Donald Trump’, ‘Donald Trump’ => ‘Donald Trump

In the graph below we present the twenty most cited people in our financial dataset.