Big data expects talk about text, Twitter and turning quantamental
Tom Doris of OTAS; Peter Hafez, RavenPack; and Gautham Sastri, iSentium discuss contexts around NLP.
Using machines to read text as a way to enhance understanding of market movements is a topic of intense polarisation and debate.
Back in the 90s, work on natural language processing (NLP) involved teams of linguists and computer scientists attempting to code up rules of grammar. Recent work has focused on techniques like word embedding, the underlying idea that a word is characterised by the company it keeps; semantic similarities between words are based on their distribution in large samples of data.
The “bag of words” approach has been applied commercially in finance for more than 10 years. But it can depend on the source of information being analysed: a rule-based approach can work pretty well for news articles that follow certain editorial processes, while social media proves much more challenging.
Tom Doris, CEO at OTAS Technologies, takes a longer view of the technology’s potential; he thinks asking an AI how to get better prediction of which stock is going to outperform may be the wrong question. In terms of efficiency, it certainly can be used to quickly reveal things that would take many years’ experience and a carefully curated set of information sources to put together.
He said: “What’s exciting is we finally have the techniques that can look at entire economies and start to understand where the ebbs and flows are, and anticipate where the potential downturns or the potential resource constraints are going to be in future.
“That’s quite different to the hype around social media, which has really been focused on the low latency play and being the first to identify when Carl Icahn tweets something, or when the CEO of a company says something stupid – or the President for that matter.”
Doris, who holds a PhD in computer science, believes Twitter can be useful, but in a more limited domain. He said it doesn’t contain the information to answer a lot of the questions that are interesting to traders, and it has proven extremely difficult to extract information from Twitter that eliminates enough of the noise for traders to be interested in it.
“Nothing really works very well with Twitter because basically there just isn’t that much information in Twitter,” said Doris. “I think where a lot of this stuff falls down – whether it’s natural language processing or AI in general – people expect it to be able to extract information where there just fundamentally isn’t that much information.”
One approach to event trading with Twitter is using a so-called white list, where only verified company or influencer accounts have been used. Taking on the entire Twitter firehose may offer the wisdom of the crowd, but it’s extremely noisy and can easily become very expensive from a trading point of view.
However, some experts can demonstrate that Twitter-based indicators substantially outperform the market over extended periods of time. Sentiment analytics company iSentium extracts actionable indictors from large amounts of unstructured social content. It points to independent research commissioned by Nasdaq and carried out by Lucena Research.
Gautham Sastri, CEO of iSentium, emphasised that his company does not use a bag of words approach. “Our team of linguists, led by Dr Anna Maria di Sciullo [post-doc from MIT, Fellow of the Royal Society] all have PhDs and have been working since 2008 to build a system that seeks to understand social media messages in a human-like way.
“If short messages are less valuable because they are brief and to the point, then how does one explain the extensive references in Churchill’s History of the Second World War to telegrams that he sent and received? And why were so many resources expended on building the Enigma machine at Bletchley Park?
“I would argue the point that brevity is indeed the soul of wit; and low latency, combined with volume, can provide substantial edge when properly exploited.”
Sastri went on to say that his company processes more volume of social content in a second than the entire New York Stock Exchange produces in a day. “For each tweet that we process, we generate 24 different fields that can provide deep insights regarding demography, geolocation, contagion, etc.”
Even with replies and hashtags, Twitter is very disparate and lacks a strong sense of topic threading, says Doris. “You’d have much richer content if you looked at the archives of email within a company,” he said. “From that you build up a much richer picture about how the different parts of the organisation interact and where the connectivity is and what topics people are discussing.
“For instance, analysis of thread length would tell you whether things are languishing – you just don’t have that richness in Twitter. It’s been very successful in part because it’s so brief and to the point, but that makes it less valuable from an analysis point of view.”
Peter Hafez, chief data scientist of big data analytics firm RavenPack, said investors are finding value in Twitter, but not as much as some people might think, and not as much as one you you might find looking at more traditional sources like news. “That said, I believe Twitter may get a second life as the technology knowledge graphs that are applied are becoming more advanced.
“Twitter is more about tracking consumer sentiment than about following the views of prophets. People tweet about the products they like or dislike, what they wish to buy, observed side effects caused by a given drug, etc.
“They don’t necessarily tweet about the companies that own the products or a given subsidiary. For example, I might say that I love the new Q5, but I might leave out Audi, the owner of the product; and surely I would leave out Volkswagen, the owner of Audi, which would be the stock that I would have to buy if I wanted to trade the equities markets. You could of course have traded Audi’s corporate bonds, skipping the link back to VW; tracking products in a point-in-time fashion is the hard part.”
Today internal content is driving the discussion and many financial firms are looking towards internal content such as email and instant messages like Slack, instant Bloomberg, Symphony, Skype and more for a competitive advantage.
Hafez added: “More traditional firms have started turning their emails and internal investment notes into actionable data points that can be used more directly within their investment process, basically making internal content more easily accessible within the organisation. It allows deeper understanding of where an organisation has a true competitive advantage over publicly available information.”
Returning to a more macro picture of the world, Doris sees NLP technology dovetailing with human discretionary analysis, where the ability to quickly surface information algorithmically will be a fundamental sweet spot.
“It’s about what will be the ultimate truths of the macro environment; are interest rates going to change significantly; is the money supply going to change significantly; do we think that there is going to be a significant change in the socio-economic or the global geo-political environment?” he said.
“That’s the kind of stuff that you do need a human level of perception to understand whether it’s a fad or actually the establishment of a long term trend.”
And regarding firms publishing backtests of their signals, Doris pointed out that it’s really important to know whether trading costs have been taken into account. “If you don’t take into account trading costs, it is trivial to create a ‘signal’ that always makes money, e.g. by assuming you can instantly trade large volumes of stock whenever the main index future price moves and before the stocks catch up.”
Newsweek’s AI and Data Science in Capital Markets conference on December 6-7 in New York is the most important gathering of experts in Artificial Intelligence and Machine Learning in trading. Join us for two days of talks, workshops and networking sessions with key industry players.