The enormous influx of written data has brought the necessity to analyze language to the forefront. In order to find relevant information in the ever-growing sea of words, companies have invented ways to model and crunch the data in a meaningful way. However, the complexity of analyzing languages has made it a difficult and time-consuming task to accomplish. More often than not, the process leaves you feeling like you are killing flies with a cannon. But is there a way to dodge the complexity? Would lists of words sprinkled with statistics and trends suffice for conducting thorough sentiment analysis? I’d say not quite.
The English linguist John Rupert Firth authored one of my most beloved quotes on linguistics: “You shall know a word by the company it keeps.” In text analytics, almost everything is context, at all levels. For example, depending on its context and position in a sentence, the word ‘left’ can be a verb as in ‘I left home,’ an adverb in ‘I turned left,’ or be used as an adjective in ‘my left foot.’ Syntax determines the class a word belongs to, which sets a blueprint for combining words to convey meaning in a way that makes sense to the listener.
But it gets even more complex. Compare ‘left foot’ with saying you ‘left luggage in Paddington.’ ‘Left’ is used as an adjective in both cases, but their meaning is entirely different. Welcome to the fascinating world of semantics! Indeed, the world of linguistics is much like a carnival experience, where nothing is what it seems – we must look beneath the levels of the obvious, past the masks and surface illusions.
From here on, our linguistic cavalcade brings us to the house of mirrors where we take a peek at the words that are their own opposites. Here ‘left’ can be ‘the departed’, or ‘the ones remaining’, like in ‘the gentlemen left, and only the ladies were left.’ On another level, ‘awful’ sounds bad, but to make an ‘awful lot of sense’ is what I’m trying to do here. And while a ‘cheap’ handbag might be worth acquiring, is it still worth having if it looks ‘cheap.’
Moving to a bigger textual context, such as a news story, let’s take on a relatively recent piece of news:
“Hewlett-Packard Co. (HPQ) Chief Executive Officer Meg Whitman, the company’s board and former CEO Leo Apotheker were sued by an investor alleging that mismanagement and botched acquisitions have destroyed shareholder value.”
By context analysis, I learn that this story talks about two persons, Meg Whitman and Leo Apotheker, who hold or have held an executive position at Hewlett-Packard, a company with HPQ as its ticker symbol. I also learn that some investor sued them for mismanagement.
When later I read: “Executives have failed “in their most fundamental stewardship responsibilities owed to HP,” resulting in a series of mishaps including bribery probes, the hiring and firing of Apotheker and an $8.8 billion write-down in 2012 of the Autonomy Corp. acquisition, shareholder A.J. Copeland said in a complaint in federal court in San Francisco.” I learn that there is a shareholder, which is synonym of investor, whose name is A.J. Copeland that is complaining among other things about the acquisition of Autonomy Corp. (another company) by the sum of $8.8 billion.
But can we get this information through text analytics? Yes, we can if we are able identify the actors and objects, classes, descriptions, facts, events and relations. But could you get to the same kind of insight by considering a document a jumbled bucket of terms? Again, I don’t think so.
There is an even bigger context that transcends the document, akin to life itself, and just as elusive. To call ‘Chicago’ ‘Chiberia’ is both witty and informative, as it refers to the low temperatures that often characterize the city. But to know that it is witty, we must become aware of the play on words – to know that ‘Chiberia’ refers to ‘Siberia’ and then that Siberia has a reputation of being one of the coldest places on earth. But doesn’t it also have a reputation for being remote, isolated and having work camps for prisoners? Now what? We need to know that Chicago is experiencing very low temperatures before we can make the right connection. This is the hardest nut to crack – to add real world knowledge to a system. Words, expressions and whole ideas do not come alone and need context to fully understand.
Reading comments on Twitter about a notorious low-cost airline company, I stumbled upon one saying that ‘It would be safer to travel on the back of a shark.’ I’ve also read about an amusement park being compared to the Tokyo Metro. Siberia is cold and isolated, the Tokyo Metro is crowded and sharks are unsafe for humans. Sometimes these associations are local or culture-related or shifting in time. It is hard to model this kind of knowledge, but who knows if one day we will have all these figured out as well?
In sum, it is a long and winding trip in a text from A to Z, but if you want to take the ride, you better have good tools handy to analyze it and contextualize the knowledge you have acquired, or you might end up thinking you saw a white building in a quantity of one, when in fact you saw the Taj Mahal.
Senior Computational Linguist AKA The Context Doctor