Attensity Blog

  • Inside Automated Sentiment Analysis

    This post details Biz360‘s automated sentiment analysis system, including our goals, how the system works, how we measure success, and the ways it can be used and misused. Before getting into the how or why, I want to start with the what. For our purposes, sentiment is the opinion of the author of an article towards the subject of an article. We classify sentiment into four possible categories.

    Positive

    Arguing for something, saying something is a good product, talking about good things a person or company has done, enjoying something, liking something, preferring something. If a mostly positive post has a small portion that is negative, it is still positive.

    Negative

    Arguing against something, saying something is a bad product, a bad experiences, talking about bad things a person or company has done, disliking or having problems with something. If a mostly negative post has a small portion that is positive, it is still negative

    Neutral

    If an post doesn’t express any opinion, doesn’t present anyone or anything in a favorable or unfavorable way, and wouldn’t lead someone to form an opinion for or against, it is neutral.

    Mixed

    If an post is both positive and negative, such as saying something was good in some ways but bad in others, or if the post talks about different subjects and is positive toward one subject but negative to another, then rate the post as mixed.

    The first question is why do you need automated sentiment. The simple answer is that there’s just too much content. As conversations that used to take place over coffee and on street corners move to Twitter and forums, they become trackable. If a magazine with 100,000 readers mentions you in an article, you’ll read that article and discuss what it’s saying about you. If 10,000 people tell ten of their friends what they think of Kevin Smith vs. Southwest Air, you can’t hope to read more than a small sampling. It’s this later use case that we cared about.

    1. What portion of my coverage is positive, negative, etc?
    2. I got a spike in coverage on Monday. Was that spike positive or negative?
    3. What kinds of things are people saying that’s positive? Negative?

    We knew from the start that accuracy on the individual article level was never going to be that good. That is, if you want to know what the sentiment for some particular article is, the best thing to do is click on it, read it, and form your own opinion. With the help of Bill MacCartney, an NLP researcher from Stanford, we quickly honed in on the following design parameters:

    1. A statistical classification system using two classifiers to detect positive and negative, and another classifier to combine these results. We would start with a simple Naive Bayes classifier and a Decision Tree classifier to get everything working, and experiment with more advanced classifiers like the Linear MaxEnt classifier once we had a baseline to measure improvements.
    2. The system would be trained using lots of data from Mechanical Turk. Each item would be rated multiple times so we could throw out the results from raters who didn’t understand or were not taking enough care.
    3. Our training data would be real social media content, drawn from all the types of social media we process (blogs, micro-blogs, etc).

    At a very high level view, text classification systems get lumped into groups based on whether they are based on statistical learning from data, or whether they are based on hand-coded rules. Our system is solidly in the statistical camp. We were skeptical that a rule-based system could encompass the wide variety of topics and writing styles and the frequency of ungrammatical or misspelled content on the less formal parts of the Internet.

    Our sentiment engine turns each post into a set of features, like (“good”, “deal”) -> 2, meaning the word “good” followed by the word “deal” occurs twice. This gets fed into a two-stage system. First, everything gets flagged for how positive it is (regardless of also being negative) and for how negative it is (regardless of how positive it is). Next, these get combined into the four categories that are displayed. So high positive sentiment and low negative sentiment would be positive, and high positive and high negative would be negative.

    We really wanted a mixed category, because in terms of whether it’s a post worth reading, someone who is saying both good and bad things about you is even more interesting than positive or negative. Consider the following three clips:

    1. I love my Kinesis Maxim keyboard, it’s the best. My wrists feel great since I’ve been typing on it.
    2. Kinesis is stupid, the Maxim has a stupid layout. I had one for a while but I threw it out.
    3. I like my Kinesis Maxim, but the left alt key is too small and too far to the left.

    Sure, the first one is what you hope everyone is saying, but reading these doesn’t provide much value. The second one at least is an opportunity for damage control, but the third one is the real gold. In a system based on just a range from negative through neutral to positive, the positive and negative would cancel out and this kind of thing would get lumped into the neutral bucket.

    This kind of statistical system isn’t any good without good data, so we used an approach that gives us lots of good data quickly and cheaply. We sent out thousands of clips to Mechanical Turk, Amazon’s “artificial artificial intelligence” where they were scored by ten humans each. The instructions they were given were exactly the definitions I gave above. Those aren’t just descriptions of what we think the system produces, those are the starting point. When the results came back, the humans didn’t always agree, and some agreed more than others. We threw out the ones who looked like they just didn’t understand the problem at all or were clicking randomly since payment was per item. Of the remaining items, we still got disagreements, so we took the majority, so that if five people said positive, three said neutral and two said mixed, we’d used that clip as training data for positive. All of our data was real social media data. We evaluated one off-the-shelf solution which was trained on newspaper data, and when it said that “Comcast sucks!” was neutral, we gave up on that idea.

    To evaluate our accuracy, we looked at a whole slew of numbers. We used a technique called k-fold cross validation, which means that we’d hold back some of our human-annotated data to use to evaluate how accurate the system is. A big challenge was that most of the content we got was neutral or positive, not mixed or negative. This makes it hard to use simple accuracy as the only metric. That is, if I have 90 items that should be classified as A and 10 items that should be classified at B, I could be 90% accurate by just saying everything was A. So I looked at the accuracy rates for each of the categories separately, and tried to balance them. Given my example with 90 A and 10 B, if I could get 90% accuracy, I’d really prefer 81 out of 90 As classified correctly and 9 out of the 10 Bs.

    Predicted vs. Human Sentiment Chart

    Predicted vs. Human Sentiment Chart

    Of course, there’s no “make mistakes evenly” button to press, but I think we found a combination that gives useful results. You can see in the attached chart of predicted vs. human-annotated sentiment that the errors are evenly spread across the categories. This illustrates what we mean when we say that sentiment, though it is only correct for about 2/3 of the individual items, is directionally accurate. If the system finds 100 articles for for a topic, and says 50 of them are positive, a lot of those will be wrong. Maybe you go through them and you see that 10 of them were neutral, four negative and one mixed. But when you go to the other categories, you’ll find that the errors mostly balance out. Some of the neutral should have been positive, and so on. So maybe there should have been 52 positive.

    There’s a strong temptation when building an automated sentiment system to treat neutral as “I’m not sure”. Computers make different kinds of mistakes than humans, and when the computer screws up something a human would have no trouble classifying correctly, it erodes confidence. The problem with this approach is that it focuses too much on not being wrong, and not enough on being right. If uncertain posts are rated as neutral, it changes the whole distribution of content. If you look at a topic and 75% of the content is “neutral”, how much is really neutral and how much is swept under the rug because it didn’t cross a confidence threshold? We treat neutral as just another category. To classify something that should be positive, negative, or mixed as neutral is just as incorrect as vice-versa.

    I hope this has given you some insight into how Biz360′s sentiment engine works, and lets you make better sense of the numbers you are seeing, or, if you are still comparing solutions, gives you things to look for and questions to ask. I’ll be following this up in the future with another article explaining “entity” or topic-based sentiment.

    Bookmark and Share

    Follow comments via the RSS Feed | Leave a comment | Trackback URL

    8 Responses to “Inside Automated Sentiment Analysis”

    1. Seth Grimes

      Any chance you could share the data from the Mechnical Turk ratings? Did you really get cases as indeterminate as “five people said positive, three said neutral and two said mixed”? What proportion?

      Regarding system capabilities, with statements such as “I like my Kinesis Maxim, but the left alt key is too small and too far to the left.”: You seem to be resolving sentiment at the sentence level. Can you separately resolve for each of the two clauses and for each of the entities (Kinesis Maxim and left alt key) that the sentence contains?

      On a related topic: Folks who are interested in the general field should think about attending the upcoming Sentiment Analysis Symposium, which I’m organizing: http://www.sentimentsymposium.com/

      Reply   More from author
      • Maria Ogneva

        Hi Seth,

        Re: data from mechanical turk ratings… It just so happens that our sentiment engineer who wrote this post is out of office this week, and he is the best person to answer this question. I will ask him to follow up with you when he is back.

        Re: system capabilities… Yes, that was sentiment at the sentence level. In cases of Twitter, this is pretty much all you get, a sentence or two, so if this was a tweet, that would pretty much be the sentiment of the tweet. With longer articles and blogposts, however, we allow to measure sentiment on two levels – topic (or entity) level – in this case Kinesis Maxim – and article level. When asked for an entity level sentiment, the system looks for words in close proximity to the entity. On an article level, it looks at the entire article.

        Does that help?

        Cheers!

        - Maria

        Reply
    2. Kevin Peterson

      Seth,

      I made up that example for the post, but I did see items like that. The 79% inter-annotator agreement does mean 8 out of 10 on average, but there’s a lot of variability, mainly among difficult items. I see either almost everyone agrees, or it’s all over the place. For example, some real numbers are:

      +, -, N, M
      1, 3, 1, 0
      0, 5, 1, 0
      0, 9, 0, 0
      0, 0, 7, 0
      8, 0, 6, 2

      The first four are about what you expect — mostly agreement with some disagreement, or everyone agreeing. The last one is less common, but there’s at least one like this on each screen I look at, so maybe up to 5 or 10%.

      Regarding sentence-level sentiment, we don’t do aggregation that way. We do featurization on the entire text and then feed it into the classifiers. I was just giving some examples to show what I mean by mixed. You should just imagine that those are separate tweets, totally unrelated.

      I’ll talk more about how we do entity-level sentiment in a future post. We wouldn’t consider left alt-key to be a named entity. Taking terminology from a previous product, we’d probably call it a feature, as in a component of a larger whole. We’d only extract Kinesis Maxim as a named entity. I don’t have things set up right now on this machine, so I can’t say whether it’s right or not, but the correct sentiment for that should be mixed overall, and also mixed for Kinesis Maxim as a named entity.

      Reply   More from author
    3. Katie Delahaye Paine

      Having trained humans to accurately read and code social media items since 1996, I know a bit about accuracy in reading social media. While I applaud your transparency in how you set up your systems, my biggest gripe with your methodology is the use of Mechanical Turk for accurate reading. In my world, intercoder reliability scores of 88% or better are required, and I know that it takes a good six weeks to train readers to read any given subject to that level of accuracy. I question how 10 random readers without proper training could provide anything close to an acceptable level of accuracy. So if the system is based on inaccurate testing, I have to wonder how accurate it really is.

      Reply   More from author
      • Maria Ogneva

        Thanks for your insight, Katie. First of all, we are looking at an issue of cost of having trained the readers first for six weeks. Hard to do without affecting the price of the product, which is pretty aggressively priced, compared to our peers. Secondly, we don’t use raw annotations; we use consensus.

        Reply

    Leave a Reply




    Additional comments powered by BackType

Follow Us

Our Tweets