从贾斯汀•比伯到数据学家，Twitter何以成为一门显学

财富中文网 >> 商业

Erika Fry | 2014-09-01 04:00

分享： [双语阅读]

自Twitter创建以来，各路学者纷纷涌向这一微博平台，不是去发帖，而是去从事研究工作。在学术界看来，Twitter拥有最为丰富，也许是前所未有的数据集。它就相当于一个实时数据的虚拟培养皿，吸引着各个学科的学者开展五花八门的研究。

“传染”系列文章：

【传染之一】比SARS更致命：蝙蝠病毒MERS是如何成为人类杀手的

【传染之二】“自拍”何以变成社会流行病

【传染之三】市场抛盘是怎样发生的？

【传染之四】并购传闻如何不胫而走

【传染之五】从贾斯汀•比伯到数据学家，Twitter何以成为一门显学

两三年前，伊利诺伊州大学（University of Illinois）健康经济学家雪莉•埃默里在Twitter上看到谈论“吸烟辣妹”的帖子，也有一些帖子谈论“熏制肋骨”、“抽大麻”，以及教皇选举会议的象征——“冒烟的烟囱”。如果她幸运的话，还能看到那些明显与香烟有关的帖子，例如“吸烟广场”或者仅仅是“吸烟”。

多年来，埃默里一直在研究烟草广告的影响。直到前不久，这项工作还意味着查看电视或广播插播广告，跟踪尼尔森收视率（Nielsen Ratings）和地区吸烟率。但是2011年的一天晚上，她在浏览视频网站Netflix时冒出了一个想法：如果她在上网，那么其他人也是一样——而且他们很可能会在Twitter等社交平台上发表自己对吸烟的看法。

2011年9月，美国国家癌症研究所（National Cancer Institute）为埃莫里拨款720万美元，用于开展此项研究。就这样，她进入了Twitter学（Twitterology）这一纷繁的新领域，成为了同行中第一个吃螃蟹的人。

现在，她并不是孤军奋战。自Twitter于2006年创建以来，各路学者纷纷涌向这一微博平台——不是去发帖（尽管有些人也这么做了），而是去研究这些帖子。每天有2.25亿Twitter用户发表5亿条帖子，在学术界看来，Twitter拥有最为丰富，也许是前所未有的数据集¬¬。它就相当于一个实时数据的虚拟培养皿，吸引着各个学科的学者开展五花八门的研究。物理学家利用Twitter研究网络；心理学家则用它来研究自恋心理；语言学家用它来研究语言的地区差异。其中也有一些论文利用Twitter来跟踪牙痛、空气质量和公众对流感的忧虑——也有人研究Twitter在预测美国橄榄球联盟（NFL）比赛结果，诊断创伤后应激障碍，以及衡量全球幸福指数方面的潜力。总之，据学术刊物数据库Scopus的统计，已有约2,000篇期刊文章和3,000篇会议论文在研究Twitter（或至少在文章标题、关键词或摘要中包含Twitter一词）。《文献工作杂志》（Journal of Documentation）于2013年发表了一篇论文，其标题就是“人们研究Twitter时是在研究什么？对Twitter相关学术论文进行分类”。

社交网站不大像是能够令学术界动心的工具。那么，Twitter，一家要求每条留言最多为140个字节，把两大流行歌星凯蒂•佩里（拥有5,560万粉丝）和贾斯汀•比伯（拥有5,360万粉丝）奉为最具影响力用户的网站，是如何成为学术界眼中的香饽饽？

在以传染为主题的系列文章中，我和《财富》杂志(Fortune)的同事决定探究事物是如何蔓延的——从并购传闻，到市场恐慌，再到“自拍”。作为该系列的最后一篇文章，我们决定追本溯源。毕竟，Twitter是当今研究传染力的首选工具之一，而剖析传染这种社会流行病的最好方法，莫过于研究Twitter本身为何在其研究者中如此具有传染力。

A couple years ago, Sherry Emery, a health economist at the University of Illinois at Chicago, found herself reading tweets about “smoking hot girls.” Also about “smoking ribs,” “smoking weed,” and the “smoking chimney” of the papal conclave. If she got lucky, they’d be about “smoking squares” or just “smoking,” in an easily decoded context that referred to cigarettes.

Emery has studied the impact of tobacco-related advertising for years. Until recently, that meant looking at TV and radio spots, tracking Nielsen Ratings and regional smoking rates. But then, one night watching Netflix in 2011, she had a thought: if she was on the web, so were many others—and they were likely leaving a trail of their attitudes towards smoking on social media platforms such as Twitter.

In September 2011, the National Cancer Institute awarded her a $7.2 million grant to look into it—and so she went, a pioneer (in her line of work) into the brave new world of Twitterology.

She’s hardly alone these days. Since Twitter was founded in 2006, academics have flocked to the micro-blogging platform—not to tweet messages (though some do that too), but to study them. With 225 million users issuing half a billion tweets per day, Twitter represents the richest dataset to hit academia….well, maybe ever—a virtual Petri dish of real-time data, attractive to scholars of all disciplines, for studies of all sorts. Physicists have used Twitter to study networks; psychologists to study narcissism; linguists to study regional language variation. There are research papers about what can be learned by using Twitter to track dental pain, air quality and public concern about flu outbreaks—as well as studies on Twitter’s potential to predict the outcome of NFL games, and diagnose post-traumatic stress disorder, and measure worldwide happiness. In all, some 2,000 journal articles and 3,000 conference papers have been written about Twitter (or have at least contained the word in their title, keywords or abstract), according to Scopus, a database of academic publications. There’s even a paper, published in 2013 in the Journal of Documentation, entitled, “What do people study when they study Twitter? Classifying Twitter related academic papers.”

The social networking site is not the most likely of tools to have caught fire in the Ivory Tower. How did Twitter, a site that traffics in 140-character-or-less messages and that counts two pop stars—Katy Perry (with 55.6 million followers) and Justin Bieber (with 53.6 million)—as its most influential users, become so hot among the academic set?

In this series on contagion, my FORTUNE colleagues and I set out to explore how things spread—from M&A rumors, to market panics, to the ‘selfie’. And for the final installment of this series, we decided to get especially meta. After all, how better to probe the anatomy of a social epidemic than to track how Twitter, one of the preferred tools for studying contagion these days, got so contagious among people studying it?

这个故事的开篇距现在并不遥远，最初的主角是计算机科学家。相较于大多数学者，数据对于计算机科学家甚至更为重要——多年来，他们一直在挖掘他们各种稀奇古怪的数据集。例如，安然公司（Enron）的电邮【大约600,000条讯息，分属于158名安然雇员，美国联邦能源监管委员会（Federal Energy Regulatory Commission）在结束对安然公司的调查后将其公布于众】于2003年公布后，就成为该领域的流行素材。

看上去，社交媒体显然是学者们挖掘数据的下一个前沿阵地，但在2003年，当计算机科学家詹妮弗•戈尔贝克受到MySpace启示，首次开始研究这些社交平台时，人们并不认为这些研究是有前途，或严肃的工作。她的高科技领域同事将这一研究嗤之为“社交科学”；而在社交网络的萌芽阶段，规模最大的网站是拥有两千万会员的成人交友网站AdultFriendFinder。

作为一名博士研究生，戈尔贝克看到了此类平台中蕴含的巨大潜力。她说：“在这些平台上可以做大量有趣的计算工作”。然而，甚至当她在2005年拿到学位的时候，她依然没有说服计算机科学系认同这种观点。

现如今，已经成为马里兰州大学帕克分校（University of Maryland, College Park）教授，并兼任人机互动实验室负责人的戈尔贝克，继续利用社交媒体研究人和人际关系。她的著述颇丰，曾以“YouTube上的社区感与社区结构”、国会议员如何使用Twitter、以及人与宠物关系等主题发表论文。而使她尤其受到追捧的是她在TED大会上的发言：《扭扭薯条谜题：社交媒体点赞泄露的信息超乎你想象的原因何在》，自2013年10月以来，该视频的观看次数已经多达120万次。

另一名先驱是密歇根州大学(University of Michigan)信息与计算机科学助理教授埃伊坦•阿达尔。数年前，他利用博客来研究模因的蔓延机制，2007年，他参与创立了“网络博客与社交媒体国际大会”（International Conference on Weblogs and Social Media），其目的是为从事类似工作的研究者建立一个生态圈。同年的活动吸引了145人参与，大会主题包括《在公司博客上建立信任》和《Flickr上的社交探索》等等，其主旨演讲人埃文•威廉姆斯不是别人，正是当时羽翼未丰的Twitter公司的创始人。

研究Twitter的首批学者，往往是像戈尔贝克和阿达尔这样的计算机科学家，他们既懂Twitter，同时也具备收集并处理数据的技术。此外，首批学者中还包括对网络效应特别感兴趣的物理学家以及信息科学和通讯学者。早期的研究往往以Twitter为中心，对该服务的使用方式和目的进行统计分析。然后出现了一些更复杂的研究，其重点是研究Twitter的机制：比如“取消关注的动态情况”、“瞬时群体发现”、或者“Twitter主题内用户及消息集群的模式”。新加入研究大军的人多为埃默里这样的社会科学家，他们提出了数据应用的构想，比如预测选举的结果，或者阐明Twitter大学年龄用户自恋情节。但这些人往往并不是收集和处理数据的行家里手。（正因如此，大量跨学科研究工作层出不穷，戈尔贝克的实验室就从事类似研究）。

研究报告《人们研究Twitter时是在研究什么？》指出，专注于Twitter的论文数量在2007年有3篇，2008年增加到了8篇，2009年增加到了36篇，此后便一路显著上升。

The story begins in the not-too-distant past with computer scientists. Even more than most academics, computer scientists need data—and for years, they’ve mined whatever odd and interesting datasets have come their way. The Enron emails—the 600,000 some messages belonging to 158 Enron employees and made public by the Federal Energy Regulatory Commission after its investigation of the company—became popular fodder in the field, for example, after they were released in 2003.

Social media may seem an obvious next frontier for data-minded academics, but when computer scientist Jennifer Golbeck first started studying such platforms in 2003 (she was inspired by MySpace), it was not considered particularly promising or serious work. Colleagues in her highly technical field dismissed it as “social science”; and in the nascent universe of online social networks, the largest was a hook-up site with a community of 20 million members called AdultFriendFinder.

Golbeck, a Ph.D. student at the time, saw greater potential in such platforms: “There was so much interesting computing to be done,” she says. But she was still battling to convince computer science departments of this when she completed her degree in 2005.

Now a professor at University of Maryland, College Park, Golbeck heads up the school’s Human-Computer Interaction Lab and continues to study what can be learned about humans and relationships using social media. Her prolific output has included papers on “the sense and structure of community on YouTube,” how Congressional representatives use Twitter, and the dynamics of the human-pet relationship (many platforms). That work makes her much in demand—her TED talk, “The Curly Fries Conundrum: Why social media likes say more than you might think,” has been viewed 1.2 million times since October 2013.

Eytan Adar, now an assistant professor of information and computer science at the University of Michigan was another pioneer. Years ago he used blogs to study how memes spread and, in 2007, he co-founded the International Conference on Weblogs and Social Media in an effort to build community among researchers doing similar work. That year, the event drew 145 people, offered talks like “Building Trust on Corporate Blogs” and “Social Browsing on Flickr,” and featured Ev Williams, the founder of a then-fledgling start-up called Twitter, as the keynote speaker. (Like Twitter, the conference has grown a lot since then.)

The first academics to study Twitter tended to be computer scientists like Golbeck and Adar, who had both the savvy to understand Twitter and the tech skills to collect and manipulate its data, as well as physicists and information science and communications scholars who were particularly interested in network effects. Research from those early years tended to focus on Twitter—statistical analyses of how and for what the service was used. Then came more sophisticated studies focused on the mechanics of Twitter: the study of things like “unfollow dynamics,” “transient crowd discovery,” or “patterns in Twitter intra-topic user and message clustering.” Later to the party were social scientists, like Emery, who dreamt up applications for the data—predicting the outcome of elections, for instance, or elucidating the narcissism of Twitter’s college-aged users—but tended to be less technically adept at collecting and manipulating it. (As a result, a number of interdisciplinary research efforts—like those that take place in Golbeck’s lab—have sprung up.)

According to the study, “What do people study when they study Twitter?,” the number of Twitter-focused papers has grown from 3 in 2007, to 8 in 2008, to 36 in 2009, and is up considerably since then.

Texifter公司CEO斯图尔特•舒尔曼表示，“一些在社会科学研究中较早使用Twitter研究数据的研究人员遭到了嘲笑。”该公司是一家文本分析工具开发商，也是一家通常向学者授权使用Twitter数据的供应商。他说，资深学者往往不信任这些同事（大多数是年轻人）。“你为什么要这么做？难道你可以靠这些数据获得终身教职？而现在，即将从研究生院毕业的整整一代人都准备撰写与社交平台数据有关的硕士论文。”

如今，成为一名社交数据博士似乎不愁没事做。随着Twitter研究论文的数量不断增长，邀请学者提交其研究成果的会议数量也在迅速增多。实际上，阿达尔的网络博客与社交媒体国际大会正面临多个同类会议的竞争压力。

Twitter在学者们中如此受欢迎，不仅仅是因为它是一个海量公共数据集，还因为它是一个带有时间刻度的海量公共数据集——捕捉特定时间中（在一些情况下，也是在特定空间中）数百万人关于所有主题事项的想法。如果你认为人们在公共舞台上谈论或推送的内容是有限制的，那你就大错特错了，实际情况绝非如此。而如果你认为人们在公共舞台上几乎可以谈论、推送任何内容，那么你就对了：人们在Twitter上无话不谈，实际上，卫生研究者正在利用这个平台跟踪爆发性食物中毒。（可以花点时间想象一下……）

这些特性使得Twitter有别于其他数据丰富的社交网站。例如，Facebook拥有隐私政策，其内容不是按照时间顺序，而是按照动态消息（NewsFeed）的新颖算法排列。

这并不是说，利用Twitter开展学术研究就特别容易。尽管Twitter是一个公共平台，但仅有很小一部分——约占Twitter数据流的1%，Twitter将其称为“汽酒”（spritzer）——是公众可以通过Twitter应用程序编程接口（API）免费获取的。一些特定合作伙伴（其中一些是学者）经协商可以通过Twitter的“浇水管”（garden hose）略微扩大数据获取量（占数据流的10%）。若要通过Twitter 的“消防带”（firehose）进行完全访问，甚至取得特定搜索查询的无限访问权，则需付出高昂的费用，且只能通过少数几家供应商获得。【尽管国会图书馆(Library of Congress)存储有整个Twitter档案，但它并没有能力满足它收到的大量数据请求。】

今年早些时候，在一片群情激动的欢呼声中，Twitter宣布了一项数据授权计划，以减轻学者开展此类研究的成本负担。但事实上，该公司的授权数量极其有限：在1300个申请人中，仅有6人获得了授权，占0.5%。Texifter公司目前向36个研究团队提供类似授权。

现在，学者们在使用这个平台从事研究时显然更加得心应手。数据过滤技术正在变得愈发精确和复杂。同时，学者们正逐渐了解Twitter 最适合哪类研究。阿达尔称，该平台的数据最适合了解某时某地正在发生什么，但依然不是一个特别靠谱的预测工具。

也有人仍在担心Twitter数据样本的代表性。正如一位涉猎Twitter研究的学者对我所说的那样，你很难判断你所观察到的有多少是人类行为，有多少是Twitter上的人类行为。

Texifter公司的舒尔曼表示，“这可能是一时的风潮，可能我们会认为，以对Twitter500万活跃用户的研究概括整个世界完全是一种愚蠢的行为”。“但我不这样认为。如果有人声称Twitter无足轻重，那才是真正的愚蠢。”

或者，也许Twitter的确不容小觑，但它仍然是一时的风潮。阿达尔已经注意到了这样的迹象：学者对该平台的青睐程度已不如从前。他指出，“仍然有大量关于Twitter的研究。但有人已将目光投向其他社交媒体。当研究同一事物的人数过多时，我们就不得不转移目标了，尝试着做出更加新颖的贡献。”（财富中文网）

译者：Simon

“Early adopters in the social sciences of research data from Twitter were just mocked,” says Stuart Shulman, the CEO of Texifter, a developer of text analysis tools and a vendor of Twitter data that often licenses it to academics. Seasoned academics tended to be incredulous towards these (mostly) younger colleagues, he says. “Why would you do that? You can’t get tenure using that? Now there’s a whole generation coming through grad school that are going to write their masters theses about social data.”

These days, becoming a doctor of social data looks like a secure line of work. Just as the number of papers based on Twitter research has soared, so has the number of conferences inviting academics to submit their findings. Indeed, Adar’s International Conference on Weblogs and Social Media’s annual conference now competes with a number of rival meetings.

What has made Twitter so popular with academics, though, isn’t just that it’s an enormous public dataset, it’s that it’s an enormous public dataset with a time scale—capturing thoughts from millions of people on all matters of subjects recorded in specific time (and, in some cases, specific space). You might think there’d be limitations to the things people would say, or tweet, on a public stage—okay, scratch that: we all know better. You might think there’d be virtually no limitations to the things people would say, or tweet, on a public stage, and you’d be right: folks on Twitter are so unfiltered, in fact, that health researchers are using the platform to track food-poisoning outbreaks. (Take a moment to figure that one out….)

Such properties set Twitter apart from other data-rich social networking sites. Facebook, for example, has privacy issues and rolls out content, not chronologically but according to the funky algorithm of its NewsFeed.

That’s not to say academic research with Twitter is particularly easy. While Twitter is a public platform, only a fraction of its data, or 1% of the Twitter stream—Twitter calls it the “spritzer”—is free and accessible to the public through Twitter’s application programming interface (API). Some select partners—some of whom are academics—have negotiated slightly more robust access via Twitter’s “garden hose” (10% of the stream). Complete access, via the Twitter firehose or even unlimited access to particular search queries, is costly and can be obtained only through a handful of vendors. (While the Library of Congress warehouses the whole Twitter archive, it does not have the capacity to address the many data requests it receives.)

Twitter, to much excitement and fanfare, announced a data grant program earlier this year to help academics shoulder the costs of such research. In truth, the company barely opened the spigot: of 1300 applicants, just six, or 0.5%, were awarded grants. Texifter is now making similar grants to a total of 36 research teams.

Academics using the platform for research are certainly getting better at it. Data filtering techniques are getting more precise and sophisticated. Meanwhile, scholars are learning what sort of research Twitter is good for. Adar says the platform’s data is best for understanding what’s going on in a particular place at a particular instant; it’s a less proven (yet more highly sought-after) tool for prediction.

There also remain concerns about just how representative the Twitter data sample is. As one scholar, who dabbles in Twitter research told me, it’s hard to know how much you’re watching human behavior versus how much you’re watching human behavior on Twitter.

“Maybe it’s a fad, and maybe we’ll determine that studying the five million active users of Twitter, and talking about whole world is really kind of stupid,” says Texifter’s Shulman. “But I don’t think so. You’d be an idiot if you said Twitter doesn’t matter.”

Or, maybe Twitter does matter…but it’s still a fad. Adar has already seen signs that the platform isn’t as hot among academics as it used to be. “There’s still a lot of research on Twitter,” he says.“But some attention has shifted to other social media. When too many people are studying one thing, we have to move on to, you know, try to make novel contributions.”

阅读全文