Social media analytics: a survey of techniques, tools and ...

文章推薦指數: 80 %
投票人數:10人

This paper is written for (social science) researchers seeking to analyze the wealth of social media now available. Skiptomaincontent Advertisement SearchSpringerLink Search Socialmediaanalytics:asurveyoftechniques,toolsandplatforms DownloadPDF DownloadPDF AbstractThispaperiswrittenfor(socialscience)researchersseekingtoanalyzethewealthofsocialmedianowavailable.Itpresentsacomprehensivereviewofsoftwaretoolsforsocialnetworkingmedia,wikis,reallysimplesyndicationfeeds,blogs,newsgroups,chatandnewsfeeds.Forcompleteness,italsoincludesintroductionstosocialmediascraping,storage,datacleaningandsentimentanalysis.Althoughprincipallyareview,thepaperalsoprovidesamethodologyandacritiqueofsocialmediatools.Analyzingsocialmedia,inparticularTwitterfeedsforsentimentanalysis,hasbecomeamajorresearchandbusinessactivityduetotheavailabilityofweb-basedapplicationprogramminginterfaces(APIs)providedbyTwitter,FacebookandNewsservices.Thishasledtoan‘explosion’ofdataservices,softwaretoolsforscrapingandanalysisandsocialmediaanalyticsplatforms.Itisalsoaresearchareaundergoingrapidchangeandevolutionduetocommercialpressuresandthepotentialforusingsocialmediadataforcomputational(socialscience)research.Usingasimpletaxonomy,thispaperprovidesareviewofleadingsoftwaretoolsandhowtousethemtoscrape,cleanseandanalyzethespectrumofsocialmedia.Inaddition,itdiscussedtherequirementofanexperimentalcomputationalenvironmentforsocialmediaresearchandpresentsasanillustrationthesystemarchitectureofasocialmedia(analytics)platformbuiltbyUniversityCollegeLondon.Theprincipalcontributionofthispaperistoprovideanoverview(includingcodefragments)forscientistsseekingtoutilizesocialmediascrapingandanalyticseitherintheirresearchorbusiness.Thedataretrievaltechniquesthatarepresentedinthispaperarevalidatthetimeofwritingthispaper(June2014),buttheyaresubjecttochangesincesocialmediadatascrapingAPIsarerapidlychanging. IntroductionSocialmediaisdefinedasweb-basedandmobile-basedInternetapplicationsthatallowthecreation,accessandexchangeofuser-generatedcontentthatisubiquitouslyaccessible(KaplanandHaenlein2010).Besidessocialnetworkingmedia(e.g.,TwitterandFacebook),forconvenience,wewillalsousetheterm‘socialmedia’toencompassreallysimplesyndication(RSS)feeds,blogs,wikisandnews,alltypicallyyieldingunstructuredtextandaccessiblethroughtheweb.Socialmediaisespeciallyimportantforresearchintocomputationalsocialsciencethatinvestigatesquestions(Lazeretal.2009)usingquantitativetechniques(e.g.,computationalstatistics,machinelearningandcomplexity)andso-calledbigdatafordataminingandsimulationmodeling(Cioffi-Revilla2010).Thishasledtonumerousdataservices,toolsandanalyticsplatforms.However,thiseasyavailabilityofsocialmediadataforacademicresearchmaychangesignificantlyduetocommercialpressures.Inaddition,asdiscussedinSect.2,thetoolsavailabletoresearchersarefarfromideal.Theyeithergivesuperficialaccesstotherawdataor(fornon-superficialaccess)requireresearcherstoprogramanalyticsinalanguagesuchasJava.TerminologyWestartwithdefinitionsofsomeofthekeytechniquesrelatedtoanalyzingunstructuredtextualdata: Naturallanguageprocessing—(NLP)isafieldofcomputerscience,artificialintelligenceandlinguisticsconcernedwiththeinteractionsbetweencomputersandhuman(natural)languages.Specifically,itistheprocessofacomputerextractingmeaningfulinformationfromnaturallanguageinputand/orproducingnaturallanguageoutput. Newsanalytics—themeasurementofthevariousqualitativeandquantitativeattributesoftextual(unstructureddata)newsstories.Someoftheseattributesare:sentiment,relevanceandnovelty. Opinionmining—opinionmining(sentimentmining,opinion/sentimentextraction)istheareaofresearchthatattemptstomakeautomaticsystemstodeterminehumanopinionfromtextwritteninnaturallanguage. Scraping—collectingonlinedatafromsocialmediaandotherWebsitesintheformofunstructuredtextandalsoknownassitescraping,webharvestingandwebdataextraction. Sentimentanalysis—sentimentanalysisreferstotheapplicationofnaturallanguageprocessing,computationallinguisticsandtextanalyticstoidentifyandextractsubjectiveinformationinsourcematerials. Textanalytics—involvesinformationretrieval(IR),lexicalanalysistostudywordfrequencydistributions,patternrecognition,tagging/annotation,informationextraction,dataminingtechniquesincludinglinkandassociationanalysis,visualizationandpredictiveanalytics. ResearchchallengesSocialmediascrapingandanalyticsprovidesarichsourceofacademicresearchchallengesforsocialscientists,computerscientistsandfundingbodies.Challengesinclude: Scraping—althoughsocialmediadataisaccessiblethroughAPIs,duetothecommercialvalueofthedata,mostofthemajorsourcessuchasFacebookandGooglearemakingitincreasinglydifficultforacademicstoobtaincomprehensiveaccesstotheir‘raw’data;veryfewsocialdatasourcesprovideaffordabledataofferingstoacademiaandresearchers.NewsservicessuchasThomsonReutersandBloombergtypicallychargeapremiumforaccesstotheirdata.Incontrast,TwitterhasrecentlyannouncedtheTwitterDataGrantsprogram,whereresearcherscanapplytogetaccesstoTwitter’spublictweetsandhistoricaldatainordertogetinsightsfromitsmassivesetofdata(Twitterhasmorethan500milliontweetsaday). Datacleansing—cleaningunstructuredtextualdata(e.g.,normalizingtext),especiallyhigh-frequencystreamedreal-timedata,stillpresentsnumerousproblemsandresearchchallenges. Holisticdatasources—researchersareincreasinglybringingtogetherandcombiningnoveldatasources:socialmediadata,real-timemarket&customerdataandgeospatialdataforanalysis. Dataprotection—onceyouhavecreateda‘bigdata’resource,thedataneedstobesecured,ownershipandIPissuesresolved(i.e.,storingscrapeddataisagainstmostofthepublishers’termsofservice),andusersprovidedwithdifferentlevelsofaccess;otherwise,usersmayattemptto‘suck’allthevaluabledatafromthedatabase. Dataanalytics—sophisticatedanalysisofsocialmediadataforopinionmining(e.g.,sentimentanalysis)stillraisesamyriadofchallengesduetoforeignlanguages,foreignwords,slang,spellingerrorsandthenaturalevolvingoflanguage. Analyticsdashboards—manysocialmediaplatformsrequireuserstowriteAPIstoaccessfeedsorprogramanalyticsmodelsinaprogramminglanguage,suchasJava.Whilereasonableforcomputerscientists,theseskillsaretypicallybeyondmost(socialscience)researchers.Non-programminginterfacesarerequiredforgivingwhatmightbereferredtoas‘deep’accessto‘raw’data,forexample,configuringAPIs,mergingsocialmediafeeds,combiningholisticsourcesanddevelopinganalyticalmodels. Datavisualization—visualrepresentationofdatawherebyinformationthathasbeenabstractedinsomeschematicformwiththegoalofcommunicatinginformationclearlyandeffectivelythroughgraphicalmeans.Giventhemagnitudeofthedatainvolved,visualizationisbecomingincreasinglyimportant. SocialmediaresearchandapplicationsSocialmediadataisclearlythelargest,richestandmostdynamicevidencebaseofhumanbehavior,bringingnewopportunitiestounderstandindividuals,groupsandsociety.Innovativescientistsandindustryprofessionalsareincreasinglyfindingnovelwaysofautomaticallycollecting,combiningandanalyzingthiswealthofdata.Naturally,doingjusticetothesepioneeringsocialmediaapplicationsinafewparagraphsischallenging.Threeillustrativeareasare:business,bioscienceandsocialscience.Theearlybusinessadoptersofsocialmediaanalysisweretypicallycompaniesinretailandfinance.Retailcompaniesusesocialmediatoharnesstheirbrandawareness,product/customerserviceimprovement,advertising/marketingstrategies,networkstructureanalysis,newspropagationandevenfrauddetection.Infinance,socialmediaisusedformeasuringmarketsentimentandnewsdataisusedfortrading.Asanillustration,Bollenetal.(2011)measuredsentimentofrandomsampleofTwitterdata,findingthatDowJonesIndustrialAverage(DJIA)pricesarecorrelatedwiththeTwittersentiment2–3 daysearlierwith87.6percentaccuracy.Wolfram(2010)usedTwitterdatatotrainaSupportVectorRegression(SVR)modeltopredictpricesofindividualNASDAQstocks,finding‘significantadvantage’forforecastingprices15 mininthefuture.Inthebiosciences,socialmediaisbeingusedtocollectdataonlargecohortsforbehavioralchangeinitiativesandimpactmonitoring,suchastacklingsmokingandobesityormonitoringdiseases.AnexampleisPennStateUniversitybiologists(Salathéetal.2012)whohavedevelopedinnovativesystemsandtechniquestotrackthespreadofinfectiousdiseases,withthehelpofnewsWebsites,blogsandsocialmedia.Computationalsocialscienceapplicationsinclude:monitoringpublicresponsestoannouncements,speechesandeventsespeciallypoliticalcommentsandinitiatives;insightsintocommunitybehavior;socialmediapollingof(hardtocontact)groups;earlydetectionofemergingevents,aswithTwitter.Forexample,Lermanetal.(2008)usecomputationallinguisticstoautomaticallypredicttheimpactofnewsonthepublicperceptionofpoliticalcandidates.YessenovandMisailovic(2009)usemoviereviewcommentstostudytheeffectofvariousapproachesinextractingtextfeaturesontheaccuracyoffourmachinelearningmethods—NaiveBayes,DecisionTrees,MaximumEntropyandK-Meansclustering.Lastly,Karabulut(2013)foundthatFacebook’sGrossNationalHappiness(GNH)exhibitspeaksandtroughsin-linewithmajorpubliceventsintheUSA.SocialmediaoverviewForthispaper,wegroupsocialmediatoolsinto: Socialmediadata—socialmediadatatypes(e.g.,socialnetworkmedia,wikis,blogs,RSSfeedsandnews,etc.)andformats(e.g.,XMLandJSON).Thisincludesdatasetsandincreasinglyimportantreal-timedatafeeds,suchasfinancialdata,customertransactiondata,telecomsandspatialdata. Socialmediaprogrammaticaccess—dataservicesandtoolsforsourcingandscraping(textual)datafromsocialnetworkingmedia,wikis,RSSfeeds,news,etc.Thesecanbeusefullysubdividedinto: Datasources,servicesandtools—wheredataisaccessedbytoolswhichprotecttherawdataorprovidesimpleanalytics.Examplesinclude:GoogleTrends,SocialMention,SocialPointerandSocialSeek,whichprovideastreamofinformationthataggregatesvarioussocialmediafeeds. DatafeedsviaAPIs—wheredatasetsandfeedsareaccessibleviaprogrammableHTTP-basedAPIsandreturntaggeddatausingXMLorJSON,etc.ExamplesincludeWikipedia,TwitterandFacebook. Textcleaningandstoragetools—toolsforcleaningandstoringtextualdata.GoogleRefineandDataWranglerareexamplesfordatacleaning. Textanalysistools—individualorlibrariesoftoolsforanalyzingsocialmediadataonceithasbeenscrapedandcleaned.Thesearemainlynaturallanguageprocessing,analysisandclassificationtools,whichareexplainedbelow. Transformationtools—simpletoolsthatcantransformtextualinputdataintotables,maps,charts(line,pie,scatter,bar,etc.),timelineorevenmotion(animationovertimeline),suchasGoogleFusionTables,ZohoReports,TableauPublicorIBM’sManyEyes. Analysistools—moreadvancedanalyticstoolsforanalyzingsocialdata,identifyingconnectionsandbuildingnetworks,suchasGephi(opensource)ortheExcelplug-inNodeXL. Socialmediaplatforms—environmentsthatprovidecomprehensivesocialmediadataandlibrariesoftoolsforanalytics.Examplesinclude:ThomsonReutersMachineReadableNews,Radian6andLexalytics. Socialnetworkmediaplatforms—platformsthatprovidedataminingandanalyticsonTwitter,Facebookandawiderangeofothersocialnetworkmediasources. Newsplatforms—platformssuchasThomsonReutersprovidingcommercialnewsarchives/feedsandassociatedanalytics. SocialmediamethodologyandcritiqueThetwomajorimpedimentstousingsocialmediaforacademicresearcharefirstlyaccesstocomprehensivedatasetsandsecondlytoolsthatallow‘deep’dataanalysiswithouttheneedtobeabletoprograminalanguagesuchasJava.Themajorityofsocialmediaresourcesarecommercialandcompaniesarenaturallytryingtomonetizetheirdata.Asdiscussed,itisimportantthatresearchershaveaccesstoopen-source‘big’(socialmedia)datasetsandfacilitiesforexperimentation.Otherwise,socialmediaresearchcouldbecometheexclusivedomainofmajorcompanies,governmentagenciesandaprivilegedsetofacademicresearcherspresidingoverprivatedatafromwhichtheyproducepapersthatcannotbecritiquedorreplicated.Recently,therehasbeenamodestresponse,asTwitterandGniparepilotinganewprogramfordataaccess,startingwith5all-accessdatagrantstoselectapplicants.MethodologyResearchrequirementscanbegroupedinto:data,analyticsandfacilities.DataResearchersneedonlineaccesstohistoricandreal-timesocialmediadata,especiallytheprincipalsources,toconductworld-leadingresearch: Socialnetworkmedia—accesstocomprehensivehistoricdatasetsandalsoreal-timeaccesstosources,possiblywitha(15 min)timedelay,aswithThomsonReutersandBloombergfinancialdata. Newsdata—accesstohistoricdataandreal-timenewsdatasets,possiblythroughtheconceptof‘educationaldatalicenses’(cf.softwarelicense). Publicdata—accesstoscrapedandarchivedimportantpublicdata;availablethroughRSSfeeds,blogsoropengovernmentdatabases. Programmableinterfaces—researchersalsoneedaccesstosimpleapplicationprogramminginterfaces(APIs)toscrapeandstoreotheravailabledatasourcesthatmaynotbeautomaticallycollected. AnalyticsCurrently,socialmediadataistypicallyeitheravailableviasimplegeneralroutinesorrequiretheresearchertoprogramtheiranalyticsinalanguagesuchasMATLAB,JavaorPython.Asdiscussedabove,researchersrequire: Analyticsdashboards—non-programminginterfacesarerequiredforgivingwhatmightbetermedas‘deep’accessto‘raw’data. Holisticdataanalysis—toolsarerequiredforcombining(andconductinganalyticsacross)multiplesocialmediaandotherdatasets. Datavisualization—researchersalsorequirevisualizationtoolswherebyinformationthathasbeenabstractedcanbevisualizedinsomeschematicformwiththegoalofcommunicatinginformationclearlyandeffectivelythroughgraphicalmeans. FacilitiesLastly,thesheervolumeofsocialmediadatabeinggeneratedarguesfornationalandinternationalfacilitiestobeestablishedtosupportsocialmediaresearch(cf.WhartonResearchDataServiceshttps://wrds-web.wharton.upenn.edu): Datastorage—thevolumeofsocialmediadata,currentandprojected,isbeyondmostindividualuniversitiesandhenceneedstobeaddressedatanationalsciencefoundationlevel.Storageisrequiredbothforprincipaldatasources(e.g.,Twitter),butalsoforsourcescollectedbyindividualprojectsandarchivedforfutureusebyotherresearchers. Computationalfacility—remotelyaccessiblecomputationalfacilitiesarealsorequiredfor:a)protectingaccesstothestoreddata;b)hostingtheanalyticsandvisualizationtools;andc)providingcomputationalresourcessuchasgridsandGPUsrequiredforprocessingthedataatthefacilityratherthantransmittingitacrossanetwork. CritiqueMuchneedstobedonetosupportsocialmediaresearch.Asdiscussed,themajorityofcurrentsocialmediaresourcesarecommercial,expensiveanddifficultforacademicstoobtainfullaccess.DataIngeneral,accesstoimportantsourcesofsocialmediadataisfrequentlyrestrictedandfullcommercialaccessisexpensive. Siloeddata—mostdatasources(e.g.,Twitter)haveinherentlyisolatedinformationmakingitdifficulttocombinewithotherdatasources. Holisticdata—incontrast,researchersareincreasinglyinterestedinaccessing,storingandcombiningnoveldatasources:socialmediadata,real-timefinancialmarket&customerdataandgeospatialdataforanalysis.ThisiscurrentlyextremelydifficulttodoevenforComputerSciencedepartments. AnalyticsAnalyticaltoolsprovidedbyvendorsareoftentiedtoasingledataset,maybelimitedinanalyticalcapability,anddatachargesmakethemexpensivetouse.FacilitiesThereareanincreasingnumberofpowerfulcommercialplatforms,suchastheonessuppliedbySASandThomsonReuters,butthechargesarelargelyprohibitiveforacademicresearch.Eithercomparablefacilitiesneedtobeprovidedbynationalsciencefoundationsorvendorsneedtobepersuadedtointroducetheconceptofan‘educationallicense.’SocialmediadataClearly,thereisalargeandincreasingnumberof(commercial)servicesprovidingaccesstosocialnetworkingmedia(e.g.,Twitter,FacebookandWikipedia)andnewsservices(e.g.,ThomsonReutersMachineReadableNews).Equivalentmajoracademicservicesarescarce.Westartbydiscussingtypesofdataandformatsproducedbytheseservices.TypesofdataAlthoughwefocusonsocialmedia,asdiscussed,researchersarecontinuallyfindingnewandinnovativesourcesofdatatobringtogetherandanalyze.Sowhenconsideringtextualdataanalysis,weshouldconsidermultiplesources(e.g.,socialnetworkingmedia,RSSfeeds,blogsandnews)supplementedbynumeric(financial)data,telecomsdata,geospatialdataandpotentiallyspeechandvideodata.Usingmultipledatasourcesiscertainlythefutureofanalytics.Broadly,datasubdividesinto: Historicdatasets—previouslyaccumulatedandstoredsocial/news,financialandeconomicdata. Real-timefeeds—livedatafeedsfromstreamedsocialmedia,newsservices,financialexchanges,telecomsservices,GPSdevicesandspeech. Andinto: Rawdata—unprocessedcomputerdatastraightfromsourcethatmaycontainerrorsormaybeunanalyzed. Cleaneddata—correctionorremovaloferroneous(dirty)datacausedbydisparities,keyingmistakes,missingbits,outliers,etc. Value-addeddata—datathathasbeencleaned,analyzed,taggedandaugmentedwithknowledge. TextdataformatsThefourmostcommonformatsusedtomarkuptextare:HTML,XML,JSONandCSV. HTML—HyperTextMarkupLanguage(HTML)aswell-knownisthemarkuplanguageforwebpagesandotherinformationthatcanbeviewedinawebbrowser.HTMLconsistsofHTMLelements,whichincludetagsenclosedinanglebrackets(e.g.,

),withinthecontentofthewebpage. XML—ExtensibleMarkupLanguage(XML)—themarkuplanguageforstructuringtextualdatausing…todefineelements. JSON—JavaScriptObjectNotation(JSON)isatext-basedopenstandarddesignedforhuman-readabledatainterchangeandisderivedfromJavaScript. CSV—acomma-separatedvalues(CSV)filecontainsthevaluesinatableasaseriesofASCIItextlinesorganizedsuchthateachcolumnvalueisseparatedbyacommafromthenextcolumn’svalueandeachrowstartsanewline. Forcompleteness,HTMLandXMLareso-calledmarkuplanguages(markupandcontent)thatdefineasetofsimplesyntacticrulesforencodingdocumentsinaformatbothhumanreadableandmachinereadable.Amarkupcomprisesstart-tags(e.g.,),contenttextandend-tags(e.g.,).ManyfeedsuseJavaScriptObjectNotation(JSON),thelightweightdata-interchangeformat,basedonasubsetoftheJavaScriptProgrammingLanguage.JSONisalanguage-independenttextformatthatusesconventionsthatarefamiliartoprogrammersoftheC-familyoflanguages,includingC,C++,C#,Java,JavaScript,Perl,Python,andmanyothers.JSON’sbasictypesare:Number,String,Boolean,Array(anorderedsequenceofvalues,comma-separatedandenclosedinsquarebrackets)andObject(anunorderedcollectionofkey:valuepairs).TheJSONformatisillustratedinFig. 1foraqueryontheTwitterAPIonthestring‘UCL,’whichreturnstwo‘text’resultsfromtheTwitteruser‘uclnews.’Fig. 1JSONExampleFullsizeimage Comma-separatedvaluesarenotasingle,well-definedformatbutratherrefertoanytextfilethat:(a)isplaintextusingacharactersetsuchasASCII,UnicodeorEBCDIC;(b)consistsoftextrecords(e.g.,onerecordperline);(c)withrecordsdividedintofieldsseparatedbydelimiters(e.g.,comma,semicolonandtab);and(d)whereeveryrecordhasthesamesequenceoffields.SocialmediaprovidersSocialmediadataresourcesbroadlysubdivideintothoseproviding: Freelyavailabledatabases—repositoriesthatcanbefreelydownloaded,e.g.,Wikipedia(http://dumps.wikimedia.org)andtheEnrone-maildatasetavailableviahttp://www.cs.cmu.edu/~enron/. Dataaccessviatools—sourcesthatprovidecontrolledaccesstotheirsocialmediadataviadedicatedtools,bothtofacilitateeasyinterrogationandalsotostopusers‘sucking’allthedatafromtherepository.AnexampleisGoogle’sTrends.Thesefurthersubdividedinto: Freesources—repositoriesthatarefreelyaccessible,butthetoolsprotectormaylimitaccesstothe‘raw’dataintherepository,suchastherangeoftoolsprovidedbyGoogle. Commercialsources—dataresellersthatchargeforaccesstotheirsocialmediadata.GnipandDataSiftprovidecommercialaccesstoTwitterdatathroughapartnership,andThomsonReuterstonewsdata. DataaccessviaAPIs—socialmediadatarepositoriesprovidingprogrammableHTTP-basedaccesstothedataviaAPIs(e.g.,Twitter,FacebookandWikipedia). Open-sourcedatabasesAmajoropensourceofsocialmediaisWikipedia,whichoffersfreecopiesofallavailablecontenttointerestedusers(WikimediaFoundation2014).Thesedatabasescanbeusedformirroring,databasequeriesandsocialmediaanalytics.TheyincludedumpsfromanyWikimediaFoundationproject:http://dumps.wikimedia.org/,EnglishWikipediadumpsinSQLandXML:http://dumps.wikimedia.org/enwiki/,etc.AnotherexampleoffreelyavailabledataforresearchistheWorldBankdata,i.e.,theWorldBankDatabank(http://databank.worldbank.org/data/databases.aspx ),whichprovidesover40databases,suchasGenderStatistics,HealthNutritionandPopulationStatistics,GlobalEconomicProspects,WorldDevelopmentIndicatorsandGlobalDevelopmentFinance,andmanyothers.Mostofthedatabasescanbefilteredbycountry/region,series/topicsortime(yearsandquarters).Inaddition,toolsareprovidedtoallowreportstobecustomizedanddisplayedintable,chartormapformats.DataaccessviatoolsAsdiscussed,mostcommercialservicesprovideaccesstosocialmediadataviaonlinetools,bothtocontrolaccesstotherawdataandincreasinglytomonetizethedata.FreelyaccessiblesourcesGooglewithtoolssuchasTrendsandInSightsisagoodexampleofthiscategory.Googleisthelargest‘scraper’intheworld,buttheydotheirbestto‘discourage’scrapingoftheirownpages.(ForanintroductionofhowtosurreptitiousscrapeGoogle—andavoidbeing‘banned’—seehttp://google-scraper.squabbel.com.)Google’sstrategyistoprovideawiderangeofpackages,suchasGoogleAnalytics,ratherthanfromaresearchers’viewpointthemoreusefulprogrammableHTTP-basedAPIs.Figure 2illustrateshowGoogleTrendsdisplaysaparticularsearchterm,inthiscase‘libor.’UsingGoogleTrendsyoucancompareuptofivetopicsatatimeandalsoseehowoftenthosetopicshavebeenmentionedandinwhichgeographicregionsthetopicshavebeensearchedforthemost.Fig. 2GoogleTrendsFullsizeimage CommercialsourcesThereisanincreasingnumberofcommercialservicesthatscrapesocialnetworkingmediaandthenprovidepaid-foraccessviasimpleanalyticstools.(ThemorecomprehensiveplatformswithextensiveanalyticsarereviewedinSect.8.)Inaddition,companiessuchasTwitterarebothrestrictingfreeaccesstotheirdataandlicensingtheirdatatocommercialdataresellers,suchasGnipandDataSift.Gnipistheworld’slargestproviderofsocialdata.GnipwasthefirsttopartnerwithTwittertomaketheirsocialdataavailable,andsincethen,itwasthefirsttoworkwithTumblr,Foursquare,WordPress,Disqus,StockTwitsandotherleadingsocialplatforms.Gnipdeliverssocialdatatocustomersinmorethan40countries,andGnip’scustomersdeliversocialmediaanalyticstomorethan95 %oftheFortune500.Real-timedatafromGnipcanbedeliveredasa‘Firehose’ofeverysingleactivityorviaPowerTrack,aproprietaryfilteringtoolthatallowsuserstobuildqueriesaroundonlythedatatheyneed.PowerTrackrulescanfilterdatastreamsbasedonkeywords,geoboundaries,phrasematchesandeventhetypeofcontentormediaintheactivity.ThecompanythenoffersenrichmentstothesedatastreamssuchasProfileGeo(toaddsignificantlymoreusablegeodataforTwitter),URLexpansionandlanguagedetectiontofurtherenhancethevalueofthedatadelivered.Inadditiontoreal-timedataaccess,thecompanyalsooffersHistoricalPowerTrackandSearchAPIaccessforTwitterwhichgivecustomerstheabilitytopullanyTweetsincethefirstmessageonMarch21,2006.Gnipprovidesaccesstopremium(Gnip’s‘CompleteAccess’sourcesarepublishersthathaveanagreementwithGniptoreselltheirdata)andfreedatafeeds(Gnip’s‘ManagedPublicAPIAccess’sourcesprovideaccesstonormalizedandconsolidatedfreedatafromtheirAPIs,althoughitrequiresGnip’spaidservicesfortheDataCollectors)viaitsdashboard(seeFig. 3).Firstly,theuseronlyseesthefeedsinthedashboardthatwerepaidforunderasalesagreement.Toselectafeed,theuserclicksonapublisherandthenchoosesaspecificfeedfromthatpublisherasshowninFig. 3.DifferenttypesoffeedsservedifferenttypesofusecasesandcorrespondtodifferenttypesofqueriesandAPIendpointsonthepublisher’ssourceAPI.Afterselectingthefeed,theuserisassistedbyGniptoconfigureitwithanyrequiredparametersbeforeitbeginscollectingdata.Thisincludesaddingatleastonerule.Under‘GetData’– > ‘AdvancedSettings’youcanalsoconfigurehowoftenyourfeedqueriesthesourceAPIfordata(the‘queryrate’).Choosebetweenthepublisher’snativedataformatandGnip’sActivityStreamsformat(XMLforEnterpriseDataCollectorfeeds).Fig. 3GnipDashboard,PublishersandFeedsFullsizeimage DatafeedaccessviaAPIsForresearchers,arguablythemostusefulsourcesofsocialmediadataarethosethatprovideprogrammableaccessviaAPIs,typicallyusingHTTP-basedprotocols.Giventheirimportancetoacademics,here,wereviewindividuallywikis,socialnetworkingmedia,RSSfeeds,news,etc.WikimediaWikipedia(andwikisingeneral)providesacademicswithlargeopen-sourcerepositoriesofuser-generated(crowd-sourced)content.WhatisnotwidelyknownisthatWikipediaprovidesHTTP-basedAPIsthatallowsprogrammableaccessandsearching(i.e.,scraping)thatreturnsdatainavarietyofformatsincludingXML.Infact,theAPIisnotuniquetoWikipediabutpartofMediaWiki’s(http://www.mediawiki.org/)open-sourcetoolkitandhencecanbeusedwithanyMediaWiki-basedwikis.ThewikiHTTP-basedAPIworksbyacceptingrequestscontainingoneormoreinputargumentsandreturningstrings,ofteninXMLformat,thatcanbeparsedandusedbytherequestingclient.OtherformatssupportedincludeJSON,WDDX,YAML,orPHPserialized.Detailscanbefoundat:http://en.wikipedia.org/w/api.php?action=query&list=allcategories&acprop=size&acprefix=hollywood&format=xml.TheHTTPrequestmustcontain:a)therequested‘action,’suchasquery,editordeleteoperation;b)anauthenticationrequest;andc)anyothersupportedactions.Forexample,theaboverequestreturnsanXMLstringlistingthefirst10Wikipediacategorieswiththeprefix‘hollywood.’Vaswani(2011)providesadetaileddescriptionofhowtoscrapeWikipediausinganApache/PHPdevelopmentenvironmentandanHTTPclientcapableoftransmittingGETandPUTrequestsandhandlingresponses.SocialnetworkingmediaAswithWikipedia,popularsocialnetworks,suchasFacebook,TwitterandFoursquare,makeaproportionoftheirdataaccessibleviaAPIs.AlthoughmanysocialnetworkingmediasitesprovideAPIs,notallsites(e.g.,Bing,LinkedInandSkype)provideAPIaccessforscrapingdata.Whilemoreandmoresocialnetworksareshiftingtopubliclyavailablecontent,manyleadingnetworksarerestrictingfreeaccess,eventoacademics.Forexample,FoursquareannouncedinDecember2013thatitwillnolongerallowprivatecheck-insoniOS7,andhasnowpartneredwithGniptoprovideacontinuousstreamofanonymizedcheck-indata.Thedataisavailableintwopackages:thefullFirehoseaccesslevelandafilteredversionviaGnip’sPowerTrackservice.Here,webrieflydiscusstheAPIsprovidedbyTwitterandFacebook.TwitterThedefaultaccountsettingkeepsusers’Tweetspublic,althoughuserscanprotecttheirTweetsandmakethemvisibleonlytotheirapprovedTwitterfollowers.However,lessthan10 %ofalltheTwitteraccountsareprivate.Tweetsfrompublicaccounts(includingrepliesandmentions)areavailableinJSONformatthroughTwitter’sSearchAPIforbatchrequestsofpastdataandStreamingAPIfornearreal-timedata. SearchAPI—QueryTwitterforrecentTweetscontainingspecifickeywords.ItispartoftheTwitterRESTAPIv1.1(itattemptstocomplywiththedesignprinciplesoftheRESTarchitecturalstyle,whichstandsforRepresentationalStateTransfer)andrequiresanauthorizedapplication(usingoAuth,theopenstandardforauthorization)beforeretrievinganyresultsfromtheAPI. StreamingAPI—Areal-timestreamofTweets,filteredbyuserID,keyword,geographiclocationorrandomsampling. OnemayretrieverecentTweetscontainingparticularkeywordsthroughTwitter’sSearchAPI(partofRESTAPIv1.1)withthefollowingAPIcall:https://api.twitter.com/1.1/search/tweets.json?q=APPLEandreal-timedatausingthestreamingAPIcall:https://stream.twitter.com/1/statuses/sample.json.Twitter’sStreamingAPIallowsdatatobeaccessedviafiltering(bykeywords,userIDsorlocation)orbysamplingofallupdatesfromaselectamountofusers.Defaultaccesslevel‘Spritzer’allowssamplingofroughly1 %ofallpublicstatuses,withtheoptiontoretrieve10 %ofallstatusesviathe‘Gardenhose’accesslevel(moresuitablefordataminingandresearchapplications).Insocialmedia,streamingAPIsareoftencalledFirehose—asyndicationfeedthatpublishesallpublicactivitiesastheyhappeninonebigstream.TwitterhasrecentlyannouncedtheTwitterDataGrantsprogram,whereresearcherscanapplytogetaccesstoTwitter’spublictweetsandhistoricaldatainordertogetinsightsfromitsmassivesetofdata(Twitterhasmorethan500milliontweetsaday);researchinstitutionsandacademicswillnotgettheFirehoseaccesslevel;instead,theywillonlygetthedatasetneededfortheirresearchproject.Researcherscanapplyforitatthefollowingaddress:https://engineering.twitter.com/research/data-grants.TwitterresultsarestoredinaJSONarrayofobjectscontainingthefieldsshowninFig. 4.TheJSONarrayconsistsofalistofobjectsmatchingthesuppliedfiltersandthesearchstring,whereeachobjectisaTweetanditsstructureisclearlyspecifiedbytheobject’sfields,e.g.,‘created_at’and‘from_user’.TheexampleinFig. 4consistsoftheoutputofcallingTwitter’sGETsearchAPIviahttp://search.twitter.com/search.json?q=financial%20times&rpp=1&include_entities=true&result_type=mixedwheretheparametersspecifythatthesearchqueryis‘financialtimes,’oneresultperpage,eachTweetshouldhaveanodecalled‘entities’(i.e.,metadataabouttheTweet)andlist‘mixed’resultstypes,i.e.,includebothpopularandreal-timeresultsintheresponse.Fig. 4ExampleOutputinJSONforTwitterRESTAPIv1Fullsizeimage FacebookFacebook’sprivacyissuesaremorecomplexthanTwitter’s,meaningthatalotofstatusmessagesarehardertoobtainthanTweets,requiring‘openauthorization’statusfromusers.FacebookcurrentlystoresalldataasobjectsFootnote1andhasaseriesofAPIs,rangingfromtheGraphandPublicFeedAPIstoKeywordInsightAPI.Inordertoaccessthepropertiesofanobject,itsuniqueIDmustbeknowntomaketheAPIcall.Facebook’sSearchAPI(partofFacebook’sGraphAPI)canbeaccessedbycallinghttps://graph.facebook.com/search?q=QUERY&type=page.ThedetailedAPIqueryformatisshowninFig. 5.Here,‘QUERY’canbereplacedbyanysearchterm,and‘page’canbereplacedwith‘post,’‘user,’‘page,’‘event,’‘group,’‘place,’‘checkin,’‘location’or‘placetopic.’TheresultsofthissearchwillcontaintheuniqueIDforeachobject.WhenreturningtheindividualIDforaparticularsearchresult,onecanusehttps://graph.facebook.com/IDtoobtainfurtherpagedetailssuchasnumberof‘likes.’Thiskindofinformationisofinteresttocompanieswhenitcomestobrandawarenessandcompetitionmonitoring.Fig. 5FacebookGraphAPISearchQueryFormatFullsizeimage TheFacebookGraphAPIsearchqueriesrequireanaccesstokenincludedintherequest.Searchingforpagesandplacesrequiresan‘appaccesstoken’,whereassearchingforothertypesrequiresauseraccesstoken.Replacing‘page’with‘post’intheaforementionedsearchURLwillreturnallpublicstatusescontainingthissearchterm.Footnote2Batchrequestscanbesentbyfollowingtheprocedureoutlinedhere:https://developers.facebook.com/docs/reference/api/batch/.Informationonretrievingreal-timeupdatescanbefoundhere:https://developers.facebook.com/docs/reference/api/realtime/.FacebookalsoreturnsdatainJSONformatandsocanberetrievedandstoredusingthesamemethodsasusedwithdatafromTwitter,althoughthefieldsaredifferentdependingonthesearchtype,asillustratedinFig. 6.Fig. 6FacebookGraphAPISearchResultsforq=’Centrica’andtype=’page’Fullsizeimage RSSfeedsAlargenumberofWebsitesalreadyprovideaccesstocontentviaRSSfeeds.Thisisthesyndicationstandardforpublishingregularupdatestoweb-basedcontentbasedonatypeofXMLfilethatresidesonanInternetserver.ForWebsites,RSSfeedscanbecreatedmanuallyorautomatically(withsoftware).AnRSSFeedReaderreadstheRSSfeedfile,findswhatisnewconvertsittoHTMLanddisplaysit.TheprogramfragmentinFig. 7showsthecodeforthecontrolandchannelstatementsfortheRSSfeed.Thechannelstatementsdefinetheoverallfeedorchannel,onesetofchannelstatementsintheRSSfile.Fig. 7ExampleRSSFeedControlandChannelStatementsFullsizeimage Blogs,newsgroupsandchatservicesBlogscrapingistheprocessofscanningthroughalargenumberofblogs,usuallydaily,searchingforandcopyingcontent.Thisprocessisconductedthroughautomatedsoftware.Figure 8illustratesexamplecodeforBlogScraping.ThisinvolvesgettingaWebsite’ssourcecodeviaJava’sURLClass,whichcaneventuallybeparsedviaRegularExpressionstocapturethetargetcontent.Fig. 8ExampleCodeforBlogScrapingFullsizeimage NewsfeedsNewsfeedsaredeliveredinavarietyoftextualformats,oftenasmachine-readableXMLdocuments,JSONorCSVfiles.Theyincludenumericalvalues,tagsandotherpropertiesthattendtorepresentunderlyingnewsstories.Fortestingpurposes,historicalinformationisoftendeliveredviaflatfiles,whilelivedataforproductionisprocessedanddeliveredthroughdirectdatafeedsorAPIs.Figure 9showsasnippetofthesoftwarecallstoretrievefilteredNYTimesarticles.Fig. 9ScrapingNewYorkTimesArticlesFullsizeimage Havingexaminedthe‘classic’socialmediadatafeeds,asanillustrationofscrapinginnovativedatasources,wewillbrieflylookatgeospatialfeeds.GeospatialfeedsMuchofthe‘geospatial’socialmediadatacomefrommobiledevicesthatgeneratelocation-andtime-sensitivedata.Onecandifferentiatebetweenfourtypesofmobilesocialmediafeeds(Kaplan2012): Locationandtimesensitive—exchangeofmessageswithrelevanceforonespecificlocationatonespecificpoint-intime(e.g.,Foursquare). Locationsensitiveonly—exchangeofmessageswithrelevanceforonespecificlocation,whicharetaggedtoacertainplaceandreadlaterbyothers(e.g.,YelpandQype) Timesensitiveonly—transferoftraditionalsocialmediaapplicationstomobiledevicestoincreaseimmediacy(e.g.,postingTwittermessagesorFacebookstatusupdates) Neitherlocationortimesensitive—transferoftraditionalsocialmediaapplicationstomobiledevices(e.g.,watchingaYouTubevideoorreadingaWikipediaentry) Withincreasinglyadvancedmobiledevices,notablysmartphones,thecontent(photos,SMSmessages,etc.)hasgeographicalidentificationadded,called‘geotagged.’Thesegeospatialmetadataareusuallylatitudeandlongitudecoordinates,thoughtheycanalsoincludealtitude,bearing,distance,accuracydataorplacenames.GeoRSSisanemergingstandardtoencodethegeographiclocationintoawebfeed,withtwoprimaryencodings:GeoRSSGeographyMarkupLanguage(GML)andGeoRSSSimple.ExampletoolsareGeoNetworkOpensource—afreecomprehensivecatalogingapplicationforgeographicallyreferencedinformation,andFeedBurner—awebfeedproviderthatcanalsoprovidegeotaggedfeeds,ifthespecifiedfeedssettingsallowit.AsanillustrationFig. 10showsthepseudo-codeforanalyzingageospatialfeed.Fig. 10Pseudo-codeforAnalyzingaGeospatialFeedFullsizeimage Textcleaning,taggingandstoringTheimportanceof‘qualityversusquantity’ofdatainsocialmediascrapingandanalyticscannotbeoverstated(i.e.,garbageinandgarbageout).Infact,manydetailsofanalyticsmodelsaredefinedbythetypesandqualityofthedata.Thenatureofthedatawillalsoinfluencethedatabaseandhardwareused.Naturally,unstructuredtextualdatacanbeverynoisy(i.e.,dirty).Hence,datacleaning(orcleansing,scrubbing)isanimportantareainsocialmediaanalytics.Theprocessofdatacleaningmayinvolveremovingtypographicalerrorsorvalidatingandcorrectingvaluesagainstaknownlistofentities.Specifically,textmaycontainmisspelledwords,quotations,programcodes,extraspaces,extralinebreaks,specialcharacters,foreignwords,etc.Soinordertoachievehigh-qualitytextmining,itisnecessarytoconductdatacleaningatthefirststep:spellchecking,removingduplicates,findingandreplacingtext,changingthecaseoftext,removingspacesandnon-printingcharactersfromtext,fixingnumbers,numbersignsandoutliers,fixingdatesandtimes,transformingandrearrangingcolumns,rowsandtabledata,etc.Havingreviewedthetypesandsourcesofrawdata,wenowturnto‘cleaning’or‘cleansing’thedatatoremoveincorrect,inconsistentormissinginformation.Beforediscussingstrategiesfordatacleaning,itisessentialtoidentifypossibledataproblems(Narang2009): Missingdata—whenapieceofinformationexistedbutwasnotincludedforwhateverreasonintherawdatasupplied.Problemsoccurwith:a)numericdatawhen‘blank’oramissingvalueiserroneouslysubstitutedby‘zero’whichisthentaken(forexample)asthecurrentprice;andb)textualdatawhenamissingword(like‘not’)maychangethewholemeaningofasentence. Incorrectdata—whenapieceofinformationisincorrectlyspecified(suchasdecimalerrorsinnumericdataorwrongwordintextualdata)orisincorrectlyinterpreted(suchasasystemassumingacurrencyvalueisin$wheninfactitisin£orassumingtextisinUSEnglishratherthanUKEnglish). Inconsistentdata—whenapieceofinformationisinconsistentlyspecified.Forexample,withnumericdata,thismightbeusingamixtureofformatsfordates:2012/10/14,14/10/2012or10/14/2012.Fortextualdata,itmightbeassimpleas:usingthesamewordinamixtureofcases,mixingEnglishandFrenchinatextmessage,orplacingLatinquotesinanotherwiseEnglishtext. CleansingdataAtraditionalapproachtotextdatacleaningisto‘pull’dataintoaspreadsheetorspreadsheet-liketableandthenreformatthetext.Forexample,GoogleRefine Footnote3isastandalonedesktopapplicationfordatacleaningandtransformationtovariousformats.TransformationexpressionsarewritteninproprietaryGoogleRefineExpressionLanguage(GREL)orJYTHON(animplementationofthePythonprogramminglanguagewritteninJava).Figure 11illustratestextcleansing.Fig. 11TextCleansingPseudo-codeFullsizeimage TaggingunstructureddataSincemostofthesocialmediadataisgeneratedbyhumansandthereforeisunstructured(i.e.,itlacksapre-definedstructureordatamodel),analgorithmisrequiredtotransformitintostructureddatatogainanyinsight.Therefore,unstructureddataneedtobepreprocessed,taggedandthenparsedinordertoquantify/analyzethesocialmediadata.Addingextrainformationtothedata(i.e.,taggingthedata)canbeperformedmanuallyorviarulesengines,whichseekpatternsorinterpretthedatausingtechniquessuchasdataminingandtextanalytics.Algorithmsexploitthelinguistic,auditoryandvisualstructureinherentinalloftheformsofhumancommunication.Taggingtheunstructureddatausuallyinvolvetaggingthedatawithmetadataorpart-of-speech(POS)tagging.Clearly,theunstructurednatureofsocialmediadataleadstoambiguityandirregularitywhenitisbeingprocessedbyamachineinanautomaticfashion.Usingasingledatasetcanprovidesomeinterestinginsights.However,combiningmoredatasetsandprocessingtheunstructureddatacanresultinmorevaluableinsights,allowingustoanswerquestionsthatwereimpossiblebeforehand.StoringdataAsdiscussed,thenatureofthesocialmediadataishighlyinfluentialonthedesignofthedatabaseandpossiblythesupportinghardware.Itwouldalsobeveryimportanttonotethateachsocialplatformhasveryspecific(andnarrow)rulesaroundhowtheirrespectivedatacanbestoredandused.ThesecanbefoundintheTermsofServiceforeachplatform.Forcompleteness,databasescomprise: Flatfile—aflatfileisatwo-dimensionaldatabase(somewhatlikeaspreadsheet)containingrecordsthathavenostructuredinterrelationship,thatcanbesearchedsequentially. Relationaldatabase—adatabaseorganizedasasetofformallydescribedtablestorecognizerelationsbetweenstoreditemsofinformation,allowingmorecomplexrelationshipsamongthedataitems.Examplesarerow-basedSQLdatabasesandcolumn-basedkdb + usedinfinance. noSQLdatabases—aclassofdatabasemanagementsystem(DBMS)identifiedbyitsnon-adherencetothewidelyusedrelationaldatabasemanagementsystem(RDBMS)model.noSQL/newSQLdatabasesarecharacterizedas:beingnon-relational,distributed,open-sourceandhorizontallyscalable. Apache(noSQL)databasesandtoolsThegrowthofultra-largeWebsitessuchasFacebookandGooglehasledtothedevelopmentofnoSQLdatabasesasawayofbreakingthroughthespeedconstraintsthatrelationaldatabasesincur.AkeydriverhasbeenGoogle’sMapReduce,i.e.,thesoftwareframeworkthatallowsdeveloperstowriteprogramsthatprocessmassiveamountsofunstructureddatainparallelacrossadistributedclusterofprocessorsorstand-alonecomputers(ChandrasekarandKowsalya2011).ItwasdevelopedatGoogleforindexingWebpagesandreplacedtheiroriginalindexingalgorithmsandheuristicsin2004.Themodelisinspiredbythe‘Map’and‘Reduce’functionscommonlyusedinfunctionalprogramming.MapReduce(conceptually)takesasinputalistofrecords,andthe‘Map’computationsplitsthemamongthedifferentcomputersinacluster.TheresultoftheMapcomputationisalistofkey/valuepairs.Thecorresponding‘Reduce’computationtakeseachsetofvaluesthathasthesamekeyandcombinesthemintoasinglevalue.AMapReduceprogramiscomposedofa‘Map()’procedureforfilteringandsortinganda‘Reduce()’procedureforasummaryoperation(e.g.,countingandgrouping).Figure 12providesacanonicalexampleapplicationofMapReduce.Thisexampleisaprocesstocounttheappearancesofeachdifferentwordinasetofdocuments(MapReduce2011).Fig. 12TheCanonicalExampleApplicationofMapReduceFullsizeimage Apacheopen-sourcesoftwareTheresearchcommunityisincreasinglyusingApachesoftwareforsocialmediaanalytics.WithintheApacheSoftwareFoundation,threelevelsofsoftwarearerelevant: Cassandra/hivedatabases—ApacheCassandraisanopensource(noSQL)distributedDBMSprovidingastructured‘key-value’store.Key-valuestoresallowanapplicationtostoreitsdatainaschema-lessway.RelatednoSQLdatabaseproductsinclude:ApacheHive,ApachePigandMongoDB,ascalableandhigh-performanceopen-sourcedatabasedesignedtohandledocument-orientedstorage.SincenoSQLdatabasesare‘structure-less,’itisnecessarytohaveacompanionSQLdatabasetoretainandmapthestructureofthecorrespondingdata. Hadoopplatform—isaJava-basedprogrammingframeworkthatsupportstheprocessingoflargedatasetsinadistributedcomputingenvironment.Anapplicationisbrokendownintonumeroussmallparts(alsocalledfragmentsorblocks)thatcanberunonsystemswiththousandsofnodesinvolvingthousandsofterabytesofstorage. Mahout—providesimplementationsofdistributedorotherwisescalableanalytics(machinelearning)algorithmsrunningontheHadoopplatform.MahoutFootnote4supportsfourclassesofalgorithms:a)clustering(e.g.,K-Means,FuzzyC-Means)thatgroupstextintorelatedgroups;b)classification(e.g.,ComplementaryNaiveBayesclassifier)thatusessupervisedlearningtoclassifytext;c)frequentitemsetminingtakesasetofitemgroupsandidentifieswhichindividualitemsusuallyappeartogether;andd)recommendationmining(e.g.,user-anditem-basedrecommenders)thattakesusers’behaviorandfromthattriestofinditemsusersmightlike. SocialmediaanalyticstechniquesAsdiscussed,opinionmining(orsentimentanalysis)isanattempttotakeadvantageofthevastamountsofuser-generatedtextandnewscontentonline.Oneoftheprimarycharacteristicsofsuchcontentisitstextualdisorderandhighdiversity.Here,naturallanguageprocessing,computationallinguisticsandtextanalyticsaredeployedtoidentifyandextractsubjectiveinformationfromsourcetext.Thegeneralaimistodeterminetheattitudeofawriter(orspeaker)withrespecttosometopicortheoverallcontextualpolarityofadocument.ComputationalsciencetechniquesAutomatedsentimentanalysisofdigitaltextsuseselementsfrommachinelearningsuchaslatentsemanticanalysis,supportvectormachines,bag-of-wordsmodelandsemanticorientation(Turney2002).Insimpleterms,thetechniquesemploythreebroadareas: Computationalstatistics—referstocomputationallyintensivestatisticalmethodsincludingresamplingmethods,MarkovchainMonteCarlomethods,localregression,kerneldensityestimationandprincipalcomponentsanalysis. Machinelearning—asystemcapableoftheautonomousacquisitionandintegrationofknowledgelearntfromexperience,analyticalobservation,etc.(Murphy2012).Thesesub-symbolicsystemsfurthersubdivideinto: SupervisedlearningsuchasRegressionTrees,DiscriminantFunctionAnalysis,SupportVectorMachines. UnsupervisedlearningsuchasSelf-OrganizingMaps(SOM),K-Means. MachineLearningaimstosolvetheproblemofhavinghugeamountsofdatawithmanyvariablesandiscommonlyusedinareassuchaspatternrecognition(speech,images),financialalgorithms(creditscoring,algorithmictrading)(Nutietal.2011),energyforecasting(load,price)andbiology(tumordetection,drugdiscovery).Figure 13illustratesthetwolearningtypesofmachinelearningandtheiralgorithmcategories.Fig. 13MachineLearningOverviewFullsizeimage Complexityscience—complexsimulationmodelsofdifficult-to-predictsystemsderivedfromstatisticalphysics,informationtheoryandnonlineardynamics.Therealmofphysicistsandmathematicians. Thesetechniquesaredeployedintwoways: Datamining—knowledgediscoverythatextractshiddenpatternsfromhugequantitiesofdata,usingsophisticateddifferentialequations,heuristics,statisticaldiscriminators(e.g.,hiddenMarkovmodels),andartificialintelligencemachinelearningtechniques(e.g.,neuralnetworks,geneticalgorithmsandsupportvectormachines). Simulationmodeling—simulation-basedanalysisthattestshypotheses.Simulationisusedtoattempttopredictthedynamicsofsystemssothatthevalidityoftheunderlyingassumptioncanbetested. StreamprocessingLastly,weshouldmentionstreamprocessing(Botanetal2010).Increasingly,analyticsapplicationsthatconsumereal-timesocialmedia,financial‘ticker’andsensornetworksdataneedtoprocesshigh-volumetemporaldatawithlowlatency.Theseapplicationsrequiresupportforonlineanalysisofrapidlychangingdatastreams.However,traditionaldatabasemanagementsystems(DBMSs)havenopre-definednotionoftimeandcannothandledataonlineinnearrealtime.ThishasledtothedevelopmentofDataStreamManagementSystems(DSMSs)(Hebrail2008)—processinginmainmemorywithoutstoringthedataondisk—thathandletransientdatastreamson-lineandprocesscontinuousqueriesonthesedatastreams.Examplecommercialsystemsinclude:OracleCEPengine,StreamBaseandMicrosoft’sStreamInsight(Chandramoulietal.2010).SentimentanalysisSentimentisaboutminingattitudes,emotions,feelings—itissubjectiveimpressionsratherthanfacts.Generallyspeaking,sentimentanalysisaimstodeterminetheattitudeexpressedbythetextwriterorspeakerwithrespecttothetopicortheoverallcontextualpolarityofadocument(Mejova2009).PangandLee(2008)provideathoroughdocumentationonthefundamentalsandapproachesofsentimentclassificationandextraction,includingsentimentpolarity,degreesofpositivity,subjectivitydetection,opinionidentification,non-factualinformation,termpresenceversusfrequency,POS(partsofspeech),syntax,negation,topic-orientedfeaturesandterm-basedfeaturesbeyondtermunigrams.SentimentclassificationSentimentanalysisdividesintospecificsubtasks: Sentimentcontext—toextractopinion,oneneedstoknowthe‘context’ofthetext,whichcanvarysignificantlyfromspecialistreviewportals/feedstogeneralforumswhereopinionscancoveraspectrumoftopics(Westerski2008). Sentimentlevel—textanalyticscanbeconductedatthedocument,sentenceorattributelevel. Sentimentsubjectivity—decidingwhetheragiventextexpressesanopinionorisfactual(i.e.,withoutexpressingapositive/negativeopinion). Sentimentorientation/polarity—decidingwhetheranopinioninatextispositive,neutralornegative. Sentimentstrength—decidingthe‘strength’ofanopinioninatext:weak,mildorstrong. Perhaps,themostdifficultanalysisisidentifyingsentimentorientation/polarityandstrength—positive(wonderful,elegant,amazing,cool),neutral(fine,ok)andnegative(horrible,disgusting,poor,flakey,sucks)duetoslang.Apopularapproachistoassignorientation/polarityscores(+1,0,−1)toallwords:positiveopinion(+1),neutralopinion(0)andnegativeopinion(−1).Theoverallorientation/polarityscoreofthetextisthesumoforientationscoresofall‘opinion’wordsfound.However,therearevariouspotentialproblemsinthissimplisticapproach,suchasnegation(e.g.,thereisnothingIhateaboutthisproduct).Onemethodofestimatingsentimentorientation/polarityofthetextispointwisemutualinformation(PMI)ameasureofassociationusedininformationtheoryandstatistics.SupervisedlearningmethodsThereareanumberofpopularcomputationalstatisticsandmachinelearningtechniquesusedforsentimentanalysis.Foragoodintroduction,see(Khanetal2010).Techniquesinclude: NaïveBayes(NB)—asimpleprobabilisticclassifierbasedonapplyingBayes’theoremwithstrong(naive)independenceassumptions(whenfeaturesareindependentofoneanotherwithineachclass). Maximumentropy(ME)—theprobabilitydistributionthatbestrepresentsthecurrentstateofknowledgeistheonewithlargestinformation-theoreticalentropy. Supportvectormachines(SVM)—aresupervisedlearningmodelswithassociatedlearningalgorithmsthatanalyzedataandrecognizepatterns,usedforclassificationandregressionanalysis. Logisticregression(LR)model—isatypeofregressionanalysisusedforpredictingtheoutcomeofacategorical(avariablethatcantakeonalimitednumberofcategories)criterionvariablebasedononeormorepredictorvariables. Latentsemanticanalysis—anindexingandretrievalmethodthatusesamathematicaltechniquecalledsingularvaluedecomposition(SVD)toidentifypatternsintherelationshipsbetweenthetermsandconceptscontainedinanunstructuredcollectionoftext(KobayashiandTakeda2000). Thebag-of-wordsmodelisasimplifyingrepresentationcommonlyusedinnaturallanguageprocessingandIR,whereasentenceoradocumentisrepresentedasanunorderedcollectionofwords,disregardinggrammarandevenwordorder.Thisisamodeltraditionallyappliedtosentimentanalysisthankstoitssimplicity.NaïveBayesclassifier(NBC)Asanexampleofsentimentanalysis,wewilldescribebrieflyaNaiveBayesclassifier(Murphy2006).TheNaiveBayesclassifierisgeneralpurpose,simpletoimplementandworkswellforarangeofapplications.Itclassifiesdataintwosteps: Trainingstep—usingthetrainingsamples,themethodestimatestheparametersofaprobabilitydistribution,assumingfeaturesareconditionallyindependentgiventheclass. Analysis/testingstep—Foranyunseentestsample,themethodcomputestheposteriorprobabilityofthatsamplebelongingtoeachclass.Themethodthenclassifiesthetestsampleaccordingtothelargestposteriorprobability. UsingtheNaïveBayesclassifier,theclassifiercalculatestheprobabilityforatexttobelongtoeachofthecategoriesyoutestagainst.Thecategorywiththehighestprobabilityforthegiventextwins:$${\text{classify}}\left({{\text{word}}_{1},{\text{word}}_{2},\ldots{\text{word}}_{n}}\right)=\mathop{\arg\hbox{max}}\limits_{\text{cat}}P\left({\text{cat}}\right)*\mathop\prod\limits_{i=1}^{n}P({\text{word}}_{i}|{\text{cat}})$$ Figure 14providesanexampleofsentimentclassificationusingaNaïveBayesclassifierinPython.ThereareanumberofNaïveBayesclassifierprogramsavailableinJava,includingthejBNCtoolkit(http://jbnc.sourceforge.net),WEKA(www.cs.waikato.ac.nz/ml/weka)andAlchemyAPI(www.alchemyapi.com/api/demo.html).Fig. 14SentimentClassificationExampleusingPythonFullsizeimage WenextlookattherangeofSocialMediatoolsavailable,startingwith‘tools’and‘toolkits,’andinthesubsequentchapterat‘comprehensive’socialmediaplatforms.Sincetherearealargenumberofsocialmediatextualdataservices,toolsandplatforms,wewillrestrictourselvesexaminingafewleadingexamples.SocialmediaanalyticstoolsOpinionminingtoolsarecrowdedwith(commercial)providers,mostofwhichareskewedtowardsentimentanalysisofcustomerfeedbackaboutproductsandservices.Fortunately,thereisavastspectrumoftoolsfortextualanalysisrangingfromsimpleopen-sourcetoolstolibraries,multi-functioncommercialtoolkitsandplatforms.Thissectionfocusesonindividualtoolsandtoolkitsforscraping,cleaningandanalytics,andthenextchapterlooksatwhatwecallsocialmediaplatformsthatprovidebotharchivedataandreal-timefeeds,andaswellassophisticatedanalyticstools.ScientificprogrammingtoolsPopularscientificanalyticslibrariesandtoolshavebeenenhancedtoprovidesupportforsourcing,searchingandanalyzingtext.Examplesinclude:R—usedforstatisticalprogramming,MATLAB—usedfornumericscientificprogramming,andMathematica—usedforsymbolicscientificprogramming(computeralgebra).Dataprocessinganddatamodeling,e.g.,regressionanalysis,arestraightforwardusingMATLAB,whichprovidestime-seriesanalysis,GUIandarray-basedstatistics.MATLABissignificantlyfasterthanthetraditionalprogramminglanguagesandcanbeusedforawiderangeofapplications.Moreover,theexhaustivebuilt-inplottingfunctionsmakeitacomplexanalyticstoolkit.Morecomputationallypowerfulalgorithmscanbedevelopedusingitinconjunctionwiththepackages(e.g.,FastICAinordertoperformindependentcomponentanalysis).Pythoncanbeusedfor(natural)languagedetection,titleandcontentextraction,querymatchingand,whenusedinconjunctionwithamodulesuchasscikit-learn,itcanbetrainedtoperformsentimentanalysis,e.g.,usingaNaïveBayesclassifier.Anotherexample,ApacheUIMA(UnstructuredInformationManagementApplications)isanopen-sourceprojectthatanalyzes‘bigdata’anddiscoversinformationthatisrelevanttotheuser.BusinesstoolkitsBusinessToolkitsarecommercialsuitesoftoolsthatallowuserstosource,searchandanalyzetextforarangeofcommercialpurposes.SASSentimentAnalysisManager,partoftheSASTextAnalyticsprogram,canbeusedforscrapingcontentsources,includingmainstreamWebsitesandsocialmediaoutlets,aswellasinternalorganizationaltextsources,andcreatesreportsthatdescribetheexpressedfeelingsofconsumers,customersandcompetitorsinrealtime.RapidMiner(HirudkarandSherekar2013),apopulartoolkitofferinganopen-sourceCommunityEditionreleasedundertheGNUAGPLandalsoanEnterpriseEditionofferedunderacommerciallicense.RapidMinerprovidesdataminingandmachinelearningproceduresincluding:dataloadingandtransformation(Extract,Transform,Load,a.k.a.ETL),datapreprocessingandvisualization,modeling,evaluation,anddeployment.RapidMineriswritteninJavaanduseslearningschemesandattributeevaluatorsfromtheWekamachinelearningenvironmentandstatisticalmodelingschemesfromtheRproject.OtherexamplesareLexalyticsthatprovidesacommercialsentimentanalysisengineformanyOEManddirectcustomers;andIBMSPSSStatisticsisoneofthemostusedprogramsforstatisticalanalysisinsocialscience.SocialmediamonitoringtoolsSocialmediamonitoringtoolsaresentimentanalysistoolsfortrackingandmeasuringwhatpeoplearesaying(typically)aboutacompanyoritsproducts,oranytopicacrosstheweb’ssocialmedialandscape.Intheareaofsocialmediamonitoringexamplesinclude:SocialMention,(http://socialmention.com/),whichprovidessocialmediaalertssimilarlytoGoogleAlerts;AmplifiedAnalytics(http://www.amplifiedanalytics.com/),whichfocusesonproductreviewsandmarketinginformation;LithiumSocialMediaMonitoring;andTrackur,whichisanonlinereputationmonitoringtoolthattrackswhatisbeingsaidontheInternet.Googlealsoprovidesafewusefulfreetools.GoogleTrendsshowshowoftenaparticularsearch-terminputcomparestothetotalsearchvolume.AnothertoolbuiltaroundGoogleSearchisGoogleAlerts—acontentchangedetectiontoolthatprovidesnotificationsautomatically.GooglealsoacquiredFeedBurner—anRSSfeedsmanagement—in2007.TextanalysistoolsTextanalysistoolsarebroad-basedtoolsfornaturallanguageprocessingandtextanalysis.Examplesofcompaniesinthetextanalysisareainclude:OpenAmplifyandJodangewhosetoolsautomaticallyfilterandaggregatethoughts,feelingsandstatementsfromtraditionalandsocialmedia.Therearealsoalargenumberoffreelyavailabletoolsproducedbyacademicgroupsandnon-governmentalorganizations(NGO)forsourcing,searchingandanalyzingopinions.ExamplesincludeStanfordNLPgrouptoolsandLingPipe,asuiteofJavalibrariesforthelinguisticanalysisofhumanlanguage(Teufletal2010).Avarietyofopen-sourcetextanalyticstoolsareavailable,especiallyforsentimentanalysis.Apopulartextanalysistool,whichisalsoopensource,isPythonNLTK—NaturalLanguageToolkit(www.nltk.org/),whichincludesopen-sourcePythonmodules,linguisticdataanddocumentationfortextanalytics.AnotheroneisGATE(http://gate.ac.uk/sentiment).WeshouldalsomentionLexalyticsSentimentToolkitwhichperformsautomaticsentimentanalysisoninputdocuments.Itispowerfulwhenusedonalargenumberofdocuments,butitdoesnotperformdatascraping.Othercommercialsoftwarefortextmininginclude:AeroText,Attensity,Clarabridge,IBMLanguageWare,SPSSTextAnalyticsforSurveys,LanguageComputerCorporation,STATISTICATextMinerandWordStat.DatavisualizationtoolsThedatavisualizationtoolsprovidebusinessintelligence(BI)capabilitiesandallowdifferenttypesofuserstogaininsightsfromthe‘big’data.Theuserscanperformexploratoryanalysisthroughinteractiveuserinterfacesavailableonthemajorityofdevices,witharecentfocusonmobiledevices(smartphonesandtablets).Thedatavisualizationtoolshelptheusersidentifypatterns,trendsandrelationshipsinthedatawhichwerepreviouslylatent.Fastadhocvisualizationonthedatacanrevealpatternsandoutliers,anditcanbeperformedonlarge-scaledatasetsframeworks,suchasApacheHadooporAmazonKinesis.TwonotablevisualizationtoolsareSASVisualAnalyticsandTableau.Casestudy:SASSentimentAnalysisandSocialMediaAnalyticsSASistheleadingadvancedanalyticssoftwareforBI,datamanagementandpredictiveanalytics.SASSentimentAnalysis(SASInstitute2013)automaticallyratesandclassifiesopinions.ItalsoperformsdatascrapingfromWebsites,socialmediaandinternalfilesystems.Then,itprocessesinaunifiedformattoevaluaterelevancewithregardtoitspre-definedtopics.SASSentimentAnalysisidentifiestrendsandemotionalchanges.Expertscanrefinethesentimentmodelsthroughaninteractiveworkbench.Thetoolautomaticallyassignssentimentscorestotheinputdocumentsastheyareretrievedinrealtime.SASSentimentAnalysiscombinesstatisticalmodelingandlinguistics(rule-basednaturallanguageprocessingtechniques)inordertooutputaccuratesentimentanalysisresults.Thetoolmonitorsandevaluatessentimentchangesovertime;itextractssentimentsinrealtimeasthescrapeddataisbeingretrievedandgeneratesreportsshowingpatternsanddetailedreactions.Thesoftwareidentifieswhere(i.e.,onwhatchannel)thetopicisbeingdiscussedandquantifiesperceptionsinthemarketasthesoftwarescrapesandanalyzesbothinternalandexternalcontentaboutyourorganization(ortheconceptyouareanalyzing)andcompetitors,identifyingpositive,neutral,negativeor‘nosentiment’textsinrealtime.SASSentimentAnalysisandSASSocialMediaAnalyticshaveauser-friendlyinterfacefordevelopingmodels;userscanuploadsentimentanalysismodelsdirectlytotheserverinordertominimizethemanualmodeldeployment.Moreadvanceduserscanusetheinteractiveworkbenchtorefinetheirmodels.Thesoftwareincludesgraphicstoillustrateinstantaneouslythetextclassification(i.e.,positive,negative,neutralorunclassified)andpoint-and-clickexplorationinordertodrilltheclassifiedtextintodetail.ThetoolalsoprovidessomeworkbenchfunctionalitythroughAPIs,allowingforautomatic/programmaticintegrationwithothermodules/projects.Figure 15illustratestheSASSocialMediaAnalyticsgraphicalreports,whichprovideuser-friendlysentimentinsights.TheSASsoftwarehascrawlingpluginsforthemostpopularsocialmediasites,includingFacebook,Twitter,Bing,LinkedIn,FlickrandGoogle.ItcanalsobecustomizedtocrawlanyWebsiteusingthemark-upmatcher;thisprovidesapoint-and-clickinterfacetoindicatewhatareasneedtobeextractedfromanHTMLorXML.SASSocialMediaAnalyticsgathersonlineconversationsfrompopularnetworkingsites(e.g.,FacebookandTwitter),blogsandreviewsites(e.g.,TripAdvisorandPriceline),andscoresthedataforinfluenceandsentiment.Itprovidesvisualizationtoolsforreal-timetracking;itallowsuserstosubmitcustomizedqueriesandreturnsageographicalvisualizationwithbrand-specificcommentaryfromTwitter,asillustratedinFig. 16.Fig. 15GraphicalReportswithSentimentInsightsFullsizeimage Fig. 16SASVisualizationofReal-TimeTrackingviaTwitterFullsizeimage SocialmediaanalyticsplatformsHere,weexaminecomprehensivesocialmediaplatformsthatcombinesocialmediaarchives,datafeeds,datamininganddataanalysistools.Simplyput,theplatformsaredifferentfromtoolsandtoolkitssinceplatformsaremorecomprehensiveandprovidebothtoolsanddata.Theybroadlysubdivideinto: Newsplatforms—platformssuchasThomsonReutersprovidingnewsarchives/feedsandassociatedanalyticsandtargetingcompaniessuchasfinancialinstitutionsseekingtomonitormarketsentimentinnews. Socialnetworkmediaplatforms—platformsthatprovidedataminingandanalyticsonTwitter,Facebookandawiderangeofothersocialnetworkmediasources.Providerstypicallytargetcompaniesseekingtomonitorsentimentaroundtheirbrandsorproducts. NewsplatformsThetwomostprominentbusinessnewsfeedprovidersareThomsonReutersandBloomberg.Computerreadnewsinrealtimeandprovideautomaticallykeyindicatorsandmeaningfulinsights.Thenewsitemsareautomaticallyretrieved,analyzedandinterpretedinafewmilliseconds.Themachine-readablenewsindicatorscanpotentiallyimprovequantitativestrategies,riskmanagementanddecisionmaking.Examplesofmachine-readablenewsinclude:ThomsonReutersMachineReadableNews,Bloomberg’sEvent-DrivenTradingFeedandAlphaFlash(DeutscheBörse’smachine-readablenewsfeed).ThomsonReutersMachineReadableNews(ThomsonReuters2012a,b,c)hasReutersNewscontentdatingbackto1987,andcomprehensivenewsfromover50third-partiesdatingbackto2003,suchasPRNewswire,BusinessWireandtheRegulatoryNewsService(LSE).ThefeedoffersfulltextandcomprehensivemetadataviastreamingXML.ThomsonReutersNewsAnalyticsusesNaturalLanguageProcessing(NLP)techniquestoscorenewsitemsontensofthousandsofcompaniesandnearly40commoditiesandenergytopics.Itemsaremeasuredacrossthefollowingdimensions: Authorsentiment—metricsforhowpositive,negativeorneutralthetoneoftheitemis,specifictoeachcompanyinthearticle. Relevance—howrelevantorsubstantivethestoryisforaparticularitem. Volumeanalysis—howmuchnewsishappeningonaparticularcompany. Uniqueness—howneworrepetitivetheitemisovervarioustimeperiods. Headlineanalysis—denotesspecialfeaturessuchasbrokeractions,pricingcommentary,interviews,exclusivesandwrap-ups. SocialnetworkmediaplatformsAttensity,Brandwatch,SalesforceMarketingCloud(previouslycalledRadian6)andSysomosMAP(MediaAnalysisPlatform)areexamplesofsocialmediamonitoringplatforms,whichmeasuredemographics,influentialtopicsandsentiments.Theyincludetextanalyticsandsentimentanalysisononlineconsumerconversationsandprovideuser-friendlyinterfacesforcustomizingthesearchquery,dashboards,reportsandfileexportfeatures(e.g.,toExcelorCSVformat).Mostoftheplatformsscrapearangeofsocialnetworkmediausingadistributedcrawlerthattargets:micro-blogging(TwitterviafullTwitterFirehose),blogs(Blogger,WordPress,etc.),socialnetworks(FacebookandMySpace),forums,newssites,imagessites(Flickr)andcorporatesites.Someoftheplatformsprovidemulti-languagesupportforwidelyusedlanguages(e.g.,English,French,German,ItalianandSpanish).Sentimentanalysisplatformsusetwomainmethodologies.Oneinvolvesastatisticalormodel-basedapproachwhereinthesystemlearnstoassesssentimentbyanalyzinglargequantitiesofpre-scoredmaterial.Theothermethodutilizesalargedictionaryofpre-scoredphrases.RapidMinerFootnote5isaplatformwhichcombinesdatamininganddataanalysis,which,dependingontherequirements,canbeopensource.ItusestheWEKAmachinelearninglibraryandprovidesaccesstodatasourcessuchasExcel,Access,Oracle,IBM,MySQL,PostgreSQLandTextfiles.Mozendaprovidesapoint-and-clickuserinterfaceforextractingspecificinformationfromtheWebsitesandallowsautomationanddataexporttoCSV,TSVorXMLfiles.DataSiftprovidesaccesstobothreal-timeandhistoricalsocialdatafromtheleadingsocialnetworksandmillionsofothersources,enablingclientstoaggregate,filterandgaininsightsanddiscovertrendsfromthebillionsofpublicsocialconversations.Oncethedataisaggregatedandprocessed(i.e.,DataSiftcanfilterandaddcontext,suchasenrichments—languageprocessing,geodataanddemographics—andcategorization—spamdetection,intentidentificationandmachinelearning),thecustomerscanusepre-builtintegrationswithpopularBItools,applicationanddevelopertoolstodeliverthedataintotheirbusinesses,orusetheDataSiftAPIstostreamreal-timedataintotheirapplications.Thereareagrowingnumberofsocialmediaanalyticsplatformsbeingfoundednowadays.OthernotableplatformsthathandlesentimentandsemanticanalysisofWebandWeb2.0-sourcedmaterialincludeGoogleAnalytics,HPAutonomyIDOL(IntelligentDataOperatingLayer),IBMSPSSModeler,AdobeSocialAnalytics,GraphDive,KeenIO,MassRelevance,Parse.ly,ViralHeat,Socialbakers,DachisGroup,evolve24,OpenAmplifyandAdmantX.Recently,moreandmorespecificsocialanalyticsplatformshaveemerged.OneofthemisiSpot.tvwhichlauncheditsownsocialmediaanalyticsplatformthatmatchestelevisionadswithmentionsonTwitterandFacebook.Itprovidesreal-timereportsaboutwhenandwhereanadappears,togetherwithwhatpeoplearesayingaboutitonsocialnetworks(iSpot.tvmonitorsalmost80differentnetworks).ThomsonReutershasrecentlyannouncedthatitisnowincorporatingTwittersentimentanalysisfortheThomsonReutersEikonmarketanalysisandtradingplatform,providingvisualizationsandchartsbasedonthesentimentdata.Inthepreviousyear,Bloombergincorporatedtweetsrelatedtospecificcompaniesinawiderdatastream.Casestudy:ThomsonReutersNewsAnalyticsThomsonReutersNewsAnalytics(TRNA)providesahugenewsarchivewithanalyticstoreadandinterpretnews,offeringmeaningfulinsights.TRNAscoresnewsitemsonover25,000equitiesandnearly40topics(commoditiesandenergy).Theplatformscrapesandanalyzesnewsdatainrealtimeandfeedsthedataintootherprograms/projectsorquantitativestrategies.TRNAusesanNLPsystemfromLexalytics,oneofthelinguisticstechnologyleaders,thatcantracknewssentimentovertime,andscorestextacrossthevariousdimensionsasmentionedinSect.8.1.Theplatform’stextscoringandmetadatahasmorethan80fields(ThomsonReuters2010)suchas: Itemtype—stageofthestory:Alert,Article,UpdatesorCorrections. Itemgenre—classificationofthestory,i.e.,interview,exclusiveandwrap-up. Headline—alertorheadlinetext. Relevance—variesfrom0to1.0. Prevailingsentiment—canbe1,0or−1. Positive,neutral,negative—moredetailedsentimentindication. Brokeraction—denotesbrokeractions:upgrade,downgrade,maintain,undefinedorwhetheritisthebrokeritself Price/marketcommentary—usedtoflagitemsdescribingpricing/marketcommentary Topiccodes—describeswhatthestoryisabout,i.e.,RCH = Research,RES = Results,RESF = ResultsForecast,MRG = MergersandAcquisitions AsnippetofthenewssentimentanalysisisillustratedinFig. 17.Fig. 17ThomsonReutersNewsDiscoveryApplicationwithSentimentAnalysisFullsizeimage In2012,ThomsonReutersextendeditsmachine-readablenewsofferingtoincludesentimentanalysisandscoringforsocialmedia.TRNA’sextensioniscalledThomsonReutersNewsAnalytics(TRNA)forInternetNewsandSocialMedia,whichaggregatescontentfromoverfourmillionsocialmediachannelsand50,000Internetnewssites.ThecontentisthenanalyzedbyTRNAinrealtime,generatingaquantifiableoutputacrossdimensionssuchassentiment,relevance,noveltyvolume,categoryandsourceranks.Thisextensionusesthesameextensivemetadatatagging(acrossmorethan80fields).TRNAforInternetNewsandSocialMediaisapowerfulplatformanalyzing,taggingandfilteringmillionsofpublicandpremiumsourcesofInternetcontent,turningbigdataintoactionableideas.Theplatformalsoprovidesawaytovisuallyanalyzethebigdata.ItcanbecombinedwithPanopticonDataVisualizationSoftwareinordertoreachmeaningfulconclusionsmorequicklywithvisuallyintuitivedisplays(ThomsonReuters2012a,b,c),asillustratedinFig. 18.Fig. 18CombiningTRNAforInternetNewsandSocialMediawithPanopticonDataVisualizationSoftwareFullsizeimage ThomsonReutersalsoexpandedtheNewsAnalyticsservicewithMarketPsychIndices(ThomsonReuters2012a,b,c),whichallowsforreal-timepsychologicalanalysisofnewsandsocialmedia.TheThomsonReutersMarketPsychIndices(TRMI)servicegainsaquantitativeviewofmarketpsychologyasitattemptstoidentifyhumanemotionandsentiment.ItisacomplementtoTRNAandusesNLPprocessingcreatedbyMarketPsych(http://www.marketpsych.com),aleadingcompanyinbehavioralpsychologyinfinancialmarkets.Behavioraleconomistshaveextensivelyinvestigatedwhetheremotionsaffectmarketsinpredictableways,andTRMIattemptstomeasurethestateof‘emotions’inrealtimeinordertoidentifypatternsastheyemerge.TRMIhastwokeyindicatortypes: Emotionalindicators(sentiments)—emotionssuchasGloom,Fear,Trust,Uncertainty,Innovation,Anger,Stress,Urgency,OptimismandJoy. Buzzmetrics—theyindicatehowmuchsomethingisbeingdiscussedinthenewsandsocialmediaandincludemacroeconomicthemes(e.g.,Litigation,Mergers,Volatility,Financialssector,AirlinessectorandCleanTechnologysector) TheplatformfromThomsonReutersallowstheexploitationofnewsandsocialmediatobeusedtospotopportunitiesandcapitalizeonmarketinefficiencies(ThomsonReuters2013).ExperimentalcomputationalenvironmentforsocialmediaAswehavediscussedinSect.2methodologyandcritique,researchersarguablyrequireacomprehensiveexperimentalcomputationalenvironment/facilityforsocialmediaresearchwiththefollowingattributes:Data Datascraping—theabilitythrougheasilyprogrammableAPIstoscrapeanytypeofsocialmedia(socialnetworkingmedia,RSSfeeds,blogs,wikis,news,etc.). Datastreaming—toaccessandcombinereal-timefeedsandarchiveddataforanalytics. Datastorage—amajorfacilityforstoringprincipaldatasourcesandforarchivingdatacollectedforspecificprojects. Dataprotection/security—thestoreddataneedstobeprotectedtostopusersattemptingto‘suckitout’offthefacility.Accesstocertaindatasetsmayneedtoberestrictedandchargesmaybeleviedonaccess(cf.WhartonResearchDataServices). Programmableinterfaces—researchersneedaccesstosimpleapplicationprogramminginterfaces(APIs)toscrapeandstoreotheravailabledatasourcesthatmaynotbeautomaticallycollected. Analytics Analyticsdashboards—non-programminginterfacesarerequiredforgivingwhatmightbereferredtoas‘deep’accessto‘raw’data. Programmableanalytics—programminginterfacesarealsorequiredsouserscandeployadvanceddataminingandcomputersimulationmodelsusingMATLAB,JavaandPython. Streamprocessing—facilitiesarerequiredtosupportanalyticsonstreamedreal-timedatafeeds,suchasTwitterfeeds,newsfeedsandfinancialtickdata. High-performancecomputing—lastlytheenvironmentneedstosupportnon-programminginterfacestoMapReduce/Hadoop,NoSQLdatabasesandGridsofprocessors. Decentralizedanalytics—ifresearchersaretocombinesocialmediadatawithhighlysensitive/valuableproprietarydataheldbygovernments,financialinstitutions,retailersandothercommercialorganizations,thentheenvironmentneedsinthefuturetosupportdecentralizedanalyticsacrossdistributeddatasourcesandinahighlysecureway. Realistically,thisisbestfacilitatedatanationalorinternationallevel.Toprovidesomeinsightintothestructureofanexperimentalcomputationalenvironmentforsocialmedia(analytics),belowwepresentthesystemarchitectureoftheUCLSocialSTORManalyticsplatformdevelopedbyDr.MichalGalasandhiscolleagues(Galasetal.2012)toUniversityCollegeLondon(UCL).UniversityCollegeLondon’ssocialmediastreaming,storageandanalyticsplatform(SocialSTORM)isacloud-based‘centralhub’platform,whichfacilitatestheacquisitionoftext-baseddatafromonlinesourcessuchasTwitter,Facebook,RSSmediaandnews.ThesystemincludesfacilitiestouploadandrunJava-codedsimulationmodelstoanalyzetheaggregateddata,whichmaycomprisescrapedsocialdataand/orusers’ownproprietarydata.SystemarchitectureFigure 19showsthearchitectureoftheSocialSTORMplatform,andthefollowingsectionoutlinesthekeycomponentsoftheoverallsystem.Thebasicideaisthateachexternalfeedhasadedicatedconnectivityengine(API)andthisstreamsdatatothemessagebus,whichhandlesinternalcommunication,analyticsandstorage.Fig. 19SocialSTORMPlatformArchitectureFullsizeimage Connectivityengines—theconnectivitymodulescommunicatewiththeexternaldatasources,includingTwitterandFacebook’sAPIs,financialblogs,variousRSSandnewsfeeds.Theplatform’sAPIsarecontinuallybeingexpandedtoincorporateothersocialmediasourcesasrequired.DataisfedintoSocialSTORMinrealtime,includingarandomsampleofallpublicupdatesfromTwitter,providinggigabytesoftext-baseddataeveryday. Messagingbus—themessagebusservesastheinternalcommunicationlayerwhichacceptstheincomingdatastreams(messages)fromthevariousconnectivityengines,parsesthese(fromeitherJSONorXMLformat)toaninternalrepresentationofdataintheplatform,distributestheinformationacrossalltheinterestedmodulesandwritesthevariousdatatotheappropriatetablesofthemaindatabase. Datawarehouse—thedatabasesupportsterabytesoftext-basedentries,whichareaccompaniedbyvarioustypesofmetadatatoexpandthepotentialavenuesofresearch.Entriesareorganizedbysourceandaccuratelytime-stampedwiththetimeofpublication,aswellasbeingtaggedwithtopicsforeasyretrievalbysimulationmodels.TheplatformcurrentlyusesHBase,butinfuturemightuseApacheCassandraorHive. Simulationmanager—thesimulationmanagerprovidesanexternalAPIforclientstointeractwiththedataforresearchpurposes,includingaweb-basedGUIwherebyuserscanselectvariousfilterstoapplytothedatasetsbeforeuploadingaJava-codedsimulationmodeltoperformthedesiredanalysisonthedata.Thisfacilitatesallclient-accesstothedatawarehouseandalsoallowsuserstouploadtheirowndatasetsforaggregationwithUCL’ssocialdataforaparticularsimulation.Thereisalsotheoptiontoswitchbetweenhistoricalmode(whichminesdataexistingatthetimethesimulationisstarted)andlivemode(which‘listens’toincomingdatastreamsandperformsanalysisinrealtime). PlatformcomponentsTheplatformcomprisesthefollowingmodules,whichareillustratedinFig. 20:Fig. 20EnvironmentSystemArchitectureandModulesFullsizeimage Back-endservices—thisprovidesthecoreoftheplatformfunctionalities.Itisasetofservicesthatallowconnectionstodataproviders,propagationprocessingandaggregationofdatafeeds,executionandmaintenanceofmodels,aswellastheirmanagementinamultiuserenvironment. Front-endclientAPIs—thisprovidesasetofprogrammaticandgraphicalinterfacesthatcanbeusedtointeractwithaplatformtoimplementandtestanalyticalmodels.Theprogrammaticaccessprovidesmodeltemplatestosimplifyaccesstosomeofthefunctionalitiesanddefinesgenericstructureofeverymodelintheplatform.Thegraphicuserinterfaceallowsvisualmanagementofanalyticalmodels.Itenablestheusertovisualizedatainvariousforms,providesdatawatchgridcapabilities,providesadynamicvisualizationofgroupbehaviorofdataandallowsuserstoobserveinformationoneventsrelevanttotheuser’senvironment. Connectivityengine—thisfunctionalityprovidesameansofcommunicationwiththeoutsideworld,withfinancialbrokers,dataprovidersandothers.Eachoftheoutsidevenuesutilizedbytheplatformhasadedicatedconnectorobjectresponsibleforcontrolofcommunication.ThisispossibleduetothefactthateachoftheoutsideinstitutionsprovideeitheradedicatedAPIorisusingacommunicationprotocol(e.g.,theFIXprotocolandtheJSON/XML-basedprotocol).Theplatformprovidesageneralizedinterfacetoallowstandardizationofavarietyofconnectors. Internalcommunicationlayer—theideabehindtheuseoftheinternalmessagingsystemintheplatformdrawsfromtheconceptofevent-drivenprogramming.Analyticalplatformsutilizeeventsasamainmeansofcommunicationbetweentheirelements.Theelements,inturn,areeitherproducersorconsumersoftheevents.Theapproachsignificantlysimplifiesthearchitectureofsuchsystemwhilemakingitscalableandflexibleforfurtherextensions. Aggregationdatabase—thisprovidesafastandrobustDBMSfunctionality,foranentry-levelaggregationofdata,whichisthenfiltered,enriched,restructuredandstoredinbigdatafacilities.Aggregationfacilitiesenableanalyticalplatformstostore,extractandmanipulatelargeamountsofdata.ThestoragecapabilitiesoftheAggregationelementnotonlyallowreplayofhistoricaldataformodelingpurposes,butalsoenableother,moresophisticatedtasksrelatedtofunctioningoftheplatformincludingmodelriskanalysis,evaluationofperformanceofmodelsandmanymore. ClientSDK—thisisacompletesetofAPIs(ApplicationProgrammingInterfaces)thatenabledevelopment,implementationandtestingofnewanalyticalmodelswithuseofthedeveloper’sfavoriteIDE(IntegratedDevelopmentEnvironment).TheSDKallowsconnectionfromtheIDEtotheserversideoftheplatformtoprovideallthefunctionalitiestheusermayneedtodevelopandexecutemodels. Sharedmemory—thisprovidesabuffer-typefunctionalitythatspeedsupthedeliveryoftemporal/historicaldatatomodelsandtheanalytics-relatedelementsoftheplatform(i.e.,thestatisticalanalysislibraryofmethods),and,atthesametime,reducesthememoryusagerequirement.Themainideaistohaveacentralpointinthememory(RAM)oftheplatformthatwillmanageandprovideatemporal/historicaldatafromthecurrentpointoftimeuptoaspecifiednumberoftimestampsbackinhistory).Sincethememoryisshared,nomodelwillhavetokeepandmanagehistorybyitself.Moreover,sincethememoryiskeptinRAMratherthaninthefilesortheDBMS,theaccesstoitisinstantandboundedonlybytheperformanceofhardwareandtheplatformonwhichthebufferswork. Modeltemplates—theplatformsupportstwogenerictypesofmodels:pushandpull.Thepushtyperegistersitselftolistentoaspecifiedsetofdatastreamsduringinitialization,andtheexecutionofthemodellogicistriggeredeachtimeanewdatafeedarrivestotheplatform.Thistypeisdedicatedtoveryquick,low-latency,high-frequencymodelsandthespeedisachievedatthecostofsmallsharedmemorybuffers.Thepullmodeltemplateexecutesandrequestsdataonitsown,basedonaschedule.Insteadofusingthememorybuffers,ithasadirectconnectiontothebigdatafacilitiesandhencecanrequestasmuchhistoricaldataasnecessary,attheexpenseofspeed. ConclusionsAsdiscussed,theeasyavailabilityofAPIsprovidedbyTwitter,FacebookandNewsserviceshasledtoan‘explosion’ofdataservicesandsoftwaretoolsforscrapingandsentimentanalysis,andsocialmediaanalyticsplatforms.Thispapersurveyssomeofthesocialmediasoftwaretools,andforcompletenessintroducedsocialmediascraping,datacleaningandsentimentanalysis.Perhaps,thebiggestconcernisthatcompaniesareincreasinglyrestrictingaccesstotheirdatatomonetizetheircontent.Itisimportantthatresearchershaveaccesstocomputationalenvironmentsandespecially‘big’socialmediadataforexperimentation.Otherwise,computationalsocialsciencecouldbecometheexclusivedomainofmajorcompanies,governmentagenciesandaprivilegedsetofacademicresearcherspresidingoverprivatedatafromwhichtheyproducepapersthatcannotbecritiquedorreplicated.Arguablywhatisrequiredarepublic-domaincomputationalenvironmentsanddatafacilitiesforquantitativesocialscience,whichcanbeaccessedbyresearchersviaacloud-basedfacility. NotesAnobjectmaybeaperson,apage,apictureoranevent.Detailsoftheinformationretrievedinstatusupdatescanbefoundhere:https://developers.facebook.com/docs/reference/api/status/.MoreinformationaboutGoogleRefineisfoundinitsdocumentationwiki:https://github.com/OpenRefine/OpenRefine/wiki.ApacheMahoutprojectpage:http://mahout.apache.org/. http://rapid-i.com/.References BotanIetal.(2010)SECRET:amodelforanalysisoftheexecutionsemanticsofstreamprocessingsystems.ProcVLDBEndow3(1–2):232–243 SalathéMetal.(2012)Digitalepidemiology.PLoSComputBiol8(7):1–5BollenJ,MaoH,ZengX(2011)Twittermoodpredictsthestockmarket.JComputSci2(3):1–8Article  GoogleScholar  ChandramouliBetal(2010)Datastreammanagementsystemsforcomputationalfinance.IEEEComput43(12):45–52Article  GoogleScholar  ChandrasekarC,KowsalyaN(2011)ImplementationofMapReduceAlgorithmandNutchDistributedFileSysteminNutch.IntJComputAppl1:6–11Cioffi-RevillaC(2010)Computationalsocialscience.WileyInterdiscipRevComputStatistics2(3):259–271Article  GoogleScholar  GalasM,BrownD,TreleavenP(2012)Acomputationalsocialscienceenvironmentforfinancial/economicexperiments.In:ProceedingsoftheComputationalSocialScienceSocietyoftheAmericas,vol1,pp1–13HebrailG(2008)Datastreammanagementandmining.In:Fogelman-SouliéF,PerrottaD,PiskorskiJ,SteinbergerR(eds)MiningMassiveDataSetsforSecurity.IOSPress,pp89–102HirudkarAM,SherekarSS(2013)Comparativeanalysisofdataminingtoolsandtechniquesforevaluatingperformanceofdatabasesystem.IntJComputSciAppl6(2):232–237 GoogleScholar  KaplanAM(2012)Ifyoulovesomething,letitgomobile:mobilemarketingandmobilesocialmedia4x4.BusHoriz55(2):129–139Article  GoogleScholar  KaplanAM,HaenleinM(2010)Usersoftheworld,unite!thechallengesandopportunitiesofsocialmedia.BusHoriz53(1):59–68Article  GoogleScholar  KarabulutY(2013)CanFacebookpredictstockmarketactivity?SSRNeLibrary,pp1–58.http://ssrn.com/abstract=2017099.Accessed2Feb2014KhanA,BaharudinB,LeeLH,KhanK(2010)Areviewofmachinelearningalgorithmsfortext-documentsclassification.JAdvInfTechnol1(1):4–20 GoogleScholar  KobayashiM,TakedaK(2000)Informationretrievalontheweb.ACMComputSurvCSUR32(2):144–173Article  GoogleScholar  LazerDetal(2009)Computationalsocialscience.Science323:721–723Article  GoogleScholar  LermanK,GilderA,DredzeM,PereiraF(2008)Readingthemarkets:forecastingpublicopinionofpoliticalcandidatesbynewsanalysis.In:Proceedingsofthe22ndinternationalconferenceoncomputationallinguistics1:473–480MapReduce(2011)WhatisMapReduce?.http://www.mapreduce.org/what-is-mapreduce.php.Accessed31Jan2014MejovaY(2009)Sentimentanalysis:anoverview,pp1–34.http://www.academia.edu/291678/Sentiment_Analysis_An_Overview.Accessed4Nov2013MurphyKP(2006)NaiveBayesclassifiers.UniversityofBritishColumbia,pp1–8.http://www.ic.unicamp.br/~rocha/teaching/2011s1/mc906/aulas/naivebayes.pdf MurphyKP(2012)Machinelearning:aprobabilisticperspective.In:Chapter1:Introduction.MITPress,pp1–26NarangRK(2009)Insidetheblackbox.Hoboken,NewJerseyBook  GoogleScholar  NutiG,MirghaemiM,TreleavenP,YingsaereeC(2011)Algorithmictrading.IEEEComput44(11):61–69Article  GoogleScholar  PangB,LeeL(2008)Opinionminingandsentimentanalysis.FoundTrendsInfRetr2(1–2):1–135Article  GoogleScholar  SASInstituteInc(2013)SASsentimentanalysisfactsheet.http://www.sas.com/resources/factsheet/sas-sentiment-analysis-factsheet.pdf.Accessed6Dec2013 TeuflP,PayerU,LacknerG(2010)FromNLP(naturallanguageprocessing)toMLP(machinelanguageprocessing).In:KotenkoI,SkorminV(eds)Computernetworksecurity,Springer,BerlinHeidelberg,pp256–269ThomsonReuters(2010).ThomsonReutersnewsanalytics.http://thomsonreuters.com/products/financial-risk/01_255/News_Analytics_-_Product_Brochure-_Oct_2010_1_.pdf.Accessed1Oct2013ThomsonReuters(2012)ThomsonReutersmachinereadablenews.http://thomsonreuters.com/products/financial-risk/01_255/TR_MRN_Overview_10Jan2012.pdf.Accessed5Dec2013ThomsonReuters(2012)ThomsonReutersMarketPsychIndices.http://thomsonreuters.com/products/financial-risk/01_255/TRMI_flyer_2012.pdf.Accessed7Dec2013ThomsonReuters(2012)ThomsonReutersnewsanalyticsforinternetnewsandsocialmedia.http://thomsonreuters.com/business-unit/financial/eurozone/112408/news_analytics_and_social_media.Accessed7Dec2013ThomsonReuters(2013)Machinereadablenews.http://thomsonreuters.com/machine-readable-news/?subsector=thomson-reuters-elektron.Accessed18Dec2013TurneyPD(2002)Thumbsuporthumbsdown?Semanticorientationappliedtounsupervisedclassificationofreviews.In:Proceedingsofthe40thAnnualMeetingonAssociationforComputationalLinguisticspp.417–424VaswaniV(2011)HookintoWikipediainformationusingPHPandtheMediaWikiAPI.http://www.ibm.com/developerworks/web/library/x-phpwikipedia/index.html.Accessed21Dec2012WesterskiA(2008)Sentimentanalysis:introductionandthestateoftheartoverview.UniversidadPolitecnicadeMadrid,Spain,pp1–9.http://www.adamwesterski.com/wpcontent/files/docsCursos/sentimentA_doc_TLAW.pdf.Accessed14Aug2013WikimediaFoundation(2014)Wikipedia:Databasedownload.http://en.wikipedia.org/wiki/Wikipedia:Database_download.Accessed18Apr2014WolframSMA(2010)ModellingthestockmarketusingTwitter.DissertationMasterofSciencethesis,SchoolofInformatics,UniversityofEdinburgh,pp1–74.http://homepages.inf.ed.ac.uk/miles/msc-projects/wolfram.pdf.Accessed23Jul2013YessenovK,MisailovicS(2009)Sentimentanalysisofmoviereviewcomments,pp1–17. http://people.csail.mit.edu/kuat/courses/6.863/report.pdf.Accessed16Aug2013DownloadreferencesAcknowledgmentsTheauthorswouldliketoacknowledgeMichalGalaswholedthedesignandimplementationoftheUCLSocialSTORMplatformwiththeassistanceofIlyaZheludev,KacperChwialkowskiandDanBrown.Dr.ChristianHesseofDeutscheBankisalsoacknowledgedforcollaborationonNewsAnalytics.AuthorinformationAffiliationsDepartmentofComputerScience,UniversityCollegeLondon,GowerStreet,London,WC1E6BT,UKBogdanBatrinca & PhilipC.TreleavenAuthorsBogdanBatrincaViewauthorpublicationsYoucanalsosearchforthisauthorin PubMed GoogleScholarPhilipC.TreleavenViewauthorpublicationsYoucanalsosearchforthisauthorin PubMed GoogleScholarCorrespondingauthorCorrespondenceto PhilipC.Treleaven.Rightsandpermissions OpenAccessThisarticleisdistributedunderthetermsoftheCreativeCommonsAttributionLicensewhichpermitsanyuse,distribution,andreproductioninanymedium,providedtheoriginalauthor(s)andthesourcearecredited. ReprintsandPermissionsAboutthisarticleCitethisarticleBatrinca,B.,Treleaven,P.C.Socialmediaanalytics:asurveyoftechniques,toolsandplatforms. AI&Soc30,89–116(2015).https://doi.org/10.1007/s00146-014-0549-4DownloadcitationReceived:25February2014Accepted:04July2014Published:26July2014IssueDate:February2015DOI:https://doi.org/10.1007/s00146-014-0549-4SharethisarticleAnyoneyousharethefollowinglinkwithwillbeabletoreadthiscontent:GetshareablelinkSorry,ashareablelinkisnotcurrentlyavailableforthisarticle.Copytoclipboard ProvidedbytheSpringerNatureSharedItcontent-sharinginitiative KeywordsSocialmediaScrapingBehavioreconomicsSentimentanalysisOpinionminingNLPToolkitsSoftwareplatforms DownloadPDF Advertisement


請為這篇文章評分?