Social media analytics: a survey of techniques, tools and ...

文章推薦指數: 80 %
投票人數:10人

Abstract This paper is written for (social science) researchers seeking to analyze the wealth of social media now available. Search Categories TopDownloads Login Register Search Home Socialmediaanalytics:asurveyoftechniques,toolsandplatforms Socialmediaanalytics:asurveyoftechniques,toolsandplatforms March21,2017|Author:ChristalFlynn|Category:N/A DOWNLOADPDF(617KB) Share Embed Donate Reportthislink ShortDescription DownloadSocialmediaanalytics:asurveyoftechniques,toolsandplatforms... Description AI&Soc(2015)30:89–116DOI10.1007/s00146-014-0549-4 OPENFORUM Socialmediaanalytics:asurveyoftechniques,toolsandplatformsBogdanBatrinca•PhilipC.Treleaven Received:25February2014/Accepted:4July2014/Publishedonline:26July2014TheAuthor(s)2014.ThisarticleispublishedwithopenaccessatSpringerlink.com AbstractThispaperiswrittenfor(socialscience)researchersseekingtoanalyzethewealthofsocialmedianowavailable.Itpresentsacomprehensivereviewofsoftwaretoolsforsocialnetworkingmedia,wikis,reallysimplesyndicationfeeds,blogs,newsgroups,chatandnewsfeeds.Forcompleteness,italsoincludesintroductionstosocialmediascraping,storage,datacleaningandsentimentanalysis.Althoughprincipallyareview,thepaperalsoprovidesamethodologyandacritiqueofsocialmediatools.Analyzingsocialmedia,inparticularTwitterfeedsforsentimentanalysis,hasbecomeamajorresearchandbusinessactivityduetotheavailabilityofweb-basedapplicationprogramminginterfaces(APIs)providedbyTwitter,FacebookandNewsservices.Thishasledtoan‘explosion’ofdataservices,softwaretoolsforscrapingandanalysisandsocialmediaanalyticsplatforms.Itisalsoaresearchareaundergoingrapidchangeandevolutionduetocommercialpressuresandthepotentialforusingsocialmediadataforcomputational(socialscience)research.Usingasimpletaxonomy,thispaperprovidesareviewofleadingsoftwaretoolsandhowtousethemtoscrape,cleanseandanalyzethespectrumofsocialmedia.Inaddition,itdiscussedtherequirementofanexperimentalcomputationalenvironmentforsocialmediaresearchandpresentsasanillustrationthesystemarchitectureofasocialmedia(analytics)platformbuiltbyUniversityCollegeLondon.Theprincipalcontributionofthispaperistoprovideanoverview(includingcodefragments)forB.BatrincaP.C.Treleaven(&)DepartmentofComputerScience,UniversityCollegeLondon,GowerStreet,LondonWC1E6BT,UKe-mail:[email protected]B.Batrincae-mail:[email protected] scientistsseekingtoutilizesocialmediascrapingandanalyticseitherintheirresearchorbusiness.Thedataretrievaltechniquesthatarepresentedinthispaperarevalidatthetimeofwritingthispaper(June2014),buttheyaresubjecttochangesincesocialmediadatascrapingAPIsarerapidlychanging.KeywordsSocialmediaScrapingBehavioreconomicsSentimentanalysisOpinionminingNLPToolkitsSoftwareplatforms 1IntroductionSocialmediaisdefinedasweb-basedandmobile-basedInternetapplicationsthatallowthecreation,accessandexchangeofuser-generatedcontentthatisubiquitouslyaccessible(KaplanandHaenlein2010).Besidessocialnetworkingmedia(e.g.,TwitterandFacebook),forconvenience,wewillalsousetheterm‘socialmedia’toencompassreallysimplesyndication(RSS)feeds,blogs,wikisandnews,alltypicallyyieldingunstructuredtextandaccessiblethroughtheweb.Socialmediaisespeciallyimportantforresearchintocomputationalsocialsciencethatinvestigatesquestions(Lazeretal.2009)usingquantitativetechniques(e.g.,computationalstatistics,machinelearningandcomplexity)andso-calledbigdatafordataminingandsimulationmodeling(Cioffi-Revilla2010).Thishasledtonumerousdataservices,toolsandanalyticsplatforms.However,thiseasyavailabilityofsocialmediadataforacademicresearchmaychangesignificantlyduetocommercialpressures.Inaddition,asdiscussedinSect.2,thetoolsavailabletoresearchersarefarfromideal.Theyeithergivesuperficialaccesstotherawdataor(for 123 90 AI&Soc(2015)30:89–116 non-superficialaccess)requireresearcherstoprogramanalyticsinalanguagesuchasJava.1.1TerminologyWestartwithdefinitionsofsomeofthekeytechniquesrelatedtoanalyzingunstructuredtextualdata:• • • • • • Naturallanguageprocessing—(NLP)isafieldofcomputerscience,artificialintelligenceandlinguisticsconcernedwiththeinteractionsbetweencomputersandhuman(natural)languages.Specifically,itistheprocessofacomputerextractingmeaningfulinformationfromnaturallanguageinputand/orproducingnaturallanguageoutput.Newsanalytics—themeasurementofthevariousqualitativeandquantitativeattributesoftextual(unstructureddata)newsstories.Someoftheseattributesare:sentiment,relevanceandnovelty.Opinionmining—opinionmining(sentimentmining,opinion/sentimentextraction)istheareaofresearchthatattemptstomakeautomaticsystemstodeterminehumanopinionfromtextwritteninnaturallanguage.Scraping—collectingonlinedatafromsocialmediaandotherWebsitesintheformofunstructuredtextandalsoknownassitescraping,webharvestingandwebdataextraction.Sentimentanalysis—sentimentanalysisreferstotheapplicationofnaturallanguageprocessing,computationallinguisticsandtextanalyticstoidentifyandextractsubjectiveinformationinsourcematerials.Textanalytics—involvesinformationretrieval(IR),lexicalanalysistostudywordfrequencydistributions,patternrecognition,tagging/annotation,informationextraction,dataminingtechniquesincludinglinkandassociationanalysis,visualizationandpredictiveanalytics. 1.2ResearchchallengesSocialmediascrapingandanalyticsprovidesarichsourceofacademicresearchchallengesforsocialscientists,computerscientistsandfundingbodies.Challengesinclude:• Scraping—althoughsocialmediadataisaccessiblethroughAPIs,duetothecommercialvalueofthedata,mostofthemajorsourcessuchasFacebookandGooglearemakingitincreasinglydifficultforacademicstoobtaincomprehensiveaccesstotheir‘raw’data;veryfewsocialdatasourcesprovideaffordabledataofferingstoacademiaandresearchers.NewsservicessuchasThomsonReutersandBloombergtypically 123 • • • • • • chargeapremiumforaccesstotheirdata.Incontrast,TwitterhasrecentlyannouncedtheTwitterDataGrantsprogram,whereresearcherscanapplytogetaccesstoTwitter’spublictweetsandhistoricaldatainordertogetinsightsfromitsmassivesetofdata(Twitterhasmorethan500milliontweetsaday).Datacleansing—cleaningunstructuredtextualdata(e.g.,normalizingtext),especiallyhigh-frequencystreamedreal-timedata,stillpresentsnumerousproblemsandresearchchallenges.Holisticdatasources—researchersareincreasinglybringingtogetherandcombiningnoveldatasources:socialmediadata,real-timemarket&customerdataandgeospatialdataforanalysis.Dataprotection—onceyouhavecreateda‘bigdata’resource,thedataneedstobesecured,ownershipandIPissuesresolved(i.e.,storingscrapeddataisagainstmostofthepublishers’termsofservice),andusersprovidedwithdifferentlevelsofaccess;otherwise,usersmayattemptto‘suck’allthevaluabledatafromthedatabase.Dataanalytics—sophisticatedanalysisofsocialmediadataforopinionmining(e.g.,sentimentanalysis)stillraisesamyriadofchallengesduetoforeignlanguages,foreignwords,slang,spellingerrorsandthenaturalevolvingoflanguage.Analyticsdashboards—manysocialmediaplatformsrequireuserstowriteAPIstoaccessfeedsorprogramanalyticsmodelsinaprogramminglanguage,suchasJava.Whilereasonableforcomputerscientists,theseskillsaretypicallybeyondmost(socialscience)researchers.Non-programminginterfacesarerequiredforgivingwhatmightbereferredtoas‘deep’accessto‘raw’data,forexample,configuringAPIs,mergingsocialmediafeeds,combiningholisticsourcesanddevelopinganalyticalmodels.Datavisualization—visualrepresentationofdatawherebyinformationthathasbeenabstractedinsomeschematicformwiththegoalofcommunicatinginformationclearlyandeffectivelythroughgraphicalmeans.Giventhemagnitudeofthedatainvolved,visualizationisbecomingincreasinglyimportant. 1.3SocialmediaresearchandapplicationsSocialmediadataisclearlythelargest,richestandmostdynamicevidencebaseofhumanbehavior,bringingnewopportunitiestounderstandindividuals,groupsandsociety.Innovativescientistsandindustryprofessionalsareincreasinglyfindingnovelwaysofautomaticallycollecting,combiningandanalyzingthiswealthofdata.Naturally,doingjusticetothesepioneeringsocialmedia AI&Soc(2015)30:89–116 applicationsinafewparagraphsischallenging.Threeillustrativeareasare:business,bioscienceandsocialscience.Theearlybusinessadoptersofsocialmediaanalysisweretypicallycompaniesinretailandfinance.Retailcompaniesusesocialmediatoharnesstheirbrandawareness,product/customerserviceimprovement,advertising/marketingstrategies,networkstructureanalysis,newspropagationandevenfrauddetection.Infinance,socialmediaisusedformeasuringmarketsentimentandnewsdataisusedfortrading.Asanillustration,Bollenetal.(2011)measuredsentimentofrandomsampleofTwitterdata,findingthatDowJonesIndustrialAverage(DJIA)pricesarecorrelatedwiththeTwittersentiment2–3daysearlierwith87.6percentaccuracy.Wolfram(2010)usedTwitterdatatotrainaSupportVectorRegression(SVR)modeltopredictpricesofindividualNASDAQstocks,finding‘significantadvantage’forforecastingprices15mininthefuture.Inthebiosciences,socialmediaisbeingusedtocollectdataonlargecohortsforbehavioralchangeinitiativesandimpactmonitoring,suchastacklingsmokingandobesityormonitoringdiseases.AnexampleisPennStateUniversitybiologists(Salathe´etal.2012)whohavedevelopedinnovativesystemsandtechniquestotrackthespreadofinfectiousdiseases,withthehelpofnewsWebsites,blogsandsocialmedia.Computationalsocialscienceapplicationsinclude:monitoringpublicresponsestoannouncements,speechesandeventsespeciallypoliticalcommentsandinitiatives;insightsintocommunitybehavior;socialmediapollingof(hardtocontact)groups;earlydetectionofemergingevents,aswithTwitter.Forexample,Lermanetal.(2008)usecomputationallinguisticstoautomaticallypredicttheimpactofnewsonthepublicperceptionofpoliticalcandidates.YessenovandMisailovic(2009)usemoviereviewcommentstostudytheeffectofvariousapproachesinextractingtextfeaturesontheaccuracyoffourmachinelearningmethods—NaiveBayes,DecisionTrees,MaximumEntropyandK-Meansclustering.Lastly,Karabulut(2013)foundthatFacebook’sGrossNationalHappiness(GNH)exhibitspeaksandtroughsin-linewithmajorpubliceventsintheUSA.1.4SocialmediaoverviewForthispaper,wegroupsocialmediatoolsinto:• Socialmediadata—socialmediadatatypes(e.g.,socialnetworkmedia,wikis,blogs,RSSfeedsandnews,etc.)andformats(e.g.,XMLandJSON).Thisincludesdatasetsandincreasinglyimportantreal-timedatafeeds,suchasfinancialdata,customertransactiondata,telecomsandspatialdata. 91 • Socialmediaprogrammaticaccess—dataservicesandtoolsforsourcingandscraping(textual)datafromsocialnetworkingmedia,wikis,RSSfeeds,news,etc.Thesecanbeusefullysubdividedinto:• • • • Textcleaningandstoragetools—toolsforcleaningandstoringtextualdata.GoogleRefineandDataWranglerareexamplesfordatacleaning.Textanalysistools—individualorlibrariesoftoolsforanalyzingsocialmediadataonceithasbeenscrapedandcleaned.Thesearemainlynaturallanguageprocessing,analysisandclassificationtools,whichareexplainedbelow.• • • Datasources,servicesandtools—wheredataisaccessedbytoolswhichprotecttherawdataorprovidesimpleanalytics.Examplesinclude:GoogleTrends,SocialMention,SocialPointerandSocialSeek,whichprovideastreamofinformationthataggregatesvarioussocialmediafeeds.DatafeedsviaAPIs—wheredatasetsandfeedsareaccessibleviaprogrammableHTTP-basedAPIsandreturntaggeddatausingXMLorJSON,etc.ExamplesincludeWikipedia,TwitterandFacebook. Transformationtools—simpletoolsthatcantransformtextualinputdataintotables,maps,charts(line,pie,scatter,bar,etc.),timelineorevenmotion(animationovertimeline),suchasGoogleFusionTables,ZohoReports,TableauPublicorIBM’sManyEyes.Analysistools—moreadvancedanalyticstoolsforanalyzingsocialdata,identifyingconnectionsandbuildingnetworks,suchasGephi(opensource)ortheExcelplug-inNodeXL. Socialmediaplatforms—environmentsthatprovidecomprehensivesocialmediadataandlibrariesoftoolsforanalytics.Examplesinclude:ThomsonReutersMachineReadableNews,Radian6andLexalytics.• • Socialnetworkmediaplatforms—platformsthatprovidedataminingandanalyticsonTwitter,Facebookandawiderangeofothersocialnetworkmediasources.Newsplatforms—platformssuchasThomsonReutersprovidingcommercialnewsarchives/feedsandassociatedanalytics. 2SocialmediamethodologyandcritiqueThetwomajorimpedimentstousingsocialmediaforacademicresearcharefirstlyaccesstocomprehensivedatasetsandsecondlytoolsthatallow‘deep’dataanalysis 123 92 AI&Soc(2015)30:89–116 withouttheneedtobeabletoprograminalanguagesuchasJava.Themajorityofsocialmediaresourcesarecommercialandcompaniesarenaturallytryingtomonetizetheirdata.Asdiscussed,itisimportantthatresearchershaveaccesstoopen-source‘big’(socialmedia)datasetsandfacilitiesforexperimentation.Otherwise,socialmediaresearchcouldbecometheexclusivedomainofmajorcompanies,governmentagenciesandaprivilegedsetofacademicresearcherspresidingoverprivatedatafromwhichtheyproducepapersthatcannotbecritiquedorreplicated.Recently,therehasbeenamodestresponse,asTwitterandGniparepilotinganewprogramfordataaccess,startingwith5all-accessdatagrantstoselectapplicants.2.1MethodologyResearchrequirementscanbegroupedinto:data,analyticsandfacilities. • • 2.1.3FacilitiesLastly,thesheervolumeofsocialmediadatabeinggeneratedarguesfornationalandinternationalfacilitiestobeestablishedtosupportsocialmediaresearch(cf.WhartonResearchDataServiceshttps://wrds-web.wharton.upenn.edu):• 2.1.1DataResearchersneedonlineaccesstohistoricandreal-timesocialmediadata,especiallytheprincipalsources,toconductworld-leadingresearch:• • • • Socialnetworkmedia—accesstocomprehensivehistoricdatasetsandalsoreal-timeaccesstosources,possiblywitha(15min)timedelay,aswithThomsonReutersandBloombergfinancialdata.Newsdata—accesstohistoricdataandreal-timenewsdatasets,possiblythroughtheconceptof‘educationaldatalicenses’(cf.softwarelicense).Publicdata—accesstoscrapedandarchivedimportantpublicdata;availablethroughRSSfeeds,blogsoropengovernmentdatabases.Programmableinterfaces—researchersalsoneedaccesstosimpleapplicationprogramminginterfaces(APIs)toscrapeandstoreotheravailabledatasourcesthatmaynotbeautomaticallycollected. 2.1.2AnalyticsCurrently,socialmediadataistypicallyeitheravailableviasimplegeneralroutinesorrequiretheresearchertoprogramtheiranalyticsinalanguagesuchasMATLAB,JavaorPython.Asdiscussedabove,researchersrequire:• Analyticsdashboards—non-programminginterfacesarerequiredforgivingwhatmightbetermedas‘deep’accessto‘raw’data. 123 Holisticdataanalysis—toolsarerequiredforcombining(andconductinganalyticsacross)multiplesocialmediaandotherdatasets.Datavisualization—researchersalsorequirevisualizationtoolswherebyinformationthathasbeenabstractedcanbevisualizedinsomeschematicformwiththegoalofcommunicatinginformationclearlyandeffectivelythroughgraphicalmeans. • Datastorage—thevolumeofsocialmediadata,currentandprojected,isbeyondmostindividualuniversitiesandhenceneedstobeaddressedatanationalsciencefoundationlevel.Storageisrequiredbothforprincipaldatasources(e.g.,Twitter),butalsoforsourcescollectedbyindividualprojectsandarchivedforfutureusebyotherresearchers.Computationalfacility—remotelyaccessiblecomputationalfacilitiesarealsorequiredfor:a)protectingaccesstothestoreddata;b)hostingtheanalyticsandvisualizationtools;andc)providingcomputationalresourcessuchasgridsandGPUsrequiredforprocessingthedataatthefacilityratherthantransmittingitacrossanetwork. 2.2CritiqueMuchneedstobedonetosupportsocialmediaresearch.Asdiscussed,themajorityofcurrentsocialmediaresourcesarecommercial,expensiveanddifficultforacademicstoobtainfullaccess.2.2.1DataIngeneral,accesstoimportantsourcesofsocialmediadataisfrequentlyrestrictedandfullcommercialaccessisexpensive.• • Siloeddata—mostdatasources(e.g.,Twitter)haveinherentlyisolatedinformationmakingitdifficulttocombinewithotherdatasources.Holisticdata—incontrast,researchersareincreasinglyinterestedinaccessing,storingandcombiningnoveldatasources:socialmediadata,real-timefinancialmarket&customerdataandgeospatialdatafor AI&Soc(2015)30:89–116 analysis.ThisiscurrentlyextremelydifficulttodoevenforComputerSciencedepartments.2.2.2AnalyticsAnalyticaltoolsprovidedbyvendorsareoftentiedtoasingledataset,maybelimitedinanalyticalcapability,anddatachargesmakethemexpensivetouse.2.2.3FacilitiesThereareanincreasingnumberofpowerfulcommercialplatforms,suchastheonessuppliedbySASandThomsonReuters,butthechargesarelargelyprohibitiveforacademicresearch.Eithercomparablefacilitiesneedtobeprovidedbynationalsciencefoundationsorvendorsneedtobepersuadedtointroducetheconceptofan‘educationallicense.’ 3SocialmediadataClearly,thereisalargeandincreasingnumberof(commercial)servicesprovidingaccesstosocialnetworkingmedia(e.g.,Twitter,FacebookandWikipedia)andnewsservices(e.g.,ThomsonReutersMachineReadableNews).Equivalentmajoracademicservicesarescarce.Westartbydiscussingtypesofdataandformatsproducedbytheseservices.3.1TypesofdataAlthoughwefocusonsocialmedia,asdiscussed,researchersarecontinuallyfindingnewandinnovativesourcesofdatatobringtogetherandanalyze.Sowhenconsideringtextualdataanalysis,weshouldconsidermultiplesources(e.g.,socialnetworkingmedia,RSSfeeds,blogsandnews)supplementedbynumeric(financial)data,telecomsdata,geospatialdataandpotentiallyspeechandvideodata.Usingmultipledatasourcesiscertainlythefutureofanalytics.Broadly,datasubdividesinto:•• Historicdatasets—previouslyaccumulatedandstoredsocial/news,financialandeconomicdata.Real-timefeeds—livedatafeedsfromstreamedsocialmedia,newsservices,financialexchanges,telecomsservices,GPSdevicesandspeech.Andinto: • Rawdata—unprocessedcomputerdatastraightfromsourcethatmaycontainerrorsormaybeunanalyzed. 93 • • Cleaneddata—correctionorremovaloferroneous(dirty)datacausedbydisparities,keyingmistakes,missingbits,outliers,etc.Value-addeddata—datathathasbeencleaned,analyzed,taggedandaugmentedwithknowledge. 3.2TextdataformatsThefourmostcommonformatsusedtomarkuptextare:HTML,XML,JSONandCSV.• • • • HTML—HyperTextMarkupLanguage(HTML)aswell-knownisthemarkuplanguageforwebpagesandotherinformationthatcanbeviewedinawebbrowser.HTMLconsistsofHTMLelements,whichincludetagsenclosedinanglebrackets(e.g.,\div[),withinthecontentofthewebpage.XML—ExtensibleMarkupLanguage(XML)—themarkuplanguageforstructuringtextualdatausing\tag[…\\tag[todefineelements.JSON—JavaScriptObjectNotation(JSON)isatextbasedopenstandarddesignedforhuman-readabledatainterchangeandisderivedfromJavaScript.CSV—acomma-separatedvalues(CSV)filecontainsthevaluesinatableasaseriesofASCIItextlinesorganizedsuchthateachcolumnvalueisseparatedbyacommafromthenextcolumn’svalueandeachrowstartsanewline. Forcompleteness,HTMLandXMLareso-calledmarkuplanguages(markupandcontent)thatdefineasetofsimplesyntacticrulesforencodingdocumentsinaformatbothhumanreadableandmachinereadable.Amarkupcomprisesstart-tags(e.g.,\tag[),contenttextandendtags(e.g.,\/tag[).ManyfeedsuseJavaScriptObjectNotation(JSON),thelightweightdata-interchangeformat,basedonasubsetoftheJavaScriptProgrammingLanguage.JSONisalanguage-independenttextformatthatusesconventionsthatarefamiliartoprogrammersoftheC-familyoflanguages,includingC,C??,C#,Java,JavaScript,Perl,Python,andmanyothers.JSON’sbasictypesare:Number,String,Boolean,Array(anorderedsequenceofvalues,commaseparatedandenclosedinsquarebrackets)andObject(anunorderedcollectionofkey:valuepairs).TheJSONformatisillustratedinFig.1foraqueryontheTwitterAPIonthestring‘UCL,’whichreturnstwo‘text’resultsfromtheTwitteruser‘uclnews.’Comma-separatedvaluesarenotasingle,well-definedformatbutratherrefertoanytextfilethat:(a)isplaintextusingacharactersetsuchasASCII,UnicodeorEBCDIC;(b)consistsoftextrecords(e.g.,onerecordperline);(c)withrecordsdividedintofieldsseparatedbydelimiters 123 Youarereadingapreview.Wouldyouliketoaccessthefull-text? Accessfull-text AI&Soc(2015)30:89–116 113 Fig.19SocialSTORMPlatformArchitecture • typesofmetadatatoexpandthepotentialavenuesofresearch.Entriesareorganizedbysourceandaccuratelytime-stampedwiththetimeofpublication,aswellasbeingtaggedwithtopicsforeasyretrievalbysimulationmodels.TheplatformcurrentlyusesHBase,butinfuturemightuseApacheCassandraorHive.Simulationmanager—thesimulationmanagerprovidesanexternalAPIforclientstointeractwiththedataforresearchpurposes,includingaweb-basedGUIwherebyuserscanselectvariousfilterstoapplytothedatasetsbeforeuploadingaJava-codedsimulationmodeltoperformthedesiredanalysisonthedata.Thisfacilitatesallclient-accesstothedatawarehouseandalsoallowsuserstouploadtheirowndatasetsforaggregationwithUCL’ssocialdataforaparticularsimulation.Thereisalsotheoptiontoswitchbetweenhistoricalmode(whichminesdataexistingatthetimethesimulationisstarted)andlivemode(which‘listens’toincomingdatastreamsandperformsanalysisinrealtime). 9.4Platformcomponents • • Theplatformcomprisesthefollowingmodules,whichareillustratedinFig.20:• • Back-endservices—thisprovidesthecoreoftheplatformfunctionalities.Itisasetofservicesthatallowconnectionstodataproviders,propagationprocessingandaggregationofdatafeeds,executionandmaintenanceofmodels,aswellastheirmanagementinamultiuserenvironment.Front-endclientAPIs—thisprovidesasetofprogrammaticandgraphicalinterfacesthatcanbeusedtointeract • withaplatformtoimplementandtestanalyticalmodels.Theprogrammaticaccessprovidesmodeltemplatestosimplifyaccesstosomeofthefunctionalitiesanddefinesgenericstructureofeverymodelintheplatform.Thegraphicuserinterfaceallowsvisualmanagementofanalyticalmodels.Itenablestheusertovisualizedatainvariousforms,providesdatawatchgridcapabilities,providesadynamicvisualizationofgroupbehaviorofdataandallowsuserstoobserveinformationoneventsrelevanttotheuser’senvironment.Connectivityengine—thisfunctionalityprovidesameansofcommunicationwiththeoutsideworld,withfinancialbrokers,dataprovidersandothers.Eachoftheoutsidevenuesutilizedbytheplatformhasadedicatedconnectorobjectresponsibleforcontrolofcommunication.ThisispossibleduetothefactthateachoftheoutsideinstitutionsprovideeitheradedicatedAPIorisusingacommunicationprotocol(e.g.,theFIXprotocolandtheJSON/XML-basedprotocol).Theplatformprovidesageneralizedinterfacetoallowstandardizationofavarietyofconnectors.Internalcommunicationlayer—theideabehindtheuseoftheinternalmessagingsystemintheplatformdrawsfromtheconceptofevent-drivenprogramming.Analyticalplatformsutilizeeventsasamainmeansofcommunicationbetweentheirelements.Theelements,inturn,areeitherproducersorconsumersoftheevents.Theapproachsignificantlysimplifiesthearchitectureofsuchsystemwhilemakingitscalableandflexibleforfurtherextensions.Aggregationdatabase—thisprovidesafastandrobustDBMSfunctionality,foranentry-levelaggregationofdata,whichisthenfiltered,enriched,restructuredand 123 114 AI&Soc(2015)30:89–116 Fig.20EnvironmentSystemArchitectureandModules • • storedinbigdatafacilities.Aggregationfacilitiesenableanalyticalplatformstostore,extractandmanipulatelargeamountsofdata.ThestoragecapabilitiesoftheAggregationelementnotonlyallowreplayofhistoricaldataformodelingpurposes,butalsoenableother,moresophisticatedtasksrelatedtofunctioningoftheplatformincludingmodelriskanalysis,evaluationofperformanceofmodelsandmanymore.ClientSDK—thisisacompletesetofAPIs(ApplicationProgrammingInterfaces)thatenabledevelopment,implementationandtestingofnewanalyticalmodelswithuseofthedeveloper’sfavoriteIDE(IntegratedDevelopmentEnvironment).TheSDKallowsconnectionfromtheIDEtotheserversideoftheplatformtoprovideallthefunctionalitiestheusermayneedtodevelopandexecutemodels.Sharedmemory—thisprovidesabuffer-typefunctionalitythatspeedsupthedeliveryoftemporal/historicaldatatomodelsandtheanalytics-relatedelementsoftheplatform(i.e.,thestatisticalanalysislibraryofmethods),and,atthesametime,reducesthememoryusagerequirement.Themainideaistohaveacentralpointinthememory(RAM)oftheplatformthatwillmanageandprovideatemporal/historicaldatafromthecurrentpointoftimeuptoaspecifiednumber 123 • oftimestampsbackinhistory).Sincethememoryisshared,nomodelwillhavetokeepandmanagehistorybyitself.Moreover,sincethememoryiskeptinRAMratherthaninthefilesortheDBMS,theaccesstoitisinstantandboundedonlybytheperformanceofhardwareandtheplatformonwhichthebufferswork.Modeltemplates—theplatformsupportstwogenerictypesofmodels:pushandpull.Thepushtyperegistersitselftolistentoaspecifiedsetofdatastreamsduringinitialization,andtheexecutionofthemodellogicistriggeredeachtimeanewdatafeedarrivestotheplatform.Thistypeisdedicatedtoveryquick,lowlatency,high-frequencymodelsandthespeedisachievedatthecostofsmallsharedmemorybuffers.Thepullmodeltemplateexecutesandrequestsdataonitsown,basedonaschedule.Insteadofusingthememorybuffers,ithasadirectconnectiontothebigdatafacilitiesandhencecanrequestasmuchhistoricaldataasnecessary,attheexpenseofspeed. 10ConclusionsAsdiscussed,theeasyavailabilityofAPIsprovidedbyTwitter,FacebookandNewsserviceshasledtoan AI&Soc(2015)30:89–116 ‘explosion’ofdataservicesandsoftwaretoolsforscrapingandsentimentanalysis,andsocialmediaanalyticsplatforms.Thispapersurveyssomeofthesocialmediasoftwaretools,andforcompletenessintroducedsocialmediascraping,datacleaningandsentimentanalysis.Perhaps,thebiggestconcernisthatcompaniesareincreasinglyrestrictingaccesstotheirdatatomonetizetheircontent.Itisimportantthatresearchershaveaccesstocomputationalenvironmentsandespecially‘big’socialmediadataforexperimentation.Otherwise,computationalsocialsciencecouldbecometheexclusivedomainofmajorcompanies,governmentagenciesandaprivilegedsetofacademicresearcherspresidingoverprivatedatafromwhichtheyproducepapersthatcannotbecritiquedorreplicated.Arguablywhatisrequiredarepublic-domaincomputationalenvironmentsanddatafacilitiesforquantitativesocialscience,whichcanbeaccessedbyresearchersviaacloud-basedfacility.AcknowledgmentsTheauthorswouldliketoacknowledgeMichalGalaswholedthedesignandimplementationoftheUCLSocialSTORMplatformwiththeassistanceofIlyaZheludev,KacperChwialkowskiandDanBrown.Dr.ChristianHesseofDeutscheBankisalsoacknowledgedforcollaborationonNewsAnalytics.OpenAccessThisarticleisdistributedunderthetermsoftheCreativeCommonsAttributionLicensewhichpermitsanyuse,distribution,andreproductioninanymedium,providedtheoriginalauthor(s)andthesourcearecredited. ReferencesBotanIetal.(2010)SECRET:amodelforanalysisoftheexecutionsemanticsofstreamprocessingsystems.ProcVLDBEndow3(1–2):232–243Salathe´Metal.(2012)Digitalepidemiology.PLoSComputBiol8(7):1–5BollenJ,MaoH,ZengX(2011)Twittermoodpredictsthestockmarket.JComputSci2(3):1–8ChandramouliBetal(2010)Datastreammanagementsystemsforcomputationalfinance.IEEEComput43(12):45–52ChandrasekarC,KowsalyaN(2011)ImplementationofMapReduceAlgorithmandNutchDistributedFileSysteminNutch.IntJComputAppl1:6–11Cioffi-RevillaC(2010)Computationalsocialscience.WileyInterdiscipRevComputStatistics2(3):259–271GalasM,BrownD,TreleavenP(2012)Acomputationalsocialscienceenvironmentforfinancial/economicexperiments.In:ProceedingsoftheComputationalSocialScienceSocietyoftheAmericas,vol1,pp1–13HebrailG(2008)Datastreammanagementandmining.In:Fogelman-Soulie´F,PerrottaD,PiskorskiJ,SteinbergerR(eds)MiningMassiveDataSetsforSecurity.IOSPress,pp89–102HirudkarAM,SherekarSS(2013)Comparativeanalysisofdataminingtoolsandtechniquesforevaluatingperformanceofdatabasesystem.IntJComputSciAppl6(2):232–237KaplanAM(2012)Ifyoulovesomething,letitgomobile:mobilemarketingandmobilesocialmedia4x4.BusHoriz55(2):129–139 115KaplanAM,HaenleinM(2010)Usersoftheworld,unite!thechallengesandopportunitiesofsocialmedia.BusHoriz53(1):59–68KarabulutY(2013)CanFacebookpredictstockmarketactivity?SSRNeLibrary,pp1–58.http://ssrn.com/abstract=2017099orhttp://dx.doi.org/10.2139/ssrn.2017099.Accessed2Feb2014KhanA,BaharudinB,LeeLH,KhanK(2010)Areviewofmachinelearningalgorithmsfortext-documentsclassification.JAdvInfTechnol1(1):4–20KobayashiM,TakedaK(2000)Informationretrievalontheweb.ACMComputSurvCSUR32(2):144–173LazerDetal(2009)Computationalsocialscience.Science323:721–723LermanK,GilderA,DredzeM,PereiraF(2008)Readingthemarkets:forecastingpublicopinionofpoliticalcandidatesbynewsanalysis.In:Proceedingsofthe22ndinternationalconferenceoncomputationallinguistics1:473–480MapReduce(2011)WhatisMapReduce?.http://www.mapreduce.org/what-is-mapreduce.php.Accessed31Jan2014MejovaY(2009)Sentimentanalysis:anoverview,pp1–34.http://www.academia.edu/291678/Sentiment_Analysis_An_Overview.Accessed4Nov2013MurphyKP(2006)NaiveBayesclassifiers.UniversityofBritishColumbia,pp1–8.http://www.ic.unicamp.br/*rocha/teaching/2011s1/mc906/aulas/naivebayes.pdfMurphyKP(2012)Machinelearning:aprobabilisticperspective.In:Chapter1:Introduction.MITPress,pp1–26NarangRK(2009)Insidetheblackbox.Hoboken,NewJerseyNutiG,MirghaemiM,TreleavenP,YingsaereeC(2011)Algorithmictrading.IEEEComput44(11):61–69PangB,LeeL(2008)Opinionminingandsentimentanalysis.FoundTrendsInfRetr2(1–2):1–135SASInstituteInc(2013)SASsentimentanalysisfactsheet.http://www.sas.com/resources/factsheet/sas-sentiment-analysis-factsheet.pdf.Accessed6Dec2013TeuflP,PayerU,LacknerG(2010)FromNLP(naturallanguageprocessing)toMLP(machinelanguageprocessing).In:KotenkoI,SkorminV(eds)Computernetworksecurity,Springer,BerlinHeidelberg,pp256–269ThomsonReuters(2010).ThomsonReutersnewsanalytics.http://thomsonreuters.com/products/financial-risk/01_255/News_Analytics_-_Product_Brochure-_Oct_2010_1_.pdf.Accessed1Oct2013ThomsonReuters(2012)ThomsonReutersmachinereadablenews.http://thomsonreuters.com/products/financial-risk/01_255/TR_MRN_Overview_10Jan2012.pdf.Accessed5Dec2013ThomsonReuters(2012)ThomsonReutersMarketPsychIndices.http://thomsonreuters.com/products/financial-risk/01_255/TRMI_flyer_2012.pdf.Accessed7Dec2013ThomsonReuters(2012)ThomsonReutersnewsanalyticsforinternetnewsandsocialmedia.http://thomsonreuters.com/business-unit/financial/eurozone/112408/news_analytics_and_social_media.Accessed7Dec2013ThomsonReuters(2013)Machinereadablenews.http://thomsonreuters.com/machine-readable-news/?subsector=thomson-reuters-elektron.Accessed18Dec2013TurneyPD(2002)Thumbsuporthumbsdown?Semanticorientationappliedtounsupervisedclassificationofreviews.In:Proceedingsofthe40thAnnualMeetingonAssociationforComputationalLinguisticspp.417–424VaswaniV(2011)HookintoWikipediainformationusingPHPandtheMediaWikiAPI.http://www.ibm.com/developerworks/web/library/x-phpwikipedia/index.html.Accessed21Dec2012WesterskiA(2008)Sentimentanalysis:introductionandthestateoftheartoverview.UniversidadPolitecnicadeMadrid,Spain,pp1–9.http://www.adamwesterski.com/wpcontent/files/ 123 116docsCursos/sentimentA_doc_TLAW.pdf.Accessed14Aug2013WikimediaFoundation(2014)Wikipedia:Databasedownload.http://en.wikipedia.org/wiki/Wikipedia:Database_download.Accessed18Apr2014WolframSMA(2010)ModellingthestockmarketusingTwitter.DissertationMasterofSciencethesis,SchoolofInformatics, 123 AI&Soc(2015)30:89–116UniversityofEdinburgh,pp1–74.http://homepages.inf.ed.ac.uk/miles/msc-projects/wolfram.pdf.Accessed23Jul2013YessenovK,MisailovicS(2009)Sentimentanalysisofmoviereviewcomments,pp1–17.http://people.csail.mit.edu/kuat/courses/6.863/report.pdf.Accessed16Aug2013 Viewmore... Comments Report"Socialmediaanalytics:asurveyoftechniques,toolsandplatforms" Pleasefillthisform,wewilltrytorespondassoonaspossible. Yourname Email Reason -SelectReason- Pornographic Defamatory Illegal/Unlawful Spam OtherTermsOfServiceViolation Fileacopyrightcomplaint Description Close Submit Share&Embed"Socialmediaanalytics:asurveyoftechniques,toolsandplatforms" Pleasecopyandpastethisembedscripttowhereyouwanttoembed EmbedScript Size(px) 750x600 750x500 600x500 600x400 URL Close Copyright�2017SILOInc. SUPPORTSILO Weneedyourhelp! Thankyouforusingourservices.Weareanon-profitgroupthatrunthisservicetosharedocuments.Weneedyourhelptomaintenanceandimprovethiswebsite. Donate Sharing Tokeepoursiterunning,weneedyourhelptocoverourservercost(about$500/m),asmalldonationwillhelpusalot. ShareonFacebook ShareonGoogle+ Tweet Pinit ShareonLinkedIn Sendemail Pleasehelpustoshareourservicewithyourfriends. No,thanks!Closethebox.



請為這篇文章評分?