Diffusion of Lexical Change in Social Media - PLOS

文章推薦指數: 80 %
投票人數:10人

Many words and phrases have entered the standard English lexicon from minority dialects [30]; conversely, there is evidence that minority ... BrowseSubjectAreas ? ClickthroughthePLOStaxonomytofindarticlesinyourfield. FormoreinformationaboutPLOSSubjectAreas,click here. Article Authors Metrics Comments MediaCoverage ReaderComments Figures Figures AbstractComputer-mediatedcommunicationisdrivingfundamentalchangesinthenatureofwrittenlanguage.Weinvestigatethesechangesbystatisticalanalysisofadatasetcomprising107millionTwittermessages(authoredby2.7millionuniqueuseraccounts).Usingalatentvectorautoregressivemodeltoaggregateacrossthousandsofwords,weidentifyhigh-levelpatternsindiffusionoflinguisticchangeovertheUnitedStates.OurmodelisrobusttounpredictablechangesinTwitter'ssamplingrate,andprovidesaprobabilisticcharacterizationoftherelationshipofmacro-scalelinguisticinfluencetoasetofdemographicandgeographicpredictors.Theresultsofthisanalysisoffersupportforpriorargumentsthatfocusongeographicalproximityandpopulationsize.However,demographicsimilarity–especiallywithregardtorace–playsanevenmorecentralrole,ascitieswithsimilarracialdemographicsarefarmorelikelytosharelinguisticinfluence.Ratherthanmovingtowardsasingleunified“netspeak”dialect,languageevolutionincomputer-mediatedcommunicationreproducesexistingfaultlinesinspokenAmericanEnglish. Citation:EisensteinJ,O'ConnorB,SmithNA,XingEP(2014)DiffusionofLexicalChangeinSocialMedia.PLoSONE9(11): e113114. https://doi.org/10.1371/journal.pone.0113114Editor:RobertC.Berwick,MassachusettsInstituteofTechnology,UnitedStatesofAmericaReceived:June27,2014;Accepted:October20,2014;Published:November19,2014Copyright:©2014Eisensteinetal.Thisisanopen-accessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense,whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalauthorandsourcearecredited.DataAvailability:Theauthorsconfirmthat,forapprovedreasons,someaccessrestrictionsapplytothedataunderlyingthefindings.ThetextdatainthispaperwasacquiredfromTwitter'sstreamingAPI,andredistributionoftherawtextisprohibitedbytheirtermsofservice(TOS).Acompletewordlistandtheassociatedannotationsareprovidedasasupportingdocument.PublicdisseminationoftheTweetIDswillenableotherresearcherstoobtainthisdatafromTwitter'sAPI,exceptformessageswhichhavebeendeletedbytheirauthors.TweetIDscanbeobtainedbyemailingthecorrespondingauthor.Funding:ThisworkwassupportedbyNationalScienceFoundationgrantsIIS-1111142andIIS-1054319,byGoogle'ssupportoftheReadingisBelievingprojectatCarnegieMellonUniversity,acomputingresourcesawardfromAmazonWebServices.ThisworkwassupportedbycomputingresourcesfromtheOpenSourceDataCloud(OSDC),whichisanOpenCloudConsortium(OCC)-sponsoredproject.OSDCusagewassupportedinpartbygrantsfromGordonandBettyMooreFoundationandtheNationalScienceFoundation,andbymajorcontributionsfromOCCmembersliketheUniversityofChicago.Thefundershadnoroleinstudydesign,datacollectionandanalysis,decisiontopublish,orpreparationofthemanuscript.Competinginterests:BOandNASweresupportedbyGoogle'ssupportoftheReadingisBelievingprojectatCarnegieMellonUniversity.ThisstudywasalsosupportedbyacomputingresourcesawardfromAmazonWebServices.Therearenopatents,productsindevelopmentormarketedproductstodeclare.Thisdoesnotaltertheauthors'adherencetoallthePLOSONEpoliciesonsharingdataandmaterials. IntroductionAnincreasingproportionofinformalcommunicationisconductedinwrittenform,mediatedbytechnologysuchassmartphonesandsocialmediaplatforms.Writtenlanguagehasbeenforcedtoadapttomeetthedemandsofsynchronousconversation,resultinginacreativeburstofnewforms,suchasemoticons,abbreviations,phoneticspellings,andotherneologisms[1]–[3].Suchchangeshaveoftenbeenconsideredasasingle,uniformdialect—bothbyresearchers[4],[5]andthroughoutthepopularpress[5],[6].Butdespitethefactthatsocialmediafacilitatesinstantcommunicationbetweendistantcornersoftheearth,theadoptionofnewwrittenformsisoftensharplydelineatedbygeographyanddemographics[7]–[9].Forexample,inourcorpusofsocialmediatextfrom2009to2012,theabbreviationikr(Iknow,right?)occurssixtimesmorefrequentlyintheDetroitareathanintheUnitedStatesoverall;theemoticonoccursfourtimesmorefrequentlyinSouthernCalifornia;thephoneticspellingsuttin(something)occursfivetimesmorefrequentlyinNewYorkCity. Thesedifferencesraisequestionsabouthowlanguagechangespreadsinonlinecommunication.Whatgroupsareinfluential,andwhichcommunitiesevolvetogether?Iswrittenlanguagemovingtowardglobalstandardizationorincreasedfragmentation?Aslanguageisacrucialconstituentofpersonalandgroupidentity,examinationofthecompetingsocialfactorsthatdrivelanguagechangecanshednewlightonthehiddenstructuresthatshapesociety.Thispaperoffersanewtechniqueforinducingnetworksoflinguisticinfluenceandco-evolutionfromrawwordcounts.Wethenseekexplanationsforthisnetworkinasetofdemographicandgeographicpredictors,usingalogisticregressioninwhichthesepredictorsareusedtoexplaintheinducedtransmissionpathways. Awaveofrecentresearchhasshownhowsocialmediadatasetscanenablelarge-scaleanalysisofpatternsofcommunication[10],[11],sentiment[12]–[14],andinfluence[15]–[19].Suchworkhasgenerallyfocusedontrackingthespreadofdiscretebehaviors,suchasusingapieceofsoftware[16],repostingduplicateornear-duplicatecontent[10],[20],[21],votinginpoliticalelections[17],orpostingahyperlinktoonlinecontent[18],[19].Trackinglinguisticchangesposesasignificantadditionalchallenge,asweareconcernednotwiththefirstappearanceofaword,butwiththeburstsandlullsinitspopularityovertime[22].Inaddition,thewellknown“long-tail”natureofbothwordcountsandcitysizes[23]ensuresthatmostcountsforwordsandlocationswillbesparse,renderingsimplefrequency-basedmethodsinadequate. Languagechangehaslongbeenanactiveareaofresearch,andavarietyoftheoreticalmodelshavebeenproposed.Inthewavemodel,linguisticinnovationsspreadthroughinteractionsoverthecourseofanindividual'slife,sothemovementoflinguisticinnovationfromoneregiontoanotherdependsonthedensityofinteractions[24].Inthesimplestversionofthismodel,theprobabilityofcontactbetweentwoindividualsdependsontheirdistance,solinguisticinnovationsshoulddiffusecontinuouslythroughspace.Thegravitymodelcombinespopulationandgeographicaldistance:startingfromthepremisethatthelikelihoodofcontactbetweenindividualsfromtwocitiesdependsonthesizeofthecitiesaswellastheirdistance,thismodelpredictsthatlinguisticinnovationswilltravelbetweenlargecitiesfirst[25].Theclosely-relatedcascademodelfocusesondifferencesinpopulation,arguingthatlinguisticchangeswillproceedfromthelargestcitiestothenextlargest,passingoversparselypopulatedintermediategeographicalareas[26].Quantitativevalidationofthesemodelshasfocusedonedit-distancemetricsofpronunciationdifferencesamongstEuropeandialects,withmixedfindingsontherelativeimportanceofgeographyandpopulation[27]–[29]. Culturalfactorsalsoplayanimportantroleinboththediffusionof,andresistanceto,languagechange.ManywordsandphraseshaveenteredthestandardEnglishlexiconfromminoritydialects[30];conversely,thereisevidencethatminoritygroupsintheUnitedStatesresistregionalsoundchangesassociatedwithEuropeanAmericanspeakers[31],andthatracialdifferencesinspeechpersisteveninconditionsofveryfrequentsocialcontact[32].Atpresenttherearefewquantitativesociolinguisticaccountsofhowgeographyanddemographicsinteract[33];noraretheircompetingrolesexplainedinthemenagerieoftheoreticalmodelsoflanguagechange,suchasevolutionarybiology[34],[35],dynamicalsystems[36],Nashequilibria[37],Bayesianlearners[38],andagent-basedsimulations[39].Ingeneral,suchresearchisconcernedwithdemonstratingthataproposedtheoreticalframeworkcanaccountforobservedphenomenalikegeographicaldistributionoflinguisticfeaturesandtheirrateofadoptionovertime.Incontrast,thispapertakesadata-drivenapproach,fittingamodeltoalargecorpusoftextdatafromindividuallanguageusers,andanalyzingthesocialmeaningoftheresultingparameters. Researchonreconstructinglanguagephylogeniesfromcognatetablesisalsorelated[40]–[43],butratherthanaphylogeneticprocessinwhichlanguagesseparateandthendevelopinrelativeindependence,wehaveclosely-relatedvarietiesofasinglelanguage,whichareinconstantinteraction.Otherresearchershavelinkeddatabasesoftypologicallinguisticfeatures(suchasmorphologicalcomplexity)withgeographicalandsocialpropertiesofthelanguages'speechcommunities[44].Again,ourinterestisinmoresubtledifferenceswithinthesamelanguage,ratherthandifferencesacrosstheentiresetofworldlanguages.Thetypologicalatlasesandcognatetablesthatarethebasissuchworkareinapplicabletoourproblem,requiringustotakeacorpus-basedapproach[45],estimatinganinfluencenetworkdirectlyfromrawtext. Theoverallaimofthisworkistobuildacomputationalmodelcapableofidentifyingthedemographicandgeographicfactorsthatdrivethespreadofnewlypopularwordsinonlinetext.Tothisend,weconstructastatisticalprocedureforrecoveringnetworksoflinguisticdiffusionfromrawwordcounts,evenastheunderlyingsocialmediasamplingratechangesunaccountably.WepresentaprocedureforBayesianinferenceinthismodel,capturinguncertaintyabouttheinduceddiffusionnetwork.Wethenconsiderarangeofdemographicandgeographicfactorsthatmightexplainthenetworksinducedfromthismodel,usingaposthoclogisticregressionanalysis.Thislendssupporttopriorworkontheimportanceofpopulationandgeography,butrevealsastrongroleforracialhomophilyatthelevelofcity-to-citylinguisticinfluence. MaterialsandMethodsWeconductedastatisticalanalysisofacorpusofpublicdatafromthemicroblogsiteTwitter,from2009–2012.Thecorpusincludes107millionmessages,mainlyinEnglish,frommorethan2.7millionuniqueuseraccounts.EachmessagecontainsGPScoordinatestolocationsinthecontinentalUnitedStates.Thedatawastemporallyaggregatedinto165week-longbins.Aftertakingmeasurestoremovemarketing-orientedaccounts,eachuseraccountwasassociatedwithoneofthe200largestMetropolitanStatisticalAreas(MSA)intheUnitedStates,basedontheirgeographicalcoordinates.The2010UnitedCensusprovidesdetaileddemographicsforMSAs.Bylinkingthiscensusdatatochangesinwordfrequencies,wecanobtainanaggregatepictureoftheroleofdemographicsinthediffusionoflinguisticchangeinsocialmedia. EmpiricalresearchsuggeststhatTwitter'suserbaseisyounger,moreurban,andmoreheavilycomposedofethnicminorities,incomparisonwiththeoverallUnitedStatespopulation[46],[47].OuranalysisdoesnotassumethatTwitterusersarearepresentativedemographicsampleoftheirgeographicareas.Rather,weassumethatonamacroscale,thediffusionofwordsbetweenmetropolitanareasdependsontheoveralldemographicpropertiesofthoseareas,andnotonthedemographicpropertiesspecifictotheTwitterusersthatthoseareascontain.Alternatively,theuseofpopulation-levelcensusstatisticscanbejustifiedontheassumptionthatthedemographicskewintroducedbyTwitter—forexample,towardsyoungerindividuals—isapproximatelyhomogeneousacrosscities.Table1showstheaveragedemographicsforthe200MSAsconsideredinourstudy. Download: PPTPowerPointslidePNGlargerimageTIFForiginalimageTable1.Statisticsofmetropolitanstatisticalareas. https://doi.org/10.1371/journal.pone.0113114.t001Linguistically,ouranalysisbeginswiththe100,000mostfrequenttermsoverall.Wenarrowthislistto4,854termswhosefrequencychangedsignificantlyovertime;theexcludedtermshavelittledynamicrange;theywouldthereforenotsubstantiallyeffectonthemodelparameters,butwouldincreasethecomputationalcostifincluded.Wethenmanuallyrefinethislistto2,603Englishwords,byexcludingnames,hashtags,andforeignlanguageterms.AcompletelistoftermsisgiveninAppendixS1inFileS1,examplesofeachtermaregiveninAppendixS2inFileS1,andmoredetailedproceduresfordataacquisitionaregiveninAppendixS3inFileS1.ManualannotationsofeachtermaregiveninTableS1inFileS1,andthesoftwareforourdatapreprocessingpipelineisgiveninSoftwareS1inFileS1. Figure1showsthegeographicaldistributionofsixwordsovertime.Thefirstrowshowsthewordion,whichisashortenedformofIdon't,asinionevencare.Systematicallycodingarandomsampleof300occurrencesofthestringioninourdatasetrevealedtwocasesofthetraditionalchemistrysenseofion,and294casesthatclearlymatchedIdon't.Thisworddisplaysincreasingpopularityovertime,butremainsstronglyassociatedwiththeSoutheast.Incontrast,thesecondrowshowstheemoticon-_-(indicatingannoyance),whichspreadsfromitsinitialbasesincoastalcitiestonationwidepopularity.Thethirdrowshowstheabbreviationctfu,whichstandsforcrackingthefuckup(i.e.,laughter).AtthebeginningofthesampleitisactivemainlyintheClevelandarea;bytheend,itiswidelyusedinPennsylvaniaandthemid-Atlantic,butremainsrareinthelargecitiestothewestofCleveland,suchasDetroitandChicago.Whatexplainsthenon-uniformspreadofthisterm'spopularity? Download: PPTPowerPointslidePNGlargerimageTIFForiginalimageFigure1.Changeinfrequencyforsixwords:ion,-—-,ctfu,af,ikr,ard.Bluecirclesindicatecitieswhereonaverage,atleast0.1%ofusersusethewordduringaweek.Acircle'sareaisproportionaltotheword'sprobability. https://doi.org/10.1371/journal.pone.0113114.g001Whileindividualexamplesareintriguing,weseekanaggregatedaccountofthespatiotemporaldynamicsacrossmanywords,whichwecancorrelateagainstgeographicanddemographicpropertiesofmetropolitanareas.Duetothecomplexityofdrawinginferencesaboutinfluenceanddemographicsfromrawwordcounts,weperformthisprocessinstages.AblockdiagramoftheprocedureisshowninFigure2.First,wemodelwordfrequenciesasadynamicalsystem,usingBayesianinferenceoverthelatentspatiotemporalactivationofeachword.WeusesequentialMonteCarlo[48]toapproximatethedistributionoverspatiotemporalactivationswithasetofsamples.Withineachsample,weinduceamodelofthelinguisticdynamicsbetweenmetropolitanareas,whichwethendiscretizeintoasetofpathways.Finally,weperformlogisticregressiontoidentifythegeographicanddemographicfactorsthatcorrelatewiththeinducedlinguisticpathways.Byaggregatingacrosssamples,wecanestimatetheconfidenceintervalsoftheresultinglogisticregressionparameters. Download: PPTPowerPointslidePNGlargerimageTIFForiginalimageFigure2.Blockdiagramforourstatisticalmodelingprocedure.ThedottedoutlineindicatesrepetitionacrosssamplesdrawnfromsequentialMonteCarlo. https://doi.org/10.1371/journal.pone.0113114.g002 Modelingspatiotemporallexicaldynamicsinsocialmediadata Thissectiondescribesourapproachformodelinglexicaldynamicsinourdata.Werepresentourdataascounts,whichisthenumberofindividualswhousedthewordatleastonceinMSAattime(i.e.,oneweek).(MathematicalnotationissummarizedinTable2.Wedonotconsiderthetotalnumberoftimesawordisused,sincetherearemanycasesofasingleindividualusingasinglewordhundredsorthousandsoftimes.)Tocapturethedynamicsofthesecounts,weemployalatentvectorautoregressivemodel,basedonthebinomialdistributionwithalogisticlinkfunction.Theuseoflatentvariablemodelingismotivatedbypropertiesofthedatathatareproblematicforsimplerautoregressivemodelsthatoperatedirectlyonwordcountsandfrequencies(withoutalatentvariable).Webeginbybrieflysummarizingtheseproblems;wethenpresentourmodel,describethedetailsofinferenceandestimation,andoffersomeexamplesoftheinferencesthatourmodelsupports. Download: PPTPowerPointslidePNGlargerimageTIFForiginalimageTable2.Tableofmathematicalnotation. https://doi.org/10.1371/journal.pone.0113114.t002 Challengesfordirectautoregressivemodels.Thesimplestmodelingapproachwouldbeanautoregressivemodelthatoperatesdirectlyonthewordcountsorfrequencies[49].AmajorchallengeforsuchmodelsisthatTwitteroffersonlyasampleofallpublicmessages,andthesamplingratecanchangeinunclearways[50].Forexample,formuchofthetimespanofourdata,Twitter'sdocumentationimpliesthatthesamplingrateisapproximately10%;butin2010andearlier,thesamplingrateappearstobe15%or5%.(ThisestimateisbasedoninspectionofmessageIDsmodulo100,whichappearstobehowsamplingwasimplementedatthattime.)After2010,thevolumegrowthinourdataisrelativelysmooth,implyingthatthesamplingisfair(unlikefindingsof[50],whichfocusonamoreproblematiccaseinvolvingqueryfilters,whichwedonotuse). Rawcountsarenotappropriateforanalysis,becausetheMSAshavewildlydivergentnumbersofusersandmessages.NewYorkCityhasfourtimesasmanyactiveusersasthe10thlargestmetropolitanarea(SanFrancisco-Oakland,CA),twentytimesasmanyasthe50thlargest(OklahomaCity,OK),and200timesasmanyasthe200thlargest(Yakima,WA);theseratiosaresubstantiallylargerwhenwecountmessagesinsteadofactiveusers.Thisnecessitatesnormalizingthecountstofrequencies,whereisthenumberofindividualswhohavewrittenatleastonemessageinregionattime.Theresultingfrequencyistheempiricalprobabilitythatarandomuserinusedtheword.Wordfrequenciestreatlargeandsmallcitiesmoreequally,butsufferfromseveralproblems: Thefrequencyisnotinvarianttoachangeinthesamplingrate:if,say,halfthemessagesareremoved,theprobabilityofseeingauseruseanyparticularwordgoesdown,becausewilldecreasemoreslowlythanforany.Thechangestotheglobalsamplingrateinourdatadrasticallyimpact. UsersindifferentcitiescanbemoreorlessactivelyengagedwithTwitter:forexample,theaverageNewYorkercontributed55messagestoourdataset,whiletheaverageuserwithintheSanFrancisco-Oaklandmetropolitanareacontributed21messages.Mostcitiesfallsomewhereinbetweentheseextremes,butagain,this“verbosity”maychangeovertime. Wordpopularitiescanbedrivenbyshort-livedglobalphenomena,suchasholidaysoreventsinpopularculture(e.g.,TVshows,moviereleases),whicharenotinterestingfromtheperspectiveofpersistentchangestothelexicon.Wemanuallyremovedtermsthatdirectlyrefertosuchevents(asdescribedintheAppendixS3inFileS1),buttheremaybeunpredictablesecond-orderphenomena,suchasanemphasisonwordsrelatedtooutdoorcookingandbeachtripsduringthesummer,andcomplaintsaboutboredomduringtheschoolyear. Duetothelong-tailnatureofbothwordcountsandcitypopulations[51],manywordcountsinmanycitiesarezeroatanygivenpointintime.Thisflooreffectmeansthatleastsquaresmodels,suchasPearsoncorrelationsortheKalmansmoother,arepoorlysuitedforthisdata,ineithertheorrepresentations. Latentvectorautoregressivemodel.Toaddresstheseissues,webuildalatentvariablemodelthatcontrolsfortheseconfoundingeffects,yieldingabetterviewoftheunderlyingfrequencydynamicsforeachword.Insteadofworkingwithrawfrequencies,weperforminferenceoverlatentvariables,whichrepresenttheunderlyingactivationofwordinMSAattime.Wecanconvertbetweenthesetworepresentationsusingthelogistictransformation,,where.Wewillestimateeachbymaximizingthelikelihoodoftheobservedcountdata,whichwetreatasarandomdrawfromabinomialdistribution,withthenumberoftrialsequalto,andthefrequencyparameterequalto. An-onlymodel,therefore,wouldbe(1) Thisisaverysimplegeneralizedlinearmodelwithalogitlinkfunction[52],inwhichthemaximumlikelihoodestimateofwouldsimplybealog-oddsreparameterizationoftheprobabilityofauserusingtheword,.Byitself,thismodelcorrespondstodirectlyusing,andhasallthesameproblemsasnotedintheprevioussection;inaddition,theestimategoestonegativeinfinitywhen. Theadvantageofthelogisticbinomialparameterizationisthatitallowsanadditivecombinationofeffectstocontrolforconfounds.Tothisend,weincludetwoadditionalparametersand:(2) Theparameterrepresentstheoverallactivationofthewordattime,thusaccountingfornon-geographicalchanges,suchaswhenawordbecomesmorepopulareverywhereatonce.Theparameterrepresentsthe“verbosity”ofMSAattime,whichvariesforthereasonsmentionedabove.Theseparameterscontrolforglobaleffectsdueto,suchaschangestotheAPIsamplingrate.(Becauseandbothinteractwith,itisunnecessarytointroduceamaineffectfor.)Inthismodel,thevariablesstillrepresentdifferencesinlog-odds,butaftercontrollingfor“baserate”effects;theycanbeseenanadjustmenttothebaserate,andcanbeestimatedwithgreaterstability. Wecannowmeasurelexicaldynamicsintermsofthelatentvariableratherthantherawcounts.Wetakethesimplestpossibleapproach,modelingasafirst-orderlineardynamicalsystemwithGaussiannoise[53],(3) Thedynamicsmatrixissharedoverbothwordsandtime;wealsoassumehomogeneityofvariancewithineachmetropolitanarea(perword),usingthevarianceparameter.Thesesimplifyingassumptionsaretakentofacilitatestatisticalinference,bykeepingthenumberofparametersatareasonablesize.Ifitispossibletodetectclearpatternsoflinguisticdiffusionunderthislinearhomoscedasticmodel,thenmoreflexiblemodelsshouldshowevenstrongereffects,iftheycanbeestimatedsuccessfully;weleavethisforfuturework.Itisimportanttoobservethatthismodeldoesdifferentiatedirectionality:ingeneral,.Thecoefficientreflectstheextenttowhichpredicts,andviceversafor.Intheextremecasethatignores,whileimitatesperfectly,wewillhaveand.Notethatbothcoefficientscanbepositive,inthecasethatandevolvesmoothlyandsynchronously;indeed,suchmutualconnectionsappearfrequentlyintheinducednetworks. Equation2specifiestheobservationmodel,andEquation3specifiesthedynamicsmodel;together,theyspecifythejointprobabilitydistribution,(4)whereweomitsubscriptstoindicatetheprobabilityofalland,givenalland. Becausetheobservationmodelisnon-Gaussian,thestandardKalmansmoothercannotbeapplied.Inferenceundernon-Gaussiandistributionsisoftenhandledviasecond-orderTaylorapproximation,asintheextendedKalmanfilter[53],butasecond-orderapproximationtotheBinomialdistributionisunreliablewhenthecountsaresmall.Incontrast,sequentialMonteCarlosamplingpermitsarbitraryparametricdistributionsforboththeobservationsandsystemdynamics[54].Forward-filteringbackwardsampling[48]givessmoothedsamplesfromthedistribution,soforeachword,weobtainasetofsampletrajectories,whereindexesthesample.MonteCarloapproximationbecomesincreasinglyaccurateas[54],butwefoundlittlechangeintheoverallresultsforvaluesof. Inferenceandestimation.ThetotaldimensionofisequaltotheproductofthenumberofMSAs(200),words(2,603),andtimesteps(165),requiringinferenceover85millioninterrelatedrandomvariables.Tofacilitateinferenceandestimation,weadoptastagewiseprocedure.Firstwemakeestimatesoftheparameters(overallactivationforeachword)and(region-specificverbosity),assuming.Next,weperforminferenceover,assumingasimplifieddynamicsmatrix,whichisdiagonal.Last,weperforminferenceoverthefulldynamicsmatrix,under;thisprocedureisdescribedinthenextsection.SeeFigure2forablockdiagramoftheinferenceandestimationprocedure. Theparameters(globalwordactivation)and(region-specificverbosity)areestimatedfirst.Webeginbycomputingasimplifiedastheinverselogisticfunctionofthetotalfrequencyofword,acrossalltimesteps.Next,wecomputethemaximumlikelihoodestimatesofeachviagradientdescent.Wethenholdfixed,andcomputethemaximumlikelihoodestimatesofeach.InferenceoverthelatentspatiotemporalactivationsisperformedviaMonteCarloExpectationMaximization(MCEM)[55].Foreachword,weconstructadiagonaldynamicsmatrix.Givenestimatesofand,weusethesequentialMonteCarlo(SMC)algorithmofforward-filteringbackwardsampling(FFBS)[48]todrawsamplesof;thisconstitutestheE-stepoftheMCEMprocess.Next,weapplymaximum-likelihoodestimationtoupdateand;thisconstitutestheM-step.Theseupdatesarerepeateduntileithertheparametersconvergeorwereachalimitoftwentyiterations.Wenowdescribeeachstepinmoredetail: E-step.TheE-stepconsistsofdrawingsamplesfromtheposteriordistributionover.FFBSappendsabackwardpasstoanySMCfilterthatproducesasetofhypothesesandweights,indexedby.Theweightrepresentsthelikelihoodofthehypothesis,sothattheexpectedvalue.Theroleofthebackwardpassistoreducevariancebyresamplingthehypothesesaccordingtothejointsmoothingdistribution.Ourforwardpassisastandardbootstrapfilter[54]:bysettingtheproposaldistributionequaltothetransitiondistribution,theforwardweightsareequaltotherecursiveproductoftheobservationlikelihoods,(5)Thebackwardpassusestheseweights,andreturnsasetofunweightedhypothesesthataredrawndirectlyfrom.MorecomplexSMCalgorithms—suchasresampling,annealing,andmoreaccurateproposaldistributions—didnotachievehigherlikelihoodthanthebootstrapfilter. M-step.TheM-stepconsistsofcomputingtheaverageofthemaximumlikelihoodestimatesofand.Withineachsample,maximumlikelihoodestimationisstraightforward:thedynamicsmatrixisobtainedbyleastsquares,andissettotheempiricalvariance Examples.Figure3showstheresultofthismodelingprocedureforseveralexamplewords.Intherightpanel,eachsampleofisshownwithalightdottedline.Intheleftpanel,theempiricalwordfrequenciesareshownwithcircles,andthesmoothedfrequenciesforeachsampleareshownwithdottedlines.Largecitiesgenerallyhavealowervarianceoversamples,becausethevarianceofthemaximumaposterioriestimateofthebinomialdecreaseswiththetotaleventcount.Forexample,inFigure3(c),thesamplesofaretightlyclusteredforPhiladelphia(thesixth-largestMSAintheUnitedStates),butarediffuseforYoungstown(the95thlargestMSA).Notealsothattherelationshipbetweenfrequencyandisnotmonotonic—forexample,thefrequencyofionincreasesinMemphisoverthedurationofthesample,butthevalueofdecreases.Thisisbecauseoftheparameterforbackgroundwordactivation,,whichincreasesasthewordattainsmoregeneralpopularity.ThelatentvariablemodelisthusabletoisolateMSA-specificactivationfromnuisanceeffectsthatincludetheoverallwordactivationandTwitter'schangingsamplingrate. Download: PPTPowerPointslidePNGlargerimageTIFForiginalimageFigure3.Left:empiricaltermfrequencies(circles)andtheirMonteCarlosmoothedestimates(dottedlines);Right:MonteCarlosmoothedestimatesofη. https://doi.org/10.1371/journal.pone.0113114.g003 Constructinganetworkoflinguisticdiffusion Havingobtainedsamplesfromthedistributionoverlatentspatiotemporalactivations,wenowestimatethesystemdynamics,whichdescribesthepathwaysoflinguisticdiffusion.GiventhesimpleGaussianformofthedynamicsmodel(Equation3),thecoefficientscanbeobtainedbyordinaryleastsquares.WeperformthisestimationseparatelywithineachofthesequentialMonteCarlosamples,obtainingdensematrices,for. Thecoefficientsofarenotinmeaningfulunits,andtheirrelationshiptodemographicsandgeographywillthereforebedifficulttointerpret,model,andvalidate.Instead,weprefertouseabinarized,networkrepresentation,.Givensuchanetwork,wecandirectlycomparethepropertiesoflinkedMSAswiththepropertiesofrandomlyselectedpairsofMSAsnotin,offeringfacevalidationoftheproposedlinkbetweenmacro-scalelinguisticinfluenceandthedemographicandgeographicfeaturesofcities. Specifically,weareinterestedinasetofpairsofMSAs,,forwhichweareconfidentthat,giventheuncertaintyinherentinestimationacrosssparsewordcounts.MonteCarloinferenceenablesthisuncertaintytobeeasilyquantified:wecomputez-scoresforeachorderedcitypair,usingtheempiricalmeanandstandarddeviationofacrosssamples.Weselectpairswhosez-scoreexceedsathreshold,denotingtheselectedset.Tocomputeuncertaintyaroundalargenumberofcoefficients,weapplytheBenjamini-HochbergFalseDiscoveryRate(FDR)correctionformultiplehypothesistesting[56],whichcontrolstheexpectedproportionoffalsepositivesinas(6)whereistheprobability,underaone-sidedhypothesis,thatexceedsunderastandardnormaldistribution,whichwewouldexpectifvalueswererandom;thishasprobability,whereistheGaussianCDF.isthesimulation-generatedempiricaldistributionovervalues.Ifhighz-scoresoccurmuchmoreoftenunderthemodel()thanwewouldexpectbychance(),onlyasmallproportionshouldbeexpectedtobefalsepositives;theBenjamini-Hochbergratioisanupperboundontheexpectedproportionoffalsepositivesin.Toobtain,theindividualtestthresholdisapproximately,orintermsofp-values,.Wesee510dynamicscoefficientssurvivethisthreshold;theseindicatehigh-probabilitypathwaysoflinguisticdiffusion.Theassociatedsetofcitypairsisdenoted. Figure4showsasparsernetwork,inducedusingamorestringentthresholdof.Theroleofgeographyisapparentfromthefigure:therearedenseconnectionswithinregionssuchastheNortheast,Midwest,andWestCoast,andrelativelyfewcross-countryconnections.Forexample,weobservemanyconnectionsamongtheWestCoastcitiesofSanDiego,LosAngeles,SanJose,SanFrancisco,Portland,andSeattle(frombottomtotopontheleftsideofthemap),butfewconnectionsfromthesecitiestootherpartsofthecountry. Download: PPTPowerPointslidePNGlargerimageTIFForiginalimageFigure4.Inducednetwork,showingsignificantcoefficientsamongthe40mostpopulousMSAs(usinganFDR<0.001threshold,yielding254links).Blueedgesrepresentbidirectionalinfluence,whentherearedirectededgesinbothdirections;orangelinksareunidirectional. https://doi.org/10.1371/journal.pone.0113114.g004 Practicaldetails.Toavoidoverfittinganddegeneracyintheestimationof,weplaceazero-meanGaussianprioroneachelement,tuningthevariancebygridsearchonthelog-likelihoodofaheld-outsubsetoftimesliceswithin.Themaximumaposterioriestimateofcanbecomputedinclosedformviaridgeregression.Lagsoflengthgreaterthanoneareaccountedforbyregressingthevaluesofagainstthemovingaveragefromtheprevioustentimesteps.Resultswithoutthissmoothingarebroadlysimilar. Geographicanddemographiccorrelatesoflinguisticdiffusion Byanalyzingthepropertiesofpairsofmetropolitanareasthatareconnectedinthenetwork,wecanquantifythegeographicanddemographicdriversofonlinelanguagechange.Specifically,weconstructalogisticregressiontoidentifythefactorsthatareassociatedwithwhetherapairofcitieshaveastronglinguisticconnection.ThepositiveexamplesarepairsofMSAswithstrongtransmissioncoefficients;anequalnumberofnegativeexamplesissampledrandomlyfromadistribution,whichisdesignedtomaintainthesameempiricaldistributionofMSAsasappearsinthepositiveexamples.ThisensuresthateachMSAappearswithroughlythesamefrequencyinthepositiveandnegativepairs,eliminatingapotentialconfound. TheindependentvariablesinthislogisticregressionincludegeographicanddemographicpropertiesofpairsofMSAs.Weincludethefollowingdemographicattributes:medianage,logmedianincome,andtheproportionsof,respectively,AfricanAmericans,Hispanics,individualswholiveinurbanizedareas,andindividualswhorenttheirhomes.TheproportionofEuropeanAmericanswasomittedbecauseofastrongnegativecorrelationwiththeproportionofAfricanAmericans;theproportionofAsianAmericanswasomittedbecauseitisverylowfortheoverwhelmingmajorityofthe200largestMSAs.Theserawattributesarethenconvertedintobothasymmetricandsymmetricpredictors,usingtherawdifferenceanditsabsolutevalue.Thesymmetricpredictorsindicatepairsofcitiesthatarelikelytoshareinfluence;besidesthedemographicattributes,weincludethegeographicaldistance.TheasymmetricpredictorsarepropertiesthatmaymakeanMSAlikelytobethedriverofonlinelanguagechange.Besidestherawdifferencesofthesixdemographicattributes,weincludethelogdifferenceinpopulation.Allvariablesarestandardized. Foragivendemographicattribute,anegativeregressioncoefficientfortheabsolutedifferencewouldindicatethatsimilarityisimportant;apositiveregressioncoefficientforthe(asymmetric)rawdifferencewouldindicatethatregionswithlargevaluesofthisattributetendtobesendersratherthanreceiversoflinguisticinnovations.Forexample,astrongnegativecoefficientfortheasymmetriclogdifferenceinpopulationwouldindicatethatlargercitiesusuallyleadsmallerones,asproposedinthegravityandcascademodels. Tovisuallyverifythegeographicdistancepropertiesofourmodel,Figure5comparesnetworksobtainedbydiscretizingagainstnetworksofrandomly-selectedMSApairs,sampledfrom.HistogramsofthesedistancesareshowninFigure6,andtheiraveragevaluesareshowninTable3.Thenetworksinducedbyourmodelhavemanymoreshort-distanceconnectionsthanwouldbeexpectedbychance.Table3alsoshowsthatmanyotherdemographicattributesaremoresimilaramongcitiesthatarelinkedinourmodel'snetwork. Download: PPTPowerPointslidePNGlargerimageTIFForiginalimageFigure5.Top:twosamplenetworksinferredbythemodel,.(UnlikeFigure4,all200citiesareshown.)Bottom:two“negative”networks,sampledfromQ;thesearesamplesfromthenon-linkedpairdistributionQ,whichisconstructedtohavethesamemarginaldistributionsoversendersandreceiversasintheinferrednetwork.Abluelineindicatesdirectededgesinbothdirectionsbetweenthepairofcities;orangelinesareunidirectional. https://doi.org/10.1371/journal.pone.0113114.g005 Download: PPTPowerPointslidePNGlargerimageTIFForiginalimageFigure6.Histogramsofdistancesbetweenpairsofconnectedcities,inmodel-inferrednetworks(top),versus“negative”networksfromQ(bottom). https://doi.org/10.1371/journal.pone.0113114.g006 Download: PPTPowerPointslidePNGlargerimageTIFForiginalimageTable3.Differencesbetweenlinkedand(sampled)non-linkedpairsofcities,summarizedbytheirmeananditsstandarderror. https://doi.org/10.1371/journal.pone.0113114.t003Alogisticregressioncanshowtheextenttowhicheachoftheabovepredictorsrelatestothedependentvariable,thebinarizedlinguisticinfluence.However,theposterioruncertaintyoftheestimatesofthelogisticregressioncoefficientsdependsnotonlyonthenumberofinstances(MSApairs),butprincipallyonthevarianceintheMonteCarlo-basedestimatesfor,whichinturndependsonthesamplingvarianceandthesizeoftheobservedspatiotemporalwordcounts.Toproperlyaccountforthiscomplexvariance,werunthelogisticregressionseparatelywithineachMonteCarlosample,andreporttheempiricalstandarderrorsofthelogisticcoefficientsacrossthesamples. Practicaldetails.Thisprocedurerequiresustodiscretizethedynamicsnetworkwithineachsample,whichwewillwrite.Onesolutionwouldbesimplytakethelargestvalues;alternatively,wecouldtakethecoefficientsforwhichwearemostconfidentthat.Westrikeabalancebetweenthesetwoextremesbysortingthedynamicscoefficientsaccordingtothelowerboundoftheir95%confidenceintervals.Thisensuresthatwegetcitypairsforwhichissignificantlydistinctfromzero,butthatwealsoemphasizelargevaluesratherthansmallvalueswithlowvariance.Per-sampleconfidenceintervalsareobtainedbycomputingtheclosedformsolutiontotheposteriordistributionovereachdynamicscoefficient,,which,inridgeregression,isnormallydistributed.Wecanthencomputethe95%confidenceintervalofthecoefficientsineach,andsortthembythebottomofthisconfidenceinterval,,whereistheinverseNormalcumulativedensityfunctionevaluatedat0.975,.Weselectbythenumberofcoefficientsthatpassthefalsediscoveryratethresholdintheaggregatednetwork(),asdescribedintheprevioussection.Thisprocedureyieldsdifferentdiscretizedinfluencenetworks,eachwithidenticaldensitytotheaggregatednetwork.Bycomparingthelogisticregressioncoefficientsobtainedwithineachofthesenetworks,itispossibletoquantifytheeffectofuncertaintyaboutonthesubstantiveinferencesthatwewouldliketodrawaboutthediffusionoflanguagechange. ResultsFigure7showstheresultinglogisticregressioncoefficients.Whilegeographicaldistanceisprominent,theabsolutedifferenceintheproportionofAfricanAmericansisthestrongestpredictor:themoresimilartwometropolitanareasareintermsofthisdemographic,themorelikelythatlinguisticinfluenceistransmittedbetweenthem.AbsolutedifferenceintheproportionofHispanics,residentsofurbanizedareas,andmedianincomearealsostrongpredictors.Thisindicatesthatwhilelanguagechangedoesspreadgeographically,demographicsplayacentralrole,andnearbycitiesmayremainlinguisticallydistinctiftheydifferdemographically,particularlyintermsofrace.Inspokenlanguage,AfricanAmericanEnglishdiffersmoresubstantiallyfromotherAmericanvarietiesthananyregionaldialect[57];ouranalysissuggeststhatsuchdifferencespersistinthevirtualanddisembodiedrealmofsocialmedia.ExamplesoflinguisticallylinkedcitypairsthataregeographicallydistantbutdemographicallysimilarincludeWashingtonD.C.andNewOrleans(highproportionsofAfrican-Americans),LosAngelesandMiami(highproportionsofHispanics),andBostonandSeattle(relativelyfewminorities,comparedwithotherlargecities). Download: PPTPowerPointslidePNGlargerimageTIFForiginalimageFigure7.Logisticregressioncoefficientsforpredictinglinksbetweencity(MSA)pairs.95%confidenceintervalsareplotted;standarderrorsareinparentheses.Coefficientvaluesarefromstandardizedinputs;themeanandstandarddeviationsareshowntotheright. https://doi.org/10.1371/journal.pone.0113114.g007Oftheasymmetricfeatures,populationisthemostinformative,aslargercitiesaremorelikelytotransmittosmallerones.Intheinducednetworkoflinguisticinfluence,thethreelargestmetropolitanareas–NewYork,LosAngeles,andChicago–have40outgoingconnectionsandonlyfifteenincomingconnections.ThesefindingsareinaccordwiththeoreticalmodelsofferedbyTrudgill[25]andLabov[26].Wealthierandyoungercitiesarealsosignificantlymorelikelytoleadthantofollow.Whilethismayseemtoconflictwithearlierfindingsthatlanguagechangeoftenoriginatesfromtheworkingclass,wealthycitiesmustbedifferentiatedfromwealthyindividuals:wealthycitiesmayindeedbethehometotheupwardly-mobileworkingclassthatLabovassociateswithlinguisticcreativity[58],eveniftheyalsohostagreater-than-averagenumberofverywealthyindividuals. Additionalvalidationforthelogisticregressionisobtainedbymeasuringitscross-validatedpredictiveaccuracy.Foreachofthesamples,werandomlyselect10%oftheinstances(positiveornegativecitypairs)asaheld-outtestset,andfitthelogisticregressionontheother90%.Foreachcitypairinthetestset,thelogisticregressionpredictswhetheralinkexists,andwecheckthepredictionagainstwhetherthedirectedpairispresentin.ResultsareshowninTable4.Sincethenumberofpositiveandnegativeinstancesareequal,arandombaselinewouldachieve50%accuracy.Aclassifierthatusesonlygeographyandpopulation(thetwocomponentsofthegravitymodel)gives66.5%predictiveaccuracy.Theadditionofdemographicfeatures(bothasymmetricandsymmetric)increasesthissubstantially,to74.4%.Whilesymmetricfeaturesobtainthemostrobustregressioncoefficients,addingtheasymmetricfeaturesincreasesthepredictiveaccuracyfrom74.1%to74.4%,asmallbutstatisticallysignificantdifference. Download: PPTPowerPointslidePNGlargerimageTIFForiginalimageTable4.AverageaccuracypredictinglinksbetweenMSApairs,anditsMonteCarlostandarderror(calculatedfromsimulationsamples). https://doi.org/10.1371/journal.pone.0113114.t004 DiscussionLanguagecontinuestoevolveinsocialmedia.Bytrackingthepopularityofwordsovertimeandspace,wecanharnesslarge-scaledatatouncoverthehiddenstructureoflanguagechange.Wefindaremarkablystrongrolefordemographics,particularlyasouranalysisiscenteredonageographicalgroupingofindividualusers.Languagechangeissignificantlymorelikelytobetransmittedbetweendemographically-similarareas,especiallywithregardtorace—althoughdemographicpropertiessuchassocioeconomicclassmaybemoredifficulttoassessfromcensusstatistics. Languagechangespreadsacrosssocialnetworkconnections,anditiswellknownthatthesocialnetworksthatmatterforlanguagechangeareoftenstronglyhomophilousintermsofbothdemographicsandgeography[58],[59].Thispaperapproacheshomophilyfromamacro-levelperspective:ratherthanhomophilybetweenindividualspeakers[60],weidentifyhomophilybetweengeographicalcommunitiesasanimportantfactordrivingtheobservablediffusionoflexicalchange.Individualswhoaregeographicallyproximatewillindeedbemorelikelytosharesocialnetworkconnections[61],sotheroleofgeographyinouranalysisisnotdifficulttoexplain.Butmoresurprisingistheroleofdemographics,sinceitisunclearwhetherindividualswholiveincitiesthataregeographicallydistantbutdemographicallysimilarwillbelikelytoshareasocialnetworkconnection.PreviousworkhasshownthatfriendshiplinksonFacebookareraciallyhomophilous[62],buttoourknowledgetheinteractionofurbandemographicswithgeographyhasnotbeenexplored.Inprinciple,alarge-scaleanalysisofsocialnetworklinksonTwitterorsomeotherplatformcouldshedlightonthisquestion.Suchplatformsimposerestrictionsthatmakesocialnetworksdifficulttoacquire,butonepossibleapproachwouldbetotrytolinkthe“replytrees”consideredbyGonçalvesetal.[63]withthegeographicanddemographicmetadataconsideredhere;whileintriguing,thisisoutsidethescopeofthepresentpaper.Amethodologicalcontributionofourpaperisthedemonstrationthatsimilarmacro-scalesocialphenomenacanbeinferreddirectlyfromspatiotemporalwordcounts,evenwithoutaccesstoindividualsocialnetworks. Ourapproachcanberefinedinseveralways.Wegainrobustnessbychoosingmetropolitanareasasthebasicunitsofanalysis,butmeasuringwordfrequenciesamongsub-communitiesorindividualscouldshedlightonlinguisticdiversitywithinmetropolitanareas.Similarly,estimationisfacilitatedbyfittingasinglefirst-orderdynamicsmatrixacrossallwords,butsomeregionsmayexertmoreorlessinfluencefordifferenttypesofwords,andamoreflexiblemodeloftemporaldynamicsmightyieldadditionalinsights.Finally,languagechangeoccursatmanydifferentlevels,rangingfromorthographytosyntaxandpragmatics.Thisworkpertainsonlytowordfrequencies,butfutureworkmightconsiderstructuralchanges,suchasthephoneticalprocessresultinginthetranscriptionofidon'tintoion. Itisinevitablethatthenormsofwrittenlanguagemustchangetoaccommodatethenewwaysinwhichwritingisused.Aswithalllanguagechanges,innovationmustbetransmittedbetweenreallanguageusers,ultimatelygroundingoutincountlessindividualdecisions—consciousornot—aboutwhethertouseanewlinguisticform.Traditionalsociolinguisticshasproducedmanyinsightsfromthecloseanalysisofarelativelysmallnumberofvariables.Analysisoflarge-scalesocialmediadataoffersanew,complementarymethodologybyaggregatingthelinguisticdecisionsofmillionsofindividuals. SupportingInformationFileS1.AppendixS1-S3,TableS1andSoftwareS1.AppendixS1.Termlist.Listofallwordsconsideredinourmainanalysis.AppendixS2.Termexamples.Examplesforeachtermconsideredinouranalysis.AppendixS3.DataProcedures.Descriptionoftheproceduresusedfordataprocessing,includingTwitterdataacquisition,geocoding,contentfiltering,wordfiltering,andtextprocessing.TableS1.Termannotations.Tab-separatedfiledescribingannotationsofeachtermasentities,foreign-language,oracceptableforanalysis.SoftwareS1.Preprocessingsoftware.Sourcecodefordatapreprocessing. https://doi.org/10.1371/journal.pone.0113114.s001(ZIP) Acknowledgments WethankJeffreyArnold,ChrisDyer,LaurenHall-Lew,ScottKiesling,IainMurray,JohnNerbonne,BryanRoutledge,LaurenSquires,andAndersSøgaardforcommentsonthiswork. AuthorContributionsConceivedanddesignedtheexperiments:JEBONASEPX.Performedtheexperiments:JEBO.Analyzedthedata:JEBO.Wrotethepaper:JEBONAS.References1. AndroutsopoulosJK(2000)Non-standardspellingsinmediatexts:ThecaseofGermanfanzines.JournalofSociolinguistics4:514–533. ViewArticle GoogleScholar 2. AnisJ(2007)Neography:UnconventionalspellinginFrenchSMStextmessages.In:DanetB,HerringSC,editors,TheMultilingualInternet:Language,Culture,andCommunicationOnline,OxfordUniversityPress.pp.87–115.3. HerringSC(2012)Grammarandelectroniccommunication.In:ChapelleCA,editor,TheEncyclopediaofAppliedLinguistics,Wiley.4. CrystalD(2006)LanguageandtheInternet.CambridgeUniversityPress,secondedition.5. SquiresL(2010)Enregisteringinternetlanguage.LanguageinSociety39:457–492. ViewArticle GoogleScholar 6. ThurlowC(2006)Fromstatisticalpanictomoralpanic:Themetadiscursiveconstructionandpopularexaggerationofnewmedialanguageintheprintmedia.JournalofComputer-MediatedCommunication11:667–701. ViewArticle GoogleScholar 7. EisensteinJ,O'ConnorB,SmithNA,XingEP(2010)Alatentvariablemodelforgeographiclexicalvariation.In:ProceedingsofEmpiricalMethodsforNaturalLanguageProcessing(EMNLP).pp.1277–1287.8. EisensteinJ,SmithNA,XingEP(2011)Discoveringsociolinguisticassociationswithstructuredsparsity.In:ProceedingsoftheAssociationforComputationalLinguistics(ACL).pp.1365–1374.9. SchwartzHA,EichstaedtJC,KernML,DziurzynskiL,RamonesSM,etal.(2013)Personality,gender,andageinthelanguageofsocialmedia:Theopen-vocabularyapproach.PloSONE8:e73791. ViewArticle GoogleScholar 10. LotanG,GraeffE,AnannyM,GaffneyD,PearceI,etal.(2011)Therevolutionsweretweeted:Informationflowsduringthe2011TunisianandEgyptianrevolutions.InternationalJournalofCommunication5:1375–1405. ViewArticle GoogleScholar 11. WuS,HofmanJM,MasonWA,WattsDJ(2011)Whosayswhattowhomontwitter.In:ProceedingsoftheInternationalWorldWideWebConference(WWW).pp.705–714.12. DoddsPS,HarrisKD,KloumannIM,BlissCA,DanforthCM(2011)Temporalpatternsofhappinessandinformationinaglobalsocialnetwork:Hedonometricsandtwitter.PloSONE6:e26752. ViewArticle GoogleScholar 13. ThelwallM(2009)HomophilyinMySpace.JournaloftheAssociationforInformationScienceandTechnology60:219–231. ViewArticle GoogleScholar 14. MitchellL,FrankMR,HarrisKD,DoddsPS,DanforthCM(2013)Thegeographyofhappiness:Connectingtwittersentimentandexpression,demographics,andobjectivecharacteristicsofplace.PLoSONE8:e64417. ViewArticle GoogleScholar 15. LazerD,PentlandA,AdamicL,AralS,BarabásiAL,etal.(2009)Computationalsocialscience.Science323:721–723. ViewArticle GoogleScholar 16. AralS,WalkerD(2012)Identifyinginfluentialandsusceptiblemembersofsocialnetworks.Science337:337–341. ViewArticle GoogleScholar 17. BondRM,FarissCJ,JonesJJ,KramerADI,MarlowC,etal.(2012)A61-million-personexperimentinsocialinfluenceandpoliticalmobilization.Nature489:295–298. ViewArticle GoogleScholar 18. Gomez-RodriguezM,LeskovecJ,KrauseA(2012)Inferringnetworksofdiffusionandinfluence.ACMTransactionsonKnowledgeDiscoveryfromData5:21. ViewArticle GoogleScholar 19. BakshyE,RosennI,MarlowC,AdamicL(2012)Theroleofsocialnetworksininformationdiffusion.In:ProceedingsoftheInternationalWorldWideWebConference(WWW).pp.519–528.20. LeskovecJ,BackstromL,KleinbergJ(2009)Meme-trackingandthedynamicsofthenewscycle.In:Proceedingsofthe15thACMSIGKDDConferenceonKnowledgeDiscoveryandDataMining.pp.497–506.21. ChaM,HaddadiH,BenevenutoF,GummadiPK(2010)Measuringuserinfluenceintwitter:Themillionfollowerfallacy.In:ProceedingsoftheInternationalWorkshoponWebandSocialMedia(ICWSM).pp.10–17.22. AltmannEG,PierrehumbertJB,MotterAE(2009)Beyondwordfrequency:Bursts,lulls,andscalinginthetemporaldistributionsofwords.PLoSONE4::e7678+.23. ZipfGK(1949/2012)HumanBehaviorandthePrincipleofLeastEffort:AnIntroductiontoHumanEcology.MartinoFineBooks.24. BaileyCJ(1973)VariationandLinguisticTheory.Arlington,Virginia:CenterforAppliedLinguistics.25. TrudgillP(1974)Linguisticchangeanddiffusion:Descriptionandexplanationinsociolinguisticdialectgeography.LanguageinSociety3:215–246. ViewArticle GoogleScholar 26. LabovW(2003)Pursuingthecascademodel.In:BritainD,CheshireJ,editors,SocialDialectology:InhonourofPeterTrudgill,JohnBenjamins.pp.9–22.27. NerbonneJ,HeeringaW(2007)Geographicdistributionsoflinguisticvariationreflectdynamicsofdifferentiation.Roots:linguisticsinsearchofitsevidentialbase96:267–298. ViewArticle GoogleScholar 28. HeeringaW,NerbonneJ,vanBezooijenR,SpruitMR(2007)Geografieeninwoneraantallenalsverklarendefactorenvoorvariatieinhetnederlandsedialectgebied.NederlandseTaal-enLetterkunde123:70–82. ViewArticle GoogleScholar 29. NerbonneJ(2010)Measuringthediffusionoflinguisticchange.PhilosophicalTransactionsoftheRoyalSocietyB:BiologicalSciences365:3821–3828. ViewArticle GoogleScholar 30. LeeMG(1999)Outofthehoodandintothenews:Borrowedblackverbalexpressionsinamainstreamnewspaper.AmericanSpeech74:369–388. ViewArticle GoogleScholar 31. GordonMJ(2000)Phonologicalcorrelatesofethnicidentity:Evidenceofdivergence?AmericanSpeech75:115–136. ViewArticle GoogleScholar 32. RickfordJR(1985)Ethnicityasasociolinguisticboundary.AmericanSpeech60:99–125. ViewArticle GoogleScholar 33. WielingM,NerbonneJ,BaayenRH(2011)Quantitativesocialdialectology:Explaininglinguisticvariationgeographicallyandsocially.PLoSONE6:e23613. ViewArticle GoogleScholar 34. ZhangM,GongT(2013)Principlesofparametricestimationinmodelinglanguagecompetition.ProceedingsoftheNationalAcademyofSciences110:9698–9703. ViewArticle GoogleScholar 35. BaxterGJ,BlytheRA,CroftW,McKaneAJ(2006)Utteranceselectionmodeloflanguagechange.PhysicalReviewE73::046118+.36. NiyogiP,BerwickRC(1997)Adynamicalsystemsmodelforlanguagechange.ComplexSystems11:161–204. ViewArticle GoogleScholar 37. TrapaPE,NowakMA(2000)Nashequilibriaforanevolutionarylanguagegame.Journalofmathematicalbiology41:172–188. ViewArticle GoogleScholar 38. RealiF,GriffithsTL(2010)Wordsasalleles:connectinglanguageevolutionwithbayesianlearnerstomodelsofgeneticdrift.ProceedingsBiologicalsciences/TheRoyalSociety277:429–436. ViewArticle GoogleScholar 39. FagyalZ,SwarupS,EscobarAM,GasserL,LakkarajuK(2010)Centersandperipheries:Networkrolesinlanguagechange.Lingua120:2061–2079. ViewArticle GoogleScholar 40. GrayRD,AtkinsonQD(2003)Language-treedivergencetimessupporttheanatoliantheoryofindo-europeanorigin.Nature426:435–439. ViewArticle GoogleScholar 41. GrayRD,DrummondAJ,GreenhillSJ(2009)Languagephylogeniesrevealexpansionpulsesandpausesinpacificsettlement.Science323:479–483. ViewArticle GoogleScholar 42. BouckaertR,LemeyP,DunnM,GreenhillSJ,AlekseyenkoAV,etal.(2012)Mappingtheoriginsandexpansionoftheindo-europeanlanguagefamily.Science337:957–960. ViewArticle GoogleScholar 43. DunnM,GreenhillSJ,LevinsonSC,GrayRD(2011)Evolvedstructureoflanguageshowslineage-specifictrendsinword-orderuniversals.Nature473:79–82. ViewArticle GoogleScholar 44. LupyanG,DaleR(2010)Languagestructureispartlydeterminedbysocialstructure.PloSONE5:e8559. ViewArticle GoogleScholar 45. SzmrecsanyiB(2011)Corpus-baseddialectometry:amethodologicalsketch.Corpora6:45–76. ViewArticle GoogleScholar 46. MisloveA,LehmannS,AhnYY,OnnelaJP,RosenquistJN(2011)Understandingthedemographicsoftwitterusers.In:ProceedingsoftheInternationalWorkshoponWebandSocialMedia(ICWSM).pp.554–557.47. DugganM,SmithA(2013)Socialmediaupdate2013.Technicalreport,PewResearchCenter.48. GodsillSJ,DoucetA,WestM(2004)MonteCarlosmoothingfornon-lineartimeseries.In:JournaloftheAmericanStatisticalAssociation.pp.156–168.49. WeiWWS(1994)Timeseriesanalysis.Addison-Wesley.50. MorstatterF,PfefferJ,LiuH,CarleyKM(2013)Isthesamplegoodenough?comparingdatafromtwitter'sstreamingapiwithtwitter'sfirehose.In:ProceedingsoftheInternationalWorkshoponWebandSocialMedia(ICWSM).pp.400–408.51. ClausetA,ShaliziCR,NewmanME(2009)Power-lawdistributionsinempiricaldata.SIAMreview51:661–703. ViewArticle GoogleScholar 52. GelmanA,HillJ(2006)DataAnalysisUsingRegressionandMultilevel/HierarchicalModels.CambridgeUniversityPress,1stedition.53. GelbA(1974)AppliedOptimalEstimation.MITpress.54. CappeO,GodsillSJ,MoulinesE(2007)AnoverviewofexistingmethodsandrecentadvancesinsequentialMonteCarlo.ProceedingsoftheIEEE95:899–924. ViewArticle GoogleScholar 55. WeiGCG,TannerMA(1990)AMonteCarloimplementationoftheEMalgorithmandthepoorman'sdataaugmentationalgorithms.JournaloftheAmericanStatisticalAssociation85:699–704. ViewArticle GoogleScholar 56. BenjaminiY,HochbergY(1995)Controllingthefalsediscoveryrate:apracticalandpowerfulapproachtomultipletesting.JournaloftheRoyalStatisticalSocietySeriesB(Methodological):289–300.57. WolframW,Schilling-EstesN(2005)AmericanEnglish:DialectsandVariation.Wiley-Blackwell,2ndeditionedition.58. LabovW(2001)PrinciplesofLinguisticChange,Volume2:SocialFactors.Blackwell.59. MilroyL(1991)LanguageandSocialNetworks.Wiley-Blackwell,2edition.60. KwakH,LeeC,ParkH,MoonS(2010)Whatistwitter,asocialnetworkoranewsmedia?In:ProceedingsoftheInternationalWorldWideWebConference(WWW).pp.591–600.61. SadilekA,KautzH,BighamJP(2012)Findingyourfriendsandfollowingthemtowhereyouare.In:ProceedingsoftheACMInternationalConferenceonWebSearchandDataMining(WSDM).pp.723–732.62. ChangJ,RosennI,BackstromL,MarlowC(2010)epluribus:Ethnicityonsocialnetworks.In:ProceedingsoftheInternationalWorkshoponWebandSocialMedia(ICWSM).volume10,pp.18–25.63. GonçalvesB,PerraN,VespignaniA(2011)Modelingusers'activityontwitternetworks:Validationofdunbar'snumber.PloSone6:e22656. ViewArticle GoogleScholar DownloadPDF   Citation XML Print Printarticle Reprints Share Reddit Facebook LinkedIn Mendeley Twitter Email   Advertisement SubjectAreas? FormoreinformationaboutPLOSSubjectAreas,click here. Wewantyourfeedback.DotheseSubjectAreasmakesenseforthisarticle?ClickthetargetnexttotheincorrectSubjectAreaandletusknow.Thanksforyourhelp! Cities   IstheSubjectArea"Cities"applicabletothisarticle? Yes No Thanksforyourfeedback. Language   IstheSubjectArea"Language"applicabletothisarticle? Yes No Thanksforyourfeedback. Linguisticgeography   IstheSubjectArea"Linguisticgeography"applicabletothisarticle? Yes No Thanksforyourfeedback. Urbanareas   IstheSubjectArea"Urbanareas"applicabletothisarticle? Yes No Thanksforyourfeedback. Sociolinguistics   IstheSubjectArea"Sociolinguistics"applicabletothisarticle? Yes No Thanksforyourfeedback. Socialmedia   IstheSubjectArea"Socialmedia"applicabletothisarticle? Yes No Thanksforyourfeedback. Dynamicalsystems   IstheSubjectArea"Dynamicalsystems"applicabletothisarticle? Yes No Thanksforyourfeedback. Evolutionarylinguistics   IstheSubjectArea"Evolutionarylinguistics"applicabletothisarticle? Yes No Thanksforyourfeedback.



請為這篇文章評分?