Reading by numbers

<P style="MARGIN-BOTTOM: 2.25pt; VERTICAL-ALIGN: baseline; WORD-BREAK: keep-all; TEXT-AUTOSPACE: ideograph-numeric; TEXT-ALIGN: left; mso-line-height-alt: 15.75pt; mso-pagination: widow-orphan; mso-outline-level: 2" align=left><B><SPAN style="FONT-SIZE: 16pt; COLOR: red; FONT-FAMILY: 'Verdana','sans-serif'; mso-fareast-font-family: 굴림; mso-bidi-font-family: 굴림; mso-font-kerning: 0pt">Culturomics</SPAN></B></P> <P> </P> <P><SPAN class=Apple-style-span style="FONT-SIZE: 15px; LINE-HEIGHT: normal; FONT-FAMILY: Verdana, Arial, sans-serif"> <DIV class=headline style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; FONT-WEIGHT: bold; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 2.2em; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 8px; VERTICAL-ALIGN: baseline; LINE-HEIGHT: 27px; PADDING-TOP: 24px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">Reading by numbers</DIV> <H1 class=rubric style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 1.4em; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 5px; VERTICAL-ALIGN: baseline; LINE-HEIGHT: 27px; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">Science invades the humanities</H1> <H2 class=fly-title style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 1.4em; LEFT: 0px; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 3px; VERTICAL-ALIGN: baseline; COLOR: rgb(255,0,0); LINE-HEIGHT: 21px; PADDING-TOP: 0px; POSITION: absolute; TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">Culturomics</H2> <P class="ec-article-info grid-6 grid-first" style="BORDER-TOP-WIDTH: 0px; CLEAR: both; PADDING-RIGHT: 0px; DISPLAY: inline; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 1.1em; FLOAT: left; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 13px; VERTICAL-ALIGN: baseline; WIDTH: 351px; COLOR: rgb(102,102,102); LINE-HEIGHT: 27px; PADDING-TOP: 0px; POSITION: relative; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">Dec 16th 2010 | from PRINT EDITION</P> <DIV class="share-links-header grid-4" style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; DISPLAY: inline; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 15px; FLOAT: left; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 4px 15px; VERTICAL-ALIGN: baseline; WIDTH: 229px; PADDING-TOP: 0px; POSITION: relative; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial"> <H3 style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 15px; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px; VERTICAL-ALIGN: baseline; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial"></H3> <UL class=clearfix style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; DISPLAY: block; PADDING-LEFT: 29px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 15px; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px; VERTICAL-ALIGN: baseline; PADDING-TOP: 0px; LIST-STYLE-TYPE: none; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial"> <LI class="share-inline-header-twitter first" style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 15px; FLOAT: left; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px; VERTICAL-ALIGN: baseline; PADDING-TOP: 0px; LIST-STYLE-TYPE: none; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial"><X-IFRAME class="twitter-share-button twitter-count-horizontal" title="Twitter For Websites: Tweet Button" style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 15px; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px; VERTICAL-ALIGN: baseline; WIDTH: 110px; PADDING-TOP: 0px; HEIGHT: 20px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial" tabIndex=0 src="http://platform0.twitter.com/widgets/tweet_button.html?_=1293120391076&count=horizontal&lang=en&text=Culturomics%3A%20Reading%20by%20numbers%20%7C%20The%20Economist&url=http%3A%2F%2Fwww.economist.com%2Fnode%2F17730198%2Fprint&via=theeconomist" frameBorder="0" allowTransparency scrolling="no"></X-IFRAME></LI> <LI class="share-inline-header-facebook even last" style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 15px; FLOAT: left; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px; VERTICAL-ALIGN: baseline; PADDING-TOP: 0px; LIST-STYLE-TYPE: none; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial"><X-IFRAME style="BORDER-RIGHT: 0px; PADDING-RIGHT: 0px; BORDER-TOP: 0px; OVERFLOW-Y: hidden; PADDING-LEFT: 0px; FONT-SIZE: 15px; OVERFLOW-X: hidden; PADDING-BOTTOM: 0px; MARGIN: 0px; VERTICAL-ALIGN: baseline; BORDER-LEFT: 0px; WIDTH: 90px; PADDING-TOP: 0px; BORDER-BOTTOM: 0px; HEIGHT: 21px; BACKGROUND-COLOR: transparent; background-origin: initial; background-clip: initial" src="http://www.facebook.com/plugins/like.php?href=http://www.economist.com/node/17730198&layout=button_count&show_faces=true&width=450&action=like&font=verdana&colorscheme=light&height=21" frameBorder="0" allowTransparency scrolling="no"></X-IFRAME></LI></UL></DIV> <DIV class="ec-article-content clear" style="BORDER-TOP-WIDTH: 0px; CLEAR: both; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 15px; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px; VERTICAL-ALIGN: baseline; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial"> <DIV class="content-image-float clearfix" style="BORDER-TOP-WIDTH: 0px; CLEAR: both; PADDING-RIGHT: 0px; DISPLAY: block; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 15px; FLOAT: right; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 10px 15px; VERTICAL-ALIGN: baseline; WIDTH: 290px; PADDING-TOP: 5px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial"><IMG style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; DISPLAY: block; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 15px; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 2px; VERTICAL-ALIGN: baseline; CURSOR: auto; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial" alt="" src="https://img1.daumcdn.net/relay/cafe/original/?fname=http%3A%2F%2Fmedia.economist.com%2Fimages%2Fimages-magazine%2F2010%2F12%2F18%2Fst%2F20101218_stc822.gif"><SPAN class=credit style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; DISPLAY: inline; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 1.1em; FLOAT: right; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px; VERTICAL-ALIGN: baseline; COLOR: rgb(192,192,192); PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; TEXT-ALIGN: left; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial"></SPAN></DIV> <P style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 1.3em; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 13px; VERTICAL-ALIGN: baseline; LINE-HEIGHT: 27px; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">WHEN Google began scanning books and allowing them to be searched online in 2004, publishers fretted that their literary treasure would be ransacked by internet pirates. Readers, meanwhile, revelled in the prospect of instant access to innumerable publications, some of them unavailable by other means. But Google Books is also responsible for another, quieter revolution: in the humanities.</P> <P style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 1.3em; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 13px; VERTICAL-ALIGN: baseline; LINE-HEIGHT: 27px; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">For centuries, researchers interested in tracking cultural and linguistic trends were resigned to the laborious process of perusing volumes one by one. A single person, or indeed a team of people, can read only so many books. Large-scale number-crunching seemed an impossible task. Now, though, Jean-Baptiste Michel, of Harvard University, and his colleagues have used Google Books to do just that. They report their first results in this week’s <EM style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 20px; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px; VERTICAL-ALIGN: baseline; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">Science</EM>.</P> <P style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 1.3em; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 13px; VERTICAL-ALIGN: baseline; LINE-HEIGHT: 27px; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">So far Google has managed to digitise 15m of the estimated 130m titles printed since Johannes Gutenberg perfected the press in the 15th century. Dr Michel’s team whittled this down to just over 5m volumes for which reasonably accurate bibliographic data, in particular the date and place of publication, are available. They chose to focus mainly on English texts between 1800 and 2000, but also included some French, Spanish, German, Russian, Chinese and Hebrew ones.</P> <P style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 1.3em; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 13px; VERTICAL-ALIGN: baseline; LINE-HEIGHT: 27px; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">That yielded a corpus of over 500 billion 1-grams, as Dr Michel calls a string of characters uninterrupted by a space. These include words, acronyms, numbers and dates, as well as typos (“becasue”) or misspellings (“abberation”). He also looked at combinations of 1-grams, from 2-grams (“<EM style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 20px; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px; VERTICAL-ALIGN: baseline; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">The Economist</EM>”) to 5-grams (“the United States of America”). To minimise the risk of including random concatenations of words, rare spellings or mistakes, any word or expression‎ had to appear in the corpus at least 40 times to merit inclusion in the final, chronologically ordered set.</P> <P style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 1.3em; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 13px; VERTICAL-ALIGN: baseline; LINE-HEIGHT: 27px; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">At this point, the number-crunching could begin in earnest. Dr Michel first used his data to estimate the total number of words in the English language. To do this, he and his team took a random sample from the corpus, checked what proportion were non-words and extrapolated that to the whole lot. He puts the figure at a smidgen above 1m.</P> <P style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 1.3em; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 13px; VERTICAL-ALIGN: baseline; LINE-HEIGHT: 27px; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">On their reckoning, even the most authoritative lexical repository, the “Oxford English Dictionary”, underrepresents this total by a factor of two. Also, after hardly budging in the first half of the 20th century, the English vocabulary expanded at a rate of 8,500 words a year in the second half, leading to a 70% increase in its size since 1950 (see chart).</P> <P style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 1.3em; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 13px; VERTICAL-ALIGN: baseline; LINE-HEIGHT: 27px; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">Amusingly, Dr Michel found that some words added to the “American Heritage Dictionary” in 2000, like “gypseous” or “amplidyne”, had been in widespread use a century earlier. What is more, by the time they did make it into the dictionary, they were becoming obsolete.</P> <P style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 1.3em; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 13px; VERTICAL-ALIGN: baseline; LINE-HEIGHT: 27px; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">The researchers did not confine themselves to poking fun at lexicographers, though. They also looked at a range of cultural trends, such as how long it takes innovations to impinge on the popular consciousness (which is happening ever more quickly), the age at which celebrities become famous (which is dropping, albeit at the expense of ultimately shorter spells in the limelight), as well as many other more or less frivolous trends.</P> <P style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 1.3em; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 13px; VERTICAL-ALIGN: baseline; LINE-HEIGHT: 27px; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">Clearly, books do not exhaust the whole of human culture. In recent decades their relative importance has waned. Nor are the books Google has already chosen to scan necessarily a representative sample of literature across the ages. This means that any findings based on them ought to be treated with caution.</P> <P style="BORDER-TOP-WIDTH: 0px; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 1.3em; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 13px; VERTICAL-ALIGN: baseline; LINE-HEIGHT: 27px; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">Still, Dr Michel and his team hope that their approach will spur a more rigorous, quantitative approach to the study of human culture. In fact, their paper doubles as a manifesto for a new discipline. They dub it “culturomics”, making them the first clutch of culturomists. More are sure to follow—whether or not this particular, clunking neologism survives.</P></DIV> <P class=ec-article-info style="BORDER-TOP-WIDTH: 0px; CLEAR: both; PADDING-RIGHT: 0px; PADDING-LEFT: 0px; BORDER-LEFT-WIDTH: 0px; FONT-SIZE: 1.1em; BORDER-BOTTOM-WIDTH: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px 0px 13px; VERTICAL-ALIGN: baseline; COLOR: rgb(102,102,102); LINE-HEIGHT: 27px; PADDING-TOP: 0px; BACKGROUND-COLOR: transparent; BORDER-RIGHT-WIDTH: 0px; background-origin: initial; background-clip: initial">from PRINT EDITION | Science and Technology</P></SPAN> <P> </P>