Language Technologies for the Web:

“What are they saying about us?”


Some people contend that the Americans were already in possession of all the information that was required to prevent the attacks of September 11 before they actually occurred. After all, it’s well known that the National Security Agency (NSA) eavesdrops and records much, if not all electronic communications and broadcasting throughout the world. However, to merely be in possession of recorded conversations between purported terrorists is clearly insufficient, not when those conversations lie buried in a colossal mountain of recorded data, and moreover likely took place in a language other than English, between people deliberately attempting to dissimulate their intentions. All this poses an enormous challenge to America’s intelligence agencies, and serves to explain the huge investment they made in programs like GALE (Global Autonomous Language Exploitation) in the years following Nine-eleven.

Call it the modern-day version of the needle-in-the-haystack conundrum. Broadly speaking, the goal is to extract manageable amounts of information on the planned actions of certain groups or individuals, and then cast that information in a form that is understandable to unilingual English speakers, so that it becomes ‘actionable intelligence’. To get some idea of the scope of the challenge, consider the following statistic: according to the Washington Post, the NSA intercepts and stores 1.7 billion e-mails, phone calls and other types of communications – every day! To sift through data on this scale, computer-based analysis programs offer the only possible hope; and since the data in question is linguistic, language technologies will definitely have a preponderant role to play.(1)

Of course, interest in monitoring the Web and other electronic media in order to glean critical information is not limited to the military. Increasingly, businesses, large and small, also want to know what is being said about them on the Web and in the social media. Indeed, many of the same language technologies are now being employed in both domains, although perhaps not on the same scale. Let us consider one very simple example of a basic language technology that is necessary in order to be able to extract from unstructured text such basic information as who, where, when and how: anaphora resolution. Anaphora resolution involves determining the antecedent of a pronoun or other form of referring expression in a text, and it has long been studied in linguistics, generative or otherwise. Take the ‘it’ in immediately the preceding clause, for example: How do we know that among all the candidate noun phrases in the sentence, ‘it’ refers to ‘anaphora resolution’? The answer is not at all obvious; often, the solution involves inferencing over extra-linguistic knowledge, which is why anaphora resolution was long considered a challenge requiring artificial intelligence.  

Now suppose you own a restaurant or some other type of small business, and you hire the services of a firm that promises to scan the most popular blogs and restaurant review sites to determine what people are saying about you. The following is an authentic example of the kind of input which that firm’s text analysis programs will have to cope with:

“We had a delightful wait-person and felt we had found a great place that we will return to… This is our third Michael White. He is terrific.”

The problem here is not just to establish whether the blogger’s attitude is positive or negative, but also to determine whom or what the expressed sentiment is being predicated of.(2) In other words, it is a problem of anaphora resolution, and though this type of referring expression may be somewhat unusual, this level of opacity certainly is not.

Correct anaphora resolution is also critical for another widely-used language technology: spelling and grammar checking. Consider the following simple text in French, a language more highly inflected than English, where adjectives and participles must agree in number and gender with their antecedent.

“Il y avait une erreur dans le texte. Je l’ai corrigé, tu n’as pas besoin de t’en occuper. ”

Is ‘corrigé’ written correctly here, or does the participle require an additional ‘e’? The only way to decide is to establish whether the clitic pronoun preceding ‘ai’ refers back to ‘erreur’ or to ‘texte’, the former being feminine and the latter masculine. Today, the best spelling and grammar checkers are capable of achieving impressive precision on problems such as this, and even more complex ones.


(1) In fact, the three major components of the aforementioned GALE program were all language-based: transcription of spoken language, machine translation, and distillation, which is similar to information extraction

(2) As it turns out, this ‘third Michael White’ is an Italian restaurant where the aforementioned is head chef.