Language Technologies - IconIntroduction to language technologies


What are language technologies?

The broad goal of natural language processing is to endow computers and computing systems with the same linguistic abilities that we as humans enjoy. Although we may not always be conscious of the fact, these linguistic abilities of ours are extremely varied and remarkably complex. Consider this last sentence, which you’ve just read and interpreted with little or no effort. What exactly went into this accomplishment? Abridging somewhat, you first had to distinguish certain of the marks on your screen and recognize them as letters of a particular alphabet. Then you grouped those letters into words belonging to the lexicon of a given language, abstracting perhaps from certain inflected forms; after which you attributed grammatical functions to those words and somehow assigned them and the full sentence a coherent meaning.

Now suppose we wanted to have a computer replicate this feat. What resources could we marshal today that might enable it to do so? In order to decipher the marks on a screen or a page, we would likely call on an optical character recognition (OCR) program, which would probably include a tokenization module to segment those characters into words and sentences. To determine which language we were dealing with, we could invoke an automatic language identification program, and then perhaps a lemmatizer to be able to look up the words of that language in a machine-readable dictionary. Determining the meaning of the sentences is a more daunting task, but in all likelihood we would begin by invoking some kind of parser to group the words into phrases and assign them a grammatical function.

All the programs highlighted above are examples of language technologies (LT). Each is a focussed application that seeks to automate a particular linguistic ability, and some (particularly those early in the processing chain) can today achieve impressive levels of performance. Needless to say, there are many other LTs, some of which will be mentioned below.

Back to top

The broad sectors

AILIA distinguishes five broad sectors of human language technology: speech processing, machine and machine-aided translation, content management, language technology for the Web, and language e-learning. Let us briefly examine each in turn.

Back to top

Speech processing

Speech is the most natural medium of human linguistic exchange – many of the world’s languages still don’t have writing systems – and so it is only natural that we should want to speak to our computers and be understood by them. The enabling technology here is automatic speech recognition (ASR), an area in which impressive progress has been achieved in recent years. These days, none of us think twice about picking up the phone and having a conversation with an automated system, which asks what we want and usually interprets our spoken responses correctly. We can as well issue vocal commands to our computer’s operating system, or to the cars we drive, or our mobile phones. Automatic dictation systems (a.k.a. speech-to-text) are also gaining in popularity, even among finicky translators. For many applications, however, particularly those that operate in noisy environments, the basic trade-off in ASR is still between systems that are open to a large number of speakers but only handle a small vocabulary, and large-vocabulary systems that need to be tuned to each user’s pronunciation. As for text-to-speech, its applications are becoming more and more widespread, one of the best-known being the synthesized voice that gives us directions on our GPS.

  • For more on other speech technologies and their applications, click here.

Back to top

Machine and machine-aided translation

Machine translation (MT) was one of the very first applications proposed for digital computers after the Second World War, where they were instrumental in helping to decipher secret German military codes. Researchers soon came to realize, however, that human language is far more complex and ambiguous than man-made ciphers, and the dream of a fully automatic translating machine was just that – a dream. At the same time, the world-wide demand for translation started to increase exponentially, due in large part to globalization; that demand now far outstrips the capacity of professional translators to meet it. Instead of seeking to supplant human translators, developers began turning their attention to machine-aided systems that can improve translator productivity by automating certain well-defined sub-tasks. Translation memory systems, for example, make it easier for translators to recover past translations of previously seen material. More recently, the adoption by MT researchers of the statistical methods that proved so successful in speech recognition has led to significant progress in this area as well. Given the requisite training corpora, new MT systems can now be developed in a fraction of the time required by the older rule-based systems. Furthermore, the quality of the output has improved substantially: not only is it sufficient today for gisting and information-gathering purposes, but it increasingly appears good enough to enable cost-effective post-editing for publication purposes.

  • For more on MT and other machine-aided translation systems, click here.

Back to top

Content management

In the industrial society of the last century, information was relatively scarce – something that had to be diligently sought out and protected. In today’s post-industrial era, on the other hand, we are literally awash in information and in desperate need of ways to navigate and find our way through it. That is what content management systems (CMS) do: they provide their users with the means to efficiently create, store, retrieve and publish ever-growing amounts of digital information. Document management systems are one type of CMS, but increasingly the information such systems handle also includes audio, images, video and other types of multimedia files; whence the use of the more generic term content management. These systems make a sharp distinction between the informational content itself and the platform or manner in which it is eventually published, which encourages the recycling or reuse of content. Needless to say, language technologies play a critical role in all CMS systems, whether it be for authoring assistance, document indexing, search, version control, summary generation or the automatic translation of the content into multilingual versions.

  • For more on content management systems, click here.

Back to top

Language technologies for the Web

A major factor in the transition to the information society we now live in has been the incredible expansion of the Internet and the World Wide Web; and here too, language technologies have played a crucial role. All major corporations today rely extensively on their Web sites, not just to promote their products or services, but increasingly to respond to requests for information from clients all over the globe. Some of their online knowledge bases are so large and are updated so frequently that machine translation offers the only hope of providing this information in the clients’ own languages. Of course, it’s not just large corporations that are increasingly exploiting the Web; nowadays, everybody and his uncle has his own Website – or at the very least a blog. In this ever-expanding parallel universe, how can we ensure that Internet users quickly locate the most pertinent information they seek without being inundated with irrelevant noise? This is the challenge of information retrieval, which of course relies heavily on language technologies, not just to lemmatize our queries but to suggest completions to them before we even finish typing! (The same sort of language models are at work when we send text messages on our cell phones.) The use of social media on the Internet has exploded in recent years and here too language technologies are being harnessed to provide insightful analyses of the reams of user-generated content, e.g. via sentiment analysis.

  • For more on language technologies for the Web, click here.

Back to top

Language e-learning

Globalization has certainly accentuated the need to learn a second language, not just in Canada (an officially bilingual country, where demand has always been strong), but the world over. One has only to think of the rise of English as the de facto lingua franca of commerce and science in all non-English speaking countries. The problem is that most adults no longer have the time to learn a second language in the same way as they used to. Indeed, the principal challenge facing the language teaching industry today is not so much to invent new technologies that will miraculously expedite the learning process, but rather to develop new ways of exploiting existing IT technologies that will allow learners flexible access to focussed, individually adapted instruction whenever they find the time and wherever they happen to be, using their mobile phone, tablet computer, or other similar device.

  • For more on how LTs are contributing to language e-learning, click here.


For information on AILIA’s Language Technologies Committee, click here.

Back to top