(Technical.ly file photo)
Johns Hopkins’ Center for Speech and Language Processing was among the research centers that helped develop translation and voice tools that have been finding wider use in commercial products in recent years. In a relatively close-knit field, the center stands out for its size and longevity.
Since the 1980s, researchers worked to develop technology that is used in tools like Google Translate, Siri or Facebook’s button that spits out a post in a different language. Such tools grew from open source systems, with the grandeur of the undertaking illustrated by Biblical names like Moses and Joshua.
“All that technology ultimately started at research labs like ours, said Phillipp Koehn. For 20 years, the computer science professor has worked on machine language translation and is affiliated with the center. He notes that the tools now in wide use weren’t a given until only recently, and have been worked on over decades.
“It’s impressive to see that it’s good enough for real users,” Koehn said in a recent interview. “That is quite a threshold.”
Most of those tools involve translation to languages that are widely-used, and feature lots of written work available in them. Google Translate, for instance, is available in the roughly 100 most prevalent languages.
Now, Koehn and other researchers are set to apply tools they’ve used in speech recognition, information retrieval and extraction of information from text to languages that aren’t as widely used.
He is leading a group of 20 that will look to develop a system that can respond to inquiries written in English of documents written in these “low resource” languages. The Office of the Director of National Intelligence awarded a $10.7 million grant for the project, which includes a mix of professors and about a dozen PhD students.
They’ll seek to start with languages like Swahili and Tagalog. Those are examples of languages that have good examples for kinds of languages that have “millions and millions of speakers…but just don’t have that much of a presence on the internet or official communication.”
The challenge is to take documents written in one of the languages, and produce an algorithm that would help intelligence agents get a quick look at what happened. “We have to return back to them relevant Swahili documents with a summary,” Koehn said.
After building an initial tool for the first two languages, the team will be tasked with putting it to use. For intelligence agencies, the tool could be used to quickly analyze documents in languages when a major event happens that they want to analyze. Some of the languages of interest to that end include Kurdish, Serbo-Croatian, Khmer, Hmong and Somali.
While deep learning network–oriented tools have come a long way, Koehn said there’s a new data challenge inherent in analyzing such languages. More widely-used languages often have large datasets to work with and train tools. “Now it’s much, much smaller,” Koehn said. This will require new strategies to obtain data that can be translated, whether through context or linguistic analysis.
The four-year project is the beginning of a new phase of research for the field. As Koehn noted, there are 6,000 languages in the world. The resources may not be there to translate all of them, but it means there’s plenty left to explore.
UMB licenses technology to Y Combinator-backed biotech startup
This earmuff-style technology looks to make life less ruff for military working dogs
This apprenticeship is designed to bring gov contractor biz dev to smaller firms
How independence fuels confidence and professional development at SmartLogic
5 student startups that pitched at this Johns Hopkins accelerator’s first demo day
UMBC’s HARP successfully launched to the International Space Station
6 student-led startups we met at Innov8MD’s Baltimore and Beyond Conference
How this lawyer is helping entrepreneurs bark up the right tree
Sign-up for daily news updates from Technical.ly Baltimore