Archive for the ‘i18n’ Category

Thursday, January 27th, 2005

The crazy thing with writing a book is that it only starts to have an effect on your life years after you write it. I was just informed that my information architecture book is being translated into Japanese - pretty cool!

Thursday, January 27th, 2005

3 Metaphors for culture we use in Consulting Work

Thursday, January 27th, 2005

MultiLingual Computing, Inc.: MultiLingual Computing & Technology, Article Detail: there are still quite a few languages that aren’t easily used on websites, due to font problems.

Friday, January 21st, 2005

Eric Scheid pointed me to Guidelines for Forming Language Equivalents: A Model Based on the Art & Architecture Thesaurus. I will report in more depth on this later.

Thursday, January 20th, 2005

alex wright has some follow-up musings on my i18n folksonomy post. Gene Smith also weighs in.

Monday, January 17th, 2005

Steve Arnold pointed out to me that “Languages that are synthetic do not fare well in automatic systems unless the source documents are highly technical.”

Which makes sense, once you understand what a synthetic language is. A synthetic language combines bits into really long words. For example, in Mohawk: Washakotya’tawitsherahetkvhta’se = “He ruined her dress” (strictly, “He made the thing that one puts on one’s body ugly for her”). One word is used for something that other languages need a multiple words or a whole sentence for. You can see how that can mess with automated systems.

Languages are not synthetic or isolating, they fall into a spectrum: some languages are just more synthetic than others. Examples of common synthetic languages are German, Russian, Turkish, Finnish, Japanese, Korean, and many more.

Monday, January 17th, 2005

I came accross this other interesting paper on multi-lingual search.

Internet Searching and Browsing in a Multilingual World: An Experiment on the Chinese Business Intelligence Portal (CBizPort) (PDF) Journal of the American society for information science and technology —July 2004.

It gives some good practical background on a project to create a business search for Chinese. It also describes an approach to automatically summarizing and categorizing search results. Here is its description of the Chinese search landscape, illustrating some interesting language-locale subtleties.

“Chinese is the primary language used by people in mainland China, Taiwan, and Hong Kong. Language encoding, vocabularies, economies, and societies of the three regions differ signiď¬?cantly. Regional search engines, therefore, have been developed to provide Internet searching.

In mainland China, the major search engines include Sina and Baidu. Baidu currently powers over 80% of Internet search services in China, including ChinaRen, 163.net, etc. The database of Baidu stores over 60 million Web pages collected from mainland China, Hong Kong, Taiwan, and Singapore, and grows at a speed of several hundreds of thousands of Web pages per day. Sina is an Internet portal providing comprehensive services such as Web searching, e-mail, news, business directory, entertainment, weather forecast, etc. From our review of search engines in mainland China, we found that Baidu has better search capabilities than the others, as shown by its content coverage. Sina has a wider scope of functions than Baidu.

In Taiwan, the two major Internet search portals are Open�nd and Yam. Open�nd, established in 1998, is one of the largest portals in
Taiwan. In addition to basic searching, Open�nd suggests terms that are highly associated with users’ queries to help them re�ne their search. It also allows users to �nd more related items from each search result and highlights the query terms in the results. Established in 1995, Yam provides comprehensive online services. Its four major focuses are content, communication, community, and commerce (4C). Yam’s search engine allows users to search various media: Web sites, Web pages, news, Internet forum messages, and activities (in 18 Taiwan cities or regions). We found that Open�nd has better functionality and content coverage, but Yam was better established in the local market (e.g., it powers the search function of the Taiwan government’s Web sites).

In Hong Kong, due to its bilingual culture, people rely on both English and Chinese when accessing and searching the
Internet. Major search portals include Yahoo Hong Kong and Timway. Of these, Yahoo Hong Kong is one of the most popular. Yahoo Hong Kong’s search engine returns results in different categories, Web sites, Web pages, and news. Headquartered in Hong Kong, Timway provides services such as Web searching, Web directory, e-mail, news, forums, etc. Its database stores over 30,000 Hong Kong Web sites and over 10 million Web pages. “

In other words, it’s not because Chinese is a language, that one Chinese search engine will be enough for the various users that want to search in Chinese. There are different groups of people who search in Chinese, with different local requirements, and this local requirement has given rise to a number of different search engines for Chinese.

Monday, January 17th, 2005

A Framework for Multilingual Searching and Meta-information Extraction: what is “term isolation“? “Term isolations means extracting the individual terms from the text. This is necessary for languages such as Chinese and Japanese, that do not contain white space between individual terms. Term isolation is not a trivial task, and requires the software to understand the language’s grammar and have a complete dictionary. Term isolation is clearly a language-specific task (with different software modules for different languages).”

Multi-lingual search

Monday, January 17th, 2005

How do you know what language the query a user entered is in, and how do you search languages like Japanese, that don’t use spaces between the words? And how do you identify languages used in an unstructured, multi-lingual document? Multilingual search is a hard problem.

Most of the search players all use the same basic technology provided by Basistech, check out their customer list: Google, Amazon, MSN, Yahoo, Endeca, Peoplesoft, Overture, you name it, they have them.

The technology, called the Rosette Linguistic Platform, “helps your applications unlock the meaning of unstructured text by determining the language and encoding of a given document, converting the text to Unicode so that it can be processed, identifying the basic linguistic features and structure, and locating key concepts like the names of people and places.”

In other words, it deals with Asian and Arabic language search problems, and does entity extraction (extracting names of people, places and companies and such). It identifies individual words for languages such as Japanese that do not use spaces between words, breaks compound words into their individual components, and identifies parts-of-speech such as verb, adjective, etc.

Once it has done its job making sense of the languages it finds, the search technology of the vendor takes over.

They have a pretty cool demo that explains what the technology does - here’s a screenshot:

rosetta

Sunday, January 16th, 2005

Nick Finch (?) writes about content and locales and the problems with continents, but I can’t seem to contact him.

I linked to another website with this post, that I found via Technorati, and I now suspect that that other website was a fake blog meant to create Pagerank, populated automatically from various feeds. Evil.

Emergent i18n effects in folksonomies

Saturday, January 15th, 2005

My series of posts on international information architecture:

  1. Translating taxonomies and categories
  2. Translating categories, translating terms
  3. Translating the Dewey Decimal Classification system
  4. Designing the relationship between content and locales
  5. Emergent i18n effects in folksonomies (this post)
  6. The Maori versus Dewey, and why limiting access can be culturally appropriate.

Folksonomies are taxonomies created by users who add tags to things. Folksonomies are messy and have a lot of problems, but their great merit is that they’re scalable and they use the users’ terminology by definition, a serious problem with more classic taxonomies that are created by information architects or librarians.

There is a lot of innovation happening around folksonomies. One interesting area is internationalization. Users enter tags in many languages, but generally, the system does not know what language a tag is in. Have a look at a screenshot of this page from Technorati, showing popular tags:

technorati tags

I drew big red lines around them, so you’ll probably notice some tags in other languages than English. “Algemeen - Algemein - Algemeines” means “general” in Dutch and German. “Entertainment Entretenimento” is English-Spanish for entertainment. Notice that a misspelling of “Entretenimiento” is also used. But we are talking about languages today, not spelling. “Music - Musique - MĂşsica” are English, French and Spanish.

So what’s going on here? People are tagging things in many languages. Right now, Technorati displays the various languages mixed together on one page. That’s pretty interesting, expecially if you’re interested in languages like me. But it might also be cool to see only tags in your language. Especially if you don’t speak English.

How can we do this? One way is to use a dictionary lookup to figure out what language a certain tag is in. It won’t be perfect, but this approach could be used to display a page of popular tags in mostly German, mostly English, mostly Hindi and so on. This will reduce the amount of tags available to the user, but make them more relevant to them (because they are in their language). Again, seeing a few non-English tags won’t bother you, but this is not for English speakers - the dominant language. For all the people who don’t speak English, seeing tags in their language will be invaluable. If you do dictionary lookup only with popular tags, it shouldn’t be too resource-expensive - a tag only has to be checked once against the dictionary, and assigned a probable language.

Another way is to look at the language of the source (rss feed, …) and assume the tag is in the same language. Tricky - I’m not sure this will work. There might be other algorythms as well - if I do a Google search for “MĂşsica”, it knows this isn’t English, because it asks me if I want to “Search for English results only”, so there is some algorythm going on there I assume (unless they also use the dictionary approach).

Later: I realized something else. Displaying tags in mostly some language, as opposed to exclusively in that language, is not necessarily a bug, it might be a feature. Many user populations around the world incorporate words in multiple languages in their vocabulary. The language namespaces I am talking about might not map perfectly to a specific language, but include words in other languages, and slang and such, and in this way be a much better representation of the real language of a certain user population than if we were to just use one language. So it’s not so much about language namespaces, it’s more about user population namespaces. Language is just a starting point and might be an easy way to group user populations.

As an aside, I think the real innovation with folksonomies will come from creating algorythms. It’s all about scalability. The way Google’s superior algorythm in search made them the nr1 search engine, someone will invent superior algorythms in tagging and this may make them the nr1 tagging engine.

Back to languages. The most interesting aspect of the screenshot above isn’t that there are tags in other languages, is that the tags are the same in other languages. The tags in French and Spanish and such have their English translations right on the same page. This suggests that people seem to tag things similarly in different languages. Is there a way to create algorythms that take advantage of this fact? Also take into account that different people can tag the same things (pictures, bookmarks, …), in different languages.

An interesting language effect with tags was pointed out by Tanya, on this Flickr page for the tag chat. “Chat” means cat in French. Here’s a screenshot:

flickr tags chat

So different people have used “chat” and “cat” to tag similar items, and as a result Flickr knows that “chat” and “cat” are related tags. I’m not sure what that means for internationalization of folksonomies, but as an emergent i18n effect I think that’s pretty amazing.

A third and similar i18n effect I found when playing around with this in Flickr is that of language namespaces. If you start following related tags in Flickr in a certain language, you will see many tags in that language. Here’s a screenshot of the tag “leuk“, which means “funny” in Dutch.

leuk on flickr

The related tags are in English, but in the see also tags we see a whole bunch of Dutch words: “ik, konijn, middelharnis, mooi, oudetonge, overflakkee, plankje”. And if you follow those you’ll see more Dutch words in the related and see-also tags, creating a kind of Dutch namespace almost. Again, I’m not sure how to use this exactly, but it’s pretty amazing to me that, in this early stage, there are already interesting i18n effects happening in the tagging space.

Comments welcome!

| | | | |

Saturday, January 15th, 2005

I’m helping out a bit on the IA of ourmedia.org, and I’m trying to come up with a list of common languages. These will be expanded later. It’s tricky - the list I finally came up with is just a list - WARNING: please don’t use this list for your project if you need common languages - it is fairly arbitrary.

It’s pretty hard to come up with a list of languages for a project like this. We want to allow anyone to upload stuff in any language, but we also wanted to have a short-ish list of languages (20 to 40) instead of just using the really long one (a decision that’s not necessarily the right one, but it was taken).

We didn’t have time to take into account which languages are producing more media that’s likely to be uploaded to ourmedia.org, or any factors like that. So this list is mostly based on the amount of speakers these languages have, with some preference for European languages (target audience in terms of computer access right now).

- Top 10 Common Languages by amount of native speakers
- The UN uses these main languages on their website: English, Arabic, Chinese, French, Russian, Spanish.

This is the list I came up with, alphabetically. Users can use this to indicate the language of their media, and other things, or choose another language.

Arabic
Bengali
Chinese
Dutch
English
French
German
Hindi
Italian
Japanese
Javanese
Korean
Marathi
Polish
Portuguese
Russian
Spanish
Tamil
Telugu
Turkish
Urdu
Vietnamese

Again, don’t just use this list if you need common languages. It’s very arbitrary and it took me less than an hour to put this together. I share it here as a starting point only.

Please leave comments!

Wednesday, January 12th, 2005

Entertaining cultural differences in manual.

Designing the relationship between content and locales

Wednesday, January 12th, 2005

My series of posts on international information architecture:

  1. Translating taxonomies and categories
  2. Translating categories, translating terms
  3. Translating the Dewey Decimal Classification system
  4. Designing the relationship between content and locales (this post)
  5. Emergent i18n effects in folksonomies
  6. The Maori versus Dewey, and why limiting access can be culturally appropriate.

I am trying to gather thoughts on the various structures and patterns that occur when creating multiple versions of a website in multiple locales. I hope to create a model, a way of thinking about these structures that can help when making decisions about things like:

  • What content should be translated?
  • From which website to which website?

We can come up with many models (models schmodels), the question is: is this a useful one? Does it aid your thinking? This is early work, so any feedback (shoot it down!) is greatly appreciated.

Let’s get started.

When you are creating versions of a website in multiple locales, the simplest case is when you just translate everything on the website. One on one translation. You start with a master locale, and just translate it into another locale. This is rare though – very few translation projects are this simple.


One on one translation.

The second, more common case is selective translation: you have a master locale (often English), and other locales are partially translated.


Selective translation.

There are various types of selective translation: you can do a summation, where multiple pages or whole sections of the master website are replaced by just one page in the translated version. Or you can just not translate parts of the website: removal. Most projects do a bit of both.

A third case is when you have a master locale, but also original content in the translated locale: original content.


Original content

The original content in the translated language can be used as a master for translation into yet another language.

For example, your master locale is English-US, the translated locales are English-Canadian and French-Canadian. (French is an official language in Canada, and there are certain legal requirements to provide information in both official languages.) You might have a partial original content translation from English-US to English-Canadian, in other words, you take parts of the English-US content, and create parts of the content for English-Canadian from scratch. Then you might do a one-on-one translation from English-Canadian to French-Canadian.

This example can be described as a grouping locales. Many countries or regions have legal requirements (and human needs) to provide content in various official languages. If you have an intranet in Canada, you must provide content in French even if you only have 1 employee in Montreal who speaks French. In Belgium, you should provide content in French and Dutch, since half the country speaks French and the other half Dutch. If your locale is South-America, you better provide content in both Spanish and Portuguese, and maybe a few other languages as well.

When you are creating your content-locale structure, grouping locales often makes sense, in that a certain locale can become the master of all local languages, like in our Canadian example.The content needed for this group of locales is the same or very close.

Finally, sometimes almost no content is directly shared. In this case, we’re just talking about separate websites. It is a valid option, but I won’t discuss it in much depth here.

So, to recap, we have 4 simple ways to connect locales:

  • One-on-one translation
  • Selective translation (summation and removal)
  • Original content
  • Grouping of locales.

These structures can help us design the relationship between content and locales for our websites.

When you start designing the relationship between content and locales, you’ll often find that the structure you come up with is different for different types of content. Technical support information may be translatable directly in a one-on-one translation. You might not sell exactly the same products or services in all markets, so marketing content might require a partial original content translation. And so on.

So here is an example of how you might design the relationship between content and locales for a public product website. We are distinguishing a few different content types, and for each content type we provide different relationships.

| | | |

Monday, January 10th, 2005

James Tauber : Translations, Glosses, Tags and Folksonomies. I’ll parse this later and report back.

*Pixelcharmer: Field Notes: L’iChat

Friday, January 7th, 2005

Tanhya points out how even uncontrolled vocabularies can show some nice multilingual effects.

Translating the Dewey Decimal Classification system

Friday, January 7th, 2005

My series of posts on international information architecture:

  1. Translating taxonomies and categories
  2. Translating categories, translating terms
  3. Translating the Dewey Decimal Classification system (this post)
  4. Designing the relationship between content and locales
  5. Emergent i18n effects in folksonomies
  6. The Maori versus Dewey, and why limiting access can be culturally appropriate.

The Dewey Decimal Classification system (used in libraries throughout the world to classify books) has been translated into many languages, so I figured I’d ask their Editor in Chief, Joan S. Mitchell, about their experience. I was particularly interested in their approach to categories as “concepts”, and their approach to developing a taxonomy of concepts, regardless of translatability.

Joan S. Mitchell: “Many DDC translations contain more detailed developments in selected areas than found in the English-language standard edition. For example, the province of Rovigo has one number (–4533) in Table 2 in the English-language full edition; there are nine subdivisions listed under this number (representing parts of Rovigo province) in the Italian translation of the full edition. ”

In other words, local translations have different requirements - the English version may be content with providing geographical categories for Italy up to the level of a province, but people in Italy may be interested in having categories for subdivisions of provinces.

Here’s a case study describing this as well: Beall, J. 2003. Approaches to expansions: case studies from the German and Vietnamese translations. Presented at the Classification and Indexing - Workshop, World Library and Information Congress: 69th IFLA General Conference and Council. 1-9 August 2003, Berlin. http://www.ifla.org/IV/ifla69/papers/123e-Beall.pdf

I also asked her about terms or categories that are not easily translatable.

Joan S. Mitchell: “In nearly every translation, we have encountered terms that are not easily translated from one language to another. There are a number of strategies we employ to address these issues.

In some cases, the term is left in English, e.g., “land-grant colleges” appears in English in the German translation because it is part of the name of a category and there isn’t a suitable equivalent in German.

Sometimes, we adjust the definition of category itself to accommodate translation issues. For example, the popular German form of bowling is skittles (ninepins); in the US, the most popular form is tenpin bowling. After an inquiry from the German translators, we added a note to indicate that the category of bowling covers both types.

In the case of examples that illustrate categories, we routinely encourage translators to substitute examples that have meaning in the specific language group for those used in the English-language edition. We also encourage the inclusion of index terms that are meaningful to the language group and compatible with the definition of the category. The index to a translation can include different synthesized numbers and accompanying index terms from those found in the English-language standard edition.”

I was curious about their approach to the taxonomy as a “concept tree”. This assumes that all concepts can be expressed in all languages, although (as in the bowling example above), sometimes categories are adjusted with feedback from translators.

In my previous post, “Translating categories, translating terms“, I discussed the problems with translating categories. My conclusions was that sometimes, categories are just not translatable. People think too different. The DCC has a different philosophy: the assume ALL concepts are translatable.

Joan S. Mitchell: “Translatability is not a criteria for adding new concepts; however, there has to be a certain threshold of published material (”literary warrant”) for the inclusion of a term or the expansion of a category. The literary warrant we use to develop the English-language standard edition is based primarily on the contents of OCLC’s WorldCat database, the world’s largest bibliographic database (over 57 million records). ”

For non library folks: “literary warrant” just means that, if you have a lot of books about a subdivision of a province in Italy, you should probably have that subdivision as a category, so people can find those books. The concept is the same with products you sell, or services, or whatever. Most companies have different products or services in different parts of the world, or have a different amount of information available (maybe not all technical manuals have been translated), therefore, the categories should be of different granularity.

| | | |

Saturday, January 1st, 2005

*Pixelcharmer: Field Notes: Mapping Problems: “A friend and I were discussing the similarity between multilingual thesauri problems and the difficulties you run into when mapping different taxonomies.”

Friday, December 31st, 2004

How I Learned French in One Year

Monday, December 27th, 2004

Passion is driving force behind translator’s work

Sunday, December 26th, 2004

Do you speak Frenglish? ~ Naked Translations:

“Coordinator: “Please write your ideas on the flip-chart.”
Celine: “Veuillez noter vos idees sur le… le…”

What’s flip-chart in French?? Don’t panic, don’t panic.

“Le… le…”

19 pairs of eyes are on me. I can feel drops of sweat slowing running down my cold forehead.

“Le… le…”

“TABLEAU DE CONFďż˝RENCE!”, I finally blurt out, a bit too loudly. I’m sure I can hear a crowd cheering and chanting my name in the distance.

French client: “Tableau de conference? C’est marrant, nous on dit paperboard.” (That’s funny, we say paperboard).”

Monday, December 6th, 2004

Topic Exchange: Channel ‘multilingual_blogging’

Monday, December 6th, 2004

Joel on the impossibility of translation.

Friday, December 3rd, 2004

More typical american words: condo and suv. Yesterday, a Belgian colleague told me he was working on a site and they had a “community” tab in French “communite”. In Flemish (=Dutch) though, “community” translates into “gemeenschap”, which has sexual connotations.

My Dutch readers, have you seen a good translation of “community”?