Language Detection

Carim asked an interesting question:

Do you think a UDF could determine the string language … ?

In Switzerland, there are three national languages (German, French, Italian) and it would be very handy to have an UDF return in which language a given string is written

My first thought was to create a group of common words in each language. Then match the contents of the string to those words. I guessed that if the input string had 10 words, I would only need 50 words in the group to predict the language. I have no idea if that’s really enough, but I’d like to know your opinion.

Before I could test my theory, I came to my senses and turned to Google. I came up with the following code:

Nice and clean, but doesn’t really tell you much. It uses the Google AJAX Language API. Instead of me comparing ten words to 50 words in three languages, Google probably compares every word to every word in every language. Here’s some of the CLanguage class:

When I set InputText to something, I create a new xmlhttp object, pass it some stuff, send it, then read back the ResponseText. I don’t know the ins and outs of the xmlhttp object, I just copied it from Juan Pablo. So I probably don’t have near the error checking in there that I need. The class has some properties that you can read too.

I’m parsing ResponseText to get at the piece I need. Here’s an example of what the ResponseText looks like:

{"responseData": {"language":"en","isReliable":true,"confidence":0.75503504}, "responseDetails": null, "responseStatus": 200}

From what I can tell, Google determines whether the detection was reliable based on a confidence of 50%. I decided to let the user determine how confident it should be and set the default at 25%. I think in Excel, you’re going to pass smaller strings and need to be more aggressive with the guessing. Here are some of the results.

Also note that I only pull the first two letters of the language. A few of the languages have more letters, but I was too lazy to do it right.

That 90 in the last example, is 9000% so it will never pass. I meant to pass it 90%, which would give the same result. The confidence level per the ResponseText was 75.5%.

Other sources:

9 thoughts on “Language Detection

  1. I’d figure the present tense forms of the verbs to be and to have along with the pronouns would be sufficient to identify any German or Latin root language. For that matter, I’d figure the frequency of the German articles der and das alone would be sufficient to identify German from either French or Italian. Then the frequency of French plural definite article les and the cojunction et should be sufficient to identify French from Italian.

    IOW, I’d be willing to bet the following would be sufficient in any noncontrived text of 50 words or more.

    German der, das, ein, und, oder
    French les, un, et, ou

    Tangent: I thought Rumantsch was still also a national language.

  2. I read an rss feed from WordPress.com for any post that contains the word ‘Excel’. It gets tons of foreign language blogs since Excel is the same in every language. So I run them through an rss filter that filters out based on one word. I just find a word in the post that couldn’t possibly appear in English and use that in my filter. It’s the complement to what we’re doing here; excluding rather than including languages. But I think it supports what fzz says; it won’t take many carefully chosen words to identify a language provided the input is long enough.

  3. Indeed, we Swiss people do have 4 official national languages and it includes Rumantsch.

    Another way crossed my mind, but it perhaps fulfills not quite the task that was asked for. I would simply retrieve this information by

    Application.LanguageSettings.LanguageID(msoLanguageIDUI)

    This way, you would get the language used. Of course, it is not a detection which you could guess a language independently from the used system (that’s maybe what was asked for).

    Anyway, it’s a very interesting solution you suggested.

    By the way, can anyone explain me what reason you could have to use Python for Excel?

  4. Hi Dick:

    The Google response is in JSON. Rather than parse the result yourself, get one http://www.ediy.co.nz/vbjson-json-parser-library-in-vb6-xidc55680.html or http://code.google.com/p/vba-json/

    I used the latter since it is Excel based VBA.

    Now, the function becomes

    Public Function DetectLanguage(sInput As String, Optional dConfidence As Double = 0.25) As String
       
        Dim clsLanguage As New cLanguage
        With clsLanguage
        .InputText = sInput
       
        DetectLanguage = IIf(.Confidence > dConfidence, .LanguageCode, “unknown”)
            End With
        End Function

    In the cLanguage class, modify:

    Private msInputText As String, ResponseText As String, ResponseObj
        If oHttp.Status = 200 Then
            ResponseText = oHttp.ResponseText
            Dim X As JSONLib: Set X = New JSONLib
            Set ResponseObj = X.parse(“[” & ResponseText & “]”)
        End If
    Public Property Get Confidence() As Double
        Confidence = ResponseObj(1)(“responseData”)(“confidence”)
        End Property
    Public Property Get LanguageCode() As String
        LanguageCode = ResponseObj(1)(“responseData”)(“language”)
        End Property

    and the code to use DetectLanguage

    Sub testDetectLang()
        MsgBox DetectLanguage(“this is one language”) _
            & “,” & DetectLanguage(“ich bien ein berliner”) _
            & “,” & DetectLanguage(“esta es una segunda lengua”) _
            & “,” & DetectLanguage(“eto mae hyn yn iaith arall”) _
            & “,” & DetectLanguage(“Bu baska bir dildir”, 0) _
            & “,” & DetectLanguage(“Dit is nog ‘n ander taal”)
            ‘Welsh is reported as en, Afrikaans is reported as nl, _
             tr confidence is lower than the default, don’t know what it is but I used 0
        End Sub
  5. Hi there,
    I know this is a very old post, but I am hoping there is an update which will work as it is exactly what I am looking for.
    Would anyone be able to give me some tips on how to get this working as I am getting a compile error on the third line of code in the original post.
    Niall

  6. Unfortunately that google api doesn’t work anymore. They have a different one that requires an API key, which makes it more complicated. I wasn’t able to find a simple one.


Posting code? Use <pre> tags for VBA and <code> tags for inline.

Leave a Reply

Your email address will not be published.