zhh007 / ude

Automatically exported from code.google.com/p/ude
Other
0 stars 0 forks source link

DESCRIPTION

Ude is a C# port of Mozilla Universal Charset Detector.

The original source code is available at: 

http://mxr.mozilla.org/mozilla/source/extensions/universalchardet/src/    

The article "A composite approach to language/encoding detection" describes
the algorithms of Universal Charset Detector and is available at: 

    http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
    http://mxr.mozilla.org/mozilla-central/source/extensions/universalchardet/doc/UniversalCharsetDetection.doc

Some data-structures used into this port have been adapted from the Java port 
"juniversalchardet", available at:

    http://code.google.com/p/juniversalchardet/

Another port I know of is "chardet" (in Python) available at: 

    http://chardet.feedparser.org/

USAGE

Import the library:

    using Ude;

and feed a stream or a byte array to the detector. Call DataEnd to notify the detector that 
you want back the result:

    ICharsetDetector cdet = new CharsetDetector();
    byte[] buff = new byte[1024];
    int read;
    while ((read = stream.Read(buff, 0, buff.Length)) > 0 && !done) {
        cdet.Feed(buff, 0, read);
    }
    cdet.DataEnd();
    Console.WriteLine("Charset: {0}, confidence: {1}, cdet.Charset, cdet.Confidence);

Alternatively, you can feed a Stream to the detector:

    using (FileStream fs = File.OpenRead(filename)) {
        ICharsetDetector cdet = new CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        Console.WriteLine("Charset: {0}, confidence: {1}, cdet.Charset, cdet.Confidence);
    }    

Or you can provide an alternative implementation of the interface - Ude.ICharsetDetector - 
that wraps the original nsUniversalDetector API. 

LICENSE

This library is subject to the Mozilla Public License Version 1.1 (the
"License").  Alternatively, the contents of this file may be used under the
terms of either the GNU General Public License Version 2 or later (the "GPL"),
or the GNU Lesser General Public License Version 2.1 or later (the "LGPL")