-----------Summary Charset Detector - as the name says - is a stand alone executable module for automatic charset detection of a given text. It can be useful for internationalisation support in multilingual applications such as web-script editors or Unicode editors. Given input buffer will be analysed to guess used encoding. The result can be used as control parameter for charset conversation procedure. Charset Detector can be compiled (and hopefully used) for MS Windows (as dll - dynamic link library) or Linux. Based on Mozilla's i18n component - http://www.mozilla.org/projects/intl/. -----------State Version 0.2.6 stable. The latest version can be found at http://chsdet.sourceforge.net. -----------Requirements Charset Detector doesn't need any external components. -----------Output As result you will get guessed charset as MS Windows Code Page id and charset name. -----------Licence Charset Detector is open source project and distributed under Lesser GPL. See the GNU Lesser General Public License for more details - http://www.opensource.org/licenses/lgpl-license.php -----------Supported charsets +-----------+---------------------------+------------------------+ | Code pade | Name | Note | +-----------+---------------------------+------------------------+ | 0 | ASCII | Pseudo code page. | | 855 | IBM855 | | | 866 | IBM866 | | | 932 | Shift_JIS | | | 950 | Big5 | | | 1200 | UTF-16LE | | | 1201 | UTF-16BE | | | 1251 | windows-1251 | | | 1252 | windows-1252 | | | 1253 | windows-1253 | | | 1255 | windows-1255 | | | 10007 | x-mac-cyrillic | | | 12000 | X-ISO-10646-UCS-4-2143 | | | 12000 | UTF-32LE | MS Windows hasn't CP.| | | | Try to use USC-4. | | 12001 | X-ISO-10646-UCS-4-3412 | | | 12001 | UTF-32BE | MS Windows hasn't CP.| | | | Try to use USC-4. | | 20866 | KOI8-R | | | 28595 | ISO-8859-5 | | | 28595 | ISO-8859-5 | | | 28597 | ISO-8859-7 | | | 28598 | ISO-8859-8 | | | 50222 | ISO-2022-JP | | | 50225 | ISO-2022-KR | | | 50227 | ISO-2022-CN | | | 51932 | EUC-JP | | | 51936 | x-euc-tw | | | 51949 | EUC-KR | | | 52936 | HZ-GB-2312 | | | 54936 | GB18030 | | | 65001 | UTF-8 | | +-----------+---------------------------+------------------------+ -----------Types Return values NS_OK = 0; NS_ERROR_OUT_OF_MEMORY = $8007000e; Returned types rCharsetInfo = record Name: pChar; // charset GNU canonical name CodePage: integer; // MS Windows CodePage id Language: pChar; // end; rAboutHolder = record MajorVersionNr: Cardinal; // Library's Major Version # MinorVersionNr: Cardinal; // Library's Minor Version # BuildVersionNr: Cardinal; // Library's Build/Release Version # About: pChar; // Copyleft information; end; -----------Exported functions procedure chsd_Reset; stdcall; Reset Charset Detector state. Prepare to new analyse. function chsd_HandleData(aBuf: PChar; aLen: integer): integer; stdcall; Analyse given buffer. Parameters aBuf - pointer to buffer with text. sLen - buffer length; Return value NS_ERROR_OUT_OF_MEMORY - failure. Unable to create internal objects. NS_OK - success. Note Function can be called more that one time to continue guessing. Charset Detector remember last state until chsd_Reset called. function chsd_Done: Boolean; stdcall; Return value TRUE - Charset Detector is sure about text encoding. FALSE - Overwise. Note If input buffer is smaller then 1K Charset Detector returns anyway FALSE. procedure chsd_DataEnd; stdcall; Signalise data end. If Charset Detector hasn't sure result (Done = FALSE) the best guessed encoding will be set as result. function chsd_GetDetectedCharset: rCharsetInfo; stdcall; Returns guessed charset. procedure chsd_GetKnownCharsets(var KnownCharsets: pChar); Fills the parameter with all supported charsets in form "CodePage - Name LineFeed". procedure chsd_GetAbout(var About: rAboutHolder); stdcall; Fills the parameter with version and copyleft information. -----------Sample The definition file "chsd_dll_intf.pas" can be found in the same direcory. Bellow is small usage sample. // WS: WideString; // Wide string which can be used in Unicode controls. // Get encoding of some buffer chsd_Reset; chsd_HandleData(aBuf, aLen); if not chsd_Done then chsd_DataEnd; ChSInfo := chsd_GetDetectedCharset(); // convert buffer to WideString OutputLength := MultiByteToWideChar(ChSInfo.CodePage, 0, aBuf, aLen, nil, 0); SetLength(WS, OutputLength); MultiByteToWideChar(ChSInfo.CodePage, 0, aBuf, aLen, PWideChar(WS), OutputLength); // If you using Unicode SynEdit SynEdit.Lines.Text := WS; Nikolaj Yakowlew © 2006-2008