昨天,在文章:终于等到你:CYQ.Data V5系列 (ORM数据层,支持.NET Core)最新版本开源了 中,

不小心看到一条留言:

然后就去该地址看了一下,这一看,顺带折腾了一天。

今天,就和大伙分享下折腾的感觉。

在该开源地址中,代码有C++和C#两个版本,编码的整体风格倾向与于C++。

主要的时间,花了在对于检测无BOM的部分,顺带重温了各种编码的基础。

建议在看此文之前,先了解下编码、和BOM的概念。

对于一个文件,或者字节流,就是一堆二进制:

如果传输的过程,有指定BOM,就是前面两三个字节是固定的255,254之类的,那么解码起来就很简单了。

像之前IOHelper内部读文件的代码是这么写的:

  1. /// <summary>
  2. /// 先自动识别UTF8,否则归到Default编码读取
  3. /// </summary>
  4. /// <returns></returns>
  5. public static string ReadAllText(string fileName)
  6. {
  7. return ReadAllText(fileName, DefaultEncoding);
  8. }
  9. public static string ReadAllText(string fileName, Encoding encoding)
  10. {
  11. try
  12. {
  13. if (!File.Exists(fileName))
  14. {
  15. return string.Empty;
  16. }
  17. Byte[] buff = null;
  18. lock (GetLockObj(fileName.Length))
  19. {
  20. if (!File.Exists(fileName))//多线程情况处理
  21. {
  22. return string.Empty;
  23. }
  24. buff = File.ReadAllBytes(fileName);
  25. }
  26. if (buff.Length == 0) { return ""; }
  27. if (buff[0] == 239 && buff[1] == 187 && buff[2] == 191)
  28. {
  29. return Encoding.UTF8.GetString(buff, 3, buff.Length - 3);
  30. }
  31. else if (buff[0] == 255 && buff[1] == 254)
  32. {
  33. return Encoding.Unicode.GetString(buff, 2, buff.Length - 2);
  34. }
  35. else if (buff[0] == 254 && buff[1] == 255)
  36. {
  37. if (buff.Length > 3 && buff[2] == 0 && buff[3] == 0)
  38. {
  39. return Encoding.UTF32.GetString(buff, 4, buff.Length - 4);
  40. }
  41. return Encoding.BigEndianUnicode.GetString(buff, 2, buff.Length - 2);
  42. }
  43. return encoding.GetString(buff);
  44. }
  45. catch (Exception err)
  46. {
  47. Log.WriteLogToTxt(err);
  48. }
  49. return string.Empty;
  50. }

代码说白了,就是检测BOM头,然后识别编码,用对应的编码解码。

中文都能正确显示。

windows下文本的另存为只有:ANSI、UTF8、Unicode(UTF16LE)、BigEndianUnicode(UTF16BE)。

这四种有BOM的都是轻松检测了。

 那如果文件或字节没有BOM头呢?如果用默认的编码,由有一定概率会乱码。

如果一堆字节流,没有指定BOM,就要分析出编码类型,还是挺有难度的。

这需要对各种编码的规则有一定的熟悉度。

先看看网友给出的Github上的 原始源码

  1. public Encoding DetectEncoding(byte[] buffer, int size)
  2. {
  3. // First check if we have a BOM and return that if so
  4. Encoding encoding = CheckBom(buffer, size);
  5. if (encoding != Encoding.None)
  6. {
  7. return encoding;
  8. }
  9.  
  10. // Now check for valid UTF8
  11. encoding = CheckUtf8(buffer, size);
  12. if (encoding != Encoding.None)
  13. {
  14. return encoding;
  15. }
  16.  
  17. // Now try UTF16
  18. encoding = CheckUtf16NewlineChars(buffer, size);
  19. if (encoding != Encoding.None)
  20. {
  21. return encoding;
  22. }
  23.  
  24. encoding = CheckUtf16Ascii(buffer, size);
  25. if (encoding != Encoding.None)
  26. {
  27. return encoding;
  28. }
  29.  
  30. // ANSI or None (binary) then
  31. if (!DoesContainNulls(buffer, size))
  32. {
  33. return Encoding.Ansi;
  34. }
  35.  
  36. // Found a null, return based on the preference in null_suggests_binary_
  37. return _nullSuggestsBinary ? Encoding.None : Encoding.Ansi;
  38. }

代码流程(和内涵)翻译下来是这样的:

  1. 1、检测BOM头,这个很Easy
  2.  
  3. 2、检测UTF8编码(这个还是很有创意的),如果编码的规则完全符合UTF8,则认为是UTF8
  4.  
  5. 3、检测字节中是否有换行符(根据换行符中的0的位置,区分是Utf16BE大尾还是LE小尾)。
  6.  
  7. 这个概率要看字节抽样的长度,带不带换行符。
  8.  
  9. 4、检测字节中,单偶数出现的0的概率,设定了一个期望值来预判(对于中文而言,基本没用),大概是老外写的,只根据英文情况分析的概率。
  10.  
  11. 5、检测字节中,有没有出现0,如果没有,返回系统默认编码(不同系统环境编码是不同的)。

首先,不得不说,原作者还是有一定想法的。

虽然代码中除了UTF8按规则写的分析外,其它的都无法代入中文环境里通过。

但至少思路上,就能得到不少启发。

于是,坑了我大半天,进行重写,改造,代入中文环境测试。

改造后的代码流程是这样的:

  1. public Encoding DetectWithoutBom(byte[] buffer, int size)
  2. {
  3. // Now check for valid UTF8
  4. Encoding encoding = CheckUtf8(buffer, size);
  5. if (encoding != Encoding.None)
  6. {
  7. return encoding;
  8. }
  9. // ANSI or None (binary) then 一个零都没有情况。
  10. if (!ContainsZero(buffer, size))
  11. {
  12. CheckChinese(buffer, size);
  13. return Encoding.Ansi;
  14. }
  15. // Now try UTF16 按寻找换行字符先进行判断
  16. encoding = CheckByNewLineChar(buffer, size);
  17. if (encoding != Encoding.None)
  18. {
  19. return encoding;
  20. }
  21. // 没办法了,只能按0出现的次数比率,做大体的预判
  22. encoding = CheckByZeroNumPercent(buffer, size);
  23. if (encoding != Encoding.None)
  24. {
  25. return encoding;
  26. }
  27. // Found a null, return based on the preference in null_suggests_binary_
  28. return Encoding.None;
  29. }

用中文解释流程是这样的:

  1. 1UTF8编码的检测规则,这个是通用的有效,可以保留。
  2. 2、调整顺序:先检测字节有没有0字节,若无,补一个是否中文的编码的检测(GB2312GBKBig5)。
  3. 这个后续有点用。
  4. 3、检测换行符:增加UTF-32编码的检测(原来的思路只有UTF16)。
  5. 4、预判概率:改造成同时适应中文环境。

测试的结果是这样的:

A、纯中文的:

该测试下,对于BigEndianUnicode的会产生乱码。

B、非纯中文的

一切编码正常通用。

  1. using System;
  2. using System.Collections.Generic;
  3. using System.IO;
  4. using System.Text;
  5. namespace CYQ.Data.Tool
  6. {
  7. internal static class IOHelper
  8. {
  9. internal static Encoding DefaultEncoding = Encoding.Default;
  10. private static List<object> tenObj = new List<object>(10);
  11. private static List<object> TenObj
  12. {
  13. get
  14. {
  15. if (tenObj.Count == 0)
  16. {
  17. for (int i = 0; i < 10; i++)
  18. {
  19. tenObj.Add(new object());
  20. }
  21. }
  22. return tenObj;
  23. }
  24. }
  25. private static object GetLockObj(int length)
  26. {
  27. int i = length % 9;
  28. return TenObj[i];
  29. }
  30. /// <summary>
  31. /// 先自动识别UTF8,否则归到Default编码读取
  32. /// </summary>
  33. /// <returns></returns>
  34. public static string ReadAllText(string fileName)
  35. {
  36. return ReadAllText(fileName, DefaultEncoding);
  37. }
  38. public static string ReadAllText(string fileName, Encoding encoding)
  39. {
  40. try
  41. {
  42. if (!File.Exists(fileName))
  43. {
  44. return string.Empty;
  45. }
  46. Byte[] buff = null;
  47. lock (GetLockObj(fileName.Length))
  48. {
  49. if (!File.Exists(fileName))//多线程情况处理
  50. {
  51. return string.Empty;
  52. }
  53. buff = File.ReadAllBytes(fileName);
  54. return BytesToText(buff, encoding);
  55. }
  56. }
  57. catch (Exception err)
  58. {
  59. Log.WriteLogToTxt(err);
  60. }
  61. return string.Empty;
  62. }
  63. public static bool Write(string fileName, string text)
  64. {
  65. return Save(fileName, text, false, DefaultEncoding, true);
  66. }
  67. public static bool Write(string fileName, string text, Encoding encode)
  68. {
  69. return Save(fileName, text, false, encode, true);
  70. }
  71. public static bool Append(string fileName, string text)
  72. {
  73. return Save(fileName, text, true, true);
  74. }
  75. internal static bool Save(string fileName, string text, bool isAppend, bool writeLogOnError)
  76. {
  77. return Save(fileName, text, true, DefaultEncoding, writeLogOnError);
  78. }
  79. internal static bool Save(string fileName, string text, bool isAppend, Encoding encode, bool writeLogOnError)
  80. {
  81. try
  82. {
  83. string folder = Path.GetDirectoryName(fileName);
  84. if (!Directory.Exists(folder))
  85. {
  86. Directory.CreateDirectory(folder);
  87. }
  88. lock (GetLockObj(fileName.Length))
  89. {
  90. using (StreamWriter writer = new StreamWriter(fileName, isAppend, encode))
  91. {
  92. writer.Write(text);
  93. }
  94. }
  95. return true;
  96. }
  97. catch (Exception err)
  98. {
  99. if (writeLogOnError)
  100. {
  101. Log.WriteLogToTxt(err);
  102. }
  103. else
  104. {
  105. Error.Throw("IOHelper.Save() : " + err.Message);
  106. }
  107. }
  108. return false;
  109. }
  110. internal static bool Delete(string fileName)
  111. {
  112. try
  113. {
  114. if (File.Exists(fileName))
  115. {
  116. lock (GetLockObj(fileName.Length))
  117. {
  118. if (File.Exists(fileName))
  119. {
  120. File.Delete(fileName);
  121. return true;
  122. }
  123. }
  124. }
  125. }
  126. catch
  127. {
  128. }
  129. return false;
  130. }
  131. public static bool IsLastFileWriteTimeChanged(string fileName, ref DateTime compareTimeUtc)
  132. {
  133. bool isChanged = false;
  134. IOInfo info = new IOInfo(fileName);
  135. if (info.Exists && info.LastWriteTimeUtc != compareTimeUtc)
  136. {
  137. isChanged = true;
  138. compareTimeUtc = info.LastWriteTimeUtc;
  139. }
  140. return isChanged;
  141. }
  142. public static string BytesToText(byte[] buff, Encoding encoding)
  143. {
  144. if (buff.Length == 0) { return ""; }
  145. //if (buff[0] == 239 && buff[1] == 187 && buff[2] == 191)
  146. //{
  147. // return Encoding.UTF8.GetString(buff, 3, buff.Length - 3);
  148. //}
  149. //else if (buff[0] == 255 && buff[1] == 254)
  150. //{
  151. // return Encoding.Unicode.GetString(buff, 2, buff.Length - 2);
  152. //}
  153. //else if (buff[0] == 254 && buff[1] == 255)
  154. //{
  155. // if (buff.Length > 3 && buff[2] == 0 && buff[3] == 0)
  156. // {
  157. // return Encoding.UTF32.GetString(buff, 4, buff.Length - 4);
  158. // }
  159. // return Encoding.BigEndianUnicode.GetString(buff, 2, buff.Length - 2);
  160. //}
  161. //else
  162. //{
  163. TextEncodingDetect detect = new TextEncodingDetect();
  164. //检测Bom
  165. switch (detect.DetectWithBom(buff))
  166. {
  167. case TextEncodingDetect.Encoding.Utf8Bom:
  168. return Encoding.UTF8.GetString(buff, 3, buff.Length - 3);
  169. case TextEncodingDetect.Encoding.UnicodeBom:
  170. return Encoding.Unicode.GetString(buff, 2, buff.Length - 2);
  171. case TextEncodingDetect.Encoding.BigEndianUnicodeBom:
  172. return Encoding.BigEndianUnicode.GetString(buff, 2, buff.Length - 2);
  173. case TextEncodingDetect.Encoding.Utf32Bom:
  174. return Encoding.UTF32.GetString(buff, 4, buff.Length - 4);
  175. }
  176. if (encoding != DefaultEncoding && encoding != Encoding.ASCII)//自定义设置编码,优先处理。
  177. {
  178. return encoding.GetString(buff);
  179. }
  180. switch (detect.DetectWithoutBom(buff, buff.Length > 1000 ? 1000 : buff.Length))//自动检测。
  181. {
  182. case TextEncodingDetect.Encoding.Utf8Nobom:
  183. return Encoding.UTF8.GetString(buff);
  184. case TextEncodingDetect.Encoding.UnicodeNoBom:
  185. return Encoding.Unicode.GetString(buff);
  186. case TextEncodingDetect.Encoding.BigEndianUnicodeNoBom:
  187. return Encoding.BigEndianUnicode.GetString(buff);
  188. case TextEncodingDetect.Encoding.Utf32NoBom:
  189. return Encoding.UTF32.GetString(buff);
  190. case TextEncodingDetect.Encoding.Ansi:
  191. if (IsChineseEncoding(DefaultEncoding) && !IsChineseEncoding(encoding))
  192. {
  193. if (detect.IsChinese)
  194. {
  195. return Encoding.GetEncoding("gbk").GetString(buff);
  196. }
  197. else//非中文时,默认选一个。
  198. {
  199. return Encoding.Unicode.GetString(buff);
  200. }
  201. }
  202. else
  203. {
  204. return encoding.GetString(buff);
  205. }
  206. case TextEncodingDetect.Encoding.Ascii:
  207. return Encoding.ASCII.GetString(buff);
  208. default:
  209. return encoding.GetString(buff);
  210. }
  211. // }
  212. }
  213. private static bool IsChineseEncoding(Encoding encoding)
  214. {
  215. return encoding == Encoding.GetEncoding("gb2312") || encoding == Encoding.GetEncoding("gbk") || encoding == Encoding.GetEncoding("big5");
  216. }
  217. }
  218. internal class IOInfo : FileSystemInfo
  219. {
  220. public IOInfo(string fileName)
  221. {
  222. base.FullPath = fileName;
  223. }
  224. public override void Delete()
  225. {
  226. }
  227. public override bool Exists
  228. {
  229. get
  230. {
  231. return File.Exists(base.FullPath);
  232. }
  233. }
  234. public override string Name
  235. {
  236. get
  237. {
  238. return null;
  239. }
  240. }
  241. }
  242. /// <summary>
  243. /// 字节文本编码检测
  244. /// </summary>
  245. internal class TextEncodingDetect
  246. {
  247. private readonly byte[] _UTF8Bom =
  248. {
  249. 0xEF,
  250. 0xBB,
  251. 0xBF
  252. };
  253. //utf16le _UnicodeBom
  254. private readonly byte[] _UTF16LeBom =
  255. {
  256. 0xFF,
  257. 0xFE
  258. };
  259. //utf16be _BigUnicodeBom
  260. private readonly byte[] _UTF16BeBom =
  261. {
  262. 0xFE,
  263. 0xFF
  264. };
  265. //utf-32le
  266. private readonly byte[] _UTF32LeBom =
  267. {
  268. 0xFF,
  269. 0xFE,
  270. 0x00,
  271. 0x00
  272. };
  273. //utf-32Be
  274. //private readonly byte[] _UTF32BeBom =
  275. //{
  276. // 0x00,
  277. // 0x00,
  278. // 0xFE,
  279. // 0xFF
  280. //};
  281. /// <summary>
  282. /// 是否中文
  283. /// </summary>
  284. public bool IsChinese = false;
  285. public enum Encoding
  286. {
  287. None, // Unknown or binary
  288. Ansi, // 0-255
  289. Ascii, // 0-127
  290. Utf8Bom, // UTF8 with BOM
  291. Utf8Nobom, // UTF8 without BOM
  292. UnicodeBom, // UTF16 LE with BOM
  293. UnicodeNoBom, // UTF16 LE without BOM
  294. BigEndianUnicodeBom, // UTF16-BE with BOM
  295. BigEndianUnicodeNoBom, // UTF16-BE without BOM
  296. Utf32Bom,//UTF-32LE with BOM
  297. Utf32NoBom //UTF-32 without BOM
  298. }
  299. public Encoding DetectWithBom(byte[] buffer)
  300. {
  301. if (buffer != null)
  302. {
  303. int size = buffer.Length;
  304. // Check for BOM
  305. if (size >= 2 && buffer[0] == _UTF16LeBom[0] && buffer[1] == _UTF16LeBom[1])
  306. {
  307. return Encoding.UnicodeBom;
  308. }
  309. if (size >= 2 && buffer[0] == _UTF16BeBom[0] && buffer[1] == _UTF16BeBom[1])
  310. {
  311. if (size >= 4 && buffer[2] == _UTF32LeBom[2] && buffer[3] == _UTF32LeBom[3])
  312. {
  313. return Encoding.Utf32Bom;
  314. }
  315. return Encoding.BigEndianUnicodeBom;
  316. }
  317. if (size >= 3 && buffer[0] == _UTF8Bom[0] && buffer[1] == _UTF8Bom[1] && buffer[2] == _UTF8Bom[2])
  318. {
  319. return Encoding.Utf8Bom;
  320. }
  321. }
  322. return Encoding.None;
  323. }
  324. /// <summary>
  325. /// Automatically detects the Encoding type of a given byte buffer.
  326. /// </summary>
  327. /// <param name="buffer">The byte buffer.</param>
  328. /// <param name="size">The size of the byte buffer.</param>
  329. /// <returns>The Encoding type or Encoding.None if unknown.</returns>
  330. public Encoding DetectWithoutBom(byte[] buffer, int size)
  331. {
  332. // Now check for valid UTF8
  333. Encoding encoding = CheckUtf8(buffer, size);
  334. if (encoding != Encoding.None)
  335. {
  336. return encoding;
  337. }
  338. // ANSI or None (binary) then 一个零都没有情况。
  339. if (!ContainsZero(buffer, size))
  340. {
  341. CheckChinese(buffer, size);
  342. return Encoding.Ansi;
  343. }
  344. // Now try UTF16 按寻找换行字符先进行判断
  345. encoding = CheckByNewLineChar(buffer, size);
  346. if (encoding != Encoding.None)
  347. {
  348. return encoding;
  349. }
  350. // 没办法了,只能按0出现的次数比率,做大体的预判
  351. encoding = CheckByZeroNumPercent(buffer, size);
  352. if (encoding != Encoding.None)
  353. {
  354. return encoding;
  355. }
  356. // Found a null, return based on the preference in null_suggests_binary_
  357. return Encoding.None;
  358. }
  359. /// <summary>
  360. /// Checks if a buffer contains text that looks like utf16 by scanning for
  361. /// newline chars that would be present even in non-english text.
  362. /// 以检测换行符标识来判断。
  363. /// </summary>
  364. /// <param name="buffer">The byte buffer.</param>
  365. /// <param name="size">The size of the byte buffer.</param>
  366. /// <returns>Encoding.none, Encoding.Utf16LeNoBom or Encoding.Utf16BeNoBom.</returns>
  367. private static Encoding CheckByNewLineChar(byte[] buffer, int size)
  368. {
  369. if (size < 2)
  370. {
  371. return Encoding.None;
  372. }
  373. // Reduce size by 1 so we don't need to worry about bounds checking for pairs of bytes
  374. size--;
  375. int le16 = 0;
  376. int be16 = 0;
  377. int le32 = 0;//检测是否utf32le。
  378. int zeroCount = 0;//utf32le 每4位后面多数是0
  379. uint pos = 0;
  380. while (pos < size)
  381. {
  382. byte ch1 = buffer[pos++];
  383. byte ch2 = buffer[pos++];
  384. if (ch1 == 0)
  385. {
  386. if (ch2 == 0x0a || ch2 == 0x0d)//\r \t 换行检测。
  387. {
  388. ++be16;
  389. }
  390. }
  391. if (ch2 == 0)
  392. {
  393. zeroCount++;
  394. if (ch1 == 0x0a || ch1 == 0x0d)
  395. {
  396. ++le16;
  397. if (pos + 1 <= size && buffer[pos] == 0 && buffer[pos + 1] == 0)
  398. {
  399. ++le32;
  400. }
  401. }
  402. }
  403. // If we are getting both LE and BE control chars then this file is not utf16
  404. if (le16 > 0 && be16 > 0)
  405. {
  406. return Encoding.None;
  407. }
  408. }
  409. if (le16 > 0)
  410. {
  411. if (le16 == le32 && buffer.Length % 4 == 0)
  412. {
  413. return Encoding.Utf32NoBom;
  414. }
  415. return Encoding.UnicodeNoBom;
  416. }
  417. else if (be16 > 0)
  418. {
  419. return Encoding.BigEndianUnicodeNoBom;
  420. }
  421. else if (buffer.Length % 4 == 0 && zeroCount >= buffer.Length / 4)
  422. {
  423. return Encoding.Utf32NoBom;
  424. }
  425. return Encoding.None;
  426. }
  427. /// <summary>
  428. /// Checks if a buffer contains any nulls. Used to check for binary vs text data.
  429. /// </summary>
  430. /// <param name="buffer">The byte buffer.</param>
  431. /// <param name="size">The size of the byte buffer.</param>
  432. private static bool ContainsZero(byte[] buffer, int size)
  433. {
  434. uint pos = 0;
  435. while (pos < size)
  436. {
  437. if (buffer[pos++] == 0)
  438. {
  439. return true;
  440. }
  441. }
  442. return false;
  443. }
  444. /// <summary>
  445. /// Checks if a buffer contains text that looks like utf16. This is done based
  446. /// on the use of nulls which in ASCII/script like text can be useful to identify.
  447. /// 按照一定的空0数的概率来预测。
  448. /// </summary>
  449. /// <param name="buffer">The byte buffer.</param>
  450. /// <param name="size">The size of the byte buffer.</param>
  451. /// <returns>Encoding.none, Encoding.Utf16LeNoBom or Encoding.Utf16BeNoBom.</returns>
  452. private Encoding CheckByZeroNumPercent(byte[] buffer, int size)
  453. {
  454. //单数
  455. int oddZeroCount = 0;
  456. //双数
  457. int evenZeroCount = 0;
  458. // Get even nulls
  459. uint pos = 0;
  460. while (pos < size)
  461. {
  462. if (buffer[pos] == 0)
  463. {
  464. evenZeroCount++;
  465. }
  466. pos += 2;
  467. }
  468. // Get odd nulls
  469. pos = 1;
  470. while (pos < size)
  471. {
  472. if (buffer[pos] == 0)
  473. {
  474. oddZeroCount++;
  475. }
  476. pos += 2;
  477. }
  478. double evenZeroPercent = evenZeroCount * 2.0 / size;
  479. double oddZeroPercent = oddZeroCount * 2.0 / size;
  480. // Lots of odd nulls, low number of even nulls 这里的条件做了修改
  481. if (evenZeroPercent < 0.1 && oddZeroPercent > 0)
  482. {
  483. return Encoding.UnicodeNoBom;
  484. }
  485. // Lots of even nulls, low number of odd nulls 这里的条件也做了修改
  486. if (oddZeroPercent < 0.1 && evenZeroPercent > 0)
  487. {
  488. return Encoding.BigEndianUnicodeNoBom;
  489. }
  490. // Don't know
  491. return Encoding.None;
  492. }
  493. /// <summary>
  494. /// Checks if a buffer contains valid utf8.
  495. /// 以UTF8 的字节范围来检测。
  496. /// </summary>
  497. /// <param name="buffer">The byte buffer.</param>
  498. /// <param name="size">The size of the byte buffer.</param>
  499. /// <returns>
  500. /// Encoding type of Encoding.None (invalid UTF8), Encoding.Utf8NoBom (valid utf8 multibyte strings) or
  501. /// Encoding.ASCII (data in 0.127 range).
  502. /// </returns>
  503. /// <returns>2</returns>
  504. private Encoding CheckUtf8(byte[] buffer, int size)
  505. {
  506. // UTF8 Valid sequences
  507. // 0xxxxxxx ASCII
  508. // 110xxxxx 10xxxxxx 2-byte
  509. // 1110xxxx 10xxxxxx 10xxxxxx 3-byte
  510. // 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4-byte
  511. //
  512. // Width in UTF8
  513. // Decimal Width
  514. // 0-127 1 byte
  515. // 194-223 2 bytes
  516. // 224-239 3 bytes
  517. // 240-244 4 bytes
  518. //
  519. // Subsequent chars are in the range 128-191
  520. bool onlySawAsciiRange = true;
  521. uint pos = 0;
  522. while (pos < size)
  523. {
  524. byte ch = buffer[pos++];
  525. if (ch == 0)
  526. {
  527. return Encoding.None;
  528. }
  529. int moreChars;
  530. if (ch <= 127)
  531. {
  532. // 1 byte
  533. moreChars = 0;
  534. }
  535. else if (ch >= 194 && ch <= 223)
  536. {
  537. // 2 Byte
  538. moreChars = 1;
  539. }
  540. else if (ch >= 224 && ch <= 239)
  541. {
  542. // 3 Byte
  543. moreChars = 2;
  544. }
  545. else if (ch >= 240 && ch <= 244)
  546. {
  547. // 4 Byte
  548. moreChars = 3;
  549. }
  550. else
  551. {
  552. return Encoding.None; // Not utf8
  553. }
  554. // Check secondary chars are in range if we are expecting any
  555. while (moreChars > 0 && pos < size)
  556. {
  557. onlySawAsciiRange = false; // Seen non-ascii chars now
  558. ch = buffer[pos++];
  559. if (ch < 128 || ch > 191)
  560. {
  561. return Encoding.None; // Not utf8
  562. }
  563. --moreChars;
  564. }
  565. }
  566. // If we get to here then only valid UTF-8 sequences have been processed
  567. // If we only saw chars in the range 0-127 then we can't assume UTF8 (the caller will need to decide)
  568. return onlySawAsciiRange ? Encoding.Ascii : Encoding.Utf8Nobom;
  569. }
  570. /// <summary>
  571. /// 是否中文编码(GB2312、GBK、Big5)
  572. /// </summary>
  573. private void CheckChinese(byte[] buffer, int size)
  574. {
  575. IsChinese = false;
  576. if (size < 2)
  577. {
  578. return;
  579. }
  580. // Reduce size by 1 so we don't need to worry about bounds checking for pairs of bytes
  581. size--;
  582. uint pos = 0;
  583. bool isCN = false;
  584. while (pos < size)
  585. {
  586. //GB2312
  587. //0xB0-0xF7(176-247)
  588. //0xA0-0xFE(160-254)
  589. //GBK
  590. //0x81-0xFE(129-254)
  591. //0x40-0xFE(64-254)
  592. //Big5
  593. //0x81-0xFE(129-255)
  594. //0x40-0x7E(64-126) OR 0xA1-0xFE(161-254)
  595. byte ch1 = buffer[pos++];
  596. byte ch2 = buffer[pos++];
  597. isCN = (ch1 >= 176 && ch1 <= 247 && ch2 >= 160 && ch2 <= 254)
  598. || (ch1 >= 129 && ch1 <= 254 && ch2 >= 64 && ch2 <= 254)
  599. || (ch1 >= 129 && ((ch2 >= 64 && ch2 <= 126) || (ch2 >= 161 && ch2 <= 254)));
  600. if (!isCN)
  601. {
  602. return;
  603. }
  604. }
  605. IsChinese = true;
  606. }
  607. }
  608. }

后续更新地址:https://github.com/cyq1162/cyqdata/blob/master/Tool/IOHelper.cs

1、考虑到UTF7已经过时了,所以直接无视了。

2、对于纯中文情况,UTF16下是BE还是LE,暂时没有想到好的检测方法,所以默认返回了常用的LE,即Unicode。

3、其它一切都安好,全国公开的C#版本,应该就此一份了。

版权声明:本文为cyq1162原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/cyq1162/p/9183424.html