You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 

11 KiB

Changelog

All notable changes to charset-normalizer will be documented in this file. This project adheres to Semantic Versioning. The format is based on Keep a Changelog.

2.1.0 (2022-06-19)

Added

  • Output the Unicode table version when running the CLI with --version (PR #194)

Changed

  • Re-use decoded buffer for single byte character sets from @nijel (PR #175)
  • Fixing some performance bottlenecks from @deedy5 (PR #183)

Fixed

  • Workaround potential bug in cpython with Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space (PR #175)
  • CLI default threshold aligned with the API threshold from @oleksandr-kuzmenko (PR #181)

Removed

  • Support for Python 3.5 (PR #192)

Deprecated

  • Use of backport unicodedata from unicodedata2 as Python is quickly catching up, scheduled for removal in 3.0 (PR #194)

2.0.12 (2022-02-12)

Fixed

  • ASCII miss-detection on rare cases (PR #170)

2.0.11 (2022-01-30)

Added

  • Explicit support for Python 3.11 (PR #164)

Changed

  • The logging behavior have been completely reviewed, now using only TRACE and DEBUG levels (PR #163 #165)

2.0.10 (2022-01-04)

Fixed

  • Fallback match entries might lead to UnicodeDecodeError for large bytes sequence (PR #154)

Changed

  • Skipping the language-detection (CD) on ASCII (PR #155)

2.0.9 (2021-12-03)

Changed

  • Moderating the logging impact (since 2.0.8) for specific environments (PR #147)

Fixed

  • Wrong logging level applied when setting kwarg explain to True (PR #146)

2.0.8 (2021-11-24)

Changed

  • Improvement over Vietnamese detection (PR #126)
  • MD improvement on trailing data and long foreign (non-pure latin) data (PR #124)
  • Efficiency improvements in cd/alphabet_languages from @adbar (PR #122)
  • call sum() without an intermediary list following PEP 289 recommendations from @adbar (PR #129)
  • Code style as refactored by Sourcery-AI (PR #131)
  • Minor adjustment on the MD around european words (PR #133)
  • Remove and replace SRTs from assets / tests (PR #139)
  • Initialize the library logger with a NullHandler by default from @nmaynes (PR #135)
  • Setting kwarg explain to True will add provisionally (bounded to function lifespan) a specific stream handler (PR #135)

Fixed

  • Fix large (misleading) sequence giving UnicodeDecodeError (PR #137)
  • Avoid using too insignificant chunk (PR #137)

Added

  • Add and expose function set_logging_handler to configure a specific StreamHandler from @nmaynes (PR #135)
  • Add CHANGELOG.md entries, format is based on Keep a Changelog (PR #141)

2.0.7 (2021-10-11)

Added

  • Add support for Kazakh (Cyrillic) language detection (PR #109)

Changed

  • Further, improve inferring the language from a given single-byte code page (PR #112)
  • Vainly trying to leverage PEP263 when PEP3120 is not supported (PR #116)
  • Refactoring for potential performance improvements in loops from @adbar (PR #113)
  • Various detection improvement (MD+CD) (PR #117)

Removed

  • Remove redundant logging entry about detected language(s) (PR #115)

Fixed

  • Fix a minor inconsistency between Python 3.5 and other versions regarding language detection (PR #117 #102)

2.0.6 (2021-09-18)

Fixed

  • Unforeseen regression with the loss of the backward-compatibility with some older minor of Python 3.5.x (PR #100)
  • Fix CLI crash when using --minimal output in certain cases (PR #103)

Changed

  • Minor improvement to the detection efficiency (less than 1%) (PR #106 #101)

2.0.5 (2021-09-14)

Changed

  • The project now comply with: flake8, mypy, isort and black to ensure a better overall quality (PR #81)
  • The BC-support with v1.x was improved, the old staticmethods are restored (PR #82)
  • The Unicode detection is slightly improved (PR #93)
  • Add syntax sugar __bool__ for results CharsetMatches list-container (PR #91)

Removed

  • The project no longer raise warning on tiny content given for detection, will be simply logged as warning instead (PR #92)

Fixed

  • In some rare case, the chunks extractor could cut in the middle of a multi-byte character and could mislead the mess detection (PR #95)
  • Some rare 'space' characters could trip up the UnprintablePlugin/Mess detection (PR #96)
  • The MANIFEST.in was not exhaustive (PR #78)

2.0.4 (2021-07-30)

Fixed

  • The CLI no longer raise an unexpected exception when no encoding has been found (PR #70)
  • Fix accessing the 'alphabets' property when the payload contains surrogate characters (PR #68)
  • The logger could mislead (explain=True) on detected languages and the impact of one MBCS match (PR #72)
  • Submatch factoring could be wrong in rare edge cases (PR #72)
  • Multiple files given to the CLI were ignored when publishing results to STDOUT. (After the first path) (PR #72)
  • Fix line endings from CRLF to LF for certain project files (PR #67)

Changed

  • Adjust the MD to lower the sensitivity, thus improving the global detection reliability (PR #69 #76)
  • Allow fallback on specified encoding if any (PR #71)

2.0.3 (2021-07-16)

Changed

  • Part of the detection mechanism has been improved to be less sensitive, resulting in more accurate detection results. Especially ASCII. (PR #63)
  • According to the community wishes, the detection will fall back on ASCII or UTF-8 in a last-resort case. (PR #64)

2.0.2 (2021-07-15)

Fixed

  • Empty/Too small JSON payload miss-detection fixed. Report from @tseaver (PR #59)

Changed

  • Don't inject unicodedata2 into sys.modules from @akx (PR #57)

2.0.1 (2021-07-13)

Fixed

  • Make it work where there isn't a filesystem available, dropping assets frequencies.json. Report from @sethmlarson. (PR #55)
  • Using explain=False permanently disable the verbose output in the current runtime (PR #47)
  • One log entry (language target preemptive) was not show in logs when using explain=True (PR #47)
  • Fix undesired exception (ValueError) on getitem of instance CharsetMatches (PR #52)

Changed

  • Public function normalize default args values were not aligned with from_bytes (PR #53)

Added

  • You may now use charset aliases in cp_isolation and cp_exclusion arguments (PR #47)

2.0.0 (2021-07-02)

Changed

  • 4x to 5 times faster than the previous 1.4.0 release. At least 2x faster than Chardet.
  • Accent has been made on UTF-8 detection, should perform rather instantaneous.
  • The backward compatibility with Chardet has been greatly improved. The legacy detect function returns an identical charset name whenever possible.
  • The detection mechanism has been slightly improved, now Turkish content is detected correctly (most of the time)
  • The program has been rewritten to ease the readability and maintainability. (+Using static typing)+
  • utf_7 detection has been reinstated.

Removed

  • This package no longer require anything when used with Python 3.5 (Dropped cached_property)
  • Removed support for these languages: Catalan, Esperanto, Kazakh, Baque, Volapük, Azeri, Galician, Nynorsk, Macedonian, and Serbocroatian.
  • The exception hook on UnicodeDecodeError has been removed.

Deprecated

  • Methods coherence_non_latin, w_counter, chaos_secondary_pass of the class CharsetMatch are now deprecated and scheduled for removal in v3.0

Fixed

  • The CLI output used the relative path of the file(s). Should be absolute.

1.4.1 (2021-05-28)

Fixed

  • Logger configuration/usage no longer conflict with others (PR #44)

1.4.0 (2021-05-21)

Removed

  • Using standard logging instead of using the package loguru.
  • Dropping nose test framework in favor of the maintained pytest.
  • Choose to not use dragonmapper package to help with gibberish Chinese/CJK text.
  • Require cached_property only for Python 3.5 due to constraint. Dropping for every other interpreter version.
  • Stop support for UTF-7 that does not contain a SIG.
  • Dropping PrettyTable, replaced with pure JSON output in CLI.

Fixed

  • BOM marker in a CharsetNormalizerMatch instance could be False in rare cases even if obviously present. Due to the sub-match factoring process.
  • Not searching properly for the BOM when trying utf32/16 parent codec.

Changed

  • Improving the package final size by compressing frequencies.json.
  • Huge improvement over the larges payload.

Added

  • CLI now produces JSON consumable output.
  • Return ASCII if given sequences fit. Given reasonable confidence.

1.3.9 (2021-05-13)

Fixed

  • In some very rare cases, you may end up getting encode/decode errors due to a bad bytes payload (PR #40)

1.3.8 (2021-05-12)

Fixed

  • Empty given payload for detection may cause an exception if trying to access the alphabets property. (PR #39)

1.3.7 (2021-05-12)

Fixed

  • The legacy detect function should return UTF-8-SIG if sig is present in the payload. (PR #38)

1.3.6 (2021-02-09)

Changed

  • Amend the previous release to allow prettytable 2.0 (PR #35)

1.3.5 (2021-02-08)

Fixed

  • Fix error while using the package with a python pre-release interpreter (PR #33)

Changed

  • Dependencies refactoring, constraints revised.

Added

  • Add python 3.9 and 3.10 to the supported interpreters