Browse Source

📝 Documentations adjustments (#46)

pull/48/head
TAHRI Ahmed R 1 year ago committed by GitHub
parent
commit
fa5e4a8636
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 5
      README.md
  2. 4
      UPGRADE.md
  3. 8
      charset_normalizer/api.py
  4. 27
      docs/advanced_search.rst
  5. 15
      docs/getstarted.rst
  6. 8
      docs/handling_result.rst
  7. 22
      docs/miscellaneous.rst
  8. 13
      docs/support.rst

5
README.md

@ -85,6 +85,11 @@ Or directly from dev-master for latest preview
pip install git+https://github.com/Ousret/charset_normalizer.git
```
If you want a more up-to-date `unicodedata` than the one available in your Python setup.
```sh
pip install charset-normalizer[unicode_backport]
```
## 🚀 Basic Usage
### CLI

4
UPGRADE.md

@ -1,7 +1,9 @@
Guide to upgrade your code from v1 to v2
----------------------------------------
* If you are using the legacy `detect` function, that is it. You have nothing to do.
* If you are using the legacy `detect` function, that is it. You have nothing to do.
## Detection
### Before

8
charset_normalizer/api.py

@ -313,6 +313,10 @@ def from_fp(
preemptive_behaviour: bool = True,
explain: bool = False
) -> CharsetMatches:
"""
Same thing than the function from_bytes but using a file pointer that is already ready.
Will not close the file pointer.
"""
return from_bytes(
fp.read(),
steps,
@ -335,6 +339,10 @@ def from_path(
preemptive_behaviour: bool = True,
explain: bool = False
) -> CharsetMatches:
"""
Same thing than the function from_bytes but with one extra step. Opening and reading given file path in binary mode.
Can raise IOError.
"""
with open(path, 'rb') as fp:
return from_fp(fp, steps, chunk_size, threshold, cp_isolation, cp_exclusion, preemptive_behaviour, explain)

27
docs/advanced_search.rst

@ -6,11 +6,11 @@ optional parameters that can be tweaked.
As follow ::
from charset_normalizer import CharsetNormalizerMatches as CnM
from charset_normalizer import from_bytes
my_byte_str = '我没有埋怨,磋砣的只是一些时间。'.encode('gb18030')
results = CnM.from_bytes(
results = from_bytes(
my_byte_str,
steps=10, # Number of steps/block to extract from my_byte_str
chunk_size=512, # Set block size of each extraction
@ -22,19 +22,19 @@ As follow ::
)
Using CharsetNormalizerMatches
Using CharsetMatches
------------------------------
Here, ``results`` is a ``CharsetNormalizerMatches`` object. It behave like a list.
Initially it is not sorted. Be cautious when extracting ``first()`` result without calling method ``best()``.
Here, ``results`` is a ``CharsetMatches`` object. It behave like a list but does not implements all related methods.
Initially, it is sorted. Calling ``best()`` is sufficient to extract the most probable result.
.. autoclass:: charset_normalizer.CharsetNormalizerMatches
.. autoclass:: charset_normalizer.CharsetMatches
:members:
List behaviour
--------------
Like said earlier, ``CharsetNormalizerMatches`` object behave like a list.
Like said earlier, ``CharsetMatches`` object behave like a list.
::
@ -53,26 +53,25 @@ Like said earlier, ``CharsetNormalizerMatches`` object behave like a list.
Using best()
------------
Like said above, ``CharsetNormalizerMatches`` object behave like a list and it is not sorted after calling
Like said above, ``CharsetMatches`` object behave like a list and it is sorted by default after getting results from
``from_bytes``, ``from_fp`` or ``from_path``.
Using ``best()`` keep only the lowest chaotic results and in it the best coherent result if necessary.
It produce also a ``CharsetNormalizerMatches`` object as return value.
Using ``best()`` return the most probable result, the first entry of the list. Eg. idx 0.
It return a ``CharsetMatch`` object as return value or None if there is not results inside it.
::
results = results.best()
result = results.best()
Calling first()
---------------
This method is callable from a ``CharsetNormalizerMatches`` object. It extract the first match in list.
This method return a ``CharsetNormalizerMatch`` object. See Handling result section.
The very same thing than calling the method ``best()``.
Class aliases
-------------
``CharsetNormalizerMatches`` is also known as ``CharsetDetector``, ``CharsetDoctor`` and ``EncodingDetector``.
``CharsetMatches`` is also known as ``CharsetDetector``, ``CharsetDoctor`` and ``CharsetNormalizerMatches``.
It is useful if you prefer short class name.
Verbose output

15
docs/getstarted.rst

@ -11,9 +11,9 @@ Charset Normalizer can be installed from pip::
pip install charset-normalizer
You may enable extra feature unicode data v12 backport as follow::
You may retrieve the latest unicodedata backport as follow::
pip install charset-normalizer[UnicodeDataBackport]
pip install charset-normalizer[unicode_backport]
From git via master
-----------------------
@ -31,20 +31,20 @@ The new way
You may want to get right to it. ::
from charset_normalizer import CharsetNormalizerMatches as CnM
from charset_normalizer import from_bytes, from_path
# This is going to print out your sequence once properly decoded
print(
CnM.from_bytes(
from_bytes(
my_byte_str
).best().first()
).best()
)
# You could also want the same from a file
print(
CnM.from_path(
from_path(
'./data/sample.1.ar.srt'
).best().first()
).best()
)
@ -52,6 +52,7 @@ Backward compatibility
----------------------
If you were used to python chardet, we are providing the very same ``detect()`` method as chardet.
This function is mostly backward-compatible with Chardet. The migration should be painless.
::

8
docs/handling_result.rst

@ -9,17 +9,17 @@ When initiating search upon a buffer, bytes or file you can assign the return va
my_byte_str = '我没有埋怨,磋砣的只是一些时间。'.encode('gb18030')
# Assign return value so we can fully exploit result
result = CnM.from_bytes(
result = from_bytes(
my_byte_str
).best().first()
).best()
print(result.encoding) # gb18030
Using CharsetNormalizerMatch
----------------------------
Here, ``result`` is a ``CharsetNormalizerMatch`` object or ``None``.
Here, ``result`` is a ``CharsetMatch`` object or ``None``.
.. autoclass:: charset_normalizer.CharsetNormalizerMatch
.. autoclass:: charset_normalizer.CharsetMatch
:members:

22
docs/miscellaneous.rst

@ -18,25 +18,3 @@ Any ``CharsetNormalizerMatch`` object can be transformed to exploitable ``str``
# This should print '我没有埋怨,磋砣的只是一些时间。'
print(str(result))
Expect UnicodeDecodeError
-------------------------
This package also offer you the possibility to reconfigure the way ``UnicodeDecodeError`` is raised.
Charset Normalizer offer the possibility to extend the actual message inside it to provide a clue about what
encoding it should actually be.
::
import charset_normalizer # Nothing else is needed
my_byte_str = '我没有埋怨,磋砣的只是一些时间。'.encode('gb18030')
my_byte_str.decode('utf_8') # raise UnicodeDecodeError
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 0: invalid continuation byte; you may want to consider gb18030 codec for this sequence.
# instead of
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 0: invalid continuation byte
Here, the addition is "you may want to consider gb18030 codec for this sequence.".
Is does not work when using ``try`` .. ``except`` block.

13
docs/support.rst

@ -9,7 +9,8 @@ may change depending of your python version.
Supported Encodings
-------------------
Charset Normalizer is able to detect any of those encoding.
Charset Normalizer is able to detect any of those encoding. This list is NOT static and depends heavily on what your
current cPython version is shipped with. See https://docs.python.org/3/library/codecs.html#standard-encodings
=============== ===============================================================================================================================
IANA Code Page Aliases
@ -127,7 +128,6 @@ Japanese,
Portuguese,
Swedish,
Chinese,
Catalan,
Ukrainian,
Norwegian,
Finnish,
@ -141,7 +141,6 @@ Romanian,
Farsi,
Arabic,
Danish,
Esperanto,
Serbian,
Lithuanian,
Slovene,
@ -149,19 +148,11 @@ Slovak,
Malay,
Hebrew,
Bulgarian,
Kazakh,
Baque,
Volapük,
Croatian,
Hindi,
Estonian,
Azeri,
Galician,
Simple English,
Nynorsk,
Thai,
Greek,
Macedonian,
Serbocroatian,
Tamil,
Classical Chinese.

Loading…
Cancel
Save