There is a draft PR for the lineinfile
module which introduces a new module option encoding
to add compatibility for target files not encoded in UTF-8.
Currently, the lineinfile
module code does a binary-read on a target file and puts the contents as bytes in a buffer. This buffer is assumed to contain UTF-8 encoded bytes upon which regex matching operations and write operations are done. If a target file is not UTF-8 encoded, the regex matching does not work correctly because the regex comparison is a UTF-8 regex pattern compared to non-UTF-8 encoded bytes. And since write operations are done by adding UTF-8 bytes to the buffer, in the case of a non-UTF-8 encoded file, since this buffer would not contain UTF-8 encoded bytes, when the buffer is written to the file, the resulting file contains characters from multiple encodings.
The proposed change introduces a new module option encoding
, which when specified reads the file contents into a Unicode text buffer instead of bytes so that regex matching is done in Unicode and write operations are done by adding Unicode chars to the buffer instead of UTF-8 bytes. Since Python3 strings internally represent characters in Unicode, all the Unicode operations are just simply Python string operations. File reads and writes are done in text-mode so that the optional encoding parameter can be specified when opening the file descriptor (docs for Python open function).
An alternative approach was explored where the initial file read was done in text-mode and converted to UTF-8 bytes so the remainder of the code could remain unchanged until the write operation at the very end which also involved a conversion. The Unicode approach seems overall cleaner even though it requires more changes. See the code diff for the other approach here.
See the code in the PR here.