backslashes in regex_replace filter

Hi - Does anyone (who understands how backslashes work in Ansible/YAML) know why both of the following tasks work:

(ansible2_15_8) rowagn@localhost:~#> cat d.yml

  • hosts: all
    gather_facts: no
    vars:
    s: ‘This is a string containing 1 and 2.’
    t:
  • p1_xyz
  • p2_xyz
  • p4_xyz

tasks:

  • name: single backslash
    debug:
    msg: ‘{{ item }} is in s’
    loop: ‘{{ t }}’
    when: ( item | regex_replace(‘^p(\d+).*$’, ‘\1’) ) in s

  • name: double backslash
    debug:
    msg: ‘{{ item }} is in s’
    loop: ‘{{ t }}’
    when: ( item | regex_replace(‘^p(\d+).*$’, ‘\1’) ) in s

(ansible2_15_8) rowagn@localhost:~#> ansible-playbook -i l d.yml

PLAY [all] ******************************************************************************************************************************************************

TASK [single backslash] *****************************************************************************************************************************************
ok: [localhost] => (item=p1_xyz) => {
“msg”: “p1_xyz is in s”
}
ok: [localhost] => (item=p2_xyz) => {
“msg”: “p2_xyz is in s”
}
skipping: [localhost] => (item=p4_xyz)

TASK [double backslash] *****************************************************************************************************************************************
ok: [localhost] => (item=p1_xyz) => {
“msg”: “p1_xyz is in s”
}
ok: [localhost] => (item=p2_xyz) => {
“msg”: “p2_xyz is in s”
}
skipping: [localhost] => (item=p4_xyz)

PLAY RECAP ******************************************************************************************************************************************************
localhost : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

The tasks are extracting the number from the strings in list t and then looking for that number in string s. What is strange is the second example at https://docs.ansible.com/ansible/latest/collections/ansible/builtin/regex_replace_filter.html#examples indicates the backslashes in both parameters need to be doubled, but the above testing shows double backslashes are not required in the first parameter (they are required in the second parameter).

Thanks
Rob

regex_replace(‘^p(\d+).*$’, ‘\1’)

‘\1’ in the second argument is a “backref” (backwards reference) to the (\d+) in the first argument. It seems it is looking for an expression with digits and extracting the digits.

Your list ‘t’ has names with p1_xyz, p2_xyz, p4_xyx so this regex would extract the 1, 2, 4 digits from those strings.

Your string ‘s’ has digits 1 and 2. You are getting two lines of output as expected.

Walter

This is a result of some normalization code in jinja2 that attempts to unescape strings:

https://github.com/pallets/jinja/blob/d594969d722ceb4e8f3da8861befc9c0ac87ae1b/src/jinja2/lexer.py#L647-L653

That code results in those becoming ‘^p(\d+).*$’ and ‘\1’.

Those 2 when statements, when processed by pyyaml become:

[“( item | regex_replace(‘^p(\d+).$‘, ‘\\1’) ) in s",
"( item | regex_replace(’^p(\\d+).
$’, ‘\\1’) ) in s”]

Then if we apply the .encode/.decode:

“( item | regex_replace(‘^p(\d+).$‘, ‘\\1’) ) in s".encode(“ascii”, “backslashreplace”).decode(“unicode-escape”)
"( item | regex_replace(’^p(\d+).
$’, ‘\1’) ) in s”

“( item | regex_replace(‘^p(\\d+).$‘, ‘\\1’) ) in s".encode(“ascii”, “backslashreplace”).decode(“unicode-escape”)
"( item | regex_replace(’^p(\d+).
$’, ‘\1’) ) in s”

Thanks Matt, but I still don’t get why the first parameter (\d) MAY be double backslashed but the second parameter (\1) MUST be double backslashed. However, I’m starting to think it’s at the python level. https://stackoverflow.com/a/33582215 says Python’s string parser causes both \d and \d to become \d. But why? A little more searching takes me to https://docs.python.org/3/reference/lexical_analysis.html#escape-sequences, where I think I see why \1 becomes \1 and \1 becomes a non-printable character (octal 1). But then, by analogy, \d should become \d (it does) but why doesn’t \d become an error (since it’s not listed as a valid escape sequence).

Maybe I’ll take this over to the Python list.

The \1 must be double-backslashed because the backref needs to be backslash-digit (\1). Doubling the backslash escapes the backslash.

Walter

Right, but why doesn’t the \d need to be double-backslashed? Backslash-d is regex for matching on a digit. I just don’t get why doubling the backslash is needed on the 1 but not on the d.

Perhaps because you have single quotes inside double quotes so everything inside the single quotes is automatically escaped?

Walter

But the \1 is also inside single and double quotes, so if that were the reason, I wouldn’t have to double backslash the 1

Part of the problem is also knowing what characters are escape sequences in python.

\1 is an escape sequence, equivalent to \x01, and not equivalent to the literal \1. As such a literal \1 needs to be represented in python as \\1. \d is not an escape sequence and thus can be written as a literal \d without escaping the \

There is also a difference with quoting in YAML as mentioned above, between single quotes and double quotes. But note that the behavior of YAML with quotes only applies to quotes that surround the entire YAML value. So the single quotes you have in the middle of your string do not affect the YAML quoting differences. When not using quotes surrounding the full value in YAML, you are using “Plain Style” which has different rules than both single and double quoted values.

YAML single quotes are basically equivalent to python raw strings, where a backslash is always treated as literal. Double quotes require escaping backslashes. You can read more about the flow scalar styles of YAML at https://yaml.org/spec/1.2.2/#73-flow-scalar-styles

Thanks everyone. I’m going to chalk this up to a Python anomaly. IMO, since \d is not a valid escape sequence, Python should raise an error rather than transparently converting it into \d.