Skip to content

WPSEO_Utils::sanitize_url fails on path with unencoded non-latin characters #22903

@pls78

Description

@pls78

Context - Why was this issue created?

Integration tests for WPSEO_Utils::sanitize_url method (tests/WP/Inc/Utils_Test.php) are failing on the following case:

'with_non_encoded_non_latin_url' => [
	'expected'        => 'https://example.com/%da%af%d8%b1%d9%88%d9%87-%d8%aa%d9%84%da%af%d8%b1%d8%a7%d9%85-%d8%b3%d8%a6%d9%88',
	'url_to_sanitize' => 'https://example.com/گروه-تلگرام-سئو',
]

The issue seems to be with wp_parse_url() call, which returns a corrupted string for the URL's path (گر��-ت�گرا�-سئ�_ instead of روه-تلگرام-سئو).
Considering this test case has been written in March 2020, it might be that something has changed in wp_parse_url() implementation.

What is the goal of this issue?

  • Restore the original WPSEO_Utils::sanitize_url behaviour for URLs that have non-Unicode characters in their path.

What needs to be done to achieve the goal?

  • Investigate wp_parse_url() current behaviour
  • Change WPSEO_Utils::wp_parse_url() accordingly

Does the issue still need UX or research?

No

If available, what are the tips for fixing the problem or possible solutions?

  • If wp_parse_url() needs its input to be encoded, change WPSEO_Utils::sanitize_url behaviour accordingly (I would say by limiting the change only in the specific case covered by the test (i.e., when non-encoded non-latin characters are present in the URL's path)

What is the expected result/behavior?

  • WPSEO_Utils::sanitize_url() should return a correctly encoded URL, as expected in the integration test.

Should documentation be added or updated for this change? And if so, where?

No

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions