Skip to content

Commit

Permalink
FAQ about whitespace detection.
Browse files Browse the repository at this point in the history
  • Loading branch information
piotrczarnas committed Dec 2, 2024
1 parent e1f490c commit f0f1ec5
Showing 1 changed file with 102 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: How to detect whitespace and null value placeholders
---
# How to detect whitespace and null value placeholders
Read this guide to learn how to detect data quality issues in text columns containing spaces, tabs, or special texts equivalent to a null value.
Read this guide to learn how to detect whitespaces, such as spaces, tabs, or special texts equivalent to a null value in text columns using SQL checks.

The data quality checks for detecting whitespace and empty value placeholders are configured in the `whitespace` category in DQOps.

Expand Down Expand Up @@ -233,6 +233,107 @@ The reference section provides YAML code samples that are ready to copy-paste to
the parameters reference, and samples of data source specific SQL queries generated by [data quality sensors](../dqo-concepts/definition-of-data-quality-sensors.md)
that are used by those checks.

## FAQ
The questions and answers for popular questions related to detecting whitespace characters.

### What is a whitespace character
A whitespace character is any character that represents a blank space, such as a regular space, a tab, or a new line.
These characters are often invisible to the human eye but can affect how data is processed and interpreted by databases.
When users are previewing column values, the columns that contain values ending with whitespace look the same as trimmed values.
However, the values don't match, and a query that uses a filter in the WHERE clause in SQL will not find all rows.

### What is the valid name, "whitespace" or "white space"
Both "whitespace" and "white space" are used, but "whitespace" is generally considered the more technically correct and contemporary term, especially in computing and data management contexts.
You'll find it used more frequently in technical documentation and programming.

### How to check if a column value has space in SQL
To check for spaces in a SQL column, use the LIKE operator (e.g., column_name `LIKE '% %'`) or remove spaces with TRIM and compare the result to the original column
(e.g., `WHERE TRIM(column_name) <> column_name`).
Remember that spaces are just one type of whitespace. For a thorough check, consider your database's specific functions or regular expressions.

### How to check if a column value has space in SQL Server
In SQL Server, you can use all generic methods of detecting whitespace characters mentioned in the previous answer.
SQL Server also offers specialized functions like `CHARINDEX` to locate the position of a space within a string, giving you more precise control.

The following query will find a space character using TransactSQL in SQL Server:

```sql
SELECT column_name
FROM your_table
WHERE CHARINDEX(' ', column_name) > 0
```

### What is the difference between using IS NULL or finding a whitespace in SQL?
`IS NULL` checks if a column value is explicitly defined as *NULL*, meaning it has no value at all. This is a specific state recognized by the database.
On the other hand, finding a whitespace (using `LIKE`, `TRIM`, etc.) means looking for characters like spaces, tabs, or newlines that represent "blank" space.
These are actual characters stored in the database, even though they might appear invisible.

Here's why this difference matters:

* **Database Storage**: Some databases, notably **Oracle**, have a unique behavior. They might not store truly empty string values.
Instead, they represent them as *NULL* values. This can lead to unexpected results if you're searching for empty strings but the database treats them as NULLs.

* **Configuration**: Both Oracle and SQL Server allow you to configure how the database handles comparisons between empty strings and *NULL* values.
This setting can affect query results and indexing. For instance, in Oracle, you can use the `ANSI_NULLS` setting to control this behavior.
In SQL Server, this is influenced by the `SET ANSI_NULLS` option.

* **Sorting and Indexing**: Empty strings and *NULL* values are treated differently in sorting operations and index construction.
This can impact the performance and organization of your data.

### How PostgreSQL handles NULL or empty value in GROUP BY?
PostgreSQL has a clear and consistent way of handling *NULL* and empty values in `GROUP BY`:

* *NULL* values are grouped together: If a column has multiple *NULL* values, they will be treated as a single group in the `GROUP BY` result.

* Empty strings are grouped together: Similar to *NULL*s, all empty strings (`''`) are considered the same and grouped into a single group.

* *NULL* and empty strings are distinct: Importantly, PostgreSQL distinguishes between *NULL* and empty strings. They are treated as separate groups in the `GROUP BY` clause.

### How to represent a tab character in SQL?
To represent a tab character in SQL, use `CHAR(9)` function. The ASCII code for a tab character is 9.
Some databases may allow you to directly insert a tab character using the Tab key or an escape sequence like `\t`, but CHAR(9) is the most reliable method in databases
that support this function.

Here are the examples of using the `CHAR` or similar functions in the most popular databases:

* **SQL Server**: `CHAR()` is a standard function.
* **Oracle**: Oracle uses `CHR()` to achieve the same result.
* **PostgreSQL**: `CHAR()` is fully supported.
* **MySQL**: `CHAR()` is a standard function.
* **IBM DB2**: `CHAR()` is part of the standard SQL functions.
* **SQLite**: `CHAR()` is available.

### How to remove whitespaces around a text in SQL?
To remove whitespaces around text in SQL, you can use the `TRIM()` function. It removes leading and trailing whitespace characters (spaces, tabs, newlines, etc.) from a string.

Here's a simple example:

```sql
SELECT TRIM(' This has extra spaces. ')
```

This will return:

```text
This has extra spaces.
```

### How to remove all spaces inside a text using SQL?
While `TRIM()` function removes spaces around text, to remove all spaces within the text, you'll need a different approach. Most databases offer a function like `REPLACE()`.
This function replaces all occurrences of a specified character with another character. In this case, you'd replace all spaces (`' '`) with an empty string (`''`).

Here's how it works:

```sql
SELECT REPLACE(' This has extra spaces. ', ' ', '')
```

This will return:

```text
Thishasextraspaces.
```

## What's next
- Learn how to [run data quality checks](../dqo-concepts/running-data-quality-checks.md#targeting-a-category-of-checks) filtering by a check category name
- Learn how to [configure data quality checks](../dqo-concepts/configuring-data-quality-checks-and-rules.md) and apply alerting rules
Expand Down

0 comments on commit f0f1ec5

Please sign in to comment.