Portmint Lighthouse

Cleaning Messy Data

Here's a truth every data person learns early: your answer is only as good as the data underneath it. Feed a sheet full of typos and blanks into a chart, and you get a confident-looking picture that's quietly wrong. Cleaning data means tidying it up so it's trustworthy before you do anything else with it.

Think of cooking. Before a good cook starts, they rinse the vegetables, pick out the bad ones, and lay everything out. The actual cooking is fast once the prep is done. Cleaning data is that prep — unglamorous, but the meal depends on it.

The usual messes

Raw data, especially when many hands have touched it, tends to arrive with the same handful of problems. Learn to spot these and you've won most of the battle.

Typos and stray spaces. "New York," "new york," and "New York " (with a hidden space at the end) look the same to you but count as three different things to a spreadsheet. When you later try to group or count them, they'll refuse to line up.

Blanks. Empty cells where a value should be. Sometimes a blank truly means zero; sometimes it means "nobody filled this in." Those are different, and pretending a missing value is a zero can skew every total.

Duplicates. The same row entered twice — one sale counted as two. Duplicates inflate your counts and totals without you noticing.

Mismatched formats. Dates written as "03/04/25" in one row and "March 4, 2025" in another. Prices written as "$40," "40," and "40.00." A spreadsheet treats a number stored as text differently from a real number, and that difference will break your math later.

How to fix them, calmly

You don't need fancy tools. A patient eye and a few built-in features handle most of it.

Standardize text. Pick one spelling for each thing and make every cell match — "New York," every time. Many spreadsheets have a Find and Replace feature: tell it to find every "new york" and replace it with "New York" in one sweep. There's also usually a TRIM function that strips those invisible trailing spaces.

Decide what blanks mean. For each empty cell, ask: is this a real zero, or is it missing? Fill in the true zeros. Leave genuinely-missing values blank (or mark them clearly), so you don't fool yourself later.

Remove duplicates. Most spreadsheets have a "Remove Duplicates" command. Before you run it, make a copy of the sheet — undo is your friend, but a backup is your safety net.

Make formats consistent. Choose one date format and one number format, and convert everything to match. If a column of prices is secretly stored as text, the spreadsheet often shows a little warning so you can convert it to real numbers.

Why this matters

Every other lesson in this course — sorting, filtering, summarizing, charting — quietly assumes the data is clean. Skip this step and the errors don't disappear; they hide, then surface in a final chart you'll present with confidence. Cleaning first is how you make sure the insight at the end is real.

Your turn

Take any spreadsheet and play data detective for ten minutes:

  1. Pick one text column and scan for the same thing spelled different ways. Standardize it to one spelling.
  2. Find a blank cell. Decide, honestly: real zero, or missing? Handle it accordingly.
  3. Look for two identical rows. If you find a true duplicate, remove one (after making a backup copy first).

You'll be surprised how much steadier the sheet feels afterward. Next, we'll put that clean data in order with sorting.

Stuck or curious?

Ask Pip about this lesson — tap the porthole bottom-right.