
After you convert a PDF bank statement into a CSV or Excel file, the data is structured but rarely perfect. There's stray whitespace inside descriptions. There are invisible characters hiding inside cells that break string comparisons. The same vendor shows up under three different spellings. None of this is wrong exactly. It just makes downstream work harder than it needs to be.
Three Excel functions handle most of it. They've been in Excel for decades. They're not exciting. But used together, in a small ritual after every conversion, they save more time over the course of a year than any pivot table you'll ever write.
TRIM: remove stray whitespace
TRIM removes leading and trailing whitespace from a string, and collapses any internal whitespace runs to single spaces. So " AMAZON MARKETPLACE " becomes "AMAZON MARKETPLACE". This matters because whitespace is invisible to you and significant to software. A categorization rule that matches "Walmart" won't match "Walmart " with a trailing space, and you won't be able to see why.
Usage: in a column next to your description column, write =TRIM(B2), drag down, then copy and paste-as-values back over the originals.
CLEAN: strip invisible characters
CLEAN removes non-printable characters. PDFs use various text encodings, and some of them embed tabs, line feeds, carriage returns, and other invisible junk in cells. You won't see them. You'll notice them when a VLOOKUP that should work doesn't, and you waste twenty minutes figuring out why. =CLEAN(B2) strips them all.
In practice, TRIM and CLEAN almost always run together. The combined formula is =TRIM(CLEAN(B2)). Many bookkeepers wrap every text column in this combination as a matter of habit, even when they don't think the data needs it. The cost is trivial and the cost of skipping it is occasionally large.
SUBSTITUTE: normalize vendor names
SUBSTITUTE replaces specific text with other text. This is where the cleanup work shifts from defensive to active. The classic use case is vendor name normalization. Amazon might appear in one month as "AMZN MKTP US," "AMAZON.COM*MKTPL," and "AMAZON MARKETPLACE", three different strings for one vendor. Categorization rules that work on exact match miss two of them. Reports that group by vendor show three rows instead of one.
=SUBSTITUTE(B2, "AMZN MKTP US", "Amazon Marketplace") collapses one variant. You can nest substitutions, though after two or three nestings it gets ugly. A cleaner pattern is to keep a small "vendor normalization" sheet in your workbook with two columns, raw name and canonical name, and use VLOOKUP against it: =IFERROR(VLOOKUP(B2, NormTable!A:B, 2, FALSE), B2). New variants get added to the table as you spot them, and the normalization improves over time.
The five-minute routine
A typical post-conversion cleanup runs about five minutes per statement once you've practiced it. Open the file. Apply =TRIM(CLEAN(...)) to your description column. Run the lookup against your normalization table. Paste-as-values. Done.
The reason this matters more than it looks is that every downstream step gets faster when the data is clean. Categorization rules fire correctly. Lookups join. Reports group properly. The pre-reconciliation checklist you run next goes faster too. You spend your professional time on the unusual transactions, not on fighting with whitespace.
If you're using Ledgertome to convert statements, build this cleanup into your standard post-conversion routine. After a month of using it consistently, you'll wonder how you ever worked from raw converted data.
Visit Ledger Tome