Pandas Pipelines Are Leaking Your Column Names to AI Providers—and One Rule Change Fixes It

Every other framework in PromptCape's AI privacy toolkit leaked through identifiers. Pandas leaks through strings—and that's forced a scoped, deliberate exception to the never-touch-strings rule that protected everything else.

The Rule That Breaks Here

Across Python and Django codebases, one principle held without exception: PromptCape never rewrites string literals. Strings are user-visible labels, error messages, template paths, MIME types—rewriting them is how you turn "Download CSV" into garbage on a button. But pandas forces the uncomfortable thing: column names like "churn_probability", "annual_salary", and "patient_diagnosis_code" live entirely inside string literals, and in any real pipeline there are dozens of them. The business secret—the fact that this company scores customers on attrition risk—sits right there in the AI provider's logs.

Where Column Names Hide

Pandas has a sprawling vocabulary of column positions: df["name"] for single access, df[["a", "b"]] for lists, .loc and .at indexers, groupby() keys, merge(on=...) join conditions, sort_values(), pivot_table(), melt(id_vars=...), rename(columns={...}), named aggregation like df.agg(avg_pay=("salary", "mean")), query strings, dtype specifications in read_csv—the list is long and the syntax is identical to dict subscripts, environment variable lookups, and JSON payload keys. Three of these are genuinely nasty: rename(columns={...}) carries column names in both the keys AND values of a dict literal—miss the values and you leak every renamed column. Named aggregation puts an output column name in a keyword-argument position while the source sits in a string on the same line. And df.query("...") embeds column names inside pandas's mini-language, requiring parsing rather than treating it as opaque.

The Detection Problem: No Declaration Site

Every other detector had it easy—the names it needed to find were declared somewhere. Pydantic fields are in class bodies, Django models declare their fields, SQLAlchemy columns are Column(...) assignments. Pandas has no declaration site. A DataFrame's columns come from data—a CSV header, a SQL result set, a Parquet schema—none of which is in source code: df = pd.read_csv("s3://acme-hr/attrition_2026.csv") gives you 40 column names that appear nowhere in the file. This splits column names into two populations: referenced columns (obfuscatable via AST) and dynamic-only columns touched via df.columns or for col in df.columns loops (not statically reachable). PromptCape handles only the first population—honestly, because a column never named in source can't leak through source.

The Type-Inference Trap

Here's the bug that defined the whole subsystem. The first version treated every string subscript as a column name: any x["..."] got rewritten. It worked on pandas code and corrupted everything else—os.environ["DATABASE_URL"], config["feature_flags"]["new_dashboard"], HTTP headers, all mangled into broken keys while the actual DataFrame columns were correctly obfuscated. The fix is DataFrame-variable inference: only rewrite subscript strings on variables the engine can prove are DataFrames by walking assignments from pd.read_csv(), pd.DataFrame(...), df.copy(), df[mask], df.groupby().agg(), and similar sources. Strings subscripted on anything unprovable stay untouched. PromptCape chooses silent under-obfuscation over silent corruption—a leaked column name is a privacy miss; a rewritten os.environ key is a broken app.

Before and After

In an HR attrition pipeline, df["churn_probability"] becomes df["col_e2d4b7c9"], groupby("department") becomes groupby("col_1f7b3d6a"), and the named aggregation churn_probability=("annual_salary", "mean") stays consistent—created as a kwarg and referenced three lines later in sort_values, both becoming col_e2d4b7c9 because the registry is keyed by real name. The AI sees a function that filters, divides, groups, and aggregates. It does not see attrition modelling or that employees have churn probability scores. Three things to notice: "attrition_risk" round-trips correctly across both its creation and usage; "active" (a value, not a column) stays untouched; pandas API names like read_csv, groupby, agg, mean all survive unchanged.

The Data File Problem

Obfuscating df["churn_probability"] accomplishes nothing if the AI can open attrition_2026.csv and read: employee_id,department,annual_salary,churn_probability. Worse, there's a runtime conflict—the obfuscated code asks for col_e2d4b7c9 but the real CSV header calls it churn_probability. The workspace won't even run. PromptCape resolves this two ways: bundled fixture CSVs get their header rows rewritten with the same column registry; production data stays in the source project and never enters the AI-visible workspace, with a thin pandas IO shim applying rename(columns=registry) immediately after each read so col_... names line up with freshly-renamed frames. The real headers live only in the developer's source tree.

What This Does NOT Protect

Column names that only appear via df.columns or for col in df.columns loops aren't statically obfuscatable—they don't appear as literals to rewrite, and they only leak if the data file itself reaches the AI. Columns on DataFrames that arrive through un-inferrable paths (third-party function returns, stored in lists) also stay unmolested by design; PromptCape won't corrupt non-pandas subscripts trying to find them. The threat boundary is honest: this removes schema vocabulary from what the AI sees, not the fact that you're doing data analysis or the actual numeric values.

Bottom Line

Pandas forced a fundamental inversion—the leak here IS the strings. DataFrame-variable inference is the whole ballgame; without it you either miss columns or corrupt dict keys and environment lookups. PromptCape ships with this pandas column detector, the DataFrame-inference sidecar, and data-file handling in the same release as the rest of the Python pipeline—free for three months at promptcape.com.

> Pandas Pipelines Are Leaking Your Column Names to AI Providers—and One Rule Change Fixes It