What Is String Interning? The O(1) Comparison Behind the `is` Trap

June 18, 20268 min read
dsaalgorithmsinterview-prepdata-structures
What Is String Interning? The O(1) Comparison Behind the `is` Trap
TL;DR
  • String interning stores one canonical copy of each unique string value in a pool so identity comparison becomes O(1) instead of O(n).
  • CPython automatically interns identifier-like strings (letters, digits, underscores) but never runtime-generated strings from input, concatenation, or slicing.
  • The is trap: Python's is tests object identity, not value — use == for strings or face silent failures on any runtime-generated string.
  • Java's string pool interns literals automatically; new String("hello") bypasses the pool entirely, making == return false even for equal values.
  • sys.intern() in Python and String.intern() in Java force pool membership — useful when a small fixed vocabulary (status codes, field names) repeats across millions of records.
  • Memory cost is permanent: interning an unbounded stream of unique strings pins them forever, creating a silent memory leak for inputs like user IDs or URLs.
  • In interviews, default to == in Python and .equals() in Java — using is on Python strings or == on Java strings is a red flag interviewers catch.

You've probably seen this Python snippet, maybe in a code review, maybe in the wild, maybe in your own old code:

a = "hello" b = "hello" print(a is b) # True a = "hello world" b = "hello world" print(a is b) # False

Same operator. Same idea. Opposite results. If that feels inconsistent, it is. One space breaks the whole thing. The explanation is string interning, and it's also the source of one of the most reliable bugs in Python and Java interview code.

One Copy, One Object

String interning stores exactly one copy of each unique string value in a shared pool. Any variable that holds that value gets a reference to the same object in memory. Not a fresh allocation. Not a copy. The exact same object.

Two interned strings with the same value are literally the same object, so comparing them for equality becomes a single pointer comparison: O(1) regardless of string length.

Without interning, equality checking walks through each character: O(n) for strings of length n. For a 40-character key looked up a million times, that adds up fast.

The optimization is only safe because strings are immutable. If "hello" could be modified in place, sharing it between variables would corrupt both. Immutability is the prerequisite. Interning is what immutability enables.

Why CPython Decides Some Strings Are Special

CPython automatically interns strings that match the pattern of Python identifiers: letters, digits, and underscores, starting with a letter or underscore. These cover variable names, dictionary keys, attribute names, and most fixed strings your code handles.

a = "status" b = "status" a is b # True, automatically interned a = "error_code" b = "error_code" a is b # True, underscore is fine a = "error code" # space kills it b = "error code" a is b # False

That space in "error code" is why your two "hello world" strings returned False. The string doesn't look like an identifier, so CPython doesn't intern it. One space. The runtime just decides it's not worthy.

Runtime-generated strings are never automatically interned, no matter what they contain. This covers anything created through concatenation, slicing, string formatting, or reading from external input.

user_input = input("Status: ") # user types "status" a is "status" # False, always, even though the value matches

CPython also constant-folds some compile-time concatenations, so "hel" + "lo" might intern as "hello". Don't rely on that. It's an implementation detail that varies across Python versions and interpreters, and leaning on it in production code is how you spend a Tuesday debugging a mystery.

For explicit control, sys.intern() guarantees it:

import sys a = sys.intern("error code") b = sys.intern("error code") a is b # True, guaranteed

This is useful when you're building a dictionary from millions of rows where each row has one of a small set of category strings. Interning those category values once means the dictionary stores a handful of objects instead of millions of duplicates.

Java Interns Literals. Everything Else Is on Its Own.

Java has string interning built into the language specification. String literals go into the pool automatically, stored in the heap since Java 7.

String a = "hello"; String b = "hello"; System.out.println(a == b); // true, same pool object

The classic trap is new String():

String a = "hello"; String c = new String("hello"); System.out.println(a == c); // false

new String("hello") creates a fresh object on the heap, bypassing the pool entirely. You now have an object that is literally equal to "hello" in every way that should matter, and yet == disagrees. Java is not sorry.

The intern() method retrieves the pool copy:

String a = "hello"; String c = new String("hello").intern(); System.out.println(a == c); // true

Before Java 7, the string pool lived in the permanent generation, a fixed-size memory region separate from the heap. Aggressively interning large sets of strings could fill it and crash the JVM with OutOfMemoryError. Java 7 moved the pool to the regular heap, where interned strings can be garbage collected normally when nothing references them. Progress.

JavaScript and Go Don't Make It Your Problem

JavaScript engines like V8 handle interning internally. String literals and short strings are typically interned automatically, but there's no public API to control it. This doesn't matter for correctness because === performs value comparison on string primitives regardless of memory address.

Go interns compile-time string constants at the compiler level. For runtime interning, Go 1.23 introduced the unique package:

import "unique" a := unique.Make("status code") b := unique.Make("status code") a == b // true, same canonical handle

For most interview problems, JavaScript and Go are background knowledge. Python and Java are where this bites you explicitly.

The Bug That Passes Every Test You Write

The trap is using identity comparison when you mean value comparison.

In Python, is tests whether two names point to the same object. == tests whether two objects have the same value. For strings, you almost always want ==.

def process(status): if status is "active": # wrong ... if status == "active": # correct ...

The is version works when status was created from a string literal in the same file. It passes in tests, where you probably wrote process("active") directly. It fails silently when status came from a database read, an API response, a json.loads() call, or any runtime source.

Modern Python (3.8+) raises a SyntaxWarning when you use is with a literal: "is" with a literal. Did you mean "=="? That warning exists because this bug is everywhere. The language itself is trying to tell you.

In Java, == on strings tests reference identity, not value equality. .equals() tests value.

String input = scanner.nextLine(); if (input == "quit") { // wrong, always False for scanner input System.exit(0); } if (input.equals("quit")) { // correct System.exit(0); }

Scanner output is never in the string pool. The == check fails every time, silently. The program never quits. Your exit condition looks right, tests clean in isolation, and then does absolutely nothing in the real program.

https://assets.spacecomplexity.ai/blog/content-images/string-interning/1781771727715-workslocally.jpeg Your is check when the string comes from json.loads() instead of a string literal.

In an interview, default to == in Python and .equals() in Java. Using is in Python or == on Java strings is a red flag the interviewer will notice, and unlike most red flags, this one has no upside.

O(1) Is Real. The Memory Pool Is Forever.

The speed gain from interning is real for repeated comparisons. Pointer equality is one instruction. Character-by-character comparison is bounded by string length. For hash map lookups with string keys, hash computation still takes O(k) where k is the key length, but collision resolution comparisons become O(1) when keys are interned.

The trade-off: the pool holds a strong reference to every interned string for as long as the pool exists. In CPython, automatically interned strings last for the lifetime of the interpreter. Manually interning an unbounded stream of unique strings pins them in memory permanently.

Don't intern strings you can't bound. Interning a fixed vocabulary of status codes or field names is fine. Interning every user ID or URL you ever see is a memory leak in slow motion.

When You Actually Need to Care About This

Outside the is versus == trap, string interning shows up in three real scenarios.

High-volume repeated keys. A CSV parser reading 100 million rows where each row has one of 500 distinct category values. Interning those categories once means 500 string objects total instead of 100 million.

Protocol and schema parsing. JSON, protobuf, and similar formats have fixed field names that repeat across every record. Parsers commonly intern field names on first encounter so subsequent records reuse existing objects.

Comparison-heavy workloads. If your code compares the same string against a fixed set millions of times, interning the set converts O(n) comparisons to O(1).

For interview problems at standard scale, you rarely need to reach for sys.intern() explicitly. What you do need is to understand why the mechanism exists, what the is trap is, and why string comparison deserves a second of thought before you write it.

Talking through memory optimizations like this, out loud, under time pressure, is a skill in itself. SpaceComplexity runs voice-based mock interviews where you work through these concepts with a live AI interviewer and get rubric-based feedback on your reasoning, not just your code.

Further Reading