HashSet vs TreeSet: One Question Decides Which You Need

You need a collection of unique elements. You reach for a set. But which one? The HashSet vs TreeSet decision trips up more engineers than it should. HashSet gives you O(1). TreeSet gives you O(log n). Case closed. Pick the faster one.

Not so fast. The deciding question isn't speed. It's whether you'll ever need to ask "what's the nearest element to X?" If you will, TreeSet wins despite the slower asymptotic complexity. If you won't, HashSet wins and it's not close. The memory, the null handling, the iteration order? All secondary. They follow from that one fork in the road.

Data Structures Using Humans comic by System32Comics showing a queue, tree, and linked list made of people Somewhere in here is a HashSet, but nobody can find it because there's no ordering. (Credit: System32Comics)

What's Actually Inside a HashSet?

A HashSet is a hash table wearing a trench coat. In Java, it's literally a HashMap with every value set to a static dummy object called PRESENT. When you call add(x), it calls map.put(x, PRESENT) internally. The entire HashSet class in OpenJDK is shockingly thin. It's the software equivalent of three kids stacked up pretending to be an adult.

Diagram showing a key flowing through hashCode() then modulo capacity to land in a bucket array, with a collision chain on bucket 3 Your element becomes a number, that number becomes a slot, and collisions form a chain. The whole trick in one picture.

Your element's hashCode() gets mixed and mapped to a bucket index. The element lands in that bucket. If two elements hash to the same bucket (a collision), they form a short linked list. In Java 8+, if a single bucket accumulates 8 or more entries (TREEIFY_THRESHOLD) and the table has at least 64 buckets (MIN_TREEIFY_CAPACITY), that linked list converts into a red-black tree within the bucket itself. This caps worst-case lookup at O(log n) even under adversarial hash collisions. Java didn't trust you to write good hash functions. Fair.

The load factor controls when the table doubles. Java's default is 0.75: once 75% of buckets are occupied, the entire table resizes and every element rehashes to its new position. This rehash is O(n), but it happens rarely enough that the amortized cost per insertion stays O(1). Python's set is more aggressive, keeping load below 60% and quadrupling the table when it's small. Python's set has commitment issues.

Each HashMap.Node in Java costs about 32 bytes: 12 bytes for the object header, 4 bytes each for the hash, key reference, value reference, and next pointer, plus 4 bytes of alignment padding. The bucket array itself adds roughly 4 bytes per slot. With a 0.75 load factor, you're paying for about 1.33 slots per element. Total overhead: around 37 bytes per element before counting the element object itself.

What's Actually Inside a TreeSet?

A TreeSet is a self-balancing binary search tree. In Java, it's backed by a TreeMap, which implements a red-black tree. Every node holds your element as the key, maintains pointers to its left child, right child, and parent, and stores a single boolean for the node's color (red or black). It's carrying a lot of baggage, but it earns its keep.

Red-black tree diagram showing 7 nodes with colored indicators, invariant rules, and worst-case O(log n) guarantees A red-black tree: every path from root to leaf has the same number of black nodes. That one rule keeps the whole thing balanced.

The red-black invariants guarantee that the longest path from root to leaf is at most twice the shortest. That constraint keeps the tree height below 2·log₂(n+1). So every operation (insert, delete, search) runs in O(log n) worst case. Not amortized. Not on average. Guaranteed. TreeSet doesn't gamble.

Each TreeMap.Entry in Java costs about 40 bytes: 12 bytes for the header, 20 bytes for five references (key, value, left, right, parent), 1 byte for the color flag, and 7 bytes of alignment padding. Surprisingly close to HashMap's 32 bytes per node.

Daniel Lemire's benchmarks confirm this. For Integer keys, HashMap used 74.5 bytes per entry and TreeMap used 72.0 bytes per entry. The memory difference between HashSet and TreeSet is negligible. So the next time someone tells you "TreeSet uses way more memory," you can politely show them the numbers. The real gap is in what operations each structure makes fast.

The Operation That Changes Everything

HashSet answers exactly one question efficiently: "Is X in the set?" Membership testing. O(1). Done. Go home.

TreeSet answers that question too, in O(log n). But it also answers a whole family of questions that HashSet cannot answer at all without scanning every element:

floor(x): the largest element ≤ x
ceiling(x): the smallest element ≥ x
higher(x) / lower(x): strict neighbors
subSet(a, b): every element between a and b
first() / last(): the minimum and maximum

These aren't convenience methods. They're the reason TreeSet exists. Every one of them runs in O(log n) because the tree structure makes "find the nearest thing" a matter of walking a few nodes down a path.

Diagram tracing a ceiling(35) query through a binary search tree, visiting 3 nodes to find the answer 40 ceiling(35) walks exactly three nodes: 30 (too small, go right), 50 (candidate, go left), 40 (better candidate, go left, null, done). Three hops. Not seven.

Side-by-side comparison of HashSet and TreeSet operations, showing HashSet fails at floor, ceiling, range, and min/max queries The green checkmarks tell the story. HashSet is fast at what it does. TreeSet does more things.

In a HashSet, asking "what's the closest element to 35?" requires iterating over every element and tracking the best candidate. O(n). In a TreeSet, it's a single ceiling(35) call that touches at most 2·log₂(n) nodes. That's the difference between "let me check my filing cabinet" and "let me open every drawer in the building."

What Costs What?

Operation	HashSet	TreeSet
`add(x)`	O(1) avg	O(log n)
`remove(x)`	O(1) avg	O(log n)
`contains(x)`	O(1) avg	O(log n)
`floor(x)` / `ceiling(x)`	not available	O(log n)
`first()` / `last()` (min/max)	O(n)	O(log n)
`subSet(a, b)`	O(n)	O(log n + k)
Iteration (all elements)	O(n)	O(n), sorted

The "avg" qualifier on HashSet matters. Under pathological hash collisions (the 2011 28C3 hash DoS attack showed this is real), a hash table degrades to O(n) per operation. Java 8's treeification of long buckets caps this at O(log n), but the constant factors are worse than a purpose-built tree. Think of it as patching a tire versus buying a car that handles off-road.

TreeSet has no "average" qualifier. O(log n) is the guarantee. That predictability matters in real-time systems and interview discussions alike.

The Requirements Each Structure Demands

HashSet and TreeSet make different demands on your objects. Think of it like applying for different jobs.

HashSet requires hashCode() and equals(). Two objects that are .equals() must produce the same hash code. If you override one and forget the other, your set silently breaks: elements vanish during lookup because the hash points to the wrong bucket. This bug has ruined more afternoons than anyone wants to admit.

TreeSet requires Comparable (or an explicit Comparator). Your elements must define a total ordering. If they don't, you'll get a ClassCastException the moment you try to add the first element. Not the second element. The first. TreeSet checks your credentials at the door.

There's a subtle trap. TreeSet uses compareTo() for equality, not equals(). If a.compareTo(b) == 0 but !a.equals(b), TreeSet treats them as duplicates and drops one. This violates the Set contract and causes bugs that are genuinely hard to track down. Classic example: new BigDecimal("4.0") and new BigDecimal("4.00") return 0 from compareTo() but false from equals() (they differ in scale). A TreeSet silently drops one. The Java docs call this "consistent with equals" and strongly recommend making compareTo and equals agree. Strongly recommend. Not enforce. Welcome to Java.

One more: HashSet allows one null element. TreeSet throws NullPointerException. The tree needs to compare elements to find their position, and you can't call compareTo() on null. Null has no opinion about where it belongs.

A Worked Example: Contains Duplicate III

LeetCode 220 makes the choice obvious. Given an array, find whether there exist two indices i and j such that the values differ by at most valueDiff and the indices differ by at most indexDiff.

With a HashSet, you'd check every pair in the window. O(n · indexDiff). With a TreeSet acting as a sliding window, you ask one question per element:

from sortedcontainers import SortedList

def contains_nearby_almost_duplicate(nums, index_diff, value_diff):
    window = SortedList()
    for i, val in enumerate(nums):
        if i > index_diff:
            window.remove(nums[i - index_diff - 1])
        # Find nearest element in window within value_diff
        pos = window.bisect_left(val - value_diff)
        if pos < len(window) and window[pos] <= val + value_diff:
            return True
        window.add(val)
    return False

The bisect_left(val - value_diff) call is the TreeSet equivalent of ceiling(val - valueDiff). It finds the smallest element in the window that's at least val - valueDiff. If that element is also at most val + valueDiff, we have our pair. Each step is O(log k) where k is the window size, giving O(n log k) total.

The moment the problem says "find the nearest" or "within a range," you need a TreeSet. A HashSet can't even express the query. It's like asking a phone book sorted by phone number to find everyone named Smith.

Why Not Just Use a PriorityQueue?

A common interview confusion. Both PriorityQueue and TreeSet give you O(log n) insertion and O(log n) access to the minimum. But PriorityQueue.remove(element) is O(n) because it does a linear scan to find the element. TreeSet.remove(element) is O(log n) because the tree knows exactly where the element sits.

This distinction drives LeetCode 480 (Sliding Window Median). As the window slides, you need to remove the element that just fell out. With a PriorityQueue, each removal costs O(k). With a TreeSet, it costs O(log k). O(nk) versus O(n log k). That's the difference between getting a TLE and getting a green checkmark.

The other distinction: TreeSet rejects duplicates, PriorityQueue allows them. If your problem has repeated values and you need a heap, PriorityQueue is the tool. If you need fast arbitrary removal from a sorted collection, TreeSet wins.

LinkedHashSet vs HashSet: The Middle Ground

Java offers a third option: LinkedHashSet. It threads a doubly-linked list through the HashMap entries, preserving insertion order while keeping O(1) operations. The overachiever of the Set family.

LinkedHashSet diagram showing hash buckets with a separate insertion-order doubly-linked list threading through the entries Each entry lives in a bucket for O(1) lookup AND sits on a linked list for ordered iteration. Two data structures for the price of one (plus two extra pointers).

Use LinkedHashSet when you need O(1) operations AND deterministic iteration order. It costs slightly more memory (two extra pointers per entry for the linked list) but gives you the best of both: fast membership testing and predictable traversal.

Across languages, JavaScript's built-in Set behaves like a LinkedHashSet by default: it preserves insertion order and provides O(1) operations.

The Cross-Language Map

Every language has its own names for the same two structures:

Language	Hash-based set	Tree-based ordered set
Java	`HashSet`	`TreeSet`
C++	`std::unordered_set`	`std::set`
Python	`set`	`sortedcontainers.SortedSet`
Rust	`HashSet`	`BTreeSet`
C#	`HashSet<T>`	`SortedSet<T>`
Go	`map[T]struct{}`	(no stdlib equivalent)
JavaScript	`Set`	(no stdlib equivalent)

C++ reverses the naming convention. std::set is the tree-based one, and std::unordered_set is the hash table. This trips up people switching between Java and C++ regularly. Someone at the C++ standards committee chose violence that day.

Rust's ordered set uses a B-tree, not a red-black tree. B-trees pack multiple keys per node, which gives better cache locality than the one-key-per-node layout of red-black trees. Rust's BTreeMap uses a branching factor of B=6, storing up to 11 keys per node.

When Should You Use HashSet vs TreeSet?

Start here:

Do you need floor, ceiling, range queries, or sorted iteration? Yes: TreeSet. Done. Stop reading.
Do your elements have a natural ordering you'll exploit? Yes: consider TreeSet.
Do you need O(1) per operation? Hash tables don't guarantee O(1) worst case, but in practice they're faster than trees for pure membership testing.
Everything else? HashSet. It's the default for a reason.

For small collections (fewer than 100 elements), the choice barely matters. The constant factors dominate, and the CPU's cache can hold the entire structure regardless. The choice matters at scale, when n is in the thousands or millions and the problem requires ordering operations.

Recap

HashSet is a hash table. O(1) average add/remove/contains. No ordering. Needs hashCode() and equals(). Allows null.
TreeSet is a red-black tree. O(log n) guaranteed everything. Sorted iteration. Floor/ceiling/range queries. Needs Comparable or Comparator. No null.
Memory overhead is nearly identical (~32-40 bytes per node in Java).
The deciding question: do you need "nearest element" queries? If yes, TreeSet. If no, HashSet.
LinkedHashSet gives you O(1) with insertion-order iteration.
In interviews, "find within range" or "sliding window with ordering" screams TreeSet. Pure deduplication or membership testing screams HashSet.

If you're preparing for coding interviews and want to practice recognizing when a problem needs an ordered set versus a hash set, SpaceComplexity runs voice-based mock interviews where an AI interviewer probes exactly this kind of data structure choice in real time.