Data Structures

Introduction

Data structures are specialized formats for organizing, processing, retrieving, and storing data in a computer's memory. They are chosen to efficiently perform operations for various algorithms and applications. In computer science, data structures are divided into two main categories:
  • Primitive Data Types: These are the most basic types, representing simple values like integers, floating-point numbers, characters, and booleans. They are the building blocks for more complex structures.
  • Abstract Data Types (ADTs): These are more complex structures that consist of both a set of data and a set of operations that can be performed on that data. ADTs, like linked lists, trees, and graphs, are defined by their behavior (what they do) rather than their specific implementation (how they do it).

The choice of data structure is critical and depends on the specific problem being solved. Some structures offer fast access, some provide efficient insertion/deletion, and others are optimized for searching or memory usage.


Arrays

An array is a collection of elements, each identified by at least one array index or key. In most programming languages, arrays store elements of the same data type in contiguous memory locations, allowing for fast, indexed access.

Features:
  • Stores a homogeneous collection of elements.
  • Elements are stored in contiguous memory locations, accessible via a 0-based index.
  • Typically has a fixed size, defined upon creation.
Advantages:
  • O(1) time complexity for accessing elements by index.
  • Memory efficient as it requires no extra storage for pointers or metadata.
Use cases:
  • Storing and accessing a collection of a known size.
  • Implementing other data structures like stacks, queues, and hash tables.
Disadvantages:
  • Insertion and deletion operations are slow (O(n)) because subsequent items must be shifted.
  • Resizing is often an expensive operation that involves creating a new array and copying elements.

Linked Lists

A linked list is a linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element (node) contains data and a pointer to the next node in the sequence.

Features:
  • Can store a homogeneous or heterogeneous collection of elements.
  • Elements are stored in non-contiguous memory locations.
  • Comes in several varieties, including singly-linked, doubly-linked, and circular lists.
Advantages:
  • Insertion and deletion are efficient (O(1)) if a pointer to the target node is already held.
  • Dynamic in size, growing and shrinking as needed.
Use cases:
  • Implementing dynamic data structures where the number of items is unknown.
  • Implementing stacks and queues.
Disadvantages:
  • Accessing an element by index has linear time complexity (O(n)) as it requires traversing the list from the beginning.
  • Requires extra memory to store pointers for each node.

Stacks

A stack is a linear data structure that follows the LIFO (Last-In, First-Out) principle. Elements can be added (pushed) or removed (popped) only from the top of the stack.

Features:
  • Elements are accessed using LIFO ordering.
  • Primary operations are `push` (add to top) and `pop` (remove from top).
Advantages:
  • Simple to implement and use.
  • Useful for reversing a sequence of elements.
Use cases:
  • Managing function calls (the call stack) in recursion.
  • Implementing undo/redo functionality in software.
  • Parsing expressions (e.g., checking for balanced parentheses).
Disadvantages:
  • Accessing elements other than the top is inefficient.
  • Stack size can be limited by available memory.

Queues

A queue is a linear data structure that follows the FIFO (First-In, First-Out) principle. Elements are inserted at the back (enqueue) and removed from the front (dequeue).

Features:
  • Elements are accessed using FIFO ordering.
  • Primary operations are `enqueue` (add to rear) and `dequeue` (remove from front).
Advantages:
  • Easy to implement and use.
  • Fairly orders tasks or elements.
Use cases:
  • Managing tasks for a printer or CPU scheduling.
  • Implementing a buffer for streaming data.
  • Breadth-First Search (BFS) algorithms in graphs.
Disadvantages:
  • Accessing elements other than the front is inefficient.

Heaps

A heap is a specialized tree-based data structure that satisfies the heap property. In a max-heap, for any given node, the value is greater than or equal to the values of its children. In a min-heap, it is less than or equal to.

Features:
  • Typically implemented as a complete binary tree.
  • Supports efficient insertion and removal of the minimum or maximum element.
Advantages:
  • Finding the minimum or maximum element is very fast (O(1)).
  • Insertion and deletion are efficient (O(log n)).
Use cases:
  • Implementing priority queues.
  • The Heapsort algorithm.
  • Graph algorithms like Dijkstra's algorithm.
Disadvantages:
  • Searching for an element other than the min/max is inefficient (O(n)).

Trees

A tree is a hierarchical data structure consisting of nodes connected by edges. It has a topmost node known as the root, and each node can have zero or more children nodes.

Features:
  • Represents hierarchical relationships.
  • Common types include Binary Trees, Binary Search Trees (BSTs), and balanced trees (e.g., AVL, Red-Black Trees).
Advantages:
  • In a balanced search tree, insertion, deletion, and search operations are very efficient (O(log n)).
  • Naturally represents hierarchical data like file systems or organizational charts.
Use cases:
  • Implementing search algorithms (e.g., Binary Search Tree).
  • Storing data for efficient retrieval (database indexing).
Disadvantages:
  • Can be complex to implement correctly.
  • In an unbalanced tree, performance can degrade to that of a linked list (O(n)).

Graphs

A graph is a non-linear data structure consisting of a set of vertices (or nodes) connected by edges. Edges can be directed (one-way) or undirected (two-way) and may have weights associated with them.

Features:
  • Represents complex network relationships.
  • Can be directed or undirected, weighted or unweighted, cyclic or acyclic.
Advantages:
  • Ideal for modeling real-world networks like social connections, road maps, and the internet.
  • Supports powerful traversal algorithms like Depth-First Search (DFS) and Breadth-First Search (BFS).
Use cases:
  • Representing social networks.
  • Pathfinding algorithms like Dijkstra's and A*.
  • Modeling dependencies in tasks or computer networks.
Disadvantages:
  • Can be difficult to implement and visualize.
  • Many graph algorithms have high time complexity.

Hash Tables

A hash table is a data structure that implements an associative array, which maps keys to values. It uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.

Features:
  • Elements are accessed using key-value pairs.
  • Uses a hash function to map keys to array indices.
  • Requires a collision resolution strategy (e.g., chaining or open addressing).
Advantages:
  • Extremely efficient insertion, deletion, and search operations on average (O(1)).
Use cases:
  • Implementing dictionaries, maps, and sets in programming languages.
  • Database indexing and caching.
Disadvantages:
  • Worst-case time complexity can be slow (O(n)) if there are many hash collisions.
  • Choosing a good hash function is critical for performance.

Exotic Structures

Exotic data structures are less common structures designed to solve specific, complex problems with high efficiency. While not as widely used as arrays or lists, they are powerful tools for specialized applications.

Bloom filters

A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It is designed to be extremely fast and memory-efficient.

Features:
  • Uses multiple hash functions to map an item to several bits in a bit array.
  • False positives are possible, but false negatives are not. If the filter says an item is not in the set, it is definitively not. If it says the item is in the set, it might be a false positive.
Use cases:
  • Checking for previously seen items, e.g., to avoid showing a user the same news article twice.
  • Network routers use them to block malicious websites without storing a giant list.

HyperLogLog

HyperLogLog is a probabilistic data structure used to estimate the number of distinct elements (the cardinality) in a set, using a very small and fixed amount of memory.

Features:
  • Provides a highly accurate cardinality estimate with minimal memory usage.
  • Uses a hash function and statistical analysis of the resulting bit patterns.
Use cases:
  • Counting unique visitors to a website in real-time.
  • Estimating the number of distinct words in a very large text corpus.

Interval Trees

An interval tree is a data structure for storing intervals and efficiently finding all intervals in the collection that overlap with a given query interval or point.

Features:
  • A tree structure where each node stores information about intervals.
  • Provides a fast way to find all overlapping intervals.
Use cases:
  • Finding scheduling conflicts in a list of appointments.
  • In genomics, finding all genes that overlap a specific DNA region.

Tries

A trie (also known as a prefix tree) is a tree-like data structure that stores a dynamic set of strings, where the keys are usually strings. Nodes do not store keys; instead, an element's position in the tree defines the key with which it is associated.

Features:
  • Each node represents a common prefix of a set of keys.
  • Very efficient for prefix-based searches.
Use cases:
  • Implementing autocomplete functionality in search engines.
  • Serving as a dictionary for spell-checking.

Wavelet trees

A wavelet tree is a compact data structure for storing a sequence of symbols, enabling efficient queries on that sequence. It compresses the data while still allowing for fast ranking, selection, and access operations.

Features:
  • Compresses data by exploiting character frequencies.
  • Answers complex queries (like "how many times does 'a' appear in the first 1000 characters?") in logarithmic time.
Use cases:
  • Bioinformatics for analyzing genomic sequences.
  • Full-text search and text indexing in large document databases.

String Representation

While strings are often a primitive type, their internal representation can be optimized using specialized data structures, especially for very large strings.

  • Rope: A binary tree used to efficiently store and manipulate large strings. Concatenation and substring operations are much faster than with simple character arrays.
  • Array of Strings: A simple and effective method for a collection of many small, fixed-size strings.
  • Trie: As mentioned before, a trie is ideal for storing large numbers of strings that share common prefixes, like in a dictionary.
  • Hash Table: Can be used for fast string retrieval and ensuring uniqueness in a collection of strings.
  • Suffix Array/Tree: Advanced structures used for complex substring searches. They store all suffixes of a string in a sorted manner, enabling extremely fast lookups. Useful in text compression and bioinformatics.

The choice of which data structure to use depends on the specific requirements of the application, such as the size and number of strings, the operations needed, and the available memory.


Read next: Algorithms