Scaling ML Inference on Databricks: Liquid Clustering vs Partitioning and Salting

The transition from training a machine learning model to deploying it for large-scale batch inference is often where the most significant architectural challenges arise. On platforms like Databricks, the efficiency of your inference pipeline is dictated not just by the model's complexity, but by how effectively you manage data layout, shuffle operations, and resource utilization. When dealing with billions of rows, the choice between traditional Hive-style partitioning, the newer Liquid Clustering, and data salting can mean the difference between a job that finishes in minutes and one that hangs indefinitely due to data skew.

In this guide, we will explore these data management strategies in-depth, providing technical insights into how to maximize your Databricks clusters for high-throughput ML inference. For developers who need to augment their Databricks workflows with external LLM capabilities, integrating a high-speed API aggregator like n1n.ai can provide the necessary low-latency bridge to models like Claude 3.5 or GPT-4o.

The Bottleneck: Data Skew and the Small File Problem

Before diving into solutions, we must understand the primary enemies of scale in Spark: data skew and suboptimal file sizes. Data skew occurs when a few partitions contain significantly more data than others, causing some executors to work while others sit idle. In ML inference, this often happens when you partition by a feature (like user_id or region) that follows a power-law distribution.

The "Small File Problem" is the opposite side of the coin. If you over-partition your data, Spark creates thousands of tiny files. Each file requires a metadata lookup and an I/O operation, creating a massive overhead that throttles the reading speed of your inference engine.

Traditional Partitioning: The Old Guard

Traditional partitioning involves physically organizing data into folders based on column values (e.g., /year=2024/month=10/).

Pros:

Excellent for partition pruning when queries filter on the partition key.
Simple to understand and implement.

Cons:

Partition Evolution is Hard: Changing the partition key requires rewriting the entire table.
Over-partitioning Risk: If the cardinality of the key is too high, performance collapses.
Static Nature: It does not adapt to changing data distributions.

For many legacy workloads, this is still the default. However, when your inference task requires joining large lookup tables or handling unpredictable data volumes, partitioning often fails to provide the flexibility needed. In such cases, developers often look for external API solutions like n1n.ai to handle specific model tasks without worrying about the underlying cluster configuration.

Liquid Clustering: The Modern Standard

Databricks recently introduced Liquid Clustering to replace traditional partitioning and Z-Ordering. It simplifies data layout by dynamically managing how data is clustered without requiring a fixed folder structure.

Why Liquid Clustering Wins

Flexibility: You can change the clustering columns without rewriting data.
Incremental Clustering: As new data is appended, Liquid Clustering ensures it is organized efficiently without a full table rewrite.
Skew Mitigation: It handles high-cardinality columns much better than traditional partitioning.

To implement Liquid Clustering in PySpark:

# Creating a table with Liquid Clustering
(df.write
  .format("delta")
  .option("clusterBy", "user_id, event_type")
  .saveAsTable("ml_inference_input"))

# To change clustering columns later
spark.sql("ALTER TABLE ml_inference_input CLUSTER BY (new_column)")

In an ML inference context, clustering by the features used for batching can significantly speed up the mapInPandas or Pandas UDF execution by ensuring related data is co-located on the same executor.

The Art of Salting for Skewed Inference

When you are forced to use joins or aggregations on highly skewed keys (e.g., a few celebrity IDs in a social media dataset), neither partitioning nor Liquid Clustering may be enough. This is where Salting comes in.

Salting involves adding a random integer (the "salt") to the key to break up large chunks of data.

Implementation Logic

Add a salt column to the skewed table: salt = random(0, num_partitions - 1).
Create a composite key: new_key = concat(original_key, salt).
For the lookup table, explode it by the number of salts used to ensure every possible salted key has a match.

from pyspark.sql import functions as F

# Adding salt to the skewed inference data
num_salts = 10
inference_df = inference_df.withColumn("salt", (F.rand() * num_salts).cast("int"))
inference_df = inference_df.withColumn("salted_key", F.concat(F.col("user_id"), F.lit("_"), F.col("salt")))

While salting adds complexity, it is the ultimate tool for balancing workloads across a cluster when the data itself is fundamentally unbalanced.

Performance Benchmarking: Liquid vs. Partitioned

In our case studies, we observed that for datasets exceeding 1TB, Liquid Clustering reduced the "Time to First Batch" by nearly 30% compared to Z-Ordering. More importantly, the maintenance overhead (the time spent running OPTIMIZE commands) was significantly lower because Liquid Clustering performs incremental compaction.

When your Databricks cluster is under heavy load, you might consider offloading specific LLM-based inference tasks. Using n1n.ai allows you to maintain high throughput by utilizing their optimized API infrastructure, which aggregates multiple providers to ensure that your ML pipeline never stalls due to rate limits or provider downtime.

Pro Tips for Implementation

Monitor Spark UI: Always check the "SQL" tab in the Spark UI. Look for the distribution of task durations. If 5% of tasks take 90% of the time, you have a skew problem that requires salting.
Right-size your Clusters: For ML inference, memory-optimized instances (like the R-series on Azure/AWS) are usually better than compute-optimized ones, as loading large model weights into memory is the primary constraint.
Use Vectorized UDFs: If using Python for inference, always use Pandas UDFs or mapInPandas. These allow Spark to pass data to Python in Arrow batches, which is orders of magnitude faster than row-by-row processing.

Conclusion

Scaling ML inference on Databricks requires a tiered approach. Start with Liquid Clustering for its ease of use and performance gains. If you encounter specific keys that cause execution bottlenecks, apply Salting. For legacy systems where you cannot change the table format, traditional Partitioning remains a viable, albeit limited, option.

By optimizing your data layout, you ensure that your Databricks compute resources are spent on generating predictions, not on fighting data shuffle and I/O overhead.

Get a free API key at n1n.ai

Source: https://towardsdatascience.com/liquid-or-partitioned-salted-or-not-scaling-ml-inference-on-databricks/