Before one can talk about how to implement partitioning and the features of partitioning one must first understand
the need; what are partitions and why might someone consider using them? When you create tables, you design those tables to
store information about an entity – i.e. customers or sales. Each table should have attributes that describe only that
entity and for customers and sales the historical premise is that all of your customers and all of your sales go into their
respective tables. While a single table for each entity is the easiest to design and understand, it may not be the best for
performance, scalability and manageability especially as the table grows large. Partitioning can provide benefits for both
large tables (and/or their indexes) and tables which have varying access patterns. More specifically, through partitioning
practices large tables have better scalability and manageability and the use of tables that have changing data is simplified
when adding or deleting large "chunks" (or ranges) of data.
So what constitutes a large table? The idea of VLDB (Very Large Database) is that the total size of the database
is measured in hundreds of gigabytes or even terabytes but the term does not necessarily specify individual table sizes. A
large table is one that does not perform as desired or one where the maintenance costs have gone beyond pre-defined maintenance
periods. Furthermore, a table can be considered large if one user’s activities significantly affect another or if maintenance
operations affect other user’s abilities. In effect, this even limits availability. Even though the server is available,
how can you consider your database available when the sales table’s performance is severely degraded or even inaccessible
during maintenance for 2 hours per day, per week, or even per month? In some cases, periodic downtime is acceptable yet it
is often possible to avoid or minimize downtime through better design.
A table whose access patterns vary may also be considered large when sets (or ranges) of rows have very different
usage patterns. Although usages patterns may not always vary (and this is not a requirement for partitioning), when usage
patterns do vary there can be additional gains. Again, thinking in terms of sales, the current month's data is read-write
while the previous month's data (and often the larger part of the table) is read-only. Large tables where the data usage varies
or large tables where the maintenance overhead is overwhelming can limit the table's ability to respond to varied user requests,
in turn limiting both availability and scalability. Moreover, especially when large sets of data are being used in different
ways maintenance operations can end up routinely maintaining static data. Performing maintenance operations on data which
does not truly need it – is costly. The costs can be seen in performance problems, blocking problems, backups (space,
time and operational costs) as well as negatively impacting the overall scalability of the server.
Furthermore, if a large table exists on a system with multiple CPUs, partitioning the table can lead to better
performance through parallel operations. Large-scale operations across extremely large data sets – typically many million
rows – can benefit by performing multiple operations against individual subsets in parallel. A simple example of performance
gains over partitions can be seen in previous releases with aggregations. For example, instead of aggregating a single large
table, SQL Server can work on partitions independently and then aggregate the aggregates. In SQL Server 2005, joins can benefit
directly from partitioning; SQL Server 2000 supported parallel join operations on subsets yet needed to create the subsets
on the fly. In SQL Server 2005, related tables (i.e. Order and OrderDetails) that are partitioned to the same partitioning
key and the same partitioning function are said to be aligned. When the optimizer detects that two partitioned and aligned
tables are joined, SQL Server 2005 can choose to join the data which resides on the same partitions first and then combine
the results. This allows SQL Server 2005 to more effectively use multiple-CPU machines.
So how can partitioning help? Where tables and indexes become very large, partitioning can help by splitting
large amounts of data into smaller more manageable chunks (i.e. partitions). The type of partitioning described in this paper
is termed horizontal partitioning. With horizontal partitioning, large chunks of rows will be stored in multiple separate
partitions. The definition of the partitioned set is customized, defined, and managed – by your needs. Partitioning
in SQL Server 2005 allows you to partition your tables based on specific data usage patterns using defined ranges. Finally,
SQL Server 2005 offers numerous options for the long-term management of partitioned tables and indexes by adding complementary
features designed around the new table and index structure.
The concept of partitioning is not new to SQL Server. In fact, forms of partitioning have been possible in every
release. However, without features to aid in creating and maintaining your partitioning scheme, partitioning has often been
cumbersome and underutilized. Typically, the design is misunderstood by users and developers and the benefits are diminished.
However, because of the significant performance gains inherent in the concept, SQL Server 7.0 began improving the features
enabling forms of partitioning through partitioned views and SQL Server 2005 now offers the largest advances through partitioned
tables.
In SQL Server 6.5 and earlier, partitioning had to be part of your design as well as built into all of your
data access coding and querying practices. By creating multiple tables and then managing access to the correct tables through
stored procedures, views or client applications you could often improve performance for some operations but at the cost of
complexity of design. Each user and developer needed to be aware of and properly reference the correct tables. Each partition
was created and managed separately and views were used to simplify access; however, this solution yielded few performance
gains. When a UNIONed view existed to simplify user/application access, the query processor had to access every underlying
table in order to determine if data was needed for the result-set. If only a limited subset of those underlying tables was
needed, then each user and developer needed to know the design in order to reference only the appropriate table(s).
The challenges faced by manually creating partitions in releases prior to SQL Server 7.0, were primarily related
to performance. While views simplified application design, user access and query writing, they did not offer performance gains.
With the release of SQL Server 7.0, views were combined with constraints to allow the query optimizer to remove irrelevant
tables from the query plan (i.e. partition elimination) and significantly reduce overall plan cost when a UNIONed view accessed
multiple tables.
In Figure 1, examine the YearlySales view. Instead of having all sales within one single, large table, you could
define twelve individual tables (SalesJanuary2003, SalesFebruary2003, etc¼) and then views for each quarter as well as a View for the entire
year – YearlySales.