Data structure

The best ways to organise data were first put forward in the early 1970s and have changed remarkably little since. The use of data analysis, and the resulting data tables, form the start point for every good database system. There is a standard way of splitting the data into separate tables, which is known as normalisation.

Excel, however, has always been different. The start point is an empty grid, and the user can put anything anywhere - there is nothing to lead the user towards any approach rather than any other. What can happen as a result is that data becomes scattered throughout a workbook and is mixed in with the analysis which is the purpose of the workbook. Such workbooks can become very difficult to extend and maintain.

However the advantages of good data design apply just as much to Excel as to any database system:

Separating the data from its subsequent analysis ensures that the data is easy to maintain.
Organising the data in separate (normalised) tables ensures that data is not duplicated.
Many of the features of Excel are designed to work with data in tables.

To give a simple example of how this would work: imagine a table of invoices, and each invoice is associated with a customer. Customers have codes, also names, addresses and other information. One could include all this information on the invoice table, but doing so would result in a great deal of duplication, as one would need to repeat the address etc. on every invoice for the same customer, and there would also be the risk of error in doing it this way. So instead one would just include the customer code on the invoice, then set up a separate customer table with the other customer details.

The end point of this process is data in a number of different tables. Possibly the most significant improvement to Excel was the introduction of support for Tables in Excel 2007, which among other features, for the first time allowed the data to be referenced by its name rather than its address. However there were still limitations - for example the traditional Pivot Table is designed to work from a single Excel table. And although it is easy to set up a formula to lookup a table based on one column, looking up two columns requires a slightly awkward array formula which can be slow with large tables.

Subsequent versions of Excel have improved on this, for example with the introduction of Power Pivot and Power Query, now known as Get and Transform. However these do not always fit easily into existing workbooks or the types of analysis which the user wants to carry out. And to use Get and Transform for anything beyond what is provided in the user interface, the user has to learn the M language which is unlike anything else in Excel.

Data Manipulation Tools

We use a toolset for manipulating tabular data efficiently in Excel VBA which greatly simplifies project development. The software we use is also available for user download on GitHub.

In addition to worksheets and databases, temporary, virtual tables can be created and manipulated in a succession of steps to produce the final desired result. The main features are:

There are extensive facilities for manipulating data on worksheets.
Databases can be created, maintained (including adding new tables) and queried with SQL with simple programming.
There is a wide range of data transformations.
There is a full object model for use in VBA where parameters are easy to change.
Methods can be called individually so can be placed where required in VBA code.
In addition to high level procedures (such as left join, unpivot, etc.) access is also provided to individual records and fields.
Calculation is done with Excel formulae. Regular Expressions are supported for data extraction.
User defined functions in VBA are supported.
It provides an easy way to maintain workbook data in an Access database or save it in other formats.
It is straightforward to create reports and ensure that formulae are protected and formatting is applied.
It runs on all versions of Excel from 2007 onwards.

The data transformations provided include:

Grouping and creation of totals, subtotals and other types of summary.
Pivot and unpivot.
Selection and filtering.
Match and join operations, to bring in data from other tables.
Sorting.
Cross product, which creates all combinations of two tables and is useful for evaluating multiple instances of the data.
Calculation using standard Excel formulae (unlike Get and Transform which requires the user to learn a completely different formula language).

There is a wide variety of output options. The results can be written back to the workbook as a table, a simple list or fed into a pivot table. Alternatively a lookup function is provided so individual results can be placed in any cell.

Use of SQL

In almost every database system, the main way data is handled is through SQL. And with good reason - almost any transformation of the data can be carried out in a single step. But although it is possible to use SQL with Excel, it is not straightforward as one can not directly reference the active workbook. We also provide a tool to set up the necessary infrastructure so that SQL can be used easily in a variety of ways:

At the simplest, the result of a SQL query can be returned as another Table (or simple list output) in the workbook.
It is also possible to use a UDF (user-defined function) which returns the result of a SQL query. This makes for a simple but very powerful way of producing summaries of the data, and by paramaterising the SQL, such summaries can be made to respond to user selections.
The output from a SQL query can form the data source for a pivot table.
Where the output requirements are more complex, the SQL query provides a start point for further analysis using the virtual table approach described above.

Also, sometimes Excel tables are not adequate for the storage of data and a true database is needed. Please see Database Options for our approach to this.