When creating any kind of data warehouse or lakehouse you need at least three steps. You get the data from the source systems (often called ingestion or extraction), you clean the data and apply some business logic, and you model the data for consumption. For data warehouse it is called ETL (Extract, Transform and Load) or ELT (Extract, Load, Transform). Not much is different with lakehouses, you might do more steps, but the general concepts are the same.
One of the most important steps is transforming the data or applying business logic. Data does rarely come from the source system ready for analysis. Some logic needs to be applied to transform and shape it ready for analysis. When creating traditional data warehouses in SQL Server most developers used either Stored Procedures or views to store the transformation logic.
Options for storing business logic in Microsoft Fabric lakehouse
Things are slightly different with lakehouses in Fabric. You still have multiple ways to store your transformation logic. Below I will try to go through the most common ones and talk about the pros and cons of each.
Notebooks
You can store the logic in notebooks. This is similar to creating a stored procedure in SQL Server. You create one notebook per table and write the logic in there. One of the pros of this method is that it´s very explicit. Each notebook has one purpose and it´s easy to understand where the logic is kept. Another pro is that you can write the logic in verity of languages. The main con is that you might end up with a lot of notebooks if you have a lot of tables.
Views in the SQL endpoint
You can store your logic in views in the SQL Endpoint. This allows you to write your views in a familiar SQL language. This is exactly as using views in SQL Server. The pros of this method are that it´s very explicit. You have one view per table and it´s easy to see where the logic is. Another pro is that lot of developers know SQL and find it comfortable to work with. One of the main pro is that you can have a generic notebook which loops through the views and applies the logic to the tables. Another pro is that you can use either notebooks or Data Factory when moving the data to the next stage. One of the cons is that you will potentially have lot of views if you have lot of tables. Another major con at this point in time (September 2024) is that the only way you can query SQL views via notebook is with a connector to the SQL endpoint and warehouse from Microsoft (see more information here: https://learn.microsoft.com/en-us/fabric/data-engineering/spark-data-warehouse-connector). There are several limitations to the connector at the present time. One major is that you can only use Scala to connect to and read from the SQL endpoints. Another one is that you can only query the whole table or view. No custom query is allowed. This last one shouldn´t be too much of a problem for storing business logic as the view should be only what is necessary. At the time of this writing (September 2024) the connector is quite unstable but that will hopefully get fixed very soon.
Spark views
Similar to SQL views Spark has a concept of permanent views as well. You can write them in Spark SQL and they work the same as SQL views. If you prefer to use other languages than Spark SQL you can build a dataframe in other languages and then write the view using Spark SQL giving you the flexibility of Spark and the convenience of SQL. One of the pros of this method is that it´s very explicit. You have one view per table making it easy to see where the logic is. Another pro is that you get the full flexibility of Spark. The cons are that you might end up with a lot of views if you have a lot of tables. Another con is that these Spark views are not visible anywhere in the UI. The only way you can discover them is via code.
Files or database
You can store your business logic as code in files or databases. You could for an example store the logic in JSON files and query it from there at runtime. Similarly, you could store the logic in a column in a database and query that at runtime. One of the pros of this method is that you separate your workload from your logic. The logic can be edited outside of your Fabric environment. Done well a subject matter expert without any Fabric knowledge can maintain the logic. The cons are that it can be complex to maintain complex logic in files or database columns. It needs to be planned and executed well to work properly.
What to choose
So, what to choose. There is no one correct answer. It depends a lot on your requirements. At present time (September 2024) I would recommend using either notebooks or Spark views to store the logic. I will use either option if I want to use Data Factory to orchestrate the data load and notebooks if I want to use MSSparkUtil to orchestrate the data load. When it becomes easier to work with SQL views in notebooks, I might consider using them but the only reason I would use them at present is if I use Pipelines to read and write the data.
I hope this article helps you understand the possibilities of where to store your business logic in Microsoft Fabric. Please leave comments if you have anything to add or get in touch if you want to hear how we can help you work with Microsoft Fabric.