.. This file is part of Sympathy for Data. .. Copyright (c) 2021 Combine Control Systems .. .. SYMPATHY FOR DATA COMMERCIAL LICENSE .. You should have received a link to the License with Sympathy for Data. Experimental feature guide: Tables/Images vs Datasets? ====================================================== In Sympathy 3.1.0, we introduce lazy-loaded datasets that increase the flexibility and capacity of existing data structures. Since this is a new and experimental feature, it might not be immediately obvious when to use the new Dataset nodes as opposed to the familiar Table / Image nodes. This guide aims to simplify the process of finding the most suitable choice based on your dataset and use case, by better understanding the differences between these data structures. Eager versus lazy-loading ------------------------- Typically, when a node linked to data sources is executed in Sympathy for Data, the contents of the data files are loaded into memory (RAM). This makes it easier to then reuse, reshape, and explore this data as needed. However, it also creates a large amount of overhead, particularly if the end goal of the workflow is to train a machine learning (ML) model. As the data size grows beyond the RAM size, keeping all the files in memory is simply no longer feasible nor desirable. Also, modern machine learning models rely on multi-dimensional arrays for feature extraction and often some form of parallelised data loading to speed up the training process, which is not currently possible with eagerly-loaded nodes. In our new paradigm, a dataset structure contains instructions rather than the data itself. This means that we can effectively package a large set of instructions to be performed at a later stage, e.g. at model training time. This works well in machine learning settings where pre-processing pipelines are often well-defined for a given endgoal such as image classification or time-series forecasting. It is also beneficial to workflows where a much smaller subset of a larger dataset is important for analysis. By only loading the most pertinent information, working with larger and richer datasets is now easier than ever before. .. image:: dataset_flow.svg :target: dataset_flow.svg :alt: alt text New nodes --------- Many of the classic tabular/image operations have simply been converted into their dataset equivalents, so that they are easy to find and use. Examples include :ref:`com.sympathyfordata.advancedmachinelearning.transformtabledataset`, :ref:`com.sympathyfordata.advancedmachinelearning.transformimagedataset`, :ref:`com.sympathyfordata.advancedmachinelearning.convertcolumntypeintabledataset`, :ref:`com.sympathyfordata.advancedmachinelearning.splitdataset`, :ref:`com.sympathyfordata.advancedmachinelearning.fitdataset` and :ref:`com.sympathyfordata.advancedmachinelearning.predict_dataset`. There are however, some entirely new nodes as well. These include model-specific nodes such as :ref:`com.sympathyfordata.advancedmachinelearning.binarytabularclassifier` and :ref:`com.sympathyfordata.advancedmachinelearning.binaryimageclassifier`. These nodes work in the same way as other machine learning model nodes in relation to fit and predict nodes as part of the machine learning pipeline. It is also important to note that other machine learning models can also be used with the :ref:`com.sympathyfordata.advancedmachinelearning.fitdataset` and :ref:`com.sympathyfordata.advancedmachinelearning.predict_dataset` nodes, however, the data will be loaded into memory all at once and therefore it may be slow for larger datasets. There are also conversion nodes when moving between datasets and tables/images. These include :ref:`com.sympathyfordata.advancedmachinelearning.datasettodict` which converts between a Dataset object and a dictionary (which underlies the Dataset structure), exposing all of its components for later use. The :ref:`com.sympathyfordata.advancedmachinelearning.datasettotables` node is useful when a much smaller subset of data is left after transformations and pre-processing have been carried out and the data can now fit inside RAM constraints. Performance comparison ---------------------- The following figure illustrates the clear potential of the dataset concept in speeding up training pipelines. In this case, the focus is on image classification, which requires neural network architectures that support multi-dimensional arrays (representing localized feature relationships between pixels in an image). We can also see a clear role reversal, as eager loading requires most of the time to be allocated to loading the data, whilst lazy-loading allows us to load and transform the data quite quickly and then leave the heavy lifting to the model training stage. .. figure:: speedup_comparison.svg :align: center :width: 500 :height: 250 CIFAR-10 Dataset (50,000 images) used for image classification When looking at tabular data, the comparisons become more case-specific. Since tables can scale both horizontally and vertically, it can be difficult to know whether a particular workflow will be faster using datasets or tables. Workflow requirements will also shape performance differences, for example the number of files or the relationship between pre-processing and training steps. To illustrate a sample comparison, the following plot shows the performance of a workflow where the same machine learning model node (`MLP Classifier`) is used in both cases. The only difference is that one workflow uses Tables and the other Tabular datasets. .. figure:: speedup_comparison_tables.svg :align: center :width: 700 :height: 345 Tabular files (60,000 rows per file) used for row classification As the plot above shows, increasing the data size leads to a crossover point where using datasets becomes incrementally faster. However, it is important to note that the difference is not significant until we approach the limit of the available RAM memory of the host computer (16GB in this case). At this point, the dataset structure is able to partition the data and ensure effective memory management to maintain a linear increase in time spent as a function of data size. Using Tables, however, leads to an exponential increase in time spent on this workflow. There are also other variables at play here, including the number of rows and columns per file, the column data types, and whether the primary workflow focus is on data exploration or training machine learning models. Decision framework ------------------ Our decision framework is by no means a "one-size-fits-all" guide, but rather a quick reference and heuristic on where to start experimenting with your Sympathy workflows. It aims to provide a simple guide for the most common questions you will face when building your Sympathy flows. The overall rule-of-thumb is that if your dataset is **1) larger than your RAM size** and/or **2) ML focused** then the new dataset data structure might be a useful place to start. In addition to these key decisions, we also suggest the following: * If you are planning to make use of any deep learning models, this will require using datasets due to data loading requirements. * If you are planning to move between Tables and Tabular Datasets, this is recommended only in cases where the amount of data is large, since the conversion itself requires a large amount of overhead. FAQ --- **Q. If I have a workflow that currently uses Tables, can I convert this into a workflow that uses datasets?** **A:** This depends on the workflow, but we currently support this in a limited way via the transformation node from :ref:`com.sympathyfordata.advancedmachinelearning.tablestodataset`. This is typically used when the dataset is quite small but the model required is more complex than standard ML models. **Q. Are you planning to phase out the use of Tables/Images in future?** **A:** Not at this time. We view each data structure in light of their purpose and datasets do not aim to replace Tables / arrays but simply expand their potential for larger datasets and models. **Q. Will datasets be expanded to other data types, e.g. text data in future?** **A:** Since the focus of this functionality is on customer use cases, this is difficult to anticipate but we hope that this will cover a wider range of data types in future. **Q. I have a particular use case in mind which is not yet covered by datasets, could you add this to the functionality?** **A:** As an Enterprise user, we are always excited to receive your feedback and incorporate this into our future feature releases. **Q. Are there any example flows that I can try out to see how the dataset nodes are used?** **A:** Yes, within Sympathy for Data, you can find examples of dataset workflows by clicking on the Help menu option and then "Advanced Machine Learning" which opens a folder containing example flows. **Q. Can I add my own custom machine learning models?** **A:** We currently only offer :ref:`com.sympathyfordata.advancedmachinelearning.binaryimageclassifier`, :ref:`com.sympathyfordata.advancedmachinelearning.binarytabularclassifier` and :ref:`com.sympathyfordata.advancedmachinelearning.tabularregressor` models to ensure consistent performance. In future, this will be expanded and allow for more customisation.