Introducing: Dal.io
The Data Science tool-belt has grown significantly as the industry matured. Nowadays, professionals must be skilled with importing, cleaning, processing and graphing data, not to mention evaluating the statistical significance of findings and managing several data sources. While these are natural developments in this hot industry, it creates barriers to other professionals who would like to dabble in it.
Dal.io was originally created to tackle this exact problem. We, at the University of Toronto Mississauga, where trying to migrate part of our fourth-year risk management curriculum from MS Excel to Python. However, we had trouble teaching students the Pandas workflows necessary to implement course material. Dal.io abstracts several of these workflows into “layers” that connect to each other to form graphs. The package does this while allowing users to create their own layers implementing features relevant to their expertise.
Architecture
Graphs consist of four base layer objects:
- Externals: used to import raw data from an external source.
- Translators: used to process and standardize raw external data.
- Pipes: used to transform a single input and output results.
- Models: used to combine multiple inputs into one output.
The code below creates a model that
- Imports financial data from Yahoo! Finance,
- Cleans the data,
- Indexes stock prices at 100, and
- Graphs the results.
stocks = YahooStockTranslator()(YahooDR())close_indexed = DateSelect(start="2018")(stocks) + \
Index(100, columns="close")plot = PandasXYGrapher(legend=”upper left”) \
.set_input(“data_in”, close_indexed) \
.set_output(“data_out”, PyPlotGraph(figsize=(12, 8)))
I’ll note that at this point, no data has been imported yet. The graph we have built is just a representation of a data-processing pipeline. We can import the data and output results by calling
plot.run(ticker=[“AAPL”, “MSFT”, “SPY”])
which yields:
This architecture allows graphs to be easily transported and modified, while keeping memory usage at a minimum.
Inheritance
The main objective of Dal.io is allowing developers to do what they do best, be it scraping, processing, mathematics or graphing. That is why the package includes several tools designed with ease of development in mind. For example, the _ColGeneration
Pipe
subclass implements several features to select, add, drop and replace columns. This way, subclasses can focus on the column transformation itself. This is how we implement the Index
object from above:
class Index(_ColGeneration): def __init__(self, index_at, *args,
columns=None, new_cols=None,
drop=True, reintegrate=False,
**kwargs): self._index_at = index_at super().__init__(
columns=columns, new_cols=new_cols,
drop=drop, reintegrate=reintegrate
) def copy(self, *args, **kwargs):
return super().copy(
self._index_at,
*args, **kwargs
) def _gen_cols(self, inter_df, **kwargs):
return inter_df.apply(
index_cols,
axis=0,
args=self._args,
i=self._index_at,
**self._kwargs
)
While this is a simple transformation, we could have done anything else with the inter_df
parameter — the selected portion of the input data frame — and returned it. The rest would have been taken care of my the inherited class.
Conclusion
There are several more of these objects in Dal.io and a several levels of abstraction you can explore. What I showed here was just the tip of the iceberg. If you’re interested in learning more, or seeing an example of how to create an optimized portfolio of stocks, check out our documentation; if you’d like to jump straight to the source, fork us on github.