Data pipelines are where most of the time is spent for those working with data because the bulk of a machine learning project involves data collection and cleaning. Loominus gives everyone the power to build the data pipelines critical to any machine learning project.
Teraport is a powerful tool within the Loominus product suite that ingests and stages data. In another post, we’ll discuss the data ingestion APIs. For now we’ll focus on building a powerful data pipeline for feature engineering.
We’re going to build a data pipeline that generates the average credit score of borrowers within a portfolio of loans. For added complexity, we’ll weight the credit score by each borrower’s outstanding revolving credit balance. Finally, we’ll group by loans that are either on time or delinquent and aggregate the weighted credit scores. The result will be a weighted average credit score for on time loans and delinquent loans.
Teraport Reporting Table Designer
In Teraport, your raw data is collected into staging tables. The first step in building data pipelines with Teraport is to create a reporting table and design a data transformation pipeline that sources from a staging table.
Our staging table contains a
status column that can be one of three values: “Delinquent”, “In Grace Period” or “Not Delinquent”. A status of “In Grace Period” means the borrower made the monthly payment within a 15 day grace period of the due date.
First we’ll use the Subset Columns Transform to select only the columns we need for this analysis.
Harness the Power of SQL
Of the many Transforms that Teraport provides for feature engineering, the most versatile is the SQL Select Transform. We’ll write some SQL to recategorize loans with status “In Grace Period” to “Not Delinquent”.
After you run the SQL Select Transform operation, check the output summary stats and you’ll see that there’s a new column,
loan_status, with possible values “Delinquent” and “Not Delinquent”. We just engineered a new feature that allowed us to reframe a multi-classification problem as a binary classification problem. Nice, right?
A Common Pattern in Data Analysis
Many data analysis problems involve the application of a split-apply-combine strategy, where the data is broken up into manageable pieces. Each piece is then operated on independently and put back together.
In our case, we want to sum the
revolving_credit_balance for each type of
loan_status and append the result to our reporting table. We could have written SQL to do this too, but the Split Apply Combine Transform works just as well!
Arithmetic on Columns
Next our data pipeline has to calculate the weight of each loan’s revolving credit balance and use that to determine each loan’s weighted average credit score. This can be done with two Arithmetic Combine Transforms.
The first adds a new column named
The second adds a new column named
Completing the Data Pipeline
The remaining Transform in our data pipeline has to sum up the
weighted_avg_credit_score for on-time loans and delinquent loans. This sounds like another job for the Split Apply Combine Transform.
We expect our reporting table to now contain nine columns, which we can confirm by peeking at the output summary.
We just demonstrated how to build a repeatable, data pipeline in Teraport. This data pipeline resulted in engineered features that will be used for modeling in Loominus.
The Reporting Table Designer is your blank canvas to slice and dice samples of your data. Design the data pipeline for the reporting table once, then as new data arrives to the staging table, the raw data will go through the data pipeline and come out clean and ready for modeling!
There’s a Transform for That
Teraport supports many out-of-the-box Transforms used to clean and reshape data for exploration, analysis and engineering new features for machine learning.
- Arithmetic Combine
- Binarizer Combine
- Coalesce Combine
- Cut Combine
- Drop Duplicates
- Drop Rows
- Impute Missing Values
- ISO Date Normalizer
- Numeric Type Conversion
- Outlier Combine
- Pivot and Melt
- Rename Columns
- Rows Predicate
- Split Apply Combine
- SQL Select
- Subset Columns
Get early access to Loominus
Help your business achieve machine learning success