Data pipelines are where most of the time is spent for those working with data because the bulk of a machine learning project involves data collection and cleaning. Loominus gives everyone the power to build the data pipelines critical to any machine learning project.

Teraport is a powerful tool within the Loominus product suite that ingests and stages data. In another post, we’ll discuss the data ingestion APIs. For now we’ll focus on building a powerful data pipeline for feature engineering.

We’re going to build a data pipeline that generates the average credit score of borrowers within a portfolio of loans. For added complexity, we’ll weight the credit score by each borrower’s outstanding revolving credit balance. Finally, we’ll group by loans that are either on time or delinquent and aggregate the weighted credit scores. The result will be a weighted average credit score for on time loans and delinquent loans.

Teraport Reporting Table Designer

In Teraport, your raw data is collected into staging tables. The first step in building data pipelines with Teraport is to create a reporting table and design a data transformation pipeline that sources from a staging table.

Teraport Staging Table Designer.

Our  staging table contains a status column that can be one of three values: “Delinquent”, “In Grace Period” or “Not Delinquent”. A status of “In Grace Period” means the borrower made the monthly payment within a 15 day grace period of the due date.

Teraport displays the categorical data fields.

Subsetting Data

First we’ll use the Subset Columns Transform to select only the columns we need for this analysis.

Teraport Subset Transform

Harness the Power of SQL

Of the many Transforms that Teraport provides for feature engineering, the most versatile is the SQL Select Transform. We’ll write some SQL to recategorize loans with status “In Grace Period” to “Not Delinquent”.

Teraport SQL Transform is super versatile.

After you run the SQL Select Transform operation, check the output summary stats and you’ll see that there’s a new column, loan_status, with possible values “Delinquent” and “Not Delinquent”. We just engineered a new feature that allowed us to reframe a multi-classification problem as a binary classification problem. Nice, right?

Teraport makes feature engineering easy.

A Common Pattern in Data Analysis

Many data analysis problems involve the application of a split-apply-combine strategy, where the data is broken up into manageable pieces. Each piece is then operated on independently and put back together.

In our case, we want to sum the revolving_credit_balance for each type of loan_status and append the result to our reporting table. We could have written SQL to do this too, but the Split Apply Combine Transform works just as well!

Teraport Split-Apply-Combine Transform: divide and conquer!

Arithmetic on Columns

Next our data pipeline has to calculate the weight of each loan’s revolving credit balance and use that to determine each loan’s weighted average credit score. This can be done with two Arithmetic Combine Transforms.

The first adds a new column named weighted_revolving_credit_balance.

Teraport Arithmetic Combine Transform.

The second adds a new column named weighted_avg_credit_score.

Teraform Arithmetic Combine Transform.

Completing the Data Pipeline

The remaining Transform in our data pipeline has to sum up the weighted_avg_credit_score for on-time loans and delinquent loans. This sounds like another job for the Split Apply Combine Transform.

Teraport Split-Apply-Combine Transform in action.

We expect our reporting table to now contain nine columns, which we can confirm by peeking at the output summary.

We're done!

Repeatable Process

We just demonstrated how to build a repeatable, data pipeline in Teraport. This data pipeline resulted in engineered features that will be used for modeling in Loominus.

The Reporting Table Designer is your blank canvas to slice and dice samples of your data. Design the data pipeline for the reporting table once, then as new data arrives to the staging table, the raw data will go through the data pipeline and come out clean and ready for modeling!

There’s a Transform for That

Teraport supports many out-of-the-box Transforms used to clean and reshape data for exploration, analysis and engineering new features for machine learning.

  • Arithmetic Combine
  • Binarizer Combine
  • Coalesce Combine
  • Cut Combine
  • Drop Duplicates
  • Drop Rows
  • Impute Missing Values
  • ISO Date Normalizer
  • Numeric Type Conversion
  • Outlier Combine
  • Pivot
  • Pivot and Melt
  • Rename Columns
  • Rows Predicate
  • Split Apply Combine
  • SQL Select
  • Subset Columns

About Loominus

Loominus is an end-to-end platform that helps teams ingest and stage data, build advanced machine learning models with no code and deploy them into production. Loominus makes it easy for individuals and teams without experience building machine learning pipelines to take advantage of machine learning faster. Loominus is equally great for experienced data scientists that need to focus on model selection and tuning.

Get early access to Loominus

Help your business achieve machine learning success