In a previous post, we talked about how you can use Loominus Teraport’s GUI Designer to build a powerful data pipeline to transform raw data and engineer new features for machine learning. In this post, we’ll discuss how you can use the Teraport API for event ingestion to collect and store data for analytics in near real time. The Teraport API for event ingestion is tuned to handle streaming data and is perfect for use cases involving increasing volumes, varieties and types of data from numerous disparate data systems such as social media, web, mobile and sensors (IoT data).



Streaming Ingestion

Unlike batch processing, the data in stream processing is continuously flowing with events happening frequently and close together in time. Businesses need to be able to derive insights from such events in seconds or minutes instead of hours or days. 

The continuous stream of data poses two big challenges to organizations:

  1. the events arriving faster than they can be ingested, and
  2. efficiently storing huge amounts of data for analytics

Loominus maximizes ingestion throughput by allowing events to be streamed untouched directly into Teraport; any transformations that need to take place on the raw data will run asynchronously and the processing is offloaded to worker nodes. The transformed data are stored in scalable Parquet format for query efficiency.

With Teraport, you only need to concern yourself with being able to send JSON-encoded events to an HTTPS endpoint.

Example Use Case: Web Scraping

In this exercise, we’ll write some Python code to scrape public stock data from MarketWatch and send it to Teraport so we can establish our data pipeline. We’ll design our data pipeline to derive a couple of new features that captures a stock’s percentage off the 52-week high or low. For example, consider a stock that in the last year traded as high as $45, as low as $25, and is currently trading at $35. This means the stock is trading 22% (1 – (35/45)) below its 52-week high and 40% ((35/25) – 1) above its 52-week low.

Here’s our web scraping code.

To run this code yourself, you’ll need to login to the Loominus Platform and get the Teraport API key to pass in the X-API-Key header. 

As you’re running the code, you can refresh the Data Stream Dashboard to see a time-series plot of the number of events being streamed to Teraport over 15-second intervals. The scraped data starts appearing in their staging table about 15 minutes (default on Loominus Public) after the first events are sent. Checking the summary stats for the staging table marketwatch_stock_data, you’ll notice that we’ve ingested 25 rows and 31 columns. That’s plenty of data points to create our data pipeline.

The scraped data starts appearing in their staging table about 15 minutes (default on Loominus Public) after the first events are sent. Checking the summary stats for the staging table marketwatch_stock_data, you’ll notice that we’ve ingested 25 rows and 31 columns.

Teraport API Streaming Data Ingestion

Recall from the previous blog that to create a data pipeline, you first create a reporting table. (We also have videos on our Loominus AI YouTube channel that you can watch to familiarize yourself with the Loominus platform.)

Our data pipeline starts with an Input, which is the marketwatch_stock_data staging table. Note that we specify the table source as “api”.

Our data pipeline starts with an Input, which is the marketwatch_stock_data staging table. Note that we specify the table source as “api”.

Then we add a Sql Select Transform to derive our two new features: percent_below_52_week_high and percent_above_52_week_low.

Then we add a Sql Select Transform to derive our two new features: percent_below_52_week_high and percent_above_52_week_low.

Click the Save button to persist the parameters for your Transform and then click the green play button on the Transform card to run the data pipeline in the Designer.

After the operation runs successfully, check the output summary stats and you’ll notice that our dataset now has 33 columns.

After the operation runs successfully, check the output summary stats and you’ll notice that our dataset now has 33 columns.

Next we’ll schedule our data pipeline to run automatically when new events are ingested. From the Designer screen, just click the name of the reporting table marketwatch_stock_report in the breadcrumbs near the top. This will take you to the reporting table’s Task Runs screen, where you’ll want to click the green clock icon near the upper right to get to the dialog window for configuring the data pipeline’s scheduler. Note that we specify the schedule as being “Event based” because we want our data pipeline to run whenever data is appended to its input, the marketwatch_stock_data staging table.

Note that we specify the schedule as being “Event based” because we want our data pipeline to run whenever data is appended to its input, the marketwatch_stock_data staging table.

Click the Save button and confirm we want to schedule our data pipeline. Now let’s rerun our scraping code to refresh our stream of most actively traded stocks according to MarketWatch. After a few minutes, we’ll see a task run appear for the marketwatch_stock_report reporting table.

After a few minutes, we’ll see a task run appear for the marketwatch_stock_report reporting table.

Summary

Teraport allows businesses to focus on their problems instead of boiler plate big data collection pipelines. Tuned for big data streaming ingestion, Teraport provides an API through which data can be streamed, stored, analyzed, and exposed to downstream systems.

About Loominus

Loominus is an end-to-end platform that helps teams ingest and stage data, build advanced machine learning models with no code and deploy them into production. Loominus makes it easy for individuals and teams without experience building machine learning pipelines to take advantage of machine learning faster. Loominus is equally great for experienced data scientists that need to focus on model selection and tuning.

Use Loominus for Free

Help your business achieve machine learning success

Use Loominus for Free