Running a data importer in Azure Container Instances
4 November 2022·
Recently I’ve been working on a side project to display UK housing data: https://housepricewatch.com. The UK government publically provides a history of every house sale going back to 1995. This is really cool data so I wanted to summarize it and stick it into some charts for insights about the UK housing market. I had to write a data importer.
Most of what back-end systems are doing is just moving data around from one place to another. Ingesting data from a third party, transforming it and storing it in your own database is a common thing all back-end systems are doing. There are loads of ETL tools available for this kind of work but sometimes it’s just easier to write your own.
There are a few considerations when writing a data importer: How often does it need to run? How big is the dataset and what transformations are going to be applied? These types of questions influence how we write the importer.
How often does it need to run?
The UK publishes a new housing data file every month. So it wouldn’t be disastrous to run it locally on my machine every month… But I always try to automate my side projects as much as I can. Hosting it in the cloud is the best option. It only runs once a month so it needs something serverless to minimize costs.
How big is the dataset and what transformations are going to be applied?
The dataset is 4.5GB. The transformation is to combine the individual house sales into monthly averages. Because of the size doing this with an event-based model like Azure Functions could become complicated. It would be much easier to load all of the data into memory.
Simple and cheap is what this importer needs. A Console App hosted as a serverless container in Azure Container Instances sounds pretty good to me.
Abit about Azure Container Instances
Hosting containers can come with two big overheads: Provisioning the infrastructure and orchestrating the containers. Large applications with 100+ containers need orchestration software like Kubernetes. Kubernetes sense for large applications but is way too much overhead if you only need to run a couple of containers.
Azure container instances logo
Azure container instances offer a very simple serverless solution for running containers. It’s perfect for running a single container data importer. There’s no infrastructure to provision or complex orchestration to set up. Just give it the container and it will run it.
It can run long-running containers that handle HTTPS requests to host a website. Or single-burst containers that automatically stop when it’s finished, saving money.
Create the container image
To run the importer as a container we need to package it as a container image. A container image contains all the stuff needed to create the container including the app itself. Images are immutable so every version of the application needs to be packed into a new version of the image. It’s kind of like NPM or Nuget packages.
To create an image of our importer console app we need to add a dockerfile. This contains the instructions to create the image. There are lots of different instructions for all types of applications. I’ve just used the one for dotnet 6.0 apps from the Microsoft docs:
FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build-env WORKDIR /App # Copy everything COPY . ./ # Restore as distinct layers RUN dotnet restore # Build and publish a release RUN dotnet publish -c Release -o out # Build runtime image FROM mcr.microsoft.com/dotnet/aspnet:6.0 WORKDIR /App COPY --from=build-env /App/out . ENTRYPOINT ["dotnet", "TestApp.dll"]
Works perfectly for our simple console app. You just need to replace “TestApp.dll” on the last line with the name of your application. And place it right next to the csproj like this:
File structure of my project showing the dockerfile right next to the csproj file.
Now, this command will build the image and store it locally for you:
docker build -t test-app -f Dockerfile .
And this command will start a container from your image:
docker run test-app
Let’s try it:
Publish the container
Azure Container Instances doesn’t store container images. The container Instance loads the images from a container registry. For this example, we’re going to keep things in Azure and use the Azure Container Registry.
The Azure CLI can push an image from our local machine to the registry. Let’s try and push our test app.
First log into Azure:
Then log into your registry:
az acr login --name yourcontainerregistryname
Then back to the Docker CLI to tag the container image with the registry URL:
docker tag test-app yourcontainerregistryname.azurecr.io/test-app
And finally, docker push to upload the image to the registry:
docker push yourcontainerregistryname.azurecr.io/test-app
The container image is now uploaded :)
The Azure portal doesn’t reveal much about the images pushed to the registry. So we need to use another command to see the images. The docker image command shows the container images in the registry:
docker image ls yourcontainerregistryname.azurecr.io/test-app
Setup the Container Instance
A container instance cannot change its container image or registry. So it’s important to do the previous steps first.
Here’s the form to create the resource in the Azure Portal:
The instance needs access to your registry. For the Azure Registry, this means enabling the admin user. Admin user is found under the registry ‘Access Keys’ menu.
The ‘Azure Container Registry’ image source option needs Service principals to work. They will need the ‘image pull’ permission. For our purposes, it’s simpler to just use the username and password option.
The container instance pulls and executes the container image straight away. And sure enough from the logs under the containers section, we can see our container has run!
The container running in azure container instances
Triggering the import
The importer is happy as a container living in container instances but it won’t trigger itself. We need an external trigger to tell the container group to start and perform the import.
A PowerShell azure function app can trigger the import but I prefer something simpler: Logic Apps. Logic Apps have a built-in container instances integration that can perform any operation you want. Hook this up to a logic app schedule trigger and it should do exactly what we want.
Here’s what it looks like to trigger the importer at 6am every day:
Logic App setup to trigger the container
And that’s it! we have an automated serverless data importer running to keep my side project up to date. After it’s been running a while I’ll add a breakdown of how much it’s costing so bookmark this page :)