Hi there 👋

This is Jilong, a software engineer. I write down my learning notes.

DeepSeek R1 on M1 Max

The newest hype in AI race is probably DeepSeek R1 - o1 comparable reasoning but with much lower costs. Lots of discussions on how to use this model right now, such as using R1 to plan and 3.5-Sonnet to generate. Those are super cool ideas. The one idea I want to explore in this article is to use R1 fully on device and see how much of a user experience looks like. ...

Intentionally using AI for a month

TL;DR I started to intentionally see how much AI can impact my life over the holiday, so I incorporate as many AI tools as possible into my daily life, both at work and at home. I wrote down some of the scenarios I encountered with AI and shared my experience and thoughts. When I ask what, why or how? Asking “what, why and how” happens a lot for me. I search things like “what is flood insurance, and why people buy it”. My typical workflow is having a Chrome tab open and another tab to research. I search from the address bar (yes, I like to do it). Then open more tabs to dive deep. The amount tab and window switches are a lot. After years, my tab/switching is seamless now but I’d like to avoid. I used ChatGPT desktop on my mac, perplexity etc. It saves a lot of tabs I have to open when I do research, and that’s where most of back-and-forth Q&A happens. ...

Hunting Down a Go Channel Synchronization Bug

One thing I like working at Cruise is the super interesting technical problem space. This article is a recent bug hunting in my team. At the core of Cruise is the Data Infrastructure systems that provides all the source data collected from the autonomous vehicles. Multiple PBs of data is in and out of this system every single day. Data is stored in a data lake with certain partition and layout, and an n-way data merging service written in Go is reponsible to reconstruct the data into right format the user is requesting. ...

Scalable Timer Job in Azure WebJob

Our team operates a collection of services that heavily rely on Azure App Service. In addition, there are more than 60 periodical jobs running in the Azure App Services to do all kinds of tasks. For example, a task to upload metadata documents in Parquet format to Azure Data Lake Storage that data pipeline needs to process data. Some jobs run faster and some are slower at each time, and some jobs are more frequent while others are not. We use Azure WebJob SDK to build those jobs and deploy it in the Azure App Service. ...

Splitting a ASP.NET Core Monolithic Service

Recently our team worked on a project to split a giant monolithic service into a few self-contained service areas. I did one area and learned a lot from the experience which I plan to share in this blog. The original monolithic service our team has is a combination of 6 different areas. We follow the ASP.NET Core pattern to build different component services that different areas can re-use like a function call. To be honest, I really like the simplicity of simple dependency injection and call a function from a shared class. However, we found the entire thing is getting bigger and bigger. Different areas have different workload as well. So we have seen a few severity 1 incidents that one REST API call takes the entire service down. Small incidents are happening every week so the productivity of the team dropped as well. In addition, it’s difficult to maintain a code base that people don’t know where to put their code because different areas are contributed by a few teams with diverse patterns and folder structures. Finally, developer experience sucks due to the slowness of everything from open solution in Visual Studio, build, debug and deployment. So we made the call to split the giant thing so that everybody is happier. ...

Spark Performance Optimization

At Spark+AI conference this year, Daniel Tomes from Databricks gave a deep-dive talk on Spark performance optimizations. After watching it, I feel it’s super useful, so I decide to write down some important notes which address the most common performance issues from his talk. Here is the YouTube video just in case if you are interested. Paritions We often encounter into situations that partition is not optimal at different stages of our workflow, so it slows down the entire job siganificantly. For example, six month ago I tried to analyze some telemetry data exported from Application Insights, but there are way to many JSON files (> 100,000 files) and each file is small (< 1MB each). This makes a groupBy stage takes an hour to finish on 8 machines. If this is a one-time workflow, I’m okay to not optimize it. But it’s not. ...

Boradcast Variable in Spark SQL

Over the holiday I spent some time to make some progress of moving one of my machine learning project into Spark. An important piece of the project is a data transformation library with pre-defined functions available. The original implementation uses pandas dataframe and runs on a single machine. Given the size of our data gets much bigger, sometimes we have to use a giant 512GB memory Azure VM to run the transformation and it takes a long time to run the entire transformation or I have chunk the data then transform in batches (which is not a good idea for column based transformation such as feature normalization). Another blocking issue is the intermediate memory consumption can be really high – 10x of the original data size. ...

Reduce Docker Image Size for Machine Learning

In my previous blog, I proposed a way to easily run large scale machine learning task in cloud using Docker container and Azure Batch. I also use this approach at work for some of my projects. One thing I start realizing is the size of the contianer image can grow very quickly as we add more functionality into the ML training task. Use open source tools such as scikit-learn, nltk etc. will bring additional dependencies into the container image. For example, some of us may use mini conda, but it can easily introduce a few hundred MBs into the docker container image. The Ubuntu 16.04 base image is about 120MB, then very quickly I start seeing my container image size go beyond 1GB, then 3GB after install some other tools. ...

Azure Batch, Containers and Machine Learning

I often encounter into a situation that I need a big virtual machine to run some Python script to train a machine learning model because the training dataset is bigger than my machine’s memory. Most of time, I would go to Azure portal, create a big VM, run it and tear it down because it is too expensive to leave the big VM running always. It is just too much overhead for the whole process and it become very costly if I want to run multiple experiment. ...

Fun at Work: Short Name Service

My team at work runs an online service that a user can query their data from multiple dimension values. For exmaple, the user wants to see a data visualization based on Dimension1=A, Dimension2=B and so on. We used to have a fixed 10 to 20 dimensions so we put them in the URL hash when the user come to the page to visualize their data. Usually, you will see a URL like this: ...