System Design Stories: JioCinema and Hotstar

Shambhavi Shandilya
7 min readMay 27, 2024

--

For the past few days, I have been consuming content related to real-world system design problems. In this writeup, I have tried to brief my learnings and the key highlights of the content.

TLDR: Taming Traffic Spikes in Live Streaming

  • Know your users: Focus on core features (streaming, ads) and prioritize the user journey.
  • Informative errors: Clear error messages and remove unnecessary errors.
  • Plan ahead: Load test, secure resources, and have breakpoints defined.
  • Be independent: Build internal solutions to reduce reliance on external resources.
  • Smart scaling: Consider custom scaling based on real-time needs.
  • Prepare for surges: Anticipate traffic spikes from external events (player performance) and internal marketing efforts.

📈 JioCinema

Reference: https://www.youtube.com/watch?v=36N1Bz7qW0A

JioCinema is an Indian over-the-top (OTT) media streaming service offering video streaming services. Since 2023, JioCinema has held the exclusive digital rights to broadcast the Indian Premier League (IPL) in India. JioCinema offers features specifically catering to the IPL experience. These include high-definition streaming (up to 4K), multiple language options, and interactive features like a dedicated fan feed and various camera angles.

In this video, some design decisions JioCinema made regarding handling more than 25 million concurrent users for video streaming IPL were discussed.

📌 Understanding User Journey

The discussion dives into understanding how users navigate the application to prioritize features effectively.

The key is separating features crucial to the core experience from those “nice-to-haves.”

For example, during peak usage, when resources are stretched thin, personalized recommendations on the home page might be temporarily paused to ensure the smooth operation of essential features like video streaming and ad delivery. Feature flags are used to perform these operations.

Another important point is the concept of graceful degradation. This means providing users with clear and helpful error messages when things go wrong. Imagine encountering an error message saying, “DB isn’t responding.” Not very informative or necessary, right? Additionally, prioritizing critical functionalities and isolating non-critical failures is key. For instance, an error related to a malfunctioning sticker pack shouldn’t prevent a user from being able to view a match. By following these strategies, developers can create a more resilient user experience even when the system is under pressure.

📌 Warming Up

In the weeks leading up to the IPL season, JioCinema engineers conduct rigorous load testing. This process simulates real-world user traffic patterns and helps identify potential bottlenecks within the system. By pinpointing areas that might struggle under heavy load, engineers can proactively address these issues before they impact user experience. Additionally, breakpoints are defined for various services. These breakpoints represent the traffic levels at which critical functionalities may begin to degrade.

While autoscaling, the automatic addition of resources to handle the increased load, is valuable, it might not provide an immediate solution during peak traffic periods. To address this, JioCinema engineers take a proactive approach by procuring additional resources (instances, CDN traffic etc.) well in advance of the IPL. This ensures that sufficient resources are readily available to handle the anticipated surge in traffic.

To safeguard against potential database outages during peak traffic, Snapshots are regularly created for core functionalities within the database. These snapshots act as a readily available backup, allowing for a swift restoration of critical data in case of an outage. Additionally, traditional database backups are also maintained. While database scaling might not be as straightforward as scaling other system components, comprehensive backups provide additional protection against data loss and ensure service continuity in the event of an unforeseen issue.

📌 Miscellaneous

Another interesting incident was troubleshooting a basic issue: “Users could not open the app”. At first glance, this might appear to result from slow internet or an issue specific to a single user. However, upon closer examination through debugging techniques, the engineers discovered a deeper problem. The root cause was an incompatibility with specific DNS resolvers provided by certain internet service providers (ISPs). Users relying on those particular resolvers could not access the JioCinema app. Therefore, the engineers built an in-house resolution to avoid complete dependency on external resources.

📈 Hotstar

Reference: https://www.youtube.com/watch?v=9b7HNzBB3OQ

Hotstar, now known as Disney+ Hotstar, is a major player in video streaming, particularly in India and Southeast Asia. Hotstar is particularly renowned for its live sports streaming, especially cricket. By offering a combination of local and international content, live sports, and a user-friendly subscription model, Disney+ Hotstar has established itself as a major force in the video streaming market.

HotStar Concurrent Views graph reaching as high as 25 Million users

In this video, Gaurav Kamboj, Cloud Architect Engineer at Hotstar, talks about some of the design decisions taken by the team to scale as high as 25 million concurrent views. I have highlighted the important points covered in this session.

📌 AutoScaling

Like the approach taken by JioCinema engineers, Hotstar moved away from autoscaling to manage sudden traffic spikes. The primary reason for this stemmed from the limitations of autoscaling in time-sensitive scenarios. While autoscaling offers benefits in many situations, its Achilles’ heel lies in the deployment timeframe. Provisioning new servers, which involves launching instances and integrating them with the appropriate load balancers, can sometimes take a significant amount of time — up to 30 minutes. While this delay might be acceptable for non-critical operations, it proves disastrous in the fast-paced world of live streaming, where even a brief interruption can negatively impact user experience. In this video, addressing this very issue, various factors hindering the dependence on autoscaling for live streaming are explored in more detail. These factors range from API throttling mechanisms that limit the rate of new resource creation to limitations imposed by availability zones within the cloud environment. By acknowledging these limitations and adopting alternative strategies, Hotstar engineers were able to ensure a more responsive and reliable platform for their live-streaming audience.

📌 Alternate Strategies

Expanding on the alternative strategies employed by Hotstar, the speaker discussed their methods for handling concurrent traffic. Beyond the practice of simulating traffic beforehand through chaos engineering to estimate server requirements, Hotstar has also developed a unique in-house scaling mechanism. This mechanism departs from traditional approaches that rely on metrics like CPU usage or network thresholds. Instead, it focuses on parameters specifically relevant to Hotstar’s business needs, such as request rate and concurrency limits. This tailored approach allows the platform to scale more effectively in response to real-world usage patterns. By considering factors that directly impact user experience, such as the volume and frequency of incoming requests, Hotstar is able to ensure that its infrastructure scales proactively to meet the demands of its audience. This not only prevents service disruptions but also optimizes resource allocation, leading to a more cost-effective and efficient operation.

📌 Miscellaneous

An interesting scenario I learnt from this video was the importance of external and internal factors in increasing the traffic peak. External factors can be the sudden rise in the concurrent traffic due to the batting of a popular cricketer (MS Dhoni, Virat Kohli) and the sudden dip when they get out.

This can also be viewed from a different perspective, as when they get out, the sudden dip in the video streaming service increases the traffic on the home page. Thus, the home page APIs should also be ready to handle such traffic.

Internal factors also play a significant role. The video discussed the importance of push notifications for audience engagement. While a marketing team might send a push notification during a match to attract a specific percentage of users (perhaps with a 10% conversion rate in mind), the actual impact can be quite different. The notification can trigger a surge in concurrent users, creating an unexpected traffic spike that the platform must be prepared for. By understanding the potential impact of internal and external factors, live streaming services can proactively manage their infrastructure and resources to ensure a smooth and uninterrupted viewing experience for their audience.

Key Takeaways for Managing Traffic Spikes in Live Streaming:

  • Prioritize User Journey and Core Functionalities: Understand user behaviour and prioritize critical features like video streaming and ad delivery. Non-essential features can be temporarily disabled during peak traffic to ensure smooth operation.
  • Graceful Degradation and User Communication: Implement informative error messages that explain issues clearly and offer solutions when possible. Discard unnecessary error messages.
  • Proactive Planning and Resource Management: Conduct load testing to identify potential bottlenecks and define thresholds for service breakpoints. Procure additional resources well before anticipated peaks to avoid reliance on reactive scaling mechanisms.
  • Reduce Reliance on External Resources: Explore building in-house solutions to reduce dependency on external factors like ISP-provided DNS resolvers and autoscaling by cloud providers. This can enhance overall system resilience and user experience.
  • Move Beyond Traditional Autoscaling: While autoscaling offers benefits, consider alternative strategies for live streaming due to its time-sensitive nature. Explore custom scaling mechanisms based on request rates and concurrency limits to meet specific platform needs.
  • Factor in External and Internal Traffic Fluctuations: Be prepared for sudden traffic surges due to external events like popular player performance. Anticipate the impact on other platform sections, like the home page, when viewership shifts due to external factors.
  • Internal Marketing Efforts: Understand the potential impact of internal actions like push notifications, which might trigger unexpected traffic spikes. By considering both internal and external factors, proactive infrastructure management can ensure a smooth viewing experience.

--

--