Hong Kong is Gone. What Happens If We Lose Taiwan?

The Hong Kong national security law demonstrates the PRC’s willingness to take draconian measures to reunify China. Are increasing tensions in the strait an indicator that Taiwan is next? We’re…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Why Build a Data Lake?

The biggest problem with the previous pipeline was that it was highly inflexible. The source of this inflexibility was twofold. First, the data loaded in its raw form was effectively unusable because it could not easily be retrieved and required heavy processing to be accessed. Therefore use of the results of our data collection was limited to tables at the end of our ETL pipeline. This brings us to the second source of inflexibility. Because the only way to access our data was via Redshift tables, we were constrained by the schema of those tables. Ultimately each schema was informed by the consuming application. Unfortunately, as applications changed and new features were introduced so would their data requirements. This meant that we would have to go back to modify the schema of our source tables. Because the table schema was tightly coupled to the ETL pipeline we would also have to go back and modify our Spark jobs to ensure the original data was parsed correctly to support the new schema. In short, smaller application changes could result in changes throughout the entire ETL pipeline. This tight coupling of our ETL pipeline to our applications slowed the pace of changes and meant that the same data could only be utilized by a small number of applications.

The second problem arose because our data warehouse was utilized by several reporting applications, some of which delivered reports directly to clients. Unfortunately, because our ETL pipeline required that we do a fair amount of deduplication and processing in the database this meant that our ETL jobs were competing for access to the same tables as our reporting applications. If a large ETL job was running it could slow down the reporting and directly impact end users. Similarly, if internal applications or groups wanted to query a dataset for the purposes of analysis, product development or reporting, they could impact client facing applications or just suffer from poor performance. If, for example, our data science team needed to run a wide variety of ad hoc queries to better understand the shape and structure of a dataset for the purpose of feature creation they would have to worry about impacting the whole business.

Data warehousing is expensive. All our data was stored in Redshift whether or not we were using it or needed it. As a storage type Redshift was fairly costly when compared to cheaper alternatives like S3. Similarly, because our compute engine was directly coupled to our data storage, we would have to expand our Redshift cluster as our data volume increased. Increasing the cluster size meant we were effectively paying for additional storage and compute whether or not we were using it. As the company grew and the amount of data we had under management increased we were starting to push the limits of our Redshift cluster.

Data lakes are best understood in contrast to data warehouses. Data warehouses are meant to serve as a central repository for all an organization’s data regardless of where that data originated. Data warehouses however suffer from some key limitations that a data lake attempts to address. Data warehouses are generally a reflection of the business entities and reporting requirements defined by an organization. The needs of the data consumer drive the structure of the warehouse and as a result there is a considerable amount of research and engineering time invested in building tables and schemas that provide direct value to those consumers. By design a data warehouses imposes a high degree of structure on an organization’s data.

Whereas a data warehouse requires that data be schematized, data lakes do not enforce structure on data. By contrast, data requires no preprocessing at all to be loaded into a data lake and therefore can serve as a repository for high volumes of raw data. After being loaded into the data lake an organization can make decisions around the best use of that data and start to impose structure in the form of various schematized datasets. Because they do not require structure data lakes are highly flexible and can store data from a wide range of sources. That same flexibility becomes powerful if the reporting needs change or the business entities change. Similarly, data lakes are typically architected in such a way that makes storage cheap so that all historical data can be preserved whether or not it’s being used. Finally, data lakes can be utilized by a wide range of stakeholders within an organization. While most users will only need structured transformed data, there is a subset of users who will need to create new datasets and search for insights within the raw data. The data lake makes this data available to those users as well.

Given the problems previously outlined a data lake made a lot of sense for Unified. Once completed the data lake helped address the inflexibility problem by giving teams the ability to load in raw data collected from social media publishers without requiring any transformation or schematization. This is critical not only because we work with so many different publishers but because if the structure of those API responses change our data collections process is unaffected. We could therefore build our data collections pipelines without any consideration of the final schema. With the raw data intact, whenever we want to build a new application that depends on a different view of the data, whether this means showing different fields or applying different transformations, it becomes relatively simple to build out a parallel ETL pipeline.

It helped to address the performance issues because now all ETL processing and data transformations could occur outside the data warehouse and thus independently of any reporting or customer facing applications. This became especially important when we started to load in extremely large proprietary datasets from third party vendors to enrich our publisher data. The business could now ingest and ETL terabytes of data completely independently of any unrelated ETL processes.

Finally, because all storage moved to S3, the business effectively only has to pay for compute in the form of EMR. This is more efficient because we can now only spin up EMR clusters when we need to complete a transformation. Compute and storage is thus fully decoupled.

Add a comment

Related posts:

Are The Apple Airpods Pro Worth Buying?

Pros and Cons before you buy

Memadamkan Paksa

Kita telah melanggar hati kita; memadamkan bara api kembar yang kita miliki dengan menggenggamnya. Memang ia akan padam seiring semakin kuat kita mencengkeram. Namun, kita lupa bahwa luka bakar siap…

LoRaWAN private IoT wireless network

Probably one of the crucial considerations when planning a private wireless network is the use of licensed or unlicensed channels. Most cellular devices operate on licensed spectrum, although some…