Join us
@adammetis ă» Nov 29,2023 ă» 9 min read ă» 909 views ă» Originally posted on www.metisdata.io
Collecting data from multiple sources poses multiple risks. Missing values can be represented differently. Dates may be parsed incorrectly. International strings may get wrong encoding. In this post we are going to see various issues that may occur during the problem, and how to address them along the way.
We collect data in various ways depending on the way we use them later on. For typical âonlineâ scenarios we use the so-called OLTP - Online Transactional Processing. This means that we focus on providing the best performance, but at the same time, we have transactions that typically modify a limited amount of entities. We also have a large number of parallel transactions. We can think of any online application that handles interactive user input.
However, for âoff-lineâ scenarios we have different requirements. These are typically related to bookkeeping like updating the database, recalculating values, or generating reports. These queries take much longer to complete, and they are not time-sensitive. Itâs okay to run them during the night and to wait for them for a couple of hours. We also donât focus on database normalization to avoid performance issues when joining tables. There are different designs for âoff-lineâ scenarios like snowflake design, and there is no need to use normal form.
Yet another important term in this area is OLAP - On-Line Analytical Processing. These are technologies that organize multiple databases to perform analysis for reporting and enterprise management. They need to deal with multiple data sources that serve similar purposes but are implemented differently under the hood. This may lead to issues that are hard to find and can easily go unnoticed.
Another case is data ingestion which is completely different for warehouses. We typically donât insert records one by one or within fine-grained transactions. We load them in bulk by importing batches of records that are later added to the table as partitions. We want to avoid locks, performance degradation, and sometimes even consistency checks to make the process faster.
Importing data from multiple sources is called Extract-Transform-Load (ETL for short). This process consists of three phases.
The first phase is called Extract. We need to get a connection to a remote data source to get the records. This can be a simple connection to the database over a well-known protocol, downloading to a file storage like FTP, or calling a web service exposing data with XML or JSON. There are multiple ways to do that, and these ways depend on the implementation details of the data source. Sometimes it is fast enough to query records one by one, sometimes we prefer to export data to a dump file which we will then load in a different place.
The second phase - Transform - aims to translate data from the source format to the destination one. This includes changing schemas by joining tables, adding columns, or renaming properties. We also need to transform values between different representations. For instance, we may need to translate dates from MM/DD to ddmm form. We need to handle missing values, placeholders, encoding schemes, and various other things. All of that needs to be done to unify all data sources, so we can load them later into the data warehouse. We also need to handle duplicates, missing records, or broken entries.
The last phase called Load focuses on loading the data into the data warehouse. Since we integrate multiple data sources, we may need to load data in bulk to achieve decent performance. Loading data needs to be performed when the database is not overloaded, so we typically do it outside of working hours. There are other aspects to consider like errors, retries, or archivization.
Multiple scenarios can go wrong when dealing with heterogeneous databases. Weâll cover some of them in this section, and think about how to make sure we avoid them.
The problem with heterogeneous data sources is mostly around identifying issues during the Transform phase of ETL. The problem is they can easily go unnoticed. Thatâs because values are not âwrongâ per se, but they change their meaning in the process.
Another issue is the scale. Itâs doable to verify manually one thousand rows and check if all is good. However, it becomes impossible once we run the process on a large scale and we need to verify millions of rows. Since the process doesnât fail, issues may be hidden in production for a long time.
Letâs go through some categories of issues.
Dates and timestamps can easily become hard to manage and reason about. The first thing to notice is the lack of a standardized date format. There are multiple out there like DD.MM.YYYY, MM.DD.YYY, ddMMyyy, MM/DD, and others. The issue here is that we may incorrectly change the format on the fly. For instance, the data source stores dates in MM/DD format, but we later process them as DD/MM. When we run an automated process we may notice that July 25 (07/25) cannot be represented in DD/MM format so we get an empty value, but we may miss that this problem doesnât apply to February 4 which becomes April 2.
A similar issue applies to timestamps because they need to include a time zone specification. The data source may store timestamps in local time, but the destination database may use UTC. If we donât pay attention, we may miss that the timestamps are transformed incorrectly. This is even more probable if we work with local times in UTC.
How to avoid that? Unit tests may work well, but we would need to think about all the possible edge cases, so that wonât be enough in practice.
We can analyze the histogram. Just count the dates in a given month in the source dataset, and then compare it with the destination one. If the difference is significant, then something went wrong along the way.
Yet another issue is with âtrickyâ dates like February 29. These dates may get lost, so itâs good to verify them manually.
Numbers can pose multiple challenges. It is common in some locales to write bigger numbers with separators. For instance, one thousand two hundred thirty-four can be written as 1234 or 1.234 or 1,234. Not only is there a separator, but a different character can be used. Some languages prefer dots, some go with commas, yet others with spaces.
Now itâs clear why errors may occur. We may read 1,234 which is one with some fractional part, but use it as if it was 1,234 which is something more than one thousand. Again, this may easily go unnoticed because both values are technically âcorrectâ, but they have different meanings.
How to avoid that? Again, unit tests will be of some help if weâre aware of the issue. It may be required to analyze the data distribution. Just plot the distribution, or calculate some central metrics (like an average of sum).
How to store the absence of a value in a database? NULL is one option, but there are others as well. âNULLâ or âNONE' could be a solution, âN/Aâ or âNot applicableâ as well, or maybe just an empty string ââ.
This gets even more interesting with various languages. For example, ân'est pas applicableâ is âNot applicableâ in French. Similarly, ânie dotyczyâ is the same in Polish. All these values should be considered the same in the target system.
However, the problem is not only on the input side. It can occur on the output side as well. ETL system may emit empty values when it cannot parse the field for whatever reason. If thatâs the case, then the target database will contain far more empty values than the source systems.
How to avoid that? Check distributions, learn other languages, and make sure that you mark fields that can be empty.
Another issue is the length of the field. Your source dataset may have very long columns that get truncated when loading to the destination system. This can also happen when we handle various encodings. For instance, EBCDIC encoding had a special character ZĆ which represented the Polish currency (just like $ represents the dollar). However, there is no single character for ZĆ in Unicode. If your target system allows for just one character to indicate the currency, then you may get an error.
How to avoid that? Run some analysis of the data size distribution. Make sure that your columns do not get silently truncated, but rather throw an error.
Character escaping can be very hard to reason about because itâs not enough to take a look at the data to understand if the problem is there. For instance, take the following JSON:
{\âfield\â: \âvalue\â}
Is this a proper JSON? It is if you escape the characters (so you use a backslash before each double-quote character). It is not if you donât escape the characters. However, can you tell just by looking at the value?
To solve this problem we need to see the logic that operates on the data. We need to go and check the source code of all the systems that may use the value. This may get even harder when we take the JSON from above and escape it again:
{\\\âfield\\\â: \\\âvalue\\\â}
Regular expressions are a standard feature of multiple programming languages, databases, and systems. However, there are differences between them. Letâs take these expressions:
?P<name>x
?<name>x
?'name'x
\k{name}x
They all do the same - capture character x and put it in a group named ânameâ. However, the actual syntax for naming a group differs between languages. The first example is from Python, the next two come from C#, and the last one is from Perl.
If you just reuse expressions from other platforms, then you may run into issues that the expressions work (i.e. donât throw errors), but they return different results.
How to avoid that? We need to understand the code and the internals of the systems we use. There is no mechanical solution that we can apply here.
Some heterogeneous distributed databases make things even harder because of how they handle the connectors. For instance, HBase or Spark (or any other database working on Hadoop) can be accessed from various technologies like Scala, Java, or Python. However, the code running in that environment often needs to be sent between nodes in some serialized form. For instance, Spark uses Kryo to send the code parts. The problem may arise when things do not get cached properly, and work differently when running on a single node versus multiple nodes. Itâs often the case that cached datasets in Spark may fail the processing because of serialization issues.
How to avoid that? There is no simple answer here. We need to test the code with actual frameworks and libraries as unit tests wonât capture the issues. One solution for spark could be to run tests with actual single-node clusters emulated locally.
Ingesting data from multiple data sources can be very hard. The biggest issue is not with running the process, but with making sure that we get a correct output. Issues can go unnoticed for months, and the systems that use the transformed data may continue to work correctly. However, the quality of your solutions may decrease. For instance, if you run your machine learning algorithm on data that is invalid and has a completely different distribution, then your model may learn differently and not work in production.
Join other developers and claim your FAUN account now!
DevRel, Metis
@adammetisInfluence
Total Hits
Posts
Only registered users can post comments. Please, login or signup.