Data consolidation & merging datasets using addresses
Data consolidation is the process of combining an organization's data into a single dataset. Consolidating your entire organization's data requires taking data from multiple source files, cleaning and verifying it by removing errors, and storing it in a single integrated location.
In order to do this, you'll need to merge your datasets. Merging Datasets is the process of taking multiple original files, cleaning them up, and combining them into one master dataset of addresses and related data.
You may be wondering why are we even talking about data consolidation? Because if there's one thing every business does, it's gather data. As time goes on, and technology progresses, data collection only seems to grow and expand. Handling the high volume data can slow down your company down without useful resources and tools.
Merging data from multiple datasets requires one or more unique identifiers to indicate which records are candidates for merging. Standard unique identifiers can be phone numbers, email addresses, or mailing addresses.
We’ll discuss the pros and cons of different unique identifiers and show how a standardized, validated mailing address could be the best for you.
You can try some of our address standardization and validation tools here. Click the "TRY IT NOW" buttons to check them out and read on, to see how address standardization will help you in your dataset merging adventure!
Bulk Address Validation | US Address Validation API | International Address Validation API |
---|---|---|
In this article, we'll cover:
- What's a data silo?
- Why is data center consolidation important?
- Merging datasets techniques overview
- Why you should consolidate data based on street address and not other field values
- Address database consolidation challenges
- What are the merge process steps of address data consolidation?
- Conclusion
Nobody likes operating in the dark. Without the right information, it's hard to know what the right decision is. Is there room in the market for your organization to expand to another geographic area? Are your customers advocates or detractors of your product? Was that pricey billboard on Main Street worth it?
The answers to those questions lie in data gathered from many different sources. Unfortunately, big data can also lead to big problems, namely, duplicate entries, outdated information, data entry error, and last but not least, the dreaded data silos.
What's a data silo?
A data silo is a repository of raw data that s controlled by one department or business unit and isolated from the rest of an organization. Siloed data is often stored in a standalone system and may be incompatible with other data sets, making it hard for other departments to access or use the siloed data. The result is a lack of transparency, efficiency, and trust among departments.
This lack of transparency can increase your risks of inaccurate data, poor resource utilization, and decreased revenue.
The solution? Data consolidation.
Why is data center consolidation important?
While the merge process may sound painful, the result of data consolidation is worth it. Having all your critical data in one place provides a 360-degree view of your business, allowing you to efficiently plan and execute business strategies and solutions. Data consolidation improves data quality, speeds up process execution, and simplifies day-to-day information access.
Here at Smarty, we often see companies aggregating address data from multiple sources. Because the address lists come from different places, the address formats rarely have identical structures they can contain typos, addresses might be contained in a single field or broken into different components, and may contain blank values, among other problems.
By standardizing and validating the address to find matches across separate datasets, all of this information can be combined into an authoritative single dataset.
Merging address datasets is the process of taking multiple original files of addresses, cleaning them up, and combining them into one master dataset of addresses and related data.
Merging datasets techniques overview
The three most common types of merge techniques for merging source datasets are ETL, Data Virtualization, and Data Warehousing.
Type of merge 1: ETL (extract, transform, load)
ETL, which stands for extract, transform and load, is one of the most common data management techniques for data consolidation. Data is extracted from multiple data sources, transformed through data cleansing, aggregation, sorting, etc. into a single, consistent data store, and loaded into a data warehouse or other target file system.
Automation integration tools can carry out ETL in two ways:
Batch processing: suitable for running repetitive, high-volume data jobs.
Real-time ETL: uses Change Data Capture (CDC) to transfer updated data to the target file system in real time.
CDC is a set of software design patterns used to determine and track data that has changed so that action can be taken using the changed data without delay.
Type of merge 2: Data virtualization
Data virtualization uses a software abstraction layer to create a consolidated, virtual view of data without physically copying, transforming or moving the source data to a target system.
With data virtualization, organizations can make data available with real-time, self-service access for business users and data scientists.The view-based approach offers agility even when underlying sources are added, removed, or changed.
Type of merge 3: Data warehousing
Data warehousing is the process of integrating data from disparate sources and storing it in a central repository, or data warehouse (often abbreviated as DW or DWH). Once in the data warehouse, the data is ingested, transformed, processed, and made accessible for reporting, business intelligence, and other ad-hoc queries.
Data warehouses provide a broad, integrated view of all data assets, with relevant data clustered. There are three main types of data warehouses:
Enterprise data warehouse (EDW)
Operational data store (ODS)
Data mart
Which data consolidation technique you choose will depend on your company's requirements.
Why you should consolidate data based on street address and not other field values
So you've picked your data consolidation technique, and you're ready to merge two datasets or perhaps even more!
Now you need to decide which value you'll base your data consolidation on. You could use the value from any field that you collect: first name, phone number, email, street address, etc. However, ideally the value will be unique, something that doesn't change frequently, and can be standardized.
For those reasons, we HIGHLY RECOMMEND that you use the street address for data consolidation. Here's how street addresses stack up against other form fields when considering our criteria of unique, rarely changing, and standardizable.
Street addresses are unique
Street addresses are a unique value. Names are not. For example, a customer named Stephanie Smith might list her name as Steph when filling out one form, and Stephanie on another. Or she might be in a hurry, and accidentally type in just Step instead of Steph. A customer service agent might incorrectly input the name as Stephany, Stefanie, or any variation of these.
Next, consider phone numbers versus street addresses. People usually only have one street address, but they may provide you with multiple phone numbers: their home phone, their work phone, their cell phone, or even their spouse's phone number.
Street addresses rarely change
Street addresses don't change very often, but your customer's email address might change due to a new job, their company being acquired, their account being hacked, a forgotten password or due to having multiple email addresses. And with job changes come phone number changes in addition to email changes.
Street addresses can be standardized
With address validation tools, it's easy to standardize street addresses into the same file format, and know that they're all correct, valid addresses. You can create a unique identifier for any given address by taking the standardized version of the address (minus the ZIP Code) and presenting it as a string, like this:
742+Evergreen+Ter+Springfield+OR
This gives you one single source of truth for your data. It's more difficult to confirm the validity of phone numbers or emails in your database.
In addition to the above factors, customers have a better incentive for providing a correct street address. They want to be able to receive any packages or important letters you're sending them. Because of this, most customer records will include a street address. You can use other customer record fields as a secondary match for outliers.
Address database consolidation challenges
What's life without a few challenges here and there, right? Address database consolidation has its challenges, but the result is worth it. Let's talk about some of the more common stumbling blocks we've seen our customers encounter.
Duplicate customers
Anyone who's ever managed a list or database knows the pain of duplicates. Customers might create multiple accounts with multiple email addresses (who here hasn't signed up for a second free trial of something with a different email???), or interact with your company on different platforms, creating multiple activity logs.
On the bright side, if you have duplicates, you get to say the word dedupe a lot, and it's fun to say. Go ahead, try it.
Deduplication is the process of identifying redundant records in a data set referring to the same real-world entity and then merging those records.
Addresses That Don't Match (But Are Actually the Same)
One thing you might not realize until you start getting into address data consolidation and address deduplication is that there are a lot of ways to write the same address and have it still be a valid address.
Someone might enter an address and abbreviate street to st while another user might not, there may be typos or missing components like a unit number, among other small differences that you might never notice if each instance is spread out throughout your database.
For example, here are a few different ways an address could be written:
100 N Main Street, Anytown CA 90210
100 North Main Str, Anytown CA 90210
100 N Main ST, Anytown California
100 N Main Street, 90210
100 North Main ST, Anytown 90210
Each address above is the same place, but they're each going to show up as a different address in your database.
Alias addresses
This is another case where addresses don't match. Alias addresses are created when a street, city or other component of the address changes. Alias addresses are also created when geographic boundaries (like cities) change.
In order to prevent confusion and lost mail, postal providers keep the previous version of the address marked as a valid alternative address (alias) and specify the new version of the address as the primary address. This creates addresses that don't look like they would represent the same place, but actually do. These 3 addresses are all the same delivery point in New York.
200 W 31ST ST, NEW YORK, NY, 10001
370 7TH AVE, NEW YORK, NY, 10001
7 PENN PLZ, NEW YORK, NY, 10001
On a national scale, street names change frequently. Since the death of Martin Luther King Jr., over 1,000 streets have been renamed to honor the civil rights leader. That means every house on each of those streets now has at least two valid addresses.
Most cities have changed boundaries or have reason to provide multiple valid city names. For example, The ZIP Code 84121 in Utah has 13 valid city names:
Salt Lake City
Brighton
Cottonwd Hts
Cottonwood
Cottonwood Heights
Cottonwood Heights City
Holladay
Holladay Cottonwood
Holladay Ctwd
Murray
Salt Lake Cty
Slc
Solitude
Every single address in the 84121 area code has a minimum of 13 valid addresses. Add in a street name change and every house on that street now has 26 valid addresses according to the USPS, though some of the address variants look very different.
Large scale data integration projects
Here at Smarty, we've helped customers tackle some big datasets. We're talking billions of addresses across dozens of different datasets. Some commons reasons a company might need to consolidate large datasets include:
Mergers and acquisitions
Multiple sales platforms (eBay, Amazon, Etsy, etc.)
Merging legacy datasets into a new system
Using multiple databases to manage customer data
Merging datasets from disparate sources to enrich data
Some common issues pop up when you're trying to integrate data from multiple sources:
Issue #1: Each platform probably holds differences in datasets with how addresses are formatted. The address format might be standardized according to the USPS guidelines in one database but not another. You might discover the addresses are broken into 5 different fields in one database, but all squashed into a single field in another database.
Issue #2: If they've purchased from you on multiple platforms, the same customer might exist simultaneously in more than one database.
Before you can merge your databases, you'll need to overcome these address data consolidation challenges. Address standardization and validation software can help. We'll explain more in the next section.
What are the merge process steps of address data consolidation?
For this address data consolidation example, we're going to use the ETL (Extract, Transform, Load) data consolidation technique.
1. Extract addresses (extract)
The first step in the merge process is getting all the address data out of your different sources. These might include accounting software like QuickBooks, a customer relationship management tool like Hubspot or SalesForce, or an enterprise resource planning software like NetSuite.
Exporting your data from these tools to a spreadsheet is fairly simple. Isn't it nice to start with an easy task?
2. Validate
Step two is to standardize and validate your addresses. You can do this using a Bulk Address Tool, Command-Line Interface (CLI), or Application Programming Interface (API).
Standardizing involves parsing your different addresses and putting them in the one consistent format as defined by the USPS. This ensures each address is in the format with the correct syntax that will be most understandable for the next data consolidation steps.
It's important to note here that most address standardization tools do NOT validate that the addresses are real places. You can use a separate address validation tool to check addresses against the USPS database, or choose a tool like Smarty's US Address Verification that standardizes and validates addresses.
3. Find duplicates
Once the addresses are standardized and validated, it's easy to eliminate duplicates. If you did deduplication before standardizing and validating, many duplicate addresses would be missed.
You can take your standardized list and sort by address to find duplicates easily.
4. Merge records where applicable
Now, all those databases you're combining might include different bits of information for each item. For example, the info pulled from a CRM might include a customer's email address and phone, but not a street address.
Their street address is probably in your billing system if you send paper invoices. In cases like this, you'll want to perform a merge/purge process.
A merge/purge is the process of merging records from one data source or multiple data sources and eliminating duplicate records. Advanced data-matching techniques used in the merge/purge process generate one record that retains all of the valuable pieces of data and removes the duplicated data.
5. Delete redundancies (transform)
You might need to remove redundant information or transform data into a new format depending on its future destination.
6. Load data into new home
Now that you've performed all your data cleansing, aggregation, sorting, etc. and created a single, consistent data store, it's time to load it into a data warehouse or other target system.
Once you've done that, you can begin to utilize your one master dataset of address data to prevent misdeliveries, improve customer satisfaction, enrich data, and simplify workflows.
Conclusion
Whether you choose to consolidate your address data using ETL, Data Virtualization, or Data Warehousing, you will benefit from cleaner data, faster process execution and greater transparency of business information. Plus, automated tools like Smarty's US Address Verification can do a lot of the steps for you and make address data consolidation simpler. Try the live API today.