Best practices for building a PPI (Property Price Index)

Last updated: Aug 28, 2023
Kirill Lepchenkov
Kiryl Lepchankou,
Group Manager

The real estate industry is notoriously slow when it comes to getting on the innovation bandwagon, but in the past 20 years, thanks to proptech (short for “property technology”), huge strides have been made in sophisticated consumer-facing platforms that help users find homes. In particular, Zillow, Redfin, and Opendoor have risen from startup status to genuine trailblazers when it comes to real estate data collection and processing solutions — to the point where each company has become a household name.

At the heart of many proptech startups are PPIs (Property Price Indexes). PPIs function in different ways, depending on the context, but overall it’s helpful to think of them as demonstrating things like which asset classes in real estate (e.g., condominiums, single-family homes, offices, hotels, land) are going up in price, where the highest transaction volume is geographically, and how seasonality affects transaction volume in a particular market. 

PPIs are real estate big data. If you’re planning on hiring a team to build a PPI, here are a few things you need to know before you roll up your sleeves.

Best practices for building a PPI (Property Price Index)

Layers behind PPIs 

My primary focus as a data scientist in proptech has been on stratified transaction-based property price indexes, which derive from real estate transactions that have already happened. To give you an idea of what my team does: We help build PPIs by building all of the application layers, verifying each data processing decision with the client’s subject matter experts. 

Building a transactional data ingestion layer demands researching all available data sources (with the help of data brokers and industry experts on our team), finding the ones that can get us where we need to go, and setting up data loading jobs. Naturally, given the recurrent nature of these jobs, appropriate orchestration and data warehousing is non-negotiable. 

Then there’s the transactional data integration layer. This layer, which is notoriously difficult in terms of implementation because developing data integration pipelines (combining transactional attributes with real estate property attributes), requires a deep domain understanding and detailed ongoing data quality assessment and anomaly detection. Not only that, but it requires domain-specific solutions for dealing with the absence of required information from the available sources. 

Absence of data is where AI comes in. For instance, imagine that you’re working with a set of recent sales of newly constructed houses, but the information you need about the attributes of these properties isn’t yet processed or accessible through government offices or other official channels (typically it takes about six months for that to happen). AI can bridge the gap and estimate these attributes, so we’re not held up by delays in processing elsewhere. 

Then there’s the transactional data processing layer, which filters out data based on business requirements and generates internal transactional attributes and meta-features. Typically, there are very specific requirements in terms of property types (like just residential sales or just commercial properties), and transaction types that determine things like whether or not you include sales out of foreclosure. 

Business rules also apply to dealing with things like batch and portfolio sales. Developing a codebase as a part of the transactional data processing layer that can accurately handle all these intricacies and variables demands a tremendous effort from everyone — both on our team and from the client — including business analysts and data analytics and data engineering experts. 

The index generation layer includes solutions for turning transactions into indexes and  typically includes a significant amount of time-series processing and trend-generation solutions, which are often proprietary and tailored to achieve a desired level of volatility.  (In this context, “volatility” refers to the degree of variation or fluctuation in the index value. Speaking in simple terms, PPIs may be more or less smooth. This smoothness depends on a variety of factors, such as the number of records in the transaction population, trend generation algorithms, etc.)

Finally there’s the index analytics layer, which turns real estate indexes into valuable insights. Generally speaking, a PPI itself is a valuable resource for gaining insights into the real estate market's performance. However, beyond the primary index, there are additional sub-indexes linked to it, which provide essential information on price fluctuations as well as  transaction volumes for various property types, property age, and various other measures. 

Due to their ability to reveal significant trends, these sub-indexes are highly sought after in the real estate industry because businesses can identify emerging patterns that can potentially reshape their strategies within the market.

Best practices for building a PPI (Property Price Index)

Strategies for successfully building PPIs

Working on a PPI is both exciting and challenging at the same time, and if you’re considering building one, here are a few things you need to be aware of when you start planning and hiring your team.

Be prepared for the complexity of data research

Providing valid insights during the data research process is extremely challenging because it requires technical ability to investigate a large number of diverse data sets, e.g., shapefiles (a format of geospatial vector data), GeoJSON documents, csv and .dat files, images, and aerial shots) as well as a degree of intuition and knowledge about the processes behind transforming land into houses and neighborhoods. 

Plat books, plat maps, subdivision registration, how subdivisions are divided into parcels and sold as vacant land, permits for new construction, demolition, and reconstruction — developers and data scientists working on a PPI need to know those sorts of things like the back of their hand. 

Also, be realistic about just how long it can take to identify, obtain, and wrangle the data that are at the very heart of proptech analytics. If, for example, you’re working on construction year analysis for a particular region, it might take literally months to deal with all sorts of data sets, get your arms around delays in reporting, and identify missing data for new construction, demolition, reconstruction, and county-wise reappraisals.

Adhere to data contracts

Data contract violation (the failure to stick to protocols around sharing data among parties) creates problems in any industry, but in protech, data contract violation manifests in changes to the format of the data, diminishing the quality of the data, or delays in providing datasets. 

All too often in real estate data collection, these roadblocks occur in tandem, without any one party notifying another, so there’s a reason why data contract violation is one of the hottest topics in proptech right now.

Zero in on the right technical skill sets

Then there are the technical skill sets needed if you’re doing data research: proficiency with Python and its libraries for table processing (e.g., pandas and spark), expertise in GIS, an understanding of coordinate reference systems, and the ability to produce advanced data analytics — proptech analytics — rapidly. Without those skills, devs and data scientists can’t get a comprehensive view of the data. 

Overall proficiency in tabular data manipulation at scale is a must have for avoiding technical traps during data research and production pipeline development. Having a cloud development expert on the team is great for infrastructure cost optimization.

Make sure your code architecture is scalable

No matter what you build, you need to be able to scale it effectively across markets. There are country-level indexes, of course, but in my work, I’ve focused on county-level PPIs, which provide detailed insights into proptech analytics and local real estate market. 

But there are 3,142 counties in the US, and each one has a non-standardized way of managing real estate transactions and storing information on properties (a proptech big data problem if there ever was one), so if you’re planning to scale your solution across counties, then you have to plan carefully how you’re going to support scaling with the right technology. 

For example, if you need to develop a coherent approach for code development and well-designed abstraction layers for inheritance. Data and backend engineers have to fight the increasing number of requirements, which makes it difficult to maintain a good codebase.

Hire devs with great communication skills

Building and maintaining PPIs also demands being comfortable making decisions and working with subject matter experts who can validate them. Experienced engineers understand the high-level requirements and validate only the details of implementation, such as which exact property types to filter out from the data source. 

But the real challenge is that subject matter experts tend to not have a lot of time, so  you need to be really strict about your communication protocols. In proptech and PPI projects I’ve worked on, we’ve had specific guidelines for how discussions were structured and decisions were made. 

For example, if an engineer needed to validate a certain filtering decision, he provided detailed information on the filter’s function, the implications of using it, and the consequences of using it for the transaction population and shared it on our online whiteboard so everyone could see it before the decision was made. (We use Miro.) Reality check: Even with those protocols in place it can take months to wrangle what you need for subject matter experts.

Sound communication protocols are part and parcel to good soft skills overall. After all, devs always need to be good at explaining complex technical concepts to non-technical team members to make sure that everyone understands the implications of decisions. Projects like these are almost always a group effort, too, so healthy collaboration among researchers, engineers, and business people is vital.

Ultimately, as is the case with many software development projects, it’s all about being solution-oriented, rather than technology-oriented — and that’s easier said than done. For data scientists and data engineers who work in proptech science and proptech analytics and are passionate about the very latest technologies, architectures, and frameworks, it’s natural to be technology-oriented first, but that approach doesn’t always work well when it comes to actually solving business problems. 

A data team could, for example, be so enchanted with deep learning they build a solution that is steeped in deep learning models that they fail to consider how maintaining the deep learning technologies could be a burden to the business for years to come. 

Working on PPIs, especially in proptech startups, can feel like riding in a car while building it, figuring out what the missing parts are as you go, constantly moving forward yet routinely pumping the breaks, then putting your foot on the gas to test how fast you can go. That tension is the best part of building disruptive technology in real estate: It’s bumpy and chaotic but exciting.

Keep reading: