Dr Bill Mansfield
https://www.linkedin.com/in/billamansfield
Inaccuracies In SIC: So What?
The Standard Industrial Classification (SIC) system is designed to classify businesses and other enterprises within a standard framework, in order to record the activity undertaken by each entity. For businesses, SIC is held as a record for each individual firm at Company’s House. The data is used for a wide range of important purposes, from informing UK economic policy, to deciding access to influencing access to grants, finance and business insurance at the individual firm level.
However, SIC codes are often inaccurate: SIC categories were last updated in 2007, and the self-selection nature of the system (businesses choose their own codes) lacks any real ‘stick’ or ‘carrot’ to ensure codes are accurate.
But do inaccuracies in SIC really matter? Is this merely a semantic issue of little interest outside the world of data science?
The reality is that SIC does matter. It has huge influence in UK corporate life, because it is used widely for pricing insurance, assessing credit applications in banks, selecting risk, and qualifying for grants and other support from public bodies (for example they were used to support business grant payments during the pandemic). SIC is also important at the macro-economic level, used widely for sectoral economic analysis, and hence informing economic policy. SIC matters, and as such it needs to be accurate and up to date.
This paper reports on research designed to quantify the level of accuracy in official SIC records, a surprisingly under-researched area. We were pleased to support a Masters study at the University of Essex, in the School of Mathematics, Statistics and Actuarial Science (SMSAS). The research used advanced data science to test how closely the actual activities undertaken by firms in Essex, as described in their websites, matched to their ‘official’ SICs.
This article adds empirical weight to the growing body of evidence that SIC is of severely limited accuracy. The research revealed that over 30-40% of the Essex sample had SICs different to their actual activities. The finding highlights the need to improve the way we understand, classify and record enterprise activity.
How does inaccuracy in SIC arise?
How Big Is The Problem?
Inaccuracies in SIC has been noted by prominent bodies over the years (see text box below). But given the importance of SIC, it is surprising that there has not been louder criticism. Part of the problem is that there has been little effort to quantify the extent of the inaccuracy.
The research with the University of Essex gives a rare perspective on the issue. The research uses data science techniques to analyse enterprises across a UK region, in order to quantify the match between registered SIC codes and SIC codes implied by activities described in websites.
Official Recognition: Shortcomings in SIC
Over the last decade, SIC has received generally very little attention. However, two official developments stand out, both of which acknowledge shortcomings in SIC.
This assessed the UK’s future statistics needs, including addressing the challenges of measuring the modern economy. The review acknowledged that ‘the changing structure of the economy means that SIC will constantly lag reality, under-representing newer industries and over-representing ones that are declining in importance’. However, the recommendations of the review were mainly to do with governance and structure of the UK government’s statistical infrastructure, leaving SIC improvement largely unaddressed.
SIC in the UK has to fit into global (and European) business classification frameworks, to ensure that international aggregation and comparison of data is possible. ISIC (International Standard of Industrial Classification) is the global industry classification that sits at the very top of the tree, determined by the International Labour Organisation (ILO), an agency of the UN. This sets the global framework, within which NACE (Nomeclature statistique des activities economiques dans la Commaute europeenne) fits as the EU framework. Our own national UK SIC framework in turn fits into NACE.
It was recognised by the ILO some years ago that the ISIC framework was out of date, given that its last review was as long ago as 2006. The revised framework was eventually confirmed in 2022, designed to reflect:
In response, NACE identified an additional 36 detailed activities – taking the total from 615 to 651.
Given that the last revision was in 2006, 17 years ago, an additional 6% is hardly radical. Implementation, meanwhile, is projected for 2025 at the earliest.
Insofar that there is recognition that SIC is broken, the effort to fix it appears narrow in scope and to be moving very slowly. Moreover, whilst modernisation of the classification framework is happening to a limited degree, shortcomings in the way the framework is actually deployed remains ignored.
The Essex Research
The research with the University of Essex was simple in design. Per the graphic above, it identified independent companies in Essex, ‘read’ their websites, matched the services and activities they described to the ‘best fit’ SIC category, and then compared with actual SIC held at Company’s House (the ‘official’ source for SIC).
Input Data. B2B data were extracted, covering standard identifiers, official SIC code (at the most detailed as well as more general levels) and website addresses.
Web Scraping. This involved automated extraction of content and data from websites, using the ‘Beautiful Soup’ package in Python.
Activity Prediction. ‘Natural Language Processing’ (NLP) was used to summarise the scraped website data into descriptions of business activity, using ‘Hugging Face’ for the transformation.
Information Retrieval. Descriptions were tagged to the closest SIC code, using Universal Sentence Encoder (USE) to convert the text to vectors to calculate closest fit to SIC.
Predicted Data. The output was a list of predicted SICs for Essex enterprises based on their website information, scraped and classified via this four step methodology. This was compared with official SICs, and subjected to validation checks (see below).
Sample Selection
Independent established businesses and not for profits in the UK county of Essex were selected for the study with more than 10 employees (but under 200) and turnover of at least £250k. This group was selected because it represented a large sample (nearly 2200 enterprises), with the size and standing to have built a website, a critical ‘shop window’ for the business describing its activities. Independent businesses were selected (as opposed to subsidiaries and affiliates of larger firms) in order to obtain activity data for the specific entity, avoiding the risk of ‘pulling’ data that related to the corporate parent or other parts of the group.
Matching Success Rate
The research was successful in allocating Essex businesses to SIC code using website data and over 2000 businesses (93% of the total) had websites that yielded the data needed for SIC prediction. However, as described below, a number of website addresses were incorrect in the feed taken from a major B2B data vendor, which required correction in the results.
Accuracy Verification and Data Cleansing
Accuracy of the methodology was assessed by randomly validating matches. A number of issues were identified across various stages of the methodology:
Results quoted in this paper are cleansed and adjusted for these issues.
Results of The Research: Inaccurate SICs
Headlines from the research were:
This finding suggests that a concerningly large number of enterprises in the UK have official SICs that are significantly different to what they ought to be if they reflected the reality of what businesses actually do per their own websites.
The research was particularly striking in suggesting that where SICs are incorrect they tend to be ‘way off target’ (30% of SICs in wrong Section) rather than ‘just a bit off target’ (inaccuracy at the most detailed Class level added just over 10% more to the unmatched total, to just over 40%).
The research provided empirical evidence for the inaccuracy of SIC. Whilst it’s well-known that the SIC taxonomy itself is flawed, this study suggests that the way it is deployed is also flawed.
Patterns of Inaccuracy
SIC inaccuracy was more concentrated in certain industries and circumstances:
Does This Matter?
The answer is yes. For example, inaccurate SIC can lead to:
What’s Being Done About The Problem?
The honest answer is not enough. Whilst the issue has been recognised and certain actions initiated (see ‘Official Recognition’ text box above), plans are limited, timelines lengthy, and resources limited.
Organisations like The Data City (www.thedatacity.com) have stepped into the vacuum, providing real time activity classifications and modern taxonomies. But these private sector alternatives have limited resources, and tend to focus on specific ‘hot’ industry segments such as new economy and net zero related businesses.
More needs to be done:
In Conclusion…
The data science research undertaken with Essex University showed that a significant proportion of SIC codes in the sample do not reflect the reality of what these enterprises actually do according to their websites.
At the most aggregated level, 30% of SICs were wrong in this study. At the more detailed level, this level of inaccuracy increased to over 40%.
The taxonomy used for SIC is antiquated, and global and national authorities are ‘modernising’ the framework only slowly.
The deployment of the existing taxonomy tolerates inaccuracy, as there is no incentive or penalty for error.
Organisations such as The Data City have arisen to meet the need for up to date taxonomies and allocation of individual enterprises to real time classification.
The scale of the issue needs to be recognised and quantified at the national and international level.
The consequences of inaccurate SICs feed through to macro-economic analysis and hence policy, deployment of public funds and business support, and private sector access to finance and risk pricing.
Acknowledgements:
I’m indebted to the University of Essex research students, academics and management staff who enabled this research, which was supported with ESRC funding.
Swaroop Pagonda. MSc Applied Data Science, School of Mathematics, Statistics and Actuarial Science, University of Essex
Dr Felipe Maldonado. Lecturer in Data Science & Operational Research
School of Mathematics, Statistics and Actuarial Science, University of Essex.
Husam Quteineh, Senior Research Officer at the Business and Local Government (BLG) Data Research Centre, University of Essex
Dr Smruti Bulsari. Research Fellow. Institute of Public Health and Wellbeing, University of Essex
Nigel Kirby. Project Manager, ESRC Business and Local Government Data Research Centre, University of Essex
Laura Brookes. Outreach and Publicity Officer. ESRC Business and Local Government Data Research Centre, University of Essex
I’m also grateful to Alex Craven, Co-Founder and CEO of The Data City, for his encouragement and feedback, and to Andrew Purdy, Data Analyst at The Data City for suggesting analytics additions to the paper.
Appendix
What is an SIC code?
The following short summary is taken from The Data City website:
A Standard Industrial Classification (SIC) code is a five-digit code that’s used to classify a business’ main area of economic activity or in short, what they do. When a company or business is incorporated, you must choose at least one SIC code, however in some cases you can choose as many as four.
SIC codes are broken down by category, with 21 (from A to U) main sections to choose from. These are broader industries such as Manufacturing (Section C), Construction (Section F) and Education (Section P) which are made up of several subcategories or codes. For example, Construction is made up of 25 SIC codes, which includes activities like Demolition (43110), Plastering (43310) and Glazing (43342).
With a total over 600 SIC codes, ranging from the mundane (64192 Building societies, 69203 Bookkeeping activities) to the more obscure (20510 Manufacture of explosives, 01230 Growing of tobacco) – the idea of this extensive segmented SIC code system is to capture the full breadth of the UK economy and give business owners the best possible chance to accurately classify their company.
We give you insights about your markets, prospects, and clients so you know where to grow, knowledge to help you thrive.