Is SIC Sick? Applying Data Science to Quantify The Mis-Match Between Official and Real Enterprise Activities In The UK

Dr Bill Mansfield
https://www.linkedin.com/in/billamansfield

 

Inaccuracies In SIC: So What?

The Standard Industrial Classification (SIC) system is designed to classify businesses and other enterprises within a standard framework, in order to record the activity undertaken by each entity.  For businesses, SIC is held as a record for each individual firm at Company’s House.  The data is used for a wide range of important purposes, from informing UK economic policy, to deciding access to influencing access to grants, finance and business insurance at the individual firm level.  

However, SIC codes are often inaccurate: SIC categories were last updated in 2007, and the self-selection nature of the system (businesses choose their own codes) lacks any real ‘stick’ or ‘carrot’ to ensure codes are accurate.

But do inaccuracies in SIC really matter?  Is this merely a semantic issue of little interest outside the world of data science?  

The reality is that SIC does matter.  It has huge influence in UK corporate life, because it is used widely for pricing insurance, assessing credit applications in banks, selecting risk, and qualifying for grants and other support from public bodies (for example they were used to support business grant payments during the pandemic).  SIC is also important at the macro-economic level, used widely for sectoral economic analysis, and hence informing economic policy.  SIC matters, and as such it needs to be accurate and up to date.  

This paper reports on research designed to quantify the level of accuracy in official SIC records, a surprisingly under-researched area.  We were pleased to support a Masters study at the University of Essex, in the School of Mathematics, Statistics and Actuarial Science (SMSAS).  The research used advanced data science to test how closely the actual activities undertaken by firms in Essex, as described in their websites, matched to their ‘official’ SICs.  

This article adds empirical weight to the growing body of evidence that SIC is of severely limited accuracy.  The research revealed that over 30-40% of the Essex sample had SICs different to their actual activities.  The finding highlights the need to improve the way we understand, classify and record enterprise activity. 

 

How does inaccuracy in SIC arise?

  • The current SIC classification framework dates to 2007 meaning many modern industries do not ‘have a home’ – witness the huge number of enterprises ‘lumped’ into ‘Other’ headings
  • When businesses are first set up and registered at Company’s House, there can be inaccuracy on the part of the owner (or their accountant) in selecting the right category from the 600 or so available.  There is no support to select the right category, no verification of the code selected and no regulatory penalty if it is incorrect
  • As the business evolves its activities, the SIC code may not be updated in the annual returns to Company’s House to reflect the new category.  Again, there is no official support, verification or penalty to ensure changes are made.

 

How Big Is The Problem?

Inaccuracies in SIC has been noted by prominent bodies over the years (see text box below).  But given the importance of SIC, it is surprising that there has not been louder criticism.  Part of the problem is that there has been little effort to quantify the extent of the inaccuracy.

The research with the University of Essex gives a rare perspective on the issue.  The research uses data science techniques to analyse enterprises across a UK region, in order to quantify the match between registered SIC codes and SIC codes implied by activities described in websites.   

 

Official Recognition: Shortcomings in SIC

Over the last decade, SIC has received generally very little attention.  However, two official developments stand out, both of which acknowledge shortcomings in SIC.

  • Independent Review of UK Economic Statistics (Professor Sir Charles Bean), March 2016. 

This assessed the UK’s future statistics needs, including addressing the challenges of measuring the modern economy.  The review acknowledged that ‘the changing structure of the economy means that SIC will constantly lag reality, under-representing newer industries and over-representing ones that are declining in importance’.  However, the recommendations of the review were mainly to do with governance and structure of the UK government’s statistical infrastructure, leaving SIC improvement largely unaddressed.

  • NACE/ISIC Revisions , 2021-2022 and ongoing.

SIC in the UK has to fit into global (and European) business classification frameworks, to ensure that international aggregation and comparison of data is possible.  ISIC (International Standard of Industrial Classification) is the global industry classification that sits at the very top of the tree, determined by the International Labour Organisation (ILO), an agency of the UN.  This sets the global framework, within which NACE (Nomeclature statistique des activities economiques dans la Commaute europeenne) fits as the EU framework.  Our own national UK SIC framework in turn fits into NACE.  

It was recognised by the ILO some years ago that the ISIC framework was out of date, given that its last review was as long ago as 2006.  The revised framework was eventually confirmed in 2022, designed to reflect:

  • new activities gaining importance while others had lost importance
  • rapid and dynamic changes in the information technology environment 
  • increased awareness of the impact of the economy on climate, creating specialized activities to protect the environment.

In response, NACE identified an additional 36 detailed activities – taking the total from 615 to 651.  

Given that the last revision was in 2006, 17 years ago, an additional 6% is hardly radical.  Implementation, meanwhile, is projected for 2025 at the earliest.  

Insofar that there is recognition that SIC is broken, the effort to fix it appears narrow in scope and to be moving very slowly.  Moreover, whilst modernisation of the classification framework is happening to a limited degree, shortcomings in the way the framework is actually deployed remains ignored. 

 

The Essex Research

The research with the University of Essex was simple in design.  Per the graphic above, it identified independent companies in Essex, ‘read’ their websites, matched the services and activities they described to the ‘best fit’ SIC category, and then compared with actual SIC held at Company’s House (the ‘official’ source for SIC).  

Input Data.  B2B data were extracted, covering standard identifiers, official SIC code (at the most detailed as well as more general levels) and website addresses.

Web Scraping.  This involved automated extraction of content and data from websites, using the ‘Beautiful Soup’ package in Python.  

Activity Prediction.  ‘Natural Language Processing’ (NLP) was used to summarise the scraped website data into descriptions of business activity, using ‘Hugging Face’ for the transformation.

Information Retrieval.  Descriptions were tagged to the closest SIC code, using Universal Sentence Encoder (USE) to convert the text to vectors to calculate closest fit to SIC. 

Predicted Data.  The output was a list of predicted SICs for Essex enterprises based on their website information, scraped and classified via this four step methodology.   This was compared with official SICs, and subjected to validation checks (see below).

 

Sample Selection

Independent established businesses and not for profits in the UK county of Essex were selected for the study with more than 10 employees (but under 200) and turnover of at least £250k.  This group was selected because it represented a large sample (nearly 2200 enterprises), with the size and standing to have built a website, a critical ‘shop window’ for the business describing its activities.  Independent businesses were selected (as opposed to subsidiaries and affiliates of larger firms) in order to obtain activity data for the specific entity, avoiding the risk of ‘pulling’ data that related to the corporate parent or other parts of the group.  

 

Matching Success Rate

The research was successful in allocating Essex businesses to SIC code using website data and over 2000 businesses (93% of the total) had websites that yielded the data needed for SIC prediction.  However, as described below, a number of website addresses were incorrect in the feed taken from a major B2B data vendor, which required correction in the results.

 

Accuracy Verification and Data Cleansing

Accuracy of the methodology was assessed by randomly validating matches.  A number of issues were identified across various stages of the methodology:

  • Input data.  A major global data provider was used to extract SIC and website data.  We were surprised to find that 22% of websites were incorrect in the feed.  Subsequent dialogue with experts in B2B data suggested that this rate of inaccuracy is not unusual in feeds from major B2B data providers.  
  • Web scraping.  This was not successful in all cases, and often resulted in high volumes of non-activity related data. This was not a surprise given the highly variable nature of websites, the large volume of textual data they carry and the time constraints in the study. 
  • Activity prediction.  Extraction of specific text related to the firm’s business activity was not always successful, again not surprising given factors just described.
  • Information retrieval.  Matching to SIC descriptors was misaligned in a number of cases: although an advanced NLP technique was used, there was not the time to refine and redeploy the analytics. 

 

Results quoted in this paper are cleansed and adjusted for these issues.

 

 

Results of The Research: Inaccurate SICs

Headlines from the research were:

  • ‘Section Level’ of SIC.  At the ‘Broadest’ level of SIC, 30% of the Essex extract did not match the SIC implied by the activity defined in the website  
  • ‘Class Level’ of SIC.  At the ‘Detailed’ level of SIC, over 40% did not match. 

 

This finding suggests that a concerningly large number of enterprises in the UK have official SICs that are significantly different to what they ought to be if they reflected the reality of what businesses actually do per their own websites.

The research was particularly striking in suggesting that where SICs are incorrect they tend to be ‘way off target’ (30% of SICs in wrong Section) rather than ‘just a bit off target’ (inaccuracy at the most detailed Class level added just over 10% more to the unmatched total, to just over 40%).  

 

The research provided empirical evidence for the inaccuracy of SIC.  Whilst it’s well-known that the SIC taxonomy itself is flawed, this study suggests that the way it is deployed is also flawed. 

 

 

Patterns of Inaccuracy

SIC inaccuracy was more concentrated in certain industries and circumstances:

  • Longer established, ‘evolving’ firms, reinvented into ‘new economy’.  Traditional SICs have not been updated, either because of effort or because appropriate new categories don’t exist.  Examples include:
    • An HR consultancy (business services) now delivering HR IT systems so clients can source their own data and insights
    • A butyl manufacturer reinvented into an environmental solutions company (still based on butyl) supporting net zero
    • A heating retailer now installing renewable energy solutions
    • A musical instrument maker now delivering environmental acoustics and noise reduction consultancy.  

 

  • ‘Once Upon a Time Manufacturers’ who now source externally.  These firms still deliver the product or service they always have, but in the modern era of global supply chains they no longer make their own goods or materials:
    • A furniture manufacturer who now only designs and fits out
    • An elevator manufacturer who now only installs and services 
    • The butyl manufacturer (above) now working with outsourced materials.

 

  • ‘Retailers who were never really retailers’. These firms appear to have ‘sloppy’ SIC selection – whilst services or products are ‘sold’ that doesn’t in itself make the firm a retailer:
    • A golf club that classifies itself as a sports equipment retailer
    • A hirer of silent disco equipment that classifies itself as a communication equipment shop
    • A stone floor and worksurface designer and installer that classifies itself as a retailer.

 

  • ‘Sloppy SIC selection, sometimes to the point of being surreal’.  
    • A dental equipment supplier that presents itself as a dental practice
    • An outdoor sit-on miniature steam railway that sees itself as offering professional business support activities.

 

 

Does This Matter?

The answer is yes.  For example, inaccurate SIC can lead to:

  • Inaccurate economic statistics, misrepresenting the true make-up of businesses and other enterprises
  • ‘Blind spots’ on crucial newer parts of the economy (eg advanced tech, environment and wider ESG based enterprises)
  • Waste in B2B marketing campaigns
  • Policy inattention in sectors that appear minor but in reality are important (eg in the pandemic, the events industry struggled to demonstrate to government its true extent in part because of SIC imprecision) 
  • Weakened data for economic policy decisions
  • Inaccurate risk assessment in banking and insurance, resulting in mispricing, even exclusion, for misclassified businesses.

 

 

What’s Being Done About The Problem?

The honest answer is not enough.  Whilst the issue has been recognised and certain actions initiated (see ‘Official Recognition’ text box above), plans are limited, timelines lengthy, and resources limited.

Organisations like The Data City (www.thedatacity.com) have stepped into the vacuum, providing real time activity classifications and modern taxonomies.   But these private sector alternatives have limited resources, and tend to focus on specific ‘hot’ industry segments such as new economy and net zero related businesses.

More needs to be done:

  • More comprehensive research needs to be undertaken as to the extent and pattern of the problem.  The study reported in this paper was based on an Essex sample; much deeper insight would be achieved if the research was scaled up to the national level.
  • The international perspective remains unexplored, and it is likely that similar issues exist in other territories, eg USA with regards NAICs data.
  • The SIC taxonomy used in the UK is out of date, fails to reflect important newer industries, and is being updated only very slowly.
  • There is no real incentives or penalties for ensuring SIC codes are correct, and updated in line with changing business activities.
  • The issue is largely ignored in the public domain, and where it has been acknowledged (eg in Sir Charles Bean’s report of 2016) there has been no follow through

 

 

In Conclusion…

The data science research undertaken with Essex University showed that a significant proportion of SIC codes in the sample do not reflect the reality of what these enterprises actually do according to their websites.  

At the most aggregated level, 30% of SICs were wrong in this study.  At the more detailed level, this level of inaccuracy increased to over 40%.

The taxonomy used for SIC is antiquated, and global and national authorities are ‘modernising’ the framework only slowly.

The deployment of the existing taxonomy tolerates inaccuracy, as there is no incentive or penalty for error.

Organisations such as The Data City have arisen to meet the need for up to date taxonomies and allocation of individual enterprises to real time classification.

The scale of the issue needs to be recognised and quantified at the national and international level.

The consequences of inaccurate SICs feed through to macro-economic analysis and hence policy, deployment of public funds and business support, and private sector access to finance and risk pricing. 

 

 

Acknowledgements:

I’m indebted to the University of Essex research students, academics and management staff who enabled this research, which was supported with ESRC funding. 

Swaroop Pagonda. MSc Applied Data Science, School of Mathematics, Statistics and Actuarial Science, University of Essex

Dr Felipe Maldonado. Lecturer in Data Science & Operational Research
School of Mathematics, Statistics and Actuarial Science, University of Essex.

Husam Quteineh, Senior Research Officer at the Business and Local Government (BLG) Data Research Centre, University of Essex

Dr Smruti Bulsari. Research Fellow. Institute of Public Health and Wellbeing, University of Essex

Nigel Kirby.  Project Manager, ESRC Business and Local Government Data Research Centre, University of Essex

Laura Brookes.  Outreach and Publicity Officer.  ESRC Business and Local Government Data Research Centre, University of Essex

I’m also grateful to Alex Craven, Co-Founder and CEO of The Data City, for his encouragement and feedback, and to Andrew Purdy, Data Analyst at The Data City for suggesting analytics additions to the paper.

 

 

Appendix

What is an SIC code?

The following short summary is taken from The Data City website:

A Standard Industrial Classification (SIC) code is a five-digit code that’s used to classify a business’ main area of economic activity or in short, what they do. When a company or business is incorporated, you must choose at least one SIC code, however in some cases you can choose as many as four.

SIC codes are broken down by category, with 21 (from A to U) main sections to choose from. These are broader industries such as Manufacturing (Section C), Construction (Section F) and Education (Section P) which are made up of several subcategories or codes. For example, Construction is made up of 25 SIC codes, which includes activities like Demolition (43110), Plastering (43310) and Glazing (43342).

With a total over 600 SIC codes, ranging from the mundane (64192 Building societies, 69203 Bookkeeping activities) to the more obscure (20510 Manufacture of explosives, 01230 Growing of tobacco) – the idea of this extensive segmented SIC code system is to capture the full breadth of the UK economy and give business owners the best possible chance to accurately classify their company.

 

Case study: mid-sized financial services firm, 100-150 staff

Client questions

  • How big is my market opportunity in the UK small business sector?
  • How does my target market break down into segments, so I can design the right marketing approaches?
  • Who are my best immediate sales prospects?

Beeline B2B provided:

  • Addressable market sizing, in terms of numbers of SMEs and their purchasing power for the client’s products
  • Segmentation of SMEs meeting the client’s target criteria – identifying the ‘sweet spots’ by industry, size, suppliers, financial strength and business maturity
  • Delivery of an extensive prospects list, with a range of profiling data, for use in the client’s direct marketing campaign

Case study: executive development training company

Client questions

  • Which types of organisations across the UK already use the kind of service I offer?
  • Which businesses, public bodies and not for profit enterprises are like these in my local area, so I can target them?

Beeline B2B provided:

  • Identification of organisations already taking the service
  • Analysis of characteristics for those most likely to buy
  • Listing of organisations fitting these characteristics in the client’s local region, with profiling and contact information