Parse Json in Below 2016 SQL Server

Dec 21, 2022

Abstract 

This article describes a TSQL JSON parser and provides the source. It is also designed to illustrate a number of string manipulation techniques and also eliminate the issues while dealing with the JSON document containing special symbols like (“/” , ”-”....) in T-SQL. With it you can do things like this to extract the data from a JSON file or document which contains noise and complexities. 

 

Summary For Implementation

The code for the JSON Parser will run in SQL Server 2005,  and even in SQL Server 2000 (note: some modifications are necessary).

First the function stores all strings in the temporary table, even the name of the elements, since they are 'escapes' in a different way, and may contain, unescaped, brackets, Special Characters which denote objects or lists. These are replaced in the json string by tokens which represent the strings.

After this fetch all the json keywords and values for further processing by using the regular expressions, various string functions and a list of SQL queries and variables to store the values for a particular object.

And at the last function will return a whole table which contains rows and columns with no noise in the values as the other tables in the particular database.  

Figure 1:- Json Input

 

Figure 2:- Function Output

Background

TSQL isn’t really designed for doing complex string parsing which contains special characters and particularly where strings represent nested data structures such as XML, JSON, or XHTML.  

You can do it but it is not a pretty sight; but If you ever want to do it anyway ?

(note You can now do this rather more easily using SQL Server 2016’s built-in JSON support.)

But If the SQL Server version is older or not compatible with the built-in JSON support then you can use this customized function to get the desired output by parsing any type of json document.

 There is so much stuff behind that all happens to you. For example, it could be that DBA doesn’t allow a CLR, or you lack the necessary skills with procedural scripting. Sometimes, there isn’t any application, or you want to run code unobtrusively across databases or servers. 

 The Traditional way for dealing with data like this is to let a separate business layer parse a JSON ‘document’ into some meaningful structure(Like Tree) and then update the database by making a series of calls and lots of sql procedures. This is pretty, but can get more complicated and headache if you need to ensure that the updates to the database are wrapped into one transaction so that if anything goes wrong or any issues occur, then the whole transaction can be rolled back. This is why a TSQL approach has advantages. 

Adjacency list tables have the same structure whatever the data in them. This means that you can define a single Table-Valued  Type and pass data structures around between stored procedures. 

Converting the data to Hierarchical table form will be different for each application, but is easy with a TSQL. You can, alternatively, convert the hierarchical table into JSON and interrogate that with SQL.

 

JSON format

JSON is one of the most popular lightweight markup languages, and is probably the best choice for transfer of object data from a web page.

JSON is designed to be as lightweight as possible and so it has only two structures. The first, delimited by curly brackets, is a collection of Key/value pairs, separated by commas. The key is followed by a colon.

The first snag for TSQL is that the curly or square brackets are not ‘escaped’ within a string, so that there is no way of partitioning a JSON ‘document’ simply. It is difficult to  differentiate a bracket used as the delimiter of an array or structure, and one that is within a string.

The second complication is that, unlike YAML, the datatypes of values can’t be explicitly declared. You have to pass them out from applying the rules from the JSON Specification.

 

Implementation

The JSON outputter is a great deal simpler, since one can be sure of the input, but essentially it does the reverse process, working from the root of the json document to the leaves. The only complication is working out the indent of the formatted output string.

In the implementation, you’ll see a fairly heavy use of PATINDEX.This uses a RegEx. However, it is all we have, and can be pressed into service by chopping the string it is searching (if only it had an optional third parameter like CHARINDEX that specified the index of the start position of the search!). The STUFF function is also important for this sort of string-manipulation work.

CREATE FUNCTION [Platform].[parseJSON] (@JSON NVARCHAR(MAX))
RETURNS @hierarchy TABLE (
	Element_ID INT IDENTITY(1, 1) NOT NULL /* internal surrogate primary key gives the order of parsing and the list order */
	,SequenceNo [int] NULL /* the place in the sequence for the element */
	,Parent_ID INT NULL /* if the element has a parent then it is in this column. The document is the ultimate parent, so you can get the structure from recursing from the document */
	,[Object_ID] INT NULL /* each list or object has an object id. This ties all elements to a parent. Lists are treated as objects here */
	,[Name] NVARCHAR(2000) NULL /* the Name of the object */
	,StringValue NVARCHAR(MAX) NOT NULL /*the string representation of the value of the element. */
	,ValueType VARCHAR(10) NOT NULL /* the declared type of the value represented as a string in StringValue*/
	)
AS
BEGIN
	DECLARE @FirstObject INT --the index of the first open bracket found in the JSON string
		,@OpenDelimiter INT --the index of the next open bracket found in the JSON string
		,@NextOpenDelimiter INT --the index of subsequent open bracket found in the JSON string
		,@NextCloseDelimiter INT --the index of subsequent close bracket found in the JSON string
		,@Type NVARCHAR(10) --whether it denotes an object or an array
		,@NextCloseDelimiterChar CHAR(1) --either a '}' or a ']'
		,@Contents NVARCHAR(MAX) --the unparsed contents of the bracketed expression
		,@Start INT --index of the start of the token that you are parsing
		,@end INT --index of the end of the token that you are parsing
		,@param INT --the parameter at the end of the next Object/Array token
		,@EndOfName INT --the index of the start of the parameter at end of Object/Array token
		,@token NVARCHAR(200) --either a string or object
		,@value NVARCHAR(MAX) -- the value as a string
		,@SequenceNo INT -- the sequence number within a list
		,@Name NVARCHAR(200) --the Name as a string
		,@Parent_ID INT --the next parent ID to allocate
		,@lenJSON INT --the current length of the JSON String
		,@characters NCHAR(36) --used to convert hex to decimal
		,@result BIGINT --the value of the hex symbol being parsed
		,@index SMALLINT --used for parsing the hex value
		,@Escape INT --the index of the next escape character
		/* in this temporary table we keep all strings, even the Names of the elements, since they are 'escaped' in a different way, and may contain, unescaped, brackets denoting objects or lists. These are replaced in the JSON string by tokens representing the string */
	DECLARE @Strings TABLE (
		String_ID INT IDENTITY(1, 1)
		,StringValue NVARCHAR(MAX)
		)

	IF ISNULL(@JSON, '') = ''
		RETURN

	SELECT @characters = '0123456789abcdefghijklmnopqrstuvwxyz' --initialise the characters to convert hex to ascii
		,@SequenceNo = 0 --set the sequence no. to something sensible.
		,@Parent_ID = 0;

	/* firstly we process all strings. This is done because [{} and ] aren't escaped in strings, which complicates an iterative parse. */
	WHILE 1 = 1 --forever until there is nothing more to do
	BEGIN
		SELECT @start = PATINDEX('%[^a-zA-Z]["]%', @json collate SQL_Latin1_General_CP850_Bin);--next delimited string

		IF @start = 0
			BREAK --no more so drop through the WHILE loop

		IF SUBSTRING(@json, @start + 1, 1) = '"'
		BEGIN --Delimited Name
			SET @start = @Start + 1;
			SET @end = PATINDEX('%[^\]["]%', RIGHT(@json, LEN(@json + '|') - @start) collate SQL_Latin1_General_CP850_Bin);
		END

		IF @end = 0 --either the end or no end delimiter to last string
		BEGIN -- check if ending with a double slash...
			SET @end = PATINDEX('%[\][\]["]%', RIGHT(@json, LEN(@json + '|') - @start) collate SQL_Latin1_General_CP850_Bin);

			IF @end = 0 --we really have reached the end 
			BEGIN
				BREAK --assume all tokens found
			END
		END

		SELECT @token = SUBSTRING(@json, @start + 1, @end - 1)

		--now put in the escaped control characters
		SELECT @token = REPLACE(@token, FromString, ToString)
		FROM (
			SELECT '\b'
				,CHAR(08)
			
			UNION ALL
			
			SELECT '\f'
				,CHAR(12)
			
			UNION ALL
			
			SELECT '\n'
				,CHAR(10)
			
			UNION ALL
			
			SELECT '\r'
				,CHAR(13)
			
			UNION ALL
			
			SELECT '\t'
				,CHAR(09)
			
			UNION ALL
			
			SELECT '\"'
				,'"'
			
			UNION ALL
			
			SELECT '\/'
				,'/'
			) substitutions(FromString, ToString)

		SELECT @token = Replace(@token, '\\', '\')

		SELECT @result = 0
			,@escape = 1

		--Begin to take out any hex escape codes
		WHILE @escape > 0
		BEGIN
			SELECT @index = 0 --find the next hex escape sequence
				,@escape = PATINDEX('%\x[0-9a-f][0-9a-f][0-9a-f][0-9a-f]%', @token collate SQL_Latin1_General_CP850_Bin)

			IF @escape > 0 --if there is one
			BEGIN
				WHILE @index < 4 --there are always four digits to a \x sequence   
				BEGIN
					SELECT --determine its value
						@result = @result + POWER(16, @index) * (CHARINDEX(SUBSTRING(@token, @escape + 2 + 3 - @index, 1), @characters) - 1)
						,@index = @index + 1;
				END

				-- and replace the hex sequence by its unicode value
				SELECT @token = STUFF(@token, @escape, 6, NCHAR(@result))
			END
		END

		--now store the string away 
		INSERT INTO @Strings (StringValue)
		SELECT @token

		-- and replace the string with a token
		SELECT @JSON = STUFF(@json, @start, @end + 1, '@string' + CONVERT(NCHAR(5), @@identity))
	END

	-- all strings are now removed. Now we find the first leaf.  
	WHILE 1 = 1 --forever until there is nothing more to do
	BEGIN
		SELECT @Parent_ID = @Parent_ID + 1

		--find the first object or list by looking for the open bracket
		SELECT @FirstObject = PATINDEX('%[{[[]%', @json collate SQL_Latin1_General_CP850_Bin) --object or array

		IF @FirstObject = 0
			BREAK

		IF (SUBSTRING(@json, @FirstObject, 1) = '{')
			SELECT @NextCloseDelimiterChar = '}'
				,@type = 'object'
		ELSE
			SELECT @NextCloseDelimiterChar = ']'
				,@type = 'array'

		SELECT @OpenDelimiter = @firstObject

		WHILE 1 = 1 --find the innermost object or list...
		BEGIN
			SELECT @lenJSON = LEN(@JSON + '|') - 1

			--find the matching close-delimiter proceeding after the open-delimiter
			SELECT @NextCloseDelimiter = CHARINDEX(@NextCloseDelimiterChar, @json, @OpenDelimiter + 1)

			--is there an intervening open-delimiter of either type
			SELECT @NextOpenDelimiter = PATINDEX('%[{[[]%', RIGHT(@json, @lenJSON - @OpenDelimiter) collate SQL_Latin1_General_CP850_Bin) --object

			IF @NextOpenDelimiter = 0
				BREAK

			SELECT @NextOpenDelimiter = @NextOpenDelimiter + @OpenDelimiter

			IF @NextCloseDelimiter < @NextOpenDelimiter
				BREAK

			IF SUBSTRING(@json, @NextOpenDelimiter, 1) = '{'
				SELECT @NextCloseDelimiterChar = '}'
					,@type = 'object'
			ELSE
				SELECT @NextCloseDelimiterChar = ']'
					,@type = 'array'

			SELECT @OpenDelimiter = @NextOpenDelimiter
		END

		---and parse out the list or Name/value pairs
		SELECT @contents = SUBSTRING(@json, @OpenDelimiter + 1, @NextCloseDelimiter - @OpenDelimiter - 1)

		SELECT @JSON = STUFF(@json, @OpenDelimiter, @NextCloseDelimiter - @OpenDelimiter + 1, '@' + @type + CONVERT(NCHAR(5), @Parent_ID))

		WHILE (PATINDEX('%[A-Za-z0-9@+.e]%', @contents collate SQL_Latin1_General_CP850_Bin)) <> 0
		BEGIN
			IF @Type = 'object' --it will be a 0-n list containing a string followed by a string, number,boolean, or null
			BEGIN
				SELECT @SequenceNo = 0
					,@end = CHARINDEX(':', ' ' + @contents) --if there is anything, it will be a string-based Name.

				SELECT @start = PATINDEX('%[^A-Za-z@][@]%', ' ' + @contents collate SQL_Latin1_General_CP850_Bin) --AAAAAAAA

				SELECT @token = RTrim(Substring(' ' + @contents, @start + 1, @End - @Start - 1))
					,@endofName = PATINDEX('%[0-9]%', @token collate SQL_Latin1_General_CP850_Bin)
					,@param = RIGHT(@token, LEN(@token) - @endofName + 1)

				SELECT @token = LEFT(@token, @endofName - 1)
					,@Contents = RIGHT(' ' + @contents, LEN(' ' + @contents + '|') - @end - 1)

				SELECT @Name = StringValue
				FROM @strings
				WHERE string_id = @param --fetch the Name
			END
			ELSE
				SELECT @Name = NULL
					,@SequenceNo = @SequenceNo + 1

			SELECT @end = CHARINDEX(',', @contents) -- a string-token, object-token, list-token, number,boolean, or null

			IF @end = 0
				--HR Engineering notation bugfix start
				IF ISNUMERIC(@contents) = 1
					SELECT @end = LEN(@contents) + 1
				ELSE
					--HR Engineering notation bugfix end 
					SELECT @end = PATINDEX('%[A-Za-z0-9@+.e][^A-Za-z0-9@+.e]%', @contents + ' ' collate SQL_Latin1_General_CP850_Bin) + 1

			SELECT @start = PATINDEX('%[^A-Za-z0-9@+.e][-A-Za-z0-9@+.e]%', ' ' + @contents collate SQL_Latin1_General_CP850_Bin)

			--select @start,@end, LEN(@contents+'|'), @contents  
			SELECT @Value = RTRIM(SUBSTRING(@contents, @start, @End - @Start))
				,@Contents = RIGHT(@contents + ' ', LEN(@contents + '|') - @end)

			IF SUBSTRING(@value, 1, 7) = '@object'
				INSERT INTO @hierarchy (
					[Name]
					,SequenceNo
					,Parent_ID
					,StringValue
					,[Object_ID]
					,ValueType
					)
				SELECT @Name
					,@SequenceNo
					,@Parent_ID
					,SUBSTRING(@value, 8, 5)
					,SUBSTRING(@value, 8, 5)
					,'object'
			ELSE IF SUBSTRING(@value, 1, 6) = '@array'
				INSERT INTO @hierarchy (
					[Name]
					,SequenceNo
					,Parent_ID
					,StringValue
					,[Object_ID]
					,ValueType
					)
				SELECT @Name
					,@SequenceNo
					,@Parent_ID
					,SUBSTRING(@value, 7, 5)
					,SUBSTRING(@value, 7, 5)
					,'array'
			ELSE IF SUBSTRING(@value, 1, 7) = '@string'
				INSERT INTO @hierarchy (
					[Name]
					,SequenceNo
					,Parent_ID
					,StringValue
					,ValueType
					)
				SELECT @Name
					,@SequenceNo
					,@Parent_ID
					,StringValue
					,'string'
				FROM @strings
				WHERE string_id = SUBSTRING(@value, 8, 5)
			ELSE IF @value IN ('true', 'false')
				INSERT INTO @hierarchy (
					[Name]
					,SequenceNo
					,Parent_ID
					,StringValue
					,ValueType
					)
				SELECT @Name
					,@SequenceNo
					,@Parent_ID
					,@value
					,'boolean'
			ELSE IF @value = 'null'
				INSERT INTO @hierarchy (
					[Name]
					,SequenceNo
					,Parent_ID
					,StringValue
					,ValueType
					)
				SELECT @Name
					,@SequenceNo
					,@Parent_ID
					,@value
					,'null'
			ELSE IF PATINDEX('%[^0-9-]%', @value collate SQL_Latin1_General_CP850_Bin) > 0
				INSERT INTO @hierarchy (
					[Name]
					,SequenceNo
					,Parent_ID
					,StringValue
					,ValueType
					)
				SELECT @Name
					,@SequenceNo
					,@Parent_ID
					,@value
					,'real'
			ELSE
				INSERT INTO @hierarchy (
					[Name]
					,SequenceNo
					,Parent_ID
					,StringValue
					,ValueType
					)
				SELECT @Name
					,@SequenceNo
					,@Parent_ID
					,@value
					,'int'

			IF @Contents = ' '
				SELECT @SequenceNo = 0
		END
	END

	INSERT INTO @hierarchy (
		[Name]
		,SequenceNo
		,Parent_ID
		,StringValue
		,[Object_ID]
		,ValueType
		)
	SELECT '-'
		,1
		,NULL
		,''
		,@Parent_ID - 1
		,@type

	RETURN
END

Code Snippet 1:- ParseJson Function

 

Closure

The so-called ‘impedance-mismatch’ between applications and databases is an illusion. if the developer has understood the data correctly then there is less complexity  while processing it.

But has been trickier with other formats such as JSON. By using techniques like this, it should be possible to liberate the application or website from having to do the mapping from the object model to the relational, and spraying the database with ad-hoc T-SQL  that uses the fact/dimension tables or updateable views.  If the database can be provided with the JSON, or the Table-Valued parameter, then there is a better chance of  maintaining full transactional integrity for the more complex updates.

The database developer already has the tools to do the work with XML, but why not the simpler, and more practical JSON? I hope these routines get you started with experimenting with all this for your requirements.

 

Hardik Dangar

About the Author

Hardik Dangar

Project Lead in Magnusminds

Hardik is working as Project Lead of MSBI in INDIA. Hardik started his career working on SQL Server and MSBI. Hardik is having 5+ years of experience. In the starting of his career he was working on SQL Server, SSIS and SSRS. Hardik likes to explore technical things on SQL Server.