Thursday, March 29, 2012

Combining text data rows

I am working with a database derived from text documents.One of the tables (TEXT001) contains the text of the documents with each paragraph of each document assigned to its own row with a paragraph number in a SectionNo column.I want the entire text of each document in a single row with its own unique number (so that I can do a full text search with SQL Server 2005 that will search and return the entire document as a result).How do I combine the rows with the same DocumentID into a single row of text data?This will put the entire text content of each document in its own row.

TEXT001 table as it is

DocumentID

SectionNo

SectionText

1

1

Paragraph 1 of Document 1

1

2

Paragraph 2 of Document 1

1

3

Paragraph 3 of Document 1

2

1

Paragraph 1 of Document 2

2

2

Paragraph 2 of Document 2

New TEXT table

DocumentID

SectionText

1

Entire text of Document 1

2

Entire text of Document 2

I realize that I can use “union” to combine tables with the same data type, but that is not what I am trying to do.Ideally, there is a way to create a new table and fill it with the combined SectionText data as a batch command.If anyone can tell how to do this, I would appreciate your help.

More modestly, I tried to use the “Group By” clause to combine the SectionText data using this query:

SELECT DocumentID, SectionText FROM TEXT001

GROUP BY DocumentID

And got this error message:

Msg 8120, Level 16, State 1, Line 5

Column 'TEXT001.SectionText' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

I figured that I could not contain the SectionText data as an aggregate function since it is text data and cannot be “summed”, so I tried including it in the GROUP BY clause:

SELECT DocumentID, SectionText FROM TEXT001

GROUP BY DocumentID, SectionText

And got his error message:

Msg 306, Level 16, State 2, Line 5

The text, ntext, and image data types cannot be compared or sorted, except when using IS NULL or LIKE operator.

Where do I go from here to accomplish my goal of combining the paragraphs of each document into one row per document?

Hi moonshadow, the following will create a stored procedure that will fullfill the requested task. run the sript, it will have only one constraint, is that i supposed that the maximum section length is hundreds of characters i.e: it will be great if you can use nvarchar instead of ntext, but if more characters are needed to be stored then comment and we will work around it.

IF EXISTS (SELECT name FROM sysobjects
WHERE name = 'myProc' AND type = 'P')
DROP PROCEDURE myProc
IF EXISTS (SELECT name FROM sysobjects
WHERE name = 'NewText' AND type = 'U')
DROP table NewText
CREATE TABLE NewText
( DocumentID int,
Paragraph1 ntext,
)
GO
CREATE PROCEDURE myProc
AS
declare @.a int
declare @.b nvarchar(3000)
declare @.c int
DECLARE myCursor CURSOR FOR
SELECT DocumentID,SectionText FROM Text001
OPEN myCursor
FETCH NEXT FROM myCursor into @.a,@.b
WHILE @.@.FETCH_STATUS = 0
BEGIN
if @.c = @.a
update NewText set Paragraph1 = cast(Paragraph1 as nvarchar) + @.b where DocumentID = @.a
else
insert into NewText(DocumentID,Paragraph1) values (@.a,@.b)
set @.c=@.a
FETCH NEXT FROM myCursor into @.a,@.b
END
CLOSE myCursor
DEALLOCATE myCursor
select * from NewText
go
--ToCall your procedure:
execute myProc

|||

Hi Mario. Thanks for script. This is exactly what I am looking for.

I created a new query, pasted in your script, and clicked Execute.

A new table ("NewText") was created with the appropriate columns.

However the data from "Text001" was not copied to "NewText". When I open the "NewText" table, the content of the rows is "null".

As a Newbie to SQL Server, I am wondering if I should be doing something else to call the "myProc" procedure. Do I need to do another step to combine the rows from "Text001" and copy the combined data to "NewText"?

|||

yes moonshadow,

first it is great that you copied the script and started its excecution.

i am sure that it will work, but look what you will have to do:

EITHER: change the data type of SectionText from ntext to nvarchar, and then test. if you are urged to use ntext, then we can figure it out... but as a first step, just for your test, do not use ntext or at least do not use large text in your records under SectionText.

OR change the datatype of DocumentID to int.

look, the Table Text001 is like the following (i created this table upon your specifications):

CREATE TABLE [Text001] (
[DocumentID] [int] NULL ,
[SectionText] [ntext] COLLATE SQL_1xCompat_CP850_CI_AS NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

|||

Thanks for your patience Mario. Here is what I tried:

1. I tried to change the data type of SectionText from ntext to nvarchar but it would only allow nvarchar(50) or nvarchar(max). I chose nvarchar(max) since the SectionText contents are likely be much more than 50 characters. NewText table was created but still empty.

2. I changed the DocumentID to int with the same result as before.

What next?

|||

Mario:To simplify and make it more concrete for working with your query, I created a new database called “Practice”

In that database I ran this query based on your example to create a table called “Text001”:

CREATE TABLE [Text001] (

[DocumentID] [int] NULL ,

[SectionNo] [int] NULL ,

[SectionText] [ntext]

COLLATE SQL_1xCompat_CP850_CI_AS NULL )

ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO

I put data into five rows of the table Text001 as follows:

DocumentID

SectionNo

SectionText

1

1

Blue

1

2

Red

1

3

Green

2

1

White

2

2

Black

I ran the initial query that you provided which was “executed successfully”

IF EXISTS (SELECT name FROM sysobjects
WHERE name = 'myProc' AND type = 'P')
DROP PROCEDURE myProc
IF EXISTS (SELECT name FROM sysobjects
WHERE name = 'NewText' AND type = 'U')
DROP table NewText
CREATE TABLE NewText
( DocumentID int,
Paragraph1 ntext,
)
GO
CREATE PROCEDURE myProc
AS
declare @.a int
declare @.b nvarchar(3000)
declare @.c int
DECLARE myCursor CURSOR FOR
SELECT DocumentID,SectionText FROM Text001
OPEN myCursor
FETCH NEXT FROM myCursor into @.a,@.b
WHILE @.@.FETCH_STATUS = 0
BEGIN
if @.c = @.a
update NewText set Paragraph1 = cast(Paragraph1 as nvarchar) + @.b where DocumentID = @.a
else
insert into NewText(DocumentID,Paragraph1) values (@.a,@.b)
set @.c=@.a
FETCH NEXT FROM myCursor into @.a,@.b
END
CLOSE myCursor
DEALLOCATE myCursor
select * from NewText
go

The NewText table that was created looked like this:

DocumentID

Paragraph1

Null

Null

I think the NewText Table should have looked like this:

DocumentID

Paragraph1

1

Blue

Red

Green

2

White

Black

I am learning alot from working with this and am thankful for your help.

|||

moonshadow!can you try this please:

add the following, IN RED,

...

declare @.c int
DECLARE myCursor CURSOR FOR
SELECT DocumentID,SectionText FROM Text001
OPEN myCursor
FETCH NEXT FROM myCursor into @.a,@.b

IF @.@.FETCH_STATUS <> 0
PRINT " ERROR!"

WHILE @.@.FETCH_STATUS = 0
BEGIN
if @.c = @.a

...

test it, cos it seems it is not entering the loop, if there was no error, try to update the following line:

if @.c = @.a
update NewText set Paragraph1 = Paragraph1 + @.b where DocumentID = @.a
else...

also, replace all field datatypes in tables Text001, and NewText to nvarchar (i.e: do not use ntext)

...CREATE TABLE NewText
( DocumentID int,
Paragraph1 nvarchar(3000), or max
)...


|||

Mario:

1. I changed all "ntext" datatypes in tables Text001, and NewText to "nvarchar(max)"

2. I copied and pasted this

IF @.@.FETCH_STATUS <> 0
PRINT " ERROR!"

as you suggested and got this error message:

Msg 128, Level 15, State 1, Procedure myProc, Line 14 The name "ERROR!" is not permitted in this context. Valid expressions are constants, constant expressions, and (in some contexts) variables. Column names are not permitted.

3. I also tried to update the following line as you suggested:

if @.c = @.a
update NewText set Paragraph1 = Paragraph1 + @.b where DocumentID = @.a
else...

but got the same error message.


|||ok moon shadow, instead of "ERROR!" just put anything, an insert statement, or a Print 1, just to check if the fetching is occuring with/without errors, just further troubleshooting|||

Mario: I tried numerous things without luck until I typed in a constant expression without the quotations:

IF @.@.FETCH_STATUS <> 0

PRINT 10

The command completed successfully but the content of the "NewText" table was still Null.

I figured 10 (or any number) is a constant expression as per the error message and should be appropriate but may not have served your purpose to see if the fetching is occurring. Any other thoughts on how to proceed?

|||

try to insert a record instead of PRINT, like the following:

IF @.@.FETCH_STATUS <> 0

insert into NewText(DocumentID,Paragraph1) values (1,'Error')

and check the NewText table

good luck

|||

Mario:

I used your new script and the "query executed successfully" but the content of the NewText table was still "null"

|||

moonshadow! run the following:

execute myProc

and then check the result in NewTable

|||

Mario: Yes!! The NewText table is now filled with the combined data. I told you I was a Newbie.

Perhaps you can help me with one refinement. The data in the NewNext table SectionText column is run together (ie., "blueredgreen") which will not work too well with the paragraphs in my original database. Is there a way to automatically format the new data fields (in NewText) so that each row of the original table (Text001) remains on its own line with a space between paragraphs when it is read in an application? Such as this:

blue

red

green

Thanks for your patience and knowledge in getting this far.

|||

Great MoonShadow!!!

now you've got the concept, you can expand it as you like...

regarding your concern, you can add a carriage return while filling each paragraph, i.e: add what is in red to the sql script:

...

update NewText set Paragraph1 = Paragraph1 + char(13)+char(10) + @.b where DocumentID = @.a
...

good luck

No comments:

Post a Comment