TextFileDataSet revisited

The following text consists of parts of a mail-thread between Matt Krei and myself. The main subject is regarding how to use TextFileDataSet on certain lay-outs of text files. After a while it turned out to be a quite extensive elaboration on regular expressions. Enjoy.

Download textfiledataset_demo the original demo.
Download Extra demo project as described in this post.

Matt Krei wrote:
I downloaded your texttodataset project from codeproject.com. I have a csv file where each line ends in a comma. How do you deal with this? Also, what if there isn’t any data between commas. For example if the line reads

 1,,12/07/1972

then this will be a misread. I need it to just be an empty string or null. How do you take care of this? Thanks

Christiaan replied: 

Hi Matt,

Well, as soon as I got home after work I looked at it again (it is code that I wrote quite a long time ago). I downloaded my own demo-project from CodeProject and altered the demo-text-file into this:

 

1,Chris,12/07/1972,

2,,03/01/1974,

3,John,03/19/1980,Drummer,

4,Mark,12/02/1980,

5,Eric,09/18/1981,

As you can see, I end every line with a comma now and have left out one of the names.

When you run the demo-project, you can try to open the file using the following regex statement: ^(?<ID>[^,]+),(?<Name>[^,]+),(?<Date>[^,]+)$

And it is true, every line is a misread. Now, expand the Regex with a comma at the end (just before ‘$’)

^(?<ID>[^,]+),(?<Name>[^,]+),(?<Date>[^,]+),$

(the $ means end-of-line, as ^ means begin-of-line)

If you open the file again, only two lines are misread. Line 2 and 3.

Now for the name as being ‘blanc’ .
The name part in the Regex is: (?<Name>[^,]+) This means, take every character that is not a comma and let me call it name. Oh, this must be ‘one up to unlimited’ times. (that is what the ‘+’ means).

This plus-sign is a ‘greedy’ control. There is also a ‘?’, this one looks for zero or one time. And there is ‘*’, zero up to unlimited. And that is exactly the one you need.

Replace the plus ‘+’ with asterisks ‘*’ as so :

^(?<ID>[^,]+),(?<Name>[^,]*),(?<Date>[^,]+),$

Reopen the file and you will see that only line 3 is still being misread (which was intended from the beginning)

Matt wrote:

Thanks a bunch.  That answered my question.  One more thing;  Most of the .csv file I am dealing with have the standard comma delimited as you show below.  Some of the lines in a file may be like this, though:

1,”Chris, John”,12/07/1972

Do you know of a way I can parse out the “Chris, John” with a regular expression and then replace the double quotes with an empty string and the comma with either an empty string or dash or something.  I appreciate your help.

Christiaan replied:

Okay, before I answer this one do yourself a favor and download (maybe even

buy) the tool RegexBuddy (www.regexbuddy.com), this will help you create and better understand regular expressions. There is a 30-day trial, and no, I do not have stock options

As you describe, you want to do two separate things:

  • remove any comma inside a field surrounded by double quotes
  • erase the double quotes surrounding the fieldOr….

You might mean you want to have an expression to capture the “Chris, John”

value without the double quotes.

 

To start with the latter, the more simpler one:

If you know how the fields are made up, you can create the expression fairly simple. If the line is:

 

1,”Chris, John”,12/07/1972

You just need the following expression:

^(?<field1>[0-9]*),”(?<field2>[^”]*)”,(?<field3>.*)$

  • begin of line

  • field1 all numbers (zero or unlimited) followed by a comma

  • field2 any character except double quote (zero or unlimited) between double quotes, followed by comma

  • field3 any character (zero or unlimited) until line-break or end-of-file

But there are some caveats, I will try to describe these in the following (trying to solve the former).

Removing and/or replacing characters with regular expressions, I always use my favorite coding language C#. I use regular expressions just to identify the strings and then alter them in code.

 

First we need to make a regular expression in which we can identify a field surrounded by double quotes. Before I head right to creating the expression, let me extend the test case a bit more. What if there is not just a field that is surrounded by double quotes, what if there are fields that use double quotes as part of their contents. E.g.

 

        1,”Chris, John”,12/07/1972

        2,Dave,12/10/1970

        3,Mike “the stone”,10/3/1980

        4,”Stuart, Little “Joe”",11/5/1983

        5,”Reggy, Dan”,”Mike, Flannigan”,10/6/1986

In above test cases I tried to make it harder, I cannot just erase all double quotes and commas inside the double quotes.

Though this might be more complex than you asked for, If we get it right, it will work on your test case and we will be prepared for any other real-life ‘exceptions’.

 

Now, let’s make a regular expression in which we can identify the field.

,”(?<field>(?<content>(?!”,).)*)”,

The trick I used here is that I used a so called ‘negative lookahead’ –> (?!”,).)*

This means that I want to capture everything as long as it does not contain “, within the boundaries ,” and “, I had to use another named capturing group (I called it content), just so the expression is valid on different platforms.

 

Running this Regex gives me results on line 1, 4 and 5. In line 5 though, only the first match is found. I hoped for a second match on Mike Flannigan.

This is not the case because the comma between the fields is used in the previous match.

Maybe we could extend the Regex further, so we can handle this. I’m more pragmatic (or lazy) and will just run the Regex again and again until there are no further matches. The other lines do not have to match, because I will first edit the found lines.

In C# I wrote the following (the demo project is attached):

using System.Text.RegularExpressions;
using System.IO;          

…          

//define the regular expression
Regex fieldfinder = new
Regex(”,(?<whole>”(?<field>(?<content>(?!”,).)*)”),”);          

//open the file with the content
StreamReader sr = new StreamReader(”demo.txt”);
string allcontent = sr.ReadToEnd();
sr.Close();          

MatchCollection matches = fieldfinder.Matches(allcontent);
while (matches.Count > 0)
{
  foreach (Match match in matches)
  {
    //find the comma in the match and replace it with nothing
    string replacement =
    match.Groups[”field”].Value.Replace(”,”, string.Empty);
    //replace the found match in the complete content with the altered text
    allcontent = allcontent.Replace(match.Groups[”whole”].Value, replacement);
  }
  //make a complete new matchcollection in case some overlap in matches occurred
  matches = fieldfinder.Matches(allcontent);
}

As you can see, I expanded the regulare expression with another capturing group, named ‘whole’. This capturing group now includes the double quotes, which makes it easier for me to replace.

The result is now:

        1,Chris John,12/07/1972

        2,Dave,12/10/1970

        3,Mike “the stone”,10/3/1980

        4,Stuart Little “Joe”,11/5/1983

        5,Reggy Dan,Mike Flannigan,10/6/1986

You can put this result through the CSVDataSet to get your DataSet, mind that there are still double quotes possible in this scenario (look at lines 3 and 4). Also mind, that the demo-code I provided reads the entire content of the file into a string. This could cause a problem with larger files.

I tried to give an example in the demo project.

Sure, if you put some more effort in it, the performance might improve. But I am sure that this will do the trick.