Apr 28, 20223 min read

Parsing Text File By Span and Memory

We will write a method and the method will read all of the text inside the file then it will count the occurrences of the words. Like a word how many times used in the text.

For example, we have a text file like this:

Murat Pikacu
Charmander Murat Pikacu

And the result should be;

So let’s start to write our project;

In order to implement multiple parsers, we need one interface to return pairs.

public interface IFileParser
{
    Task<Dictionary<string, int>> Parse(
            stringfilePath, 
            CancellationToken cancellationToken = default);
}

Our classic Parser;

Because StreamReader has the ReadLine method we will use it. Then we count if there is an occurrence.

public class TextFileParser : IFileParser
{
    public async Task<Dictionary<string, int>> Parse(stringfilePath, CancellationTokencancellationToken=default)    
    {
        var dic = new Dictionary<string, int>();
        string line;
        using (var file = new StreamReader(filePath))        
        {
            while ((line = await file.ReadLineAsync()) != null)            
            {
                if (cancellationToken.IsCancellationRequested)                
                {
                    break;                
                }
                var words = line.Split("").Where
                            (x => !string.IsNullOrWhiteSpace(x));
                foreach (var word in words)                
                {
                    if (dic.ContainsKey(word))                    
                    {
                        dic[word] = dic[word] +1;                    
                    }
                    else 
                    { 
                        dic[word] =1; }                
                    }            
                }        
            }
            
        return dic;    
    }
}

And the second one is here. Here I tried to write code to do the same thing with our classic example. Because there is no ReadLine method that implements Memory buffer I wrote something as if it is ReadLine.

public class TextFileMemoryParser : IFileParser
{    
    public async Task<Dictionary<string, int>> Parse(
            string filePath, 
            CancellationToken cancellationToken = default)    
    {        
        var dic = new Dictionary<string, int>();        
        bool goon = true;        
        string line;        
        var chars = new List<char>();        
        using (var file = new StreamReader(filePath))        
        {            
            Memory<char> memory = new Memory<char>(new char[1]);            
            
            while (goon)            
            {                
                await file.ReadAsync(memory, cancellationToken);                
                
                goon = !file.EndOfStream;                
                
                if (file.EndOfStream) 
                { 
                    chars.Add(memory.Span.ToString()[0]); 
                }                
                
                if (file.EndOfStream || memory.Span.Contains('\n') || 
                memory.Span.Contains('\r'))                
                {                    
                   line = string.Create(chars.Count, chars, (x, y) =>                    
                    {                        
                        for (int i = 0; i < x.Length; i++)                        
                        {                            
                            x[i] = y[i];                        
                        }                    
                    });                    
                    foreach (var word in line.Split(" ").Where
                            (x => !string.IsNullOrWhiteSpace(x)))                    
                    {                        
                        if (dic.ContainsKey(word))                        
                        {                            
                            dic[word] = dic[word] + 1;                        
                        }                        
                        else { dic[word] = 1; }                    
                    }                    
                    chars.Clear();                
                }                
                else                
                {                    
                    chars.Add(memory.Span.ToString()[0]);                
                }            
            }        
        }        
        return dic;    
    }
}

As you can see above we have Memory<char> and we read text file char by char until it is a new line then we create a string by our read chars. Then we do the same thing as we do in our classic example. If we had a split method like Span<string> char[].Span that would be awesome too.

So let's see what their effects are on files.

public async Task OnPostUploadAsync()
{    
    if (Upload == null) return;   
    
      _cts = new CancellationTokenSource();    
      
      var file = Path.Combine(
              _environment.WebRootPath, 
              "uploads", 
              Upload.FileName);    
      using (var fileStream = new FileStream(file, FileMode.Create))    
      {        
        await Upload.CopyToAsync(fileStream, _cts.Token);    
      }    
      
      foreach (var _fileParser in _fileParsers)    
      {        
        Stopwatch sw = Stopwatch.StartNew();        
        WordsWithCount = await _fileParser.Parse(file, _cts.Token);        
        sw.Stop();        
        if (_fileParser is TextFileParser)        
        {            
            DefaultParser = sw.Elapsed.TotalSeconds.ToString();        
        }        
        else if (_fileParser is TextFileMemoryParser)        
        {            
            MemoryParser = sw.Elapsed.TotalSeconds.ToString();        
        }    
    }
}

After injecting the list of IFileParser we watched them on a higher than 50MB text file and the result is an average of 7 seconds with MemoryParser.

Also, if you look at diagnostic tools you can see that CPU usage is less than the classic one. If you try on a file larger than 450MB you can see below that it takes %50 shorter than the classic one. And these results are not got on the released version they get on the debug version.