Examiness hints and tips from the trenches part 8 - Custom indexing

As you may or may not be aware there were a number of updates made recently to the our.umbraco.org website (full list of changes here). One of the updates was searchable documentation.

The documentation section on the our.umbraco site lives in github and there is a scheduled job which pulls down a bunch of .md files from github and copies them to the documentation folder.

In order to make the documentation section searchable I wrote a custom indexer to index the .md files I also updated the search configuration files to include the documentation index in any searches.

There are basically 2 ways of writing an indexer in Umbraco Examine, there is an easy and a hard way. Both have their advantages and disadvantages.

Method Advantage Disadvantage
Create SimpleDataIndexer Quick and easy Runs on a scheduled basis
Create new indexer based on LuceneIndexer Full control over configuration and indexing More involved coding required

 

Given the fact that the documentation does not change very often I went down the first route, namely to create a new indexer based on SimpleDataIndexer.

The process that pulls down the documentation files has a webhook that is triggered after the files have been successfully retrieved. I tap into this webhook and fire an event to indicate that all files are ready for indexing. The code for the indexer looks like:-

public class FileIndexDataService : ISimpleDataService
{
public IEnumerable<SimpleDataSet> GetAllData(string indexType)
{
var config = FileIndexerConfig.Settings;
var fullPath = HttpContext.Current.Server.MapPath(config.DirectoryToIndex);

var directory = new DirectoryInfo(fullPath);

var files = config.Recursive ? directory.GetFiles(config.SupportedFileTypes, SearchOption.AllDirectories) : directory.GetFiles(config.SupportedFileTypes);

var dataSets = new List<SimpleDataSet>();
var i = 1; //unique id for each doc

foreach (var file in files)
{
try
{
var simpleDataSet = new SimpleDataSet { NodeDefinition = new IndexedNode(), RowData = new Dictionary<string, string>() };

simpleDataSet = ExamineHelper.MapFileToSimpleDataIndexItem(file, simpleDataSet, i, indexType);

dataSets.Add(simpleDataSet);
}
catch (Exception ex)
{
Log.Add(LogTypes.Error, i, "error processing file " + file.FullName + " " + ex);
}

i++;
}

return dataSets;
}
}

//method to create the dataset
public static SimpleDataSet MapFileToSimpleDataIndexItem(FileInfo file, SimpleDataSet simpleDataSet, int index, string indexType)
{
var lines = new List<string>();
lines.AddRange(File.ReadAllLines(file.FullName));
var headLine = string.Empty;
var body = string.Empty;
if (lines.Count > 0)
{
headLine = RemoveSpecialCharacters(lines[0]);
lines.RemoveAt(0);
body = RemoveSpecialCharacters(string.Join("", lines));
}

simpleDataSet.NodeDefinition.NodeId = index;
simpleDataSet.NodeDefinition.Type = indexType;
simpleDataSet.RowData.Add("Body", body);
simpleDataSet.RowData.Add("Title", headLine);
simpleDataSet.RowData.Add("dateCreated", file.CreationTime.ToString("yyyy-MM-dd-HH:mm:ss"));
simpleDataSet.RowData.Add("dateCreatedSearchAble", file.CreationTime.SerializeForLucene());
simpleDataSet.RowData.Add("Path", file.FullName);
simpleDataSet.RowData.Add("searchAblePath", file.FullName.Replace("\\", " ").Replace(":", ""));
simpleDataSet.RowData.Add("nodeTypeAlias", "document");
simpleDataSet.RowData.Add("url", BuildUrl(file.FullName));

return simpleDataSet;
}

As you can see not much code is needed to write a custom indexer. The disadvantage that you have when creating an indexer based on SimpleDataIndexer is that you only have a method to index everything in one go. You do not have a method to update an individual item e.g. if just one item is created or updated how do we update the index?

One possible method would be to use FileSystemWatcher class on the documentation directory, then on change determine which file has changed and then look for that file in the index.

If it does not exist add it to the index. If it does exist remove it then re-add it with the updated data. The single node indexing code would look something like:

/// <summary>
/// Removes an entry in the search index that is related to the post provided by the parameter.
/// </summary>
/// <param name="doc"></param>
private void DeleteIndex(MdDocument doc)
{
ExamineManager.Instance.IndexProviderCollection["CustomIndexer"].DeleteFromIndex(MdDocument.Id.ToString(CultureInfo.InvariantCulture));
}

/// <summary>
/// Updates an entry in the search index that is related to the post provided by the parameter.
/// </summary>
/// <param name="doc"></param>
private void UpdateIndex(MdDocument doc)
{
var examineNode = doc.ToSimpleDataSet().RowData.ToExamineXml(doc.Id, "CustomData");

ExamineManager.Instance.IndexProviderCollection["CustomIndexer"].ReIndexNode(examineNode, "CustomData");
} 

If you were using SimpleDataIndexer to index database tables, then for per record updates you could use sql triggers and on trigger run update/delete/insert index updates using sql server triggers and calling .net assembly that contains your Umbraco Examine insert record code (Note not tried this myself but this is how I would try to do it.)

As part of the documentation search updates I previously posted using tuples to replace switch statements and I used this method in the search results control for our.umbraco.