summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorAndreas Baumann <mail@andreasbaumann.cc>2017-04-12 20:50:46 +0200
committerAndreas Baumann <mail@andreasbaumann.cc>2017-04-12 20:50:46 +0200
commit07a3e29f397b4a8bd5ca124a1d020e050d20e131 (patch)
tree349e0f0e71382168e1f9ac2d7bc4ff0f61ba348a
parent5ced26f343a43cb485c0f9cf76e64a5a76466fa7 (diff)
downloadwww-andreasbaumann-cc-07a3e29f397b4a8bd5ca124a1d020e050d20e131.tar.gz
www-andreasbaumann-cc-07a3e29f397b4a8bd5ca124a1d020e050d20e131.tar.bz2
some fixes and published strus web search article
-rw-r--r--content/blog/web-search-homepage.md39
-rw-r--r--static/images/blog/web-search-homepage/search.pngbin0 -> 83416 bytes
-rw-r--r--static/images/blog/web-search-homepage/strus.jpgbin36406 -> 0 bytes
3 files changed, 19 insertions, 20 deletions
diff --git a/content/blog/web-search-homepage.md b/content/blog/web-search-homepage.md
index e7083bb..3a24ce5 100644
--- a/content/blog/web-search-homepage.md
+++ b/content/blog/web-search-homepage.md
@@ -1,14 +1,13 @@
+++
-draft = true
title = "Web search for my homepage"
date = "2017-04-12T15:49:11+01:00"
-categories = [ "Strus", "Search", "Information Retrieval" ]
-thumbnail = "/images/blog/web-search-homepage/strus.jpg"
+categories = [ "Strus", "Search", "Information Retrieval", "Web" ]
+thumbnail = "/images/blog/web-search-homepage/search.png"
+++
## Intro
I wanted to add a search function to my web page.
-As the website is build with Hugo as a set of static
+As the website is built with Hugo as a set of static
HTML pages onto a read-only web server, standard
approaches didn't work like a LIKE-query in Mysql
as many CMS are implementing search.
@@ -19,16 +18,16 @@ project.
The basic idea is that the author of the web pages
can build a search index locally with the markdown version
-of his content and then push it to a webservice dedicated
+of his content and then push it to a web service dedicated
to search only. Again, the files making up the search index
can be set to read-only after an update, leaving the system
open to only DOSA or DDOSA (but which public system isn't).
## Installing strus for content indexing
-So, I installed the packages 'strusutilties' for ArchLinux
+So, I installed the 'strusutilities' package for ArchLinux
on my local machine from the
-[OpenBuildService](https://software.opensuse.org/download.html?project=home:andreas_baumann&package=strusutilities)
+[Open Build Service](https://software.opensuse.org/download.html?project=home:andreas_baumann&package=strusutilities)
with:
```
@@ -43,8 +42,9 @@ The command line tools consist of tools to analyze the document,
apply some basic parsing and normalization of search terms.
The tools take XML, JSON or TSV (tab-separated-values) currently.
-My Hugo documents have their metadata in TOML and the content in
-Markdown:
+My Hugo documents have their metadata in
+[TOML](https://en.wikipedia.org/wiki/TOML) and the content in
+[Markdown](https://de.wikipedia.org/wiki/Markdown):
```
+++
@@ -55,7 +55,7 @@ thumbnail = "/images/blog/web-search-homepage/strus.jpg"
+++
I wanted to add a search function to my web page.
-As the website is build with Hugo as a set of static
+As the website is built with Hugo as a set of static
...
```
@@ -67,9 +67,9 @@ file using:
* [pandoc](http://pandoc.org/): convert markdown to
tons of formats
-I choose to convert to a DocBook style of XML and put all
-the posts into one big file called `posts.xml`. The metadata is
-embedded as a JSON value into the XML file in a tag `<meta>`.
+I choose to convert to a [DocBook](http://docbook.org/whatis) style
+of XML and put all the posts into one big file called `posts.xml`.
+The metadata is embedded as a JSON value into the XML file in a tag `<meta>`.
The final XML file looks like:
@@ -90,7 +90,7 @@ The final XML file looks like:
<body>
<para>
I wanted to add a search function to my
- web page. As the website is build with
+ web page. As the website is built with
Hugo as a set of static
...
```
@@ -108,10 +108,10 @@ I packaged this whole ugly conversion step into a script like that:
## Configuring the document analysis and indexing process
-Now we define the configuration for the text analysis. Basically
+Now we define the configuration for the text analysis. Basically,
we tell the system where to split the document into retrievable
items, which features we want to be able to search for and what
-attributes and text we want to show in the ranklist.
+attributes and text we want to show in the rank list.
The file `document.ana` contains a configuration which describes
how Strus should analyze and index the documents:
@@ -170,7 +170,7 @@ when presenting the hit in the ranlist.
The forward index stores the document almost verbatim as a sequence
of title and text tokens. So when we get a hit in a search result
we can present a selection of them (usually a sentence containing
-the matches) in the ranklist.
+the matches) in the rank list.
Finally, we need to count the number of words per document,
this is needed or the retrieval function:
@@ -192,9 +192,8 @@ can copy to the server running the `strusWebService`.
## Installing the strusWebService for querying
-On a publicly available server I installed the 'strusWebService':
-
-[OpenBuildService](https://software.opensuse.org/download.html?project=home:andreas_baumann&package=struswebservice)
+On a publicly available server I installed the 'strusWebService' package from the
+[Open Build Service](https://software.opensuse.org/download.html?project=home:andreas_baumann&package=struswebservice)
with:
```
diff --git a/static/images/blog/web-search-homepage/search.png b/static/images/blog/web-search-homepage/search.png
new file mode 100644
index 0000000..77db107
--- /dev/null
+++ b/static/images/blog/web-search-homepage/search.png
Binary files differ
diff --git a/static/images/blog/web-search-homepage/strus.jpg b/static/images/blog/web-search-homepage/strus.jpg
deleted file mode 100644
index 70c0776..0000000
--- a/static/images/blog/web-search-homepage/strus.jpg
+++ /dev/null
Binary files differ