Gödel's

Gödel's

Share this post

Gödel's
Gödel's
TfT Performance: Methodology
Copy link
Facebook
Email
Notes
More
User's avatar
Discover more from Gödel's
Aiming for completeness and consistency is honorable but futile - Gödel’s is a newsletter about interweaving ideas and making decisions under uncertain conditions. I talk about knowledge management, mental models, and supporting Tools for Thought.
Over 1,000 subscribers
Already have an account? Sign in

TfT Performance: Methodology

This is the first article in a series looking at the performance of current applications for Personal Knowledge Management - so-called "Tools for Thought."

Alexander Rink's avatar
Alexander Rink
Dec 31, 2021
7

Share this post

Gödel's
Gödel's
TfT Performance: Methodology
Copy link
Facebook
Email
Notes
More
Share

Introduction

If you want to get realistic results from performance tests, you must create natural conditions. If you're going to compare different applications, you have to use the same data. I've chosen synthetic test data over a copy of one of my personal knowledge graphs to ensure this.

A lightweight Python script that I developed generates the synthetic test data.

The Python script for generating the test data

The program flow is straightforward and shown below; start reading in the upper-left.

Simplified program flow

To create realistic-looking pages, we need:

  • A dictionary with English words and an algorithm to select them

  • Some probabilistic functions which we use to create a certain amount of fuzziness in the length of paragraphs and sentences

  • An algorithm to generate page links to other documents

Creating Words

I use a dictionary from Wikiquote containing the 10.000 most-common words in Project Gutenberg. The good thing is that this dictionary already supplies the frequency of the words in the different books and texts, providing a perfect basis for the word generation algorithm.

As you can see, the frequency has already dropped significantly for the first ten words.

Top Ten Words with their respective frequency

Across all 10.000 words, the frequency drops inverse exponential.

Rapidly decreasing frequency over all 10.000 words

Across all 10.000 words, the frequency drops inverse exponential.

We do this based on the frequency shown above when combining selected words into sentences. Therefore we make sure that we much more likely put a "and" or "that" into a sentence than a "purified" or "canton".

We don't want to be every sentence to have the same length. Therefore we need a probabilistic function to create some randomness.

Gaussian Distribution

I've chosen a gaussian distribution with a mean of seven and a standard deviation of 1. As you can see in the graph, most of the numbers will be between 4 and 10, with seven as the most frequent value.

Gaussian distribution with mu = 7 and sigma = 1

So each time we create a sentence, we randomize the number of words. Each time we make a block, we randomize the number of sentences (and sometimes ident some of them). And each time we create a page, we randomize the number of blocks. We created 5.000 such pages.

Creating Links

I've chosen a uniform distribution for creating links. The chance of making one while generating the sentences is 1:50, equally distributed between a two- and a one-word link.

I didn't want links on the most common words or too many different ones. When looking at my graphs, it shows that there are always a few terms that gather many backlinks on themselves.

So I used the frequency distribution above but moved the words by 1.500 places. So the most frequent word won't be "the" anymore but "dressed," followed by "lifted" and "hopes."

Combining it all together

When we combine all of this, we get a decent set of test data. Either as markdown files or as a Roam Research compatible JSON File.

The final test data set

This is how the generated pages look in Roam Research and in Logseq.

Roam Research to the left, Logseq to the right

Downloads

You can download the files here if you want to experiment with my test data.

Markdown

JSON

We'll look at the competitors and the task course in the next post.


If you have any questions or suggestions, please leave a comment.

If you want to support my work, you can do this by becoming a paid member:

Or you can buy me a coffee. Thanks in advance.

Nita Jain's avatar
7 Likes
7

Share this post

Gödel's
Gödel's
TfT Performance: Methodology
Copy link
Facebook
Email
Notes
More
Share

Discussion about this post

User's avatar
TfT Performance: Obsidian
Obsidian is the first non-outliner in this benchmark and is based on Markdown files stored in local folders. The benchmark results are astonishing.
Jan 11, 2022 • 
Alexander Rink
10

Share this post

Gödel's
Gödel's
TfT Performance: Obsidian
Copy link
Facebook
Email
Notes
More
10
TfT Performance: Interim Results
We now had five opponents in the competition: Roam Research, Obsidian, Logseq, Craft, and RemNote. Time for a first interim result; who is ahead and who…
Jan 23, 2022 • 
Alexander Rink
8

Share this post

Gödel's
Gödel's
TfT Performance: Interim Results
Copy link
Facebook
Email
Notes
More
7
TfT Performance: Logseq
Open Source, mobile client, extensible, rapid development, that's Logseq. But can it stand up to the demands of my field test and how does it compare to…
Jan 7, 2022 • 
Alexander Rink
11

Share this post

Gödel's
Gödel's
TfT Performance: Logseq
Copy link
Facebook
Email
Notes
More
4

Ready for more?

© 2025 Alexander Rink
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.